linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v16 00/13] support "task_isolation" mode
@ 2017-11-03 17:04 Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel Chris Metcalf
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

Here, finally, is a new spin of the task isolation work (v16), with
changes based on the issues that were raised at last year's Linux
Plumbers Conference and in the email discussion that followed.

This version of the patch series cleans up a number of areas that were
a little dodgy in the previous patch series.

- It no longer loops in the final code that prepares to return to
  userspace; instead, it sets things up in the prctl() and then
  validates when preparing to return to userspace, adjusting the
  syscall return value to -EAGAIN at that point if something doesn't
  line up quite correctly.

- We no longer support the NOSIG mode that let you freely call into
  the kernel multiple times while in task isolation.  This was always
  a little odd, since you really should be in sufficient control of
  task isolation code that you can explicitly stop isolation with a
  "prctl(PR_TASK_ISOLATION, 0)" before using the kernel for anything
  else.  It also made the semantics of migrating the task confusing.
  More importantly, removing that support means that the only path
  that sets up task isolation is the return from prctl(), which allows
  us to make the simplification above.

- We no longer try to signal the task isolation process from a remote
  core when we detect that we are about to violate its isolation.
  Instead, we just print a message there (and optionally dump stack),
  and rely on the eventual interrupt on the core itself to trigger the
  signal.  We are always in a safe context to generate a signal when
  we enter the kernel, unlike when we are deep in a call stack sending
  an SMP IPI or whatever.

- We notice the case of an unstable scheduler clock and return
  EINVAL rather than spinning forever with EAGAIN (suggestion from
  Francis Giraldeau).

- The prctl() call requires zeros for arg3/4/5 (suggestion from
  Eugene Syromiatnikov).

The kernel internal isolation API is also now cleaner, and I have
included kerneldoc APIs for all the interfaces so that it should be
easier to port it to additional architectures; in fact looking at
include/linux/isolation.h is a good place to start understanding the
overall patch set.

I removed Catalin's Reviewed-by for arm64, and Christoph's Tested-by
for x86, since this version is sufficiently different to merit
re-review and re-testing.

Note that this is not rebased on top of Frederic's recent housekeeping
patch series, although it is largely orthogonal right now.  After
Frederic's patch series lands, task isolation is enabled with
"isolcpus=nohz,domain,CPUS".  We could add some shorthand for that
("isolcpus=full,CPUS"?) or just use it as-is.

The previous (v15) patch series is here:

https://lkml.kernel.org/r/1471382376-5443-1-git-send-email-cmetcalf@mellanox.com

This patch series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Some folks raised some good points at the LPC discussion and then in
email discussions that followed.  Rather than trying to respond to
everyone in a flurry of emails, I'll answer some questions here:


Why not just instrument user_exit() to raise the isolation-lost signal?

Andy pointed me in this direction.  The advantage is that you catch
*everything*, by definition.  There is a hook that can do this in the
current patch set, but you have to #define DEBUG_TASK_ISOLATION
manually to take advantage of it, because as written it has two issues:

1. You can't actually exit the kernel with prctl(PR_TASK_ISOLATION,0)
   because the user_exit hook kills you first.
2. You lose the ability to get much better diagnostics by waiting
   until you are further into kernel entry and know what you're doing.

You could work around #2 in several ways, but #1 is harder.  I looked
at x86 for a while, and although you could imagine this, you really
want to generate a lost-isolation signal on any syscall that isn't
that exact prctl(), and it's awkward to try to do all of that checking
before user_exit().  Since in any case we do want to have the more
specific probes at the various kernel entry points where we generate
the diagnostics, I felt like it wasn't the right approach to enable
as a compilation-time default.  I'm open to discussion on this though!


Can't we do all the exit-to-userspace work with irqs disabled?

In fact, it turns out that you can do lru_add_drain() with irqs
disabled, so that's what we're doing in the patch series now.

However, it doesn't seem possible to do the synchronous cancellation of
the vmstat deferred work with irqs disabled, though if there's a way,
it would be a little cleaner to do that; Christoph?  We can certainly
update the statistics with interrupts disabled via
refresh_cpu_vm_stats(false), but that's not sufficient.  For now, I
just issue the cancellation during sys_prctl() call, and then if it
isn't synchronized by the time we are exiting to userspace, we just
jam in an EAGAIN and let userspace retry.  In practice, this doesn't
seem to ever happen.


What about using a per-cpu flag to stop doing new deferred work?

Andy also suggested we could structure the code to have the prctl()
set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu
data, or lru page cache).  Then, we could flush those structures right
from the sys_prctl() call, and when we were returning to user space,
we'd be confident that there wasn't going to be any new work added.

With the current set of things that we are disabling for task
isolation, though, it didn't seem necessary.  Quiescing the vmstat
shepherd seems like it is generally pretty safe since we will likely
be able to sync up the per-cpu cache and kill the deferred work with
high probability, with no expectation that additional work will show
up.  And since we can flush the LRU page cache with interrupts
disabled, that turns out not to be an issue either.

I could imagine that if we have to deal with some new kind of deferred
work, we might find the per-cpu flag becomes a good solution, but for
now we don't have a good use case for that approach.


How about stopping the dyn tick?

Right now we try to stop it on return to userspace, but if we can't,
we just return EAGAIN to userspace.  In practice, what I see is that
usually the tick stops immediately, but occasionally it doesn't; in
this case I've always seen that nr_running is >1, presumably with some
temporary kernel worker threads, and the user code just needs to call
prctl() until those threads are done.  We could structure things with
a completion that we wait for, which is set by the timer code when it
finally does stop the tick, but this may be overkill, particularly
since we'll only be running this prctl() loop from userspace on cores
where we have no other useful work that we're trying to run anyway.


What about TLB flushing?

We talked about this at Plumbers and some of the email discussion also
was about TLB flushing.  I haven't tried to add it to this patch set,
because I really want to avoid scope creep; in any case, I think I
managed to convince Andy that he was going to work on it himself. :)
Paul McKenney already contributed some framework for such a patch, in
commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of
->dynticks counter").

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?


Chris Metcalf (12):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  Revert "sched/core: Drop the unused try_get_task_struct() helper
    function"
  Add try_get_task_struct_on_cpu() to scheduler for task isolation
  Add try_stop_full_tick() API for NO_HZ_FULL
  task_isolation: userspace hard isolation from kernel
  Add task isolation hooks to arch-independent code
  arch/x86: enable task isolation functionality
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation self test

Francis Giraldeau (1):
  arch/arm: enable task isolation functionality

 Documentation/admin-guide/kernel-parameters.txt    |   6 +
 arch/arm/Kconfig                                   |   1 +
 arch/arm/include/asm/thread_info.h                 |  10 +-
 arch/arm/kernel/entry-common.S                     |  12 +-
 arch/arm/kernel/ptrace.c                           |  10 +
 arch/arm/kernel/signal.c                           |  10 +-
 arch/arm/kernel/smp.c                              |   4 +
 arch/arm/mm/fault.c                                |   8 +-
 arch/arm64/Kconfig                                 |   1 +
 arch/arm64/include/asm/thread_info.h               |   5 +-
 arch/arm64/kernel/ptrace.c                         |  18 +-
 arch/arm64/kernel/signal.c                         |  10 +-
 arch/arm64/kernel/smp.c                            |   7 +
 arch/arm64/mm/fault.c                              |   5 +
 arch/tile/Kconfig                                  |   1 +
 arch/tile/include/asm/thread_info.h                |   2 +
 arch/tile/kernel/hardwall.c                        |   2 +
 arch/tile/kernel/irq.c                             |   3 +
 arch/tile/kernel/messaging.c                       |   4 +
 arch/tile/kernel/process.c                         |   4 +
 arch/tile/kernel/ptrace.c                          |  10 +
 arch/tile/kernel/single_step.c                     |   7 +
 arch/tile/kernel/smp.c                             |  21 +-
 arch/tile/kernel/time.c                            |   3 +
 arch/tile/kernel/unaligned.c                       |   4 +
 arch/tile/mm/fault.c                               |  13 +-
 arch/tile/mm/homecache.c                           |  11 +
 arch/x86/Kconfig                                   |   1 +
 arch/x86/entry/common.c                            |  14 +
 arch/x86/include/asm/apic.h                        |   3 +
 arch/x86/include/asm/thread_info.h                 |   8 +-
 arch/x86/kernel/smp.c                              |   2 +
 arch/x86/kernel/traps.c                            |   3 +
 arch/x86/mm/fault.c                                |   5 +
 drivers/clocksource/arm_arch_timer.c               |   2 +
 include/linux/isolation.h                          | 175 ++++++
 include/linux/sched.h                              |   4 +
 include/linux/sched/task.h                         |   3 +
 include/linux/tick.h                               |   1 +
 include/linux/vmstat.h                             |   4 +
 include/uapi/linux/prctl.h                         |   6 +
 init/Kconfig                                       |  28 +
 kernel/Makefile                                    |   1 +
 kernel/context_tracking.c                          |   2 +
 kernel/exit.c                                      |  13 +
 kernel/irq/irqdesc.c                               |   5 +
 kernel/irq_work.c                                  |   5 +-
 kernel/isolation.c                                 | 401 +++++++++++++
 kernel/sched/core.c                                |  11 +
 kernel/signal.c                                    |   2 +
 kernel/smp.c                                       |   6 +-
 kernel/sys.c                                       |   6 +
 kernel/time/tick-sched.c                           |  18 +
 mm/vmstat.c                                        |  19 +
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/task_isolation/Makefile    |   6 +
 tools/testing/selftests/task_isolation/config      |   1 +
 tools/testing/selftests/task_isolation/isolation.c | 643 +++++++++++++++++++++
 58 files changed, 1561 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c
 create mode 100644 tools/testing/selftests/task_isolation/Makefile
 create mode 100644 tools/testing/selftests/task_isolation/config
 create mode 100644 tools/testing/selftests/task_isolation/isolation.c

-- 
2.1.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
@ 2017-11-03 17:04 ` Chris Metcalf
  2018-03-18 14:22   ` Yury Norov
       [not found] ` <1509728692-10460-1-git-send-email-cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Chris Metcalf @ 2017-11-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	Jonathan Corbet, linux-mm, linux-doc, linux-api, linux-kernel
  Cc: Chris Metcalf

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable
nohz_full and isolcpus.  The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags.  When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced.  If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired.  In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will kill it with SIGKILL.
In addition to sending a signal, the code supports a kernel
command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting.  Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Separate patches that follow provide these changes for x86, tile,
arm, and arm64.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   6 +
 include/linux/isolation.h                       | 175 +++++++++++
 include/linux/sched.h                           |   4 +
 include/uapi/linux/prctl.h                      |   6 +
 init/Kconfig                                    |  28 ++
 kernel/Makefile                                 |   1 +
 kernel/context_tracking.c                       |   2 +
 kernel/isolation.c                              | 402 ++++++++++++++++++++++++
 kernel/signal.c                                 |   2 +
 kernel/sys.c                                    |   6 +
 10 files changed, 631 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05496622b4ef..aaf278f2cfc3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4025,6 +4025,12 @@
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION, this
+			setting will generate console backtraces to
+			accompany the diagnostics generated about
+			interrupting tasks running with task isolation.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..8189a772affd
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,175 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <stdarg.h>
+#include <linux/errno.h>
+#include <linux/cpumask.h>
+#include <linux/prctl.h>
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/**
+ * task_isolation_request() - prctl hook to request task isolation
+ * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
+ *
+ * This is called from the generic prctl() code for PR_TASK_ISOLATION.
+
+ * Return: Returns 0 when task isolation enabled, otherwise a negative
+ * errno.
+ */
+extern int task_isolation_request(unsigned int flags);
+
+/**
+ * task_isolation_start() - attempt to actually start task isolation
+ *
+ * This function should be invoked as the last thing prior to returning to
+ * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
+ * will attempt to quiesce the core and enter task-isolation mode.  If it
+ * fails, it will reset the system call return value to an error code that
+ * indicates the failure mode.
+ */
+extern void task_isolation_start(void);
+
+/**
+ * task_isolation_syscall() - report a syscall from an isolated task
+ * @nr:		The syscall number.
+ *
+ * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
+ * set in the thread_info flags.  It checks for valid syscalls,
+ * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
+ * For any other syscall it will raise a signal and return failure.
+ *
+ * Return: 0 for acceptable syscalls, -1 for all others.
+ */
+extern int task_isolation_syscall(int nr);
+
+/**
+ * _task_isolation_interrupt() - report an interrupt of an isolated task
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked at any exception or IRQ if
+ * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
+ * to invoke it if the exception will generate a signal anyway (e.g. a bad
+ * page fault), and in that case it is preferable not to invoke it but just
+ * rely on the standard Linux signal.  The macro task_isolation_syscall()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_interrupt(const char *fmt, ...);
+#define task_isolation_interrupt(fmt, ...)				\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_interrupt(fmt, ## __VA_ARGS__); \
+	} while (0)
+
+/**
+ * _task_isolation_remote() - report a remote interrupt of an isolated task
+ * @cpu:	The remote cpu that is about to be interrupted.
+ * @do_interrupt: Whether we should generate an extra interrupt.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked any time a remote IPI or other type of
+ * interrupt is being delivered to another cpu.  The function will check to
+ * see if the target core is running a task-isolation task, and generate a
+ * diagnostic on the console if so; in addition, we tag the task so it
+ * doesn't generate another diagnostic when the interrupt actually arrives.
+ * Generating a diagnostic remotely yields a clearer indication of what
+ * happened then just reporting only when the remote core is interrupted.
+ *
+ * The @do_interrupt flag, if true, causes the routine to not just print
+ * the diagnostic, but also to generate a reschedule interrupt to the
+ * remote core that is being interrupted.  This is necessary if the remote
+ * interrupt being diagnosed will not otherwise be visible to the remote
+ * core (e.g. a hypervisor service is being invoked on the remote core).
+ * Sending a reschedule will force the core to trigger the isolation signal
+ * and exit isolation mode.
+ *
+ * The task_isolation_remote() macro passes @do_interrupt as false, and the
+ * task_isolation_remote_interrupt() passes the flag as true.
+ */
+extern void _task_isolation_remote(int cpu, bool do_interrupt,
+				   const char *fmt, ...);
+#define task_isolation_remote(cpu, fmt, ...) \
+	_task_isolation_remote(cpu, false, fmt, ## __VA_ARGS__)
+#define task_isolation_remote_interrupt(cpu, fmt, ...) \
+	_task_isolation_remote(cpu, true, fmt, ## __VA_ARGS__)
+
+/**
+ * _task_isolation_remote_cpumask() - report interruption of multiple cpus
+ * @mask:	The set of remotes cpus that are about to be interrupted.
+ * @do_interrupt: Whether we should generate an extra interrupt.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This is the cpumask variant of _task_isolation_remote().  We
+ * generate a single-line diagnostic message even if multiple remote
+ * task-isolation cpus are being interrupted.
+ */
+extern void _task_isolation_remote_cpumask(const struct cpumask *mask,
+					   bool do_interrupt,
+					   const char *fmt, ...);
+#define task_isolation_remote_cpumask(cpumask, fmt, ...) \
+	_task_isolation_remote_cpumask(cpumask, false, fmt, ## __VA_ARGS__)
+#define task_isolation_remote_cpumask_interrupt(cpumask, fmt, ...) \
+	_task_isolation_remote_cpumask(cpumask, true, fmt, ## __VA_ARGS__)
+
+/**
+ * _task_isolation_signal() - disable task isolation when signal is pending
+ * @task:	The task for which to disable isolation.
+ *
+ * This function generates a diagnostic and disables task isolation; it
+ * should be called if TIF_TASK_ISOLATION is set when notifying a task of a
+ * pending signal.  The task_isolation_interrupt() function normally
+ * generates a diagnostic for events that just interrupt a task without
+ * generating a signal; here we need to hook the paths that correspond to
+ * interrupts that do generate a signal.  The macro task_isolation_signal()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_signal(struct task_struct *task);
+#define task_isolation_signal(task) do {				\
+		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_signal(task);			\
+	} while (0)
+
+/**
+ * task_isolation_user_exit() - debug all user_exit calls
+ *
+ * By default, we don't generate an exception in the low-level user_exit()
+ * code, because programs lose the ability to disable task isolation: the
+ * user_exit() hook will cause a signal prior to task_isolation_syscall()
+ * disabling task isolation.  In addition, it means that we lose all the
+ * diagnostic info otherwise available from task_isolation_interrupt() hooks
+ * later in the interrupt-handling process.  But you may enable it here for
+ * a special kernel build if you are having undiagnosed userspace jitter.
+ */
+static inline void task_isolation_user_exit(void)
+{
+#ifdef DEBUG_TASK_ISOLATION
+	task_isolation_interrupt("user_exit");
+#endif
+}
+
+#else /* !CONFIG_TASK_ISOLATION */
+static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
+static inline void task_isolation_start(void) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_interrupt(const char *fmt, ...) { }
+static inline void task_isolation_remote(int cpu, const char *fmt, ...) { }
+static inline void task_isolation_remote_interrupt(int cpu,
+						   const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask(const struct cpumask *mask,
+						 const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask_interrupt(
+	const struct cpumask *mask, const char *fmt, ...) { }
+static inline void task_isolation_signal(struct task_struct *task) { }
+static inline void task_isolation_user_exit(void) { }
+#endif
+
+#endif /* _LINUX_ISOLATION_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a7df4e558c..739d81a44e13 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1096,6 +1096,10 @@ struct task_struct {
 	/* Used by LSM modules for access restriction: */
 	void				*security;
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned short			task_isolation_flags;  /* prctl */
+	unsigned short			task_isolation_state;
+#endif
 
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..5a954302fdea 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,10 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Enable task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 78cb2461012e..500c2aeb49b7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -474,6 +474,34 @@ endmenu # "CPU/Task time and stats accounting"
 
 source "kernel/rcu/Kconfig"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+
+	 Allow userspace processes that place themselves on cores with
+	 nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
+	 to "isolate" themselves from the kernel.  Prior to returning to
+	 userspace, isolated tasks will arrange that no future kernel
+	 activity will interrupt the task while the task is running in
+	 userspace.  Attempting to re-enter the kernel while in this mode
+	 will cause the task to be terminated with a signal; you must
+	 explicitly use prctl() to disable task isolation before resuming
+	 normal use of the kernel.
+
+	 This "hard" isolation from the kernel is required for userspace
+	 tasks that are running hard real-time tasks in userspace, such as
+	 a high-speed network driver in userspace.  Without this option, but
+	 with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
+	 effort to shield a single userspace process from interrupts, but
+	 makes no guarantees.
+
+	 You should say "N" unless you are intending to run a
+	 high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index ed470aac53da..21c908c3083c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -111,6 +111,7 @@ obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 9ad37b9e44a7..df0c7a07c11f 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -156,6 +157,7 @@ void __context_tracking_exit(enum ctx_state state)
 			if (state == CONTEXT_USER) {
 				vtime_user_exit(current);
 				trace_user_exit(0);
+				task_isolation_user_exit();
 			}
 		}
 		__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..e2baa21af03a
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,402 @@
+/*
+ *  linux/kernel/isolation.c
+ *
+ *  Implementation for task isolation.
+ *
+ *  Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <linux/smp.h>
+#include <linux/tick.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include "time/tick-sched.h"
+
+/*
+ * These values are stored in task_isolation_state.
+ * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
+ * returning from sys_prctl() to userspace.
+ */
+enum {
+	STATE_NORMAL = 0,	/* Not isolated */
+	STATE_ISOLATED = 1,	/* In userspace, isolated */
+	STATE_WARNED = 2	/* Like ISOLATED but console warning issued */
+};
+
+cpumask_var_t task_isolation_map;
+
+/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
+static int __init task_isolation_init(void)
+{
+	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
+		cpumask_and(task_isolation_map, cpu_isolated_map,
+			    tick_nohz_full_mask);
+
+	return 0;
+}
+core_initcall(task_isolation_init)
+
+static inline bool is_isolation_cpu(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+/* Enable stack backtraces of any interrupts of task_isolation cores. */
+static bool task_isolation_debug;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+/*
+ * Dump stack if need be. This can be helpful even from the final exit
+ * to usermode code since stack traces sometimes carry information about
+ * what put you into the kernel, e.g. an interrupt number encoded in
+ * the initial entry stack frame that is still visible at exit time.
+ */
+static void debug_dump_stack(void)
+{
+	if (task_isolation_debug)
+		dump_stack();
+}
+
+/*
+ * Set the flags word but don't try to actually start task isolation yet.
+ * We will start it when entering user space in task_isolation_start().
+ */
+int task_isolation_request(unsigned int flags)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * The task isolation flags should always be cleared just by
+	 * virtue of having entered the kernel.
+	 */
+	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
+	WARN_ON_ONCE(task->task_isolation_flags != 0);
+	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
+
+	task->task_isolation_flags = flags;
+	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return 0;
+
+	/* We are trying to enable task isolation. */
+	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
+
+	/*
+	 * Shut down the vmstat worker so we're not interrupted later.
+	 * We have to try to do this here (with interrupts enabled) since
+	 * we are canceling delayed work and will call flush_work()
+	 * (which enables interrupts) and possibly schedule().
+	 */
+	quiet_vmstat_sync();
+
+	/* We return 0 here but we may change that in task_isolation_start(). */
+	return 0;
+}
+
+/* Disable task isolation in the specified task. */
+static void stop_isolation(struct task_struct *p)
+{
+	p->task_isolation_flags = 0;
+	p->task_isolation_state = STATE_NORMAL;
+	clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+/*
+ * This code runs with interrupts disabled just before the return to
+ * userspace, after a prctl() has requested enabling task isolation.
+ * We take whatever steps are needed to avoid being interrupted later:
+ * drain the lru pages, stop the scheduler tick, etc.  More
+ * functionality may be added here later to avoid other types of
+ * interrupts from other kernel subsystems.
+ *
+ * If we can't enable task isolation, we update the syscall return
+ * value with an appropriate error.
+ */
+void task_isolation_start(void)
+{
+	int error;
+
+	/*
+	 * We should only be called in STATE_NORMAL (isolation disabled),
+	 * on our way out of the kernel from the prctl() that turned it on.
+	 * If we are exiting from the kernel in another state, it means we
+	 * made it back into the kernel without disabling task isolation,
+	 * and we should investigate how (and in any case disable task
+	 * isolation at this point).  We are clearly not on the path back
+	 * from the prctl() so we don't touch the syscall return value.
+	 */
+	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
+		stop_isolation(current);
+		return;
+	}
+
+	/*
+	 * Must be affinitized to a single core with task isolation possible.
+	 * In principle this could be remotely modified between the prctl()
+	 * and the return to userspace, so we have to check it here.
+	 */
+	if (cpumask_weight(&current->cpus_allowed) != 1 ||
+	    !is_isolation_cpu(smp_processor_id())) {
+		error = -EINVAL;
+		goto error;
+	}
+
+	/* If the vmstat delayed work is not canceled, we have to try again. */
+	if (!vmstat_idle()) {
+		error = -EAGAIN;
+		goto error;
+	}
+
+	/* Try to stop the dynamic tick. */
+	error = try_stop_full_tick();
+	if (error)
+		goto error;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	current->task_isolation_state = STATE_ISOLATED;
+	return;
+
+error:
+	stop_isolation(current);
+	syscall_set_return_value(current, current_pt_regs(), error, 0);
+}
+
+/* Stop task isolation on the remote task and send it a signal. */
+static void send_isolation_signal(struct task_struct *task)
+{
+	int flags = task->task_isolation_flags;
+	siginfo_t info = {
+		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
+	};
+
+	stop_isolation(task);
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+	/* No need to incur an isolation signal if we are just exiting. */
+	if (syscall == __NR_exit || syscall == __NR_exit_group)
+		return true;
+
+	/* Check to see if it's the prctl for isolation. */
+	if (syscall == __NR_prctl) {
+		unsigned long arg;
+
+		syscall_get_arguments(current, current_pt_regs(), 0, 1, &arg);
+		if (arg == PR_TASK_ISOLATION)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * This routine is called from syscall entry, prevents most syscalls
+ * from executing, and if needed raises a signal to notify the process.
+ *
+ * Note that we have to stop isolation before we even print a message
+ * here, since otherwise we might end up reporting an interrupt due to
+ * kicking the printk handling code, rather than reporting the true
+ * cause of interrupt here.
+ */
+int task_isolation_syscall(int syscall)
+{
+	struct task_struct *task = current;
+
+	if (is_acceptable_syscall(syscall)) {
+		stop_isolation(task);
+		return 0;
+	}
+
+	send_isolation_signal(task);
+
+	pr_warn("%s/%d (cpu %d): task_isolation lost due to syscall %d\n",
+		task->comm, task->pid, smp_processor_id(), syscall);
+	debug_dump_stack();
+
+	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
+	return -1;
+}
+
+/*
+ * This routine is called from any exception or irq that doesn't
+ * otherwise trigger a signal to the user process (e.g. page fault).
+ * We don't warn if we are in STATE_WARNED in case a remote cpu already
+ * reported that it was going to interrupt us, so we don't generate
+ * a lot of confusingly similar messages about the same event.
+ */
+void _task_isolation_interrupt(const char *fmt, ...)
+{
+	struct task_struct *task = current;
+	va_list args;
+	char buf[100];
+	bool do_warn;
+
+	/* RCU should have been enabled prior to this point. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+	/*
+	 * Avoid reporting interrupts that happen after we have prctl'ed
+	 * to enable isolation, but before we have returned to userspace.
+	 */
+	if (task->task_isolation_state == STATE_NORMAL)
+		return;
+
+	do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	/* Handle NMIs minimally, since we can't send a signal. */
+	if (in_nmi()) {
+		pr_err("%s/%d (cpu %d): in NMI; not delivering signal\n",
+			task->comm, task->pid, smp_processor_id());
+	} else {
+		send_isolation_signal(task);
+	}
+
+	if (do_warn) {
+		pr_warn("%s/%d (cpu %d): task_isolation lost due to %s\n",
+			task->comm, task->pid, smp_processor_id(), buf);
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Called before we wake up a task that has a signal to process.
+ * Needs to be done to handle interrupts that trigger signals, which
+ * we don't catch with task_isolation_interrupt() hooks.
+ */
+void _task_isolation_signal(struct task_struct *task)
+{
+	bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	stop_isolation(task);
+	if (do_warn) {
+		pr_warn("%s/%d (cpu %d): task_isolation lost due to signal\n",
+			task->comm, task->pid, task_cpu(task));
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Return a task_struct pointer (with ref count bumped up) for the
+ * specified cpu if the task running on that cpu at this moment is in
+ * isolation mode and hasn't yet been warned, otherwise NULL.
+ * In addition, toggle the task state to WARNED in anticipation of
+ * doing a printk, and send a reschedule IPI if needed.
+ */
+static struct task_struct *isolation_task(int cpu, int do_interrupt)
+{
+	struct task_struct *p = try_get_task_struct_on_cpu(cpu);
+
+	if (p == NULL)
+		return NULL;
+
+	if (p->task_isolation_state != STATE_ISOLATED)
+		goto bad_task;
+
+	/*
+	 * If we are claiming to be delivering a remote interrupt to our
+	 * own task, this has to be a bug, since here we are already in the
+	 * kernel, and somehow we didn't reset to STATE_NORMAL.
+	 */
+	if (WARN_ON_ONCE(p == current)) {
+		stop_isolation(p);
+		goto bad_task;
+	}
+
+	p->task_isolation_state = STATE_WARNED;
+	if (do_interrupt)
+		smp_send_reschedule(cpu);
+
+	return p;
+
+bad_task:
+	put_task_struct(p);
+	return NULL;
+}
+
+/*
+ * Generate a stack backtrace if we are going to interrupt another task
+ * isolation process.
+ */
+void _task_isolation_remote(int cpu, bool do_interrupt, const char *fmt, ...)
+{
+	struct task_struct *p;
+	va_list args;
+	char buf[200];
+
+	if (!is_isolation_cpu(cpu))
+		return;
+
+	p = isolation_task(cpu, do_interrupt);
+	if (p == NULL)
+		return;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	pr_warn("%s/%d (cpu %d): task_isolation lost due to %s by %s/%d on cpu %d\n",
+		p->comm, p->pid, cpu, buf,
+		current->comm, current->pid, smp_processor_id());
+	put_task_struct(p);
+	debug_dump_stack();
+}
+
+/*
+ * Generate a stack backtrace if any of the cpus in "mask" are running
+ * task isolation processes.
+ */
+void _task_isolation_remote_cpumask(const struct cpumask *mask,
+				    bool do_interrupt, const char *fmt, ...)
+{
+	struct task_struct *p = NULL;
+	cpumask_var_t warn_mask;
+	va_list args;
+	char buf[200];
+	int cpu;
+
+	if (task_isolation_map == NULL ||
+	    !zalloc_cpumask_var(&warn_mask, GFP_KERNEL))
+		return;
+
+	for_each_cpu_and(cpu, mask, task_isolation_map) {
+		if (p)
+			put_task_struct(p);
+		p = isolation_task(cpu, do_interrupt);
+		if (p)
+			cpumask_set_cpu(cpu, warn_mask);
+	}
+	if (p == NULL)
+		goto done;
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	pr_warn("%s/%d %s %*pbl): task_isolation lost due to %s by %s/%d on cpu %d\n",
+		p->comm, p->pid,
+		cpumask_weight(warn_mask) == 1 ? "(cpu" : "etc (cpus",
+		cpumask_pr_args(warn_mask), buf,
+		current->comm, current->pid, smp_processor_id());
+	put_task_struct(p);
+	debug_dump_stack();
+
+done:
+	free_cpumask_var(warn_mask);
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index 800a18f77732..fa4786050431 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -40,6 +40,7 @@
 #include <linux/cn_proc.h>
 #include <linux/compiler.h>
 #include <linux/posix-timers.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -658,6 +659,7 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	task_isolation_signal(t);
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/sys.c b/kernel/sys.c
index 9aebc2935013..91ed79605768 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2385,6 +2386,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+	case PR_TASK_ISOLATION:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = task_isolation_request(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
       [not found] ` <1509728692-10460-1-git-send-email-cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-11-06 15:38   ` Christopher Lameter
       [not found]     ` <bd5526d4-7b6e-3b3d-79d0-5d6567c0bdf8@mellanox.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Christopher Lameter @ 2017-11-06 15:38 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau,
	linux-mm-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, 3 Nov 2017, Chris Metcalf wrote:

> However, it doesn't seem possible to do the synchronous cancellation of
> the vmstat deferred work with irqs disabled, though if there's a way,
> it would be a little cleaner to do that; Christoph?  We can certainly
> update the statistics with interrupts disabled via
> refresh_cpu_vm_stats(false), but that's not sufficient.  For now, I

Callig quiet_vmstat() gives you the folding of the statistics without
the numa page allocator page cleanup. Leaving some pages in the queues
should not be that much of an issue.

> What about that d*mn 1 Hz clock?
>
> It's still there, so this code still requires some further work before
> it can actually get a process into long-term task isolation (without
> the obvious one-line kernel hack).  Frederic suggested a while ago
> forcing updates on cpustats was required as the last gating factor; do
> we think that is still true?  Christoph was working on this at one
> point - any progress from your point of view?

Well if you still have the 1 HZ clock then you can simply defer the numa
remote page cleanup of the page allocator to that the time you execute
that tick.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
       [not found]     ` <bd5526d4-7b6e-3b3d-79d0-5d6567c0bdf8@mellanox.com>
@ 2017-11-07 17:10       ` Christopher Lameter
  2017-11-07 17:27         ` Chris Metcalf
  0 siblings, 1 reply; 9+ messages in thread
From: Christopher Lameter @ 2017-11-07 17:10 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau, linux-mm,
	linux-doc, linux-api, linux-kernel

On Mon, 6 Nov 2017, Chris Metcalf wrote:

> On 11/6/2017 10:38 AM, Christopher Lameter wrote:
> > > What about that d*mn 1 Hz clock?
> > >
> > > It's still there, so this code still requires some further work before
> > > it can actually get a process into long-term task isolation (without
> > > the obvious one-line kernel hack).  Frederic suggested a while ago
> > > forcing updates on cpustats was required as the last gating factor; do
> > > we think that is still true?  Christoph was working on this at one
> > > point - any progress from your point of view?
> > Well if you still have the 1 HZ clock then you can simply defer the numa
> > remote page cleanup of the page allocator to that the time you execute
> > that tick.
>
> We have to get rid of the 1 Hz tick, so we don't want to tie anything
> else to it...

Yes we want to get rid of the 1 HZ tick but the work on that could also
include dealing with the remove page cleanup issue that we have deferred.

Presumably we have another context there were we may be able to call into
the cleanup code with interrupts enabled.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
  2017-11-07 17:10       ` Christopher Lameter
@ 2017-11-07 17:27         ` Chris Metcalf
  2017-11-07 17:54           ` Christopher Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Metcalf @ 2017-11-07 17:27 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau,
	linux-mm-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/7/2017 12:10 PM, Christopher Lameter wrote:
> On Mon, 6 Nov 2017, Chris Metcalf wrote:
>
>> On 11/6/2017 10:38 AM, Christopher Lameter wrote:
>>>> What about that d*mn 1 Hz clock?
>>>>
>>>> It's still there, so this code still requires some further work before
>>>> it can actually get a process into long-term task isolation (without
>>>> the obvious one-line kernel hack).  Frederic suggested a while ago
>>>> forcing updates on cpustats was required as the last gating factor; do
>>>> we think that is still true?  Christoph was working on this at one
>>>> point - any progress from your point of view?
>>> Well if you still have the 1 HZ clock then you can simply defer the numa
>>> remote page cleanup of the page allocator to that the time you execute
>>> that tick.
>> We have to get rid of the 1 Hz tick, so we don't want to tie anything
>> else to it...
> Yes we want to get rid of the 1 HZ tick but the work on that could also
> include dealing with the remove page cleanup issue that we have deferred.
>
> Presumably we have another context there were we may be able to call into
> the cleanup code with interrupts enabled.

Right now for task isolation we run with interrupts enabled during the
initial sys_prctl() call, and call quiet_vmstat_sync() there, which 
currently
calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
there instead (and probably should) since we can handle dealing with
the pagesets at this time.  As we return to userspace we will test that
nothing surprising happened with vmstat; if so we jam an EAGAIN into
the syscall result value, but if not, we will be in userspace and won't need
to touch the vmstat counters until we next go back into the kernel.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
  2017-11-07 17:27         ` Chris Metcalf
@ 2017-11-07 17:54           ` Christopher Lameter
  0 siblings, 0 replies; 9+ messages in thread
From: Christopher Lameter @ 2017-11-07 17:54 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Viresh Kumar, Catalin Marinas, Will Deacon,
	Andy Lutomirski, Daniel Lezcano, Francis Giraldeau, linux-mm,
	linux-doc, linux-api, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 984 bytes --]

On Tue, 7 Nov 2017, Chris Metcalf wrote:

> > Presumably we have another context there were we may be able to call into
> > the cleanup code with interrupts enabled.
>
> Right now for task isolation we run with interrupts enabled during the
> initial sys_prctl() call, and call quiet_vmstat_sync() there, which currently
> calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
> there instead (and probably should) since we can handle dealing with
> the pagesets at this time.  As we return to userspace we will test that
> nothing surprising happened with vmstat; if so we jam an EAGAIN into
> the syscall result value, but if not, we will be in userspace and won't need
> to touch the vmstat counters until we next go back into the kernel.

If you do it too early and there is another page allocator action then it
may potentially repopulate the caches. Since this is only draining the
caches for remote node allocation it may be rare and not that important.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
  2017-11-03 17:04 ` [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel Chris Metcalf
       [not found] ` <1509728692-10460-1-git-send-email-cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-03-07 10:07 ` Yury Norov
  2018-07-12 12:29 ` Yury Norov
  3 siblings, 0 replies; 9+ messages in thread
From: Yury Norov @ 2018-03-07 10:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, linux-mm, linux-doc, linux-api, linux-kernel,
	Prasun Kapoor <Prasun.Kapo>

Hi Chris,

(CC Cavium people)

Thanks for your series.

On Fri, Nov 03, 2017 at 01:04:39PM -0400, Chris Metcalf wrote:
> Here, finally, is a new spin of the task isolation work (v16), with
> changes based on the issues that were raised at last year's Linux
> Plumbers Conference and in the email discussion that followed.
> 
> This version of the patch series cleans up a number of areas that were
> a little dodgy in the previous patch series.
> 
> - It no longer loops in the final code that prepares to return to
>   userspace; instead, it sets things up in the prctl() and then
>   validates when preparing to return to userspace, adjusting the
>   syscall return value to -EAGAIN at that point if something doesn't
>   line up quite correctly.
> 
> - We no longer support the NOSIG mode that let you freely call into
>   the kernel multiple times while in task isolation.  This was always
>   a little odd, since you really should be in sufficient control of
>   task isolation code that you can explicitly stop isolation with a
>   "prctl(PR_TASK_ISOLATION, 0)" before using the kernel for anything
>   else.  It also made the semantics of migrating the task confusing.
>   More importantly, removing that support means that the only path
>   that sets up task isolation is the return from prctl(), which allows
>   us to make the simplification above.
> 
> - We no longer try to signal the task isolation process from a remote
>   core when we detect that we are about to violate its isolation.
>   Instead, we just print a message there (and optionally dump stack),
>   and rely on the eventual interrupt on the core itself to trigger the
>   signal.  We are always in a safe context to generate a signal when
>   we enter the kernel, unlike when we are deep in a call stack sending
>   an SMP IPI or whatever.
> 
> - We notice the case of an unstable scheduler clock and return
>   EINVAL rather than spinning forever with EAGAIN (suggestion from
>   Francis Giraldeau).
> 
> - The prctl() call requires zeros for arg3/4/5 (suggestion from
>   Eugene Syromiatnikov).
> 
> The kernel internal isolation API is also now cleaner, and I have
> included kerneldoc APIs for all the interfaces so that it should be
> easier to port it to additional architectures; in fact looking at
> include/linux/isolation.h is a good place to start understanding the
> overall patch set.
> 
> I removed Catalin's Reviewed-by for arm64, and Christoph's Tested-by
> for x86, since this version is sufficiently different to merit
> re-review and re-testing.
> 
> Note that this is not rebased on top of Frederic's recent housekeeping
> patch series, although it is largely orthogonal right now.  After
> Frederic's patch series lands, task isolation is enabled with
> "isolcpus=nohz,domain,CPUS".  We could add some shorthand for that
> ("isolcpus=full,CPUS"?) or just use it as-is.
> 
> The previous (v15) patch series is here:
> 
> https://lkml.kernel.org/r/1471382376-5443-1-git-send-email-cmetcalf@mellanox.com
> 
> This patch series is available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane
> 
> Some folks raised some good points at the LPC discussion and then in
> email discussions that followed.  Rather than trying to respond to
> everyone in a flurry of emails, I'll answer some questions here:
> 
> 
> Why not just instrument user_exit() to raise the isolation-lost signal?
> 
> Andy pointed me in this direction.  The advantage is that you catch
> *everything*, by definition.  There is a hook that can do this in the
> current patch set, but you have to #define DEBUG_TASK_ISOLATION
> manually to take advantage of it, because as written it has two issues:
> 
> 1. You can't actually exit the kernel with prctl(PR_TASK_ISOLATION,0)
>    because the user_exit hook kills you first.
> 2. You lose the ability to get much better diagnostics by waiting
>    until you are further into kernel entry and know what you're doing.
> 
> You could work around #2 in several ways, but #1 is harder.  I looked
> at x86 for a while, and although you could imagine this, you really
> want to generate a lost-isolation signal on any syscall that isn't
> that exact prctl(), and it's awkward to try to do all of that checking
> before user_exit().  Since in any case we do want to have the more
> specific probes at the various kernel entry points where we generate
> the diagnostics, I felt like it wasn't the right approach to enable
> as a compilation-time default.  I'm open to discussion on this though!
> 
> 
> Can't we do all the exit-to-userspace work with irqs disabled?
> 
> In fact, it turns out that you can do lru_add_drain() with irqs
> disabled, so that's what we're doing in the patch series now.
> 
> However, it doesn't seem possible to do the synchronous cancellation of
> the vmstat deferred work with irqs disabled, though if there's a way,
> it would be a little cleaner to do that; Christoph?  We can certainly
> update the statistics with interrupts disabled via
> refresh_cpu_vm_stats(false), but that's not sufficient.  For now, I
> just issue the cancellation during sys_prctl() call, and then if it
> isn't synchronized by the time we are exiting to userspace, we just
> jam in an EAGAIN and let userspace retry.  In practice, this doesn't
> seem to ever happen.
> 
> 
> What about using a per-cpu flag to stop doing new deferred work?
> 
> Andy also suggested we could structure the code to have the prctl()
> set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu
> data, or lru page cache).  Then, we could flush those structures right
> from the sys_prctl() call, and when we were returning to user space,
> we'd be confident that there wasn't going to be any new work added.
> 
> With the current set of things that we are disabling for task
> isolation, though, it didn't seem necessary.  Quiescing the vmstat
> shepherd seems like it is generally pretty safe since we will likely
> be able to sync up the per-cpu cache and kill the deferred work with
> high probability, with no expectation that additional work will show
> up.  And since we can flush the LRU page cache with interrupts
> disabled, that turns out not to be an issue either.
> 
> I could imagine that if we have to deal with some new kind of deferred
> work, we might find the per-cpu flag becomes a good solution, but for
> now we don't have a good use case for that approach.
> 
> 
> How about stopping the dyn tick?
> 
> Right now we try to stop it on return to userspace, but if we can't,
> we just return EAGAIN to userspace.  In practice, what I see is that
> usually the tick stops immediately, but occasionally it doesn't; in
> this case I've always seen that nr_running is >1, presumably with some
> temporary kernel worker threads, and the user code just needs to call
> prctl() until those threads are done.  We could structure things with
> a completion that we wait for, which is set by the timer code when it
> finally does stop the tick, but this may be overkill, particularly
> since we'll only be running this prctl() loop from userspace on cores
> where we have no other useful work that we're trying to run anyway.
> 
> 
> What about TLB flushing?
> 
> We talked about this at Plumbers and some of the email discussion also
> was about TLB flushing.  I haven't tried to add it to this patch set,
> because I really want to avoid scope creep; in any case, I think I
> managed to convince Andy that he was going to work on it himself. :)
> Paul McKenney already contributed some framework for such a patch, in
> commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of
> ->dynticks counter").
> 
> What about that d*mn 1 Hz clock?
> 
> It's still there, so this code still requires some further work before
> it can actually get a process into long-term task isolation (without
> the obvious one-line kernel hack).  Frederic suggested a while ago
> forcing updates on cpustats was required as the last gating factor; do
> we think that is still true?  Christoph was working on this at one
> point - any progress from your point of view?

I tested your series on ThunderX 2 machine. When I run 10 giga-ticks test,
it always passed. If I run for more, the test exits like this:

# time ./isolation 1000
/sys devices: OK (using task isolation cpu 100)
prctl unaffinitized: OK
prctl on cpu 0: OK
==> hello, world
test_off: OK
Received signal 11 successfully
test_segv: OK
test_fault: OK
test_fault (SIGUSR1): OK
test_syscall: OK
test_syscall (SIGUSR1): OK
test_schedule: OK
test_schedule (SIGUSR1): OK
testing task isolation jitter for 1000000000000 ticks
ERROR: Program unexpectedly entered kernel.
INFO: loop times:
  1 cycles (count: 128606844716)
  2 cycles (count: 31558352292)
  3 cycles (count: 4)
  29 cycles (count: 437)
  30 cycles (count: 1874)
  31 cycles (count: 221)
  57 cycles (count: 4)
  58 cycles (count: 6)
  59 cycles (count: 1)

real  15m58.643s
user  15m58.626s
sys   0m0.012s

I pass task_isolation_debug to boot parameters:
# cat /proc/cmdline 
BOOT_IMAGE=/boot/Image-isol nohz_full=100-110 isolcpus=100-110 task_isolation_debug root=UUID=75b9dd5b-58d8-4a50-8868-004cb7c1f25f ro text

But dmesg looks empty...

I investigate it, but at now I have no ideas what happens. Have you seen
that before?

Anyway, we are going to include your test in our scenario, with some
modifications. I've added --dry-run option to make it easier to
demonstrate the effect of isolation on jitter. If you like it, feel
free to use this change.

Tested-by: Yury Norov <ynorov@caviumnetworks.com>

>From 5c5823c1fc8441ab1486287679de77b8cea5154d Mon Sep 17 00:00:00 2001
From: Yury Norov <ynorov@caviumnetworks.com>
Date: Wed, 7 Mar 2018 02:41:08 +0300
Subject: [PATCH] isolation test: --dry-run mode

Add dry-run mode for better understanding the effect of isolation.
Also, make sanity checks on waitticks.

Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
 tools/testing/selftests/task_isolation/isolation.c | 47 +++++++++++++++++-----
 1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c
index 9c0b49619b40..e90a6c85fe2a 100644
--- a/tools/testing/selftests/task_isolation/isolation.c
+++ b/tools/testing/selftests/task_isolation/isolation.c
@@ -53,6 +53,7 @@
 #include <unistd.h>
 #include <stdlib.h>
 #include <fcntl.h>
+#include <limits.h>
 #include <assert.h>
 #include <string.h>
 #include <errno.h>
@@ -500,7 +501,7 @@ static void jitter_handler(int sig)
 	exit(exit_status);
 }
 
-static void test_jitter(unsigned long waitticks)
+static void test_jitter(unsigned long waitticks, int flags)
 {
 	u_int64_t start, last, elapsed;
 	int rc;
@@ -513,7 +514,8 @@ static void test_jitter(unsigned long waitticks)
 	rc = mlockall(MCL_CURRENT);
 	assert(rc == 0);
 
-	set_task_isolation(PR_TASK_ISOLATION_ENABLE |
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		set_task_isolation(PR_TASK_ISOLATION_ENABLE |
 			   PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
 
 	last = start = get_cycle_count();
@@ -537,26 +539,49 @@ static void test_jitter(unsigned long waitticks)
 	} while (elapsed < waitticks);
 
 	jitter_test_complete = true;
-	prctl_isolation(0);
+
+	if (flags & PR_TASK_ISOLATION_ENABLE)
+		prctl_isolation(0);
+
 	jitter_summarize();
 }
 
 int main(int argc, char **argv)
 {
 	/* How many billion ticks to wait after running the other tests? */
-	unsigned long waitticks;
+	long waitticks = 10;
+	int flags = PR_TASK_ISOLATION_ENABLE;
 	char buf[100];
 	char *result, *end;
 	FILE *f;
 
-	if (argc == 1)
-		waitticks = 10;
-	else if (argc == 2)
-		waitticks = strtol(argv[1], NULL, 10);
-	else {
-		printf("syntax: isolation [gigaticks]\n");
+	errno = 0;
+
+	switch (argc) {
+	case 1:
+		break;
+	case 2:
+		if (strcmp(argv[1], "--dry-run") == 0)
+			flags = 0;
+		else
+			waitticks = strtol(argv[1], NULL, 10);
+		break;
+	case 3:
+		if (strcmp(argv[1], "--dry-run") == 0)
+			flags = 0;
+
+		waitticks = strtol(argv[2], NULL, 10);
+		break;
+	default:
+		printf("syntax: isolation [--dry-run] [gigaticks]\n");
 		ksft_exit_fail();
 	}
+
+	if (errno || waitticks <= 0 || waitticks > (LONG_MAX / 1000000000)) {
+		printf("Gigaticks: bad number.\n");
+		ksft_exit_fail();
+	}
+
 	waitticks *= 1000000000;
 
 	/* Get a core from the /sys nohz_full device. */
@@ -637,7 +662,7 @@ int main(int argc, char **argv)
 		return exit_status;
 	}
 
-	test_jitter(waitticks);
+	test_jitter(waitticks, flags);
 
 	return exit_status;
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel
  2017-11-03 17:04 ` [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel Chris Metcalf
@ 2018-03-18 14:22   ` Yury Norov
  0 siblings, 0 replies; 9+ messages in thread
From: Yury Norov @ 2018-03-18 14:22 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
	Jonathan Corbet, linux-mm, linux-doc, linux-api, linux-kernel

Hi Chris,

On Fri, Nov 03, 2017 at 01:04:45PM -0400, Chris Metcalf wrote:
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
> 
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device driver
> style applications, such as high-speed networking code.
> 
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
> 
> The kernel must be built with the new TASK_ISOLATION Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> "nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable
> nohz_full and isolcpus.  The "task_isolation" state is then indicated
> by setting a new task struct field, task_isolation_flag, to the
> value passed by prctl(), and also setting a TIF_TASK_ISOLATION
> bit in the thread_info flags.  When the kernel is returning to
> userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
> it calls the new task_isolation_start() routine to arrange for
> the task to avoid being interrupted in the future.
> 
> With interrupts disabled, task_isolation_start() ensures that kernel
> subsystems that might cause a future interrupt are quiesced.  If it
> doesn't succeed, it adjusts the syscall return value to indicate that
> fact, and userspace can retry as desired.  In addition to stopping
> the scheduler tick, the code takes any actions that might avoid
> a future interrupt to the core, such as a worker thread being
> scheduled that could be quiesced now (e.g. the vmstat worker)
> or a future IPI to the core to clean up some state that could be
> cleaned up now (e.g. the mm lru per-cpu cache).
> 
> Once the task has returned to userspace after issuing the prctl(),
> if it enters the kernel again via system call, page fault, or any
> other exception or irq, the kernel will kill it with SIGKILL.
> In addition to sending a signal, the code supports a kernel
> command-line "task_isolation_debug" flag which causes a stack
> backtrace to be generated whenever a task loses isolation.
> 
> To allow the state to be entered and exited, the syscall checking
> test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
> clear the bit again later, and ignores exit/exit_group to allow
> exiting the task without a pointless signal being delivered.
> 
> The prctl() API allows for specifying a signal number to use instead
> of the default SIGKILL, to allow for catching the notification
> signal; for example, in a production environment, it might be
> helpful to log information to the application logging mechanism
> before exiting.  Or, the signal handler might choose to reset the
> program counter back to the code segment intended to be run isolated
> via prctl() to continue execution.
> 
> In a number of cases we can tell on a remote cpu that we are
> going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
> In that case we generate the diagnostic (and optional stack dump)
> on the remote core to be able to deliver better diagnostics.
> If the interrupt is not something caught by Linux (e.g. a
> hypervisor interrupt) we can also request a reschedule IPI to
> be sent to the remote core so it can be sure to generate a
> signal to notify the process.
> 
> Separate patches that follow provide these changes for x86, tile,
> arm, and arm64.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   6 +
>  include/linux/isolation.h                       | 175 +++++++++++
>  include/linux/sched.h                           |   4 +
>  include/uapi/linux/prctl.h                      |   6 +
>  init/Kconfig                                    |  28 ++
>  kernel/Makefile                                 |   1 +
>  kernel/context_tracking.c                       |   2 +
>  kernel/isolation.c                              | 402 ++++++++++++++++++++++++
>  kernel/signal.c                                 |   2 +
>  kernel/sys.c                                    |   6 +
>  10 files changed, 631 insertions(+)
>  create mode 100644 include/linux/isolation.h
>  create mode 100644 kernel/isolation.c
 
[...]

> + * This routine is called from syscall entry, prevents most syscalls
> + * from executing, and if needed raises a signal to notify the process.
> + *
> + * Note that we have to stop isolation before we even print a message
> + * here, since otherwise we might end up reporting an interrupt due to
> + * kicking the printk handling code, rather than reporting the true
> + * cause of interrupt here.
> + */
> +int task_isolation_syscall(int syscall)
> +{

All callers of this function call it like this:
        if (work & _TIF_TASK_ISOLATION) {
                if (task_isolation_syscall(regs->syscallno) ==
                                -1)
                        return -1;
        }

Would it make sense to move check of _TIF_TASK_ISOLATION flag inside
the function?

> +	struct task_struct *task = current;
> +
> +	if (is_acceptable_syscall(syscall)) {
> +		stop_isolation(task);
> +		return 0;
> +	}
> +
> +	send_isolation_signal(task);
> +
> +	pr_warn("%s/%d (cpu %d): task_isolation lost due to syscall %d\n",
> +		task->comm, task->pid, smp_processor_id(), syscall);
> +	debug_dump_stack();
> +
> +	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
> +	return -1;
> +}

Yury

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v16 00/13] support "task_isolation" mode
  2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
                   ` (2 preceding siblings ...)
  2018-03-07 10:07 ` Yury Norov
@ 2018-07-12 12:29 ` Yury Norov
  3 siblings, 0 replies; 9+ messages in thread
From: Yury Norov @ 2018-07-12 12:29 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
	Paul E. McKenney, Christoph Lameter, Viresh Kumar,
	Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
	Francis Giraldeau, Sunil Goutham, linux-mm, linux-doc, linux-api,
	linux-kernel

On Fri, Nov 03, 2017 at 01:04:39PM -0400, Chris Metcalf wrote:
> Here, finally, is a new spin of the task isolation work (v16), with
> changes based on the issues that were raised at last year's Linux
> Plumbers Conference and in the email discussion that followed.

Hi Chris,

There's another possible way to break task isolation, by net subsystem.
See patch below.

Yury


>From 8025e9330bf06ce146d4ba96833aad6eafe24759 Mon Sep 17 00:00:00 2001
From: Yury Norov <ynorov@caviumnetworks.com>
Date: Sun, 8 Jul 2018 00:40:46 +0300
Subject: [PATCH] net: don't let user assign task isolation CPUs for RPS

Receive Packet Steering (RPS) subsystem distributes network traffic
handling to CPUs defined by user in
/sys/class/net/<dev>/queues/rx-<n>/rps_cpus.

If rps_cpus intersects with task_isolation_map, RPS may break task
isolation by assigning RPS work to CPU that runs isolated task.

In this patch user-provided rps_cpus map filtered to avoid that.

Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
 include/linux/isolation.h |  2 ++
 net/core/net-sysfs.c      | 13 +++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index f467545ad37d..b7f0a9085b13 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -14,6 +14,8 @@ struct task_struct;
 
 #ifdef CONFIG_TASK_ISOLATION
 
+extern cpumask_var_t task_isolation_map;
+
 /**
  * task_isolation_request() - prctl hook to request task isolation
  * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 927a6dcbad96..18e576893984 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 
 #include <linux/capability.h>
 #include <linux/kernel.h>
+#include <linux/isolation.h>
 #include <linux/netdevice.h>
 #include <net/switchdev.h>
 #include <linux/if_arp.h>
@@ -727,6 +728,18 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 		return err;
 	}
 
+#ifdef CONFIG_TASK_ISOLATION
+	if (cpumask_intersects(mask, task_isolation_map)) {
+		char tmp[256];
+
+		pr_warn("RPS is not allowed on CPUs allocated for isolated tasks\n");
+
+		cpumask_andnot(mask, mask, task_isolation_map);
+		cpumap_print_to_pagebuf(1, tmp, mask);
+		pr_warn("RPS CPUs list is reduced to: %s\n", tmp);
+	}
+#endif
+
 	map = kzalloc(max_t(unsigned int,
 			    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
 		      GFP_KERNEL);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-07-12 12:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-03 17:04 [PATCH v16 00/13] support "task_isolation" mode Chris Metcalf
2017-11-03 17:04 ` [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel Chris Metcalf
2018-03-18 14:22   ` Yury Norov
     [not found] ` <1509728692-10460-1-git-send-email-cmetcalf-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-11-06 15:38   ` [PATCH v16 00/13] support "task_isolation" mode Christopher Lameter
     [not found]     ` <bd5526d4-7b6e-3b3d-79d0-5d6567c0bdf8@mellanox.com>
2017-11-07 17:10       ` Christopher Lameter
2017-11-07 17:27         ` Chris Metcalf
2017-11-07 17:54           ` Christopher Lameter
2018-03-07 10:07 ` Yury Norov
2018-07-12 12:29 ` Yury Norov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).