* Re: [PATCH v16 00/13] support "task_isolation" mode
[not found] <1509728692-10460-1-git-send-email-cmetcalf@mellanox.com>
@ 2018-03-07 10:07 ` Yury Norov
[not found] ` <1509728692-10460-7-git-send-email-cmetcalf@mellanox.com>
2018-07-12 12:29 ` [PATCH v16 00/13] support "task_isolation" mode Yury Norov
2 siblings, 0 replies; 3+ messages in thread
From: Yury Norov @ 2018-03-07 10:07 UTC (permalink / raw)
To: Chris Metcalf
Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
Paul E. McKenney, Christoph Lameter, Viresh Kumar,
Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
Francis Giraldeau, linux-mm, linux-doc, linux-api, linux-kernel,
Prasun Kapoor, Narayana Prasad Athreya, Alex Belits,
Chandrakala Chavva
Hi Chris,
(CC Cavium people)
Thanks for your series.
On Fri, Nov 03, 2017 at 01:04:39PM -0400, Chris Metcalf wrote:
> Here, finally, is a new spin of the task isolation work (v16), with
> changes based on the issues that were raised at last year's Linux
> Plumbers Conference and in the email discussion that followed.
>
> This version of the patch series cleans up a number of areas that were
> a little dodgy in the previous patch series.
>
> - It no longer loops in the final code that prepares to return to
> userspace; instead, it sets things up in the prctl() and then
> validates when preparing to return to userspace, adjusting the
> syscall return value to -EAGAIN at that point if something doesn't
> line up quite correctly.
>
> - We no longer support the NOSIG mode that let you freely call into
> the kernel multiple times while in task isolation. This was always
> a little odd, since you really should be in sufficient control of
> task isolation code that you can explicitly stop isolation with a
> "prctl(PR_TASK_ISOLATION, 0)" before using the kernel for anything
> else. It also made the semantics of migrating the task confusing.
> More importantly, removing that support means that the only path
> that sets up task isolation is the return from prctl(), which allows
> us to make the simplification above.
>
> - We no longer try to signal the task isolation process from a remote
> core when we detect that we are about to violate its isolation.
> Instead, we just print a message there (and optionally dump stack),
> and rely on the eventual interrupt on the core itself to trigger the
> signal. We are always in a safe context to generate a signal when
> we enter the kernel, unlike when we are deep in a call stack sending
> an SMP IPI or whatever.
>
> - We notice the case of an unstable scheduler clock and return
> EINVAL rather than spinning forever with EAGAIN (suggestion from
> Francis Giraldeau).
>
> - The prctl() call requires zeros for arg3/4/5 (suggestion from
> Eugene Syromiatnikov).
>
> The kernel internal isolation API is also now cleaner, and I have
> included kerneldoc APIs for all the interfaces so that it should be
> easier to port it to additional architectures; in fact looking at
> include/linux/isolation.h is a good place to start understanding the
> overall patch set.
>
> I removed Catalin's Reviewed-by for arm64, and Christoph's Tested-by
> for x86, since this version is sufficiently different to merit
> re-review and re-testing.
>
> Note that this is not rebased on top of Frederic's recent housekeeping
> patch series, although it is largely orthogonal right now. After
> Frederic's patch series lands, task isolation is enabled with
> "isolcpus=nohz,domain,CPUS". We could add some shorthand for that
> ("isolcpus=full,CPUS"?) or just use it as-is.
>
> The previous (v15) patch series is here:
>
> https://lkml.kernel.org/r/1471382376-5443-1-git-send-email-cmetcalf@mellanox.com
>
> This patch series is available at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane
>
> Some folks raised some good points at the LPC discussion and then in
> email discussions that followed. Rather than trying to respond to
> everyone in a flurry of emails, I'll answer some questions here:
>
>
> Why not just instrument user_exit() to raise the isolation-lost signal?
>
> Andy pointed me in this direction. The advantage is that you catch
> *everything*, by definition. There is a hook that can do this in the
> current patch set, but you have to #define DEBUG_TASK_ISOLATION
> manually to take advantage of it, because as written it has two issues:
>
> 1. You can't actually exit the kernel with prctl(PR_TASK_ISOLATION,0)
> because the user_exit hook kills you first.
> 2. You lose the ability to get much better diagnostics by waiting
> until you are further into kernel entry and know what you're doing.
>
> You could work around #2 in several ways, but #1 is harder. I looked
> at x86 for a while, and although you could imagine this, you really
> want to generate a lost-isolation signal on any syscall that isn't
> that exact prctl(), and it's awkward to try to do all of that checking
> before user_exit(). Since in any case we do want to have the more
> specific probes at the various kernel entry points where we generate
> the diagnostics, I felt like it wasn't the right approach to enable
> as a compilation-time default. I'm open to discussion on this though!
>
>
> Can't we do all the exit-to-userspace work with irqs disabled?
>
> In fact, it turns out that you can do lru_add_drain() with irqs
> disabled, so that's what we're doing in the patch series now.
>
> However, it doesn't seem possible to do the synchronous cancellation of
> the vmstat deferred work with irqs disabled, though if there's a way,
> it would be a little cleaner to do that; Christoph? We can certainly
> update the statistics with interrupts disabled via
> refresh_cpu_vm_stats(false), but that's not sufficient. For now, I
> just issue the cancellation during sys_prctl() call, and then if it
> isn't synchronized by the time we are exiting to userspace, we just
> jam in an EAGAIN and let userspace retry. In practice, this doesn't
> seem to ever happen.
>
>
> What about using a per-cpu flag to stop doing new deferred work?
>
> Andy also suggested we could structure the code to have the prctl()
> set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu
> data, or lru page cache). Then, we could flush those structures right
> from the sys_prctl() call, and when we were returning to user space,
> we'd be confident that there wasn't going to be any new work added.
>
> With the current set of things that we are disabling for task
> isolation, though, it didn't seem necessary. Quiescing the vmstat
> shepherd seems like it is generally pretty safe since we will likely
> be able to sync up the per-cpu cache and kill the deferred work with
> high probability, with no expectation that additional work will show
> up. And since we can flush the LRU page cache with interrupts
> disabled, that turns out not to be an issue either.
>
> I could imagine that if we have to deal with some new kind of deferred
> work, we might find the per-cpu flag becomes a good solution, but for
> now we don't have a good use case for that approach.
>
>
> How about stopping the dyn tick?
>
> Right now we try to stop it on return to userspace, but if we can't,
> we just return EAGAIN to userspace. In practice, what I see is that
> usually the tick stops immediately, but occasionally it doesn't; in
> this case I've always seen that nr_running is >1, presumably with some
> temporary kernel worker threads, and the user code just needs to call
> prctl() until those threads are done. We could structure things with
> a completion that we wait for, which is set by the timer code when it
> finally does stop the tick, but this may be overkill, particularly
> since we'll only be running this prctl() loop from userspace on cores
> where we have no other useful work that we're trying to run anyway.
>
>
> What about TLB flushing?
>
> We talked about this at Plumbers and some of the email discussion also
> was about TLB flushing. I haven't tried to add it to this patch set,
> because I really want to avoid scope creep; in any case, I think I
> managed to convince Andy that he was going to work on it himself. :)
> Paul McKenney already contributed some framework for such a patch, in
> commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of
> ->dynticks counter").
>
> What about that d*mn 1 Hz clock?
>
> It's still there, so this code still requires some further work before
> it can actually get a process into long-term task isolation (without
> the obvious one-line kernel hack). Frederic suggested a while ago
> forcing updates on cpustats was required as the last gating factor; do
> we think that is still true? Christoph was working on this at one
> point - any progress from your point of view?
I tested your series on ThunderX 2 machine. When I run 10 giga-ticks test,
it always passed. If I run for more, the test exits like this:
# time ./isolation 1000
/sys devices: OK (using task isolation cpu 100)
prctl unaffinitized: OK
prctl on cpu 0: OK
==> hello, world
test_off: OK
Received signal 11 successfully
test_segv: OK
test_fault: OK
test_fault (SIGUSR1): OK
test_syscall: OK
test_syscall (SIGUSR1): OK
test_schedule: OK
test_schedule (SIGUSR1): OK
testing task isolation jitter for 1000000000000 ticks
ERROR: Program unexpectedly entered kernel.
INFO: loop times:
1 cycles (count: 128606844716)
2 cycles (count: 31558352292)
3 cycles (count: 4)
29 cycles (count: 437)
30 cycles (count: 1874)
31 cycles (count: 221)
57 cycles (count: 4)
58 cycles (count: 6)
59 cycles (count: 1)
real 15m58.643s
user 15m58.626s
sys 0m0.012s
I pass task_isolation_debug to boot parameters:
# cat /proc/cmdline
BOOT_IMAGE=/boot/Image-isol nohz_full=100-110 isolcpus=100-110 task_isolation_debug root=UUID=75b9dd5b-58d8-4a50-8868-004cb7c1f25f ro text
But dmesg looks empty...
I investigate it, but at now I have no ideas what happens. Have you seen
that before?
Anyway, we are going to include your test in our scenario, with some
modifications. I've added --dry-run option to make it easier to
demonstrate the effect of isolation on jitter. If you like it, feel
free to use this change.
Tested-by: Yury Norov <ynorov@caviumnetworks.com>
From 5c5823c1fc8441ab1486287679de77b8cea5154d Mon Sep 17 00:00:00 2001
From: Yury Norov <ynorov@caviumnetworks.com>
Date: Wed, 7 Mar 2018 02:41:08 +0300
Subject: [PATCH] isolation test: --dry-run mode
Add dry-run mode for better understanding the effect of isolation.
Also, make sanity checks on waitticks.
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
tools/testing/selftests/task_isolation/isolation.c | 47 +++++++++++++++++-----
1 file changed, 36 insertions(+), 11 deletions(-)
diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c
index 9c0b49619b40..e90a6c85fe2a 100644
--- a/tools/testing/selftests/task_isolation/isolation.c
+++ b/tools/testing/selftests/task_isolation/isolation.c
@@ -53,6 +53,7 @@
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
+#include <limits.h>
#include <assert.h>
#include <string.h>
#include <errno.h>
@@ -500,7 +501,7 @@ static void jitter_handler(int sig)
exit(exit_status);
}
-static void test_jitter(unsigned long waitticks)
+static void test_jitter(unsigned long waitticks, int flags)
{
u_int64_t start, last, elapsed;
int rc;
@@ -513,7 +514,8 @@ static void test_jitter(unsigned long waitticks)
rc = mlockall(MCL_CURRENT);
assert(rc == 0);
- set_task_isolation(PR_TASK_ISOLATION_ENABLE |
+ if (flags & PR_TASK_ISOLATION_ENABLE)
+ set_task_isolation(PR_TASK_ISOLATION_ENABLE |
PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
last = start = get_cycle_count();
@@ -537,26 +539,49 @@ static void test_jitter(unsigned long waitticks)
} while (elapsed < waitticks);
jitter_test_complete = true;
- prctl_isolation(0);
+
+ if (flags & PR_TASK_ISOLATION_ENABLE)
+ prctl_isolation(0);
+
jitter_summarize();
}
int main(int argc, char **argv)
{
/* How many billion ticks to wait after running the other tests? */
- unsigned long waitticks;
+ long waitticks = 10;
+ int flags = PR_TASK_ISOLATION_ENABLE;
char buf[100];
char *result, *end;
FILE *f;
- if (argc == 1)
- waitticks = 10;
- else if (argc == 2)
- waitticks = strtol(argv[1], NULL, 10);
- else {
- printf("syntax: isolation [gigaticks]\n");
+ errno = 0;
+
+ switch (argc) {
+ case 1:
+ break;
+ case 2:
+ if (strcmp(argv[1], "--dry-run") == 0)
+ flags = 0;
+ else
+ waitticks = strtol(argv[1], NULL, 10);
+ break;
+ case 3:
+ if (strcmp(argv[1], "--dry-run") == 0)
+ flags = 0;
+
+ waitticks = strtol(argv[2], NULL, 10);
+ break;
+ default:
+ printf("syntax: isolation [--dry-run] [gigaticks]\n");
ksft_exit_fail();
}
+
+ if (errno || waitticks <= 0 || waitticks > (LONG_MAX / 1000000000)) {
+ printf("Gigaticks: bad number.\n");
+ ksft_exit_fail();
+ }
+
waitticks *= 1000000000;
/* Get a core from the /sys nohz_full device. */
@@ -637,7 +662,7 @@ int main(int argc, char **argv)
return exit_status;
}
- test_jitter(waitticks);
+ test_jitter(waitticks, flags);
return exit_status;
}
--
2.14.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel
[not found] ` <1509728692-10460-7-git-send-email-cmetcalf@mellanox.com>
@ 2018-03-18 14:22 ` Yury Norov
0 siblings, 0 replies; 3+ messages in thread
From: Yury Norov @ 2018-03-18 14:22 UTC (permalink / raw)
To: Chris Metcalf
Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
Paul E. McKenney, Christoph Lameter, Viresh Kumar,
Catalin Marinas, Will Deacon, Andy Lutomirski, Michal Hocko,
Jonathan Corbet, linux-mm, linux-doc, linux-api, linux-kernel
Hi Chris,
On Fri, Nov 03, 2017 at 01:04:45PM -0400, Chris Metcalf wrote:
> The existing nohz_full mode is designed as a "soft" isolation mode
> that makes tradeoffs to minimize userspace interruptions while
> still attempting to avoid overheads in the kernel entry/exit path,
> to provide 100% kernel semantics, etc.
>
> However, some applications require a "hard" commitment from the
> kernel to avoid interruptions, in particular userspace device driver
> style applications, such as high-speed networking code.
>
> This change introduces a framework to allow applications
> to elect to have the "hard" semantics as needed, specifying
> prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
>
> The kernel must be built with the new TASK_ISOLATION Kconfig flag
> to enable this mode, and the kernel booted with an appropriate
> "nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable
> nohz_full and isolcpus. The "task_isolation" state is then indicated
> by setting a new task struct field, task_isolation_flag, to the
> value passed by prctl(), and also setting a TIF_TASK_ISOLATION
> bit in the thread_info flags. When the kernel is returning to
> userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
> it calls the new task_isolation_start() routine to arrange for
> the task to avoid being interrupted in the future.
>
> With interrupts disabled, task_isolation_start() ensures that kernel
> subsystems that might cause a future interrupt are quiesced. If it
> doesn't succeed, it adjusts the syscall return value to indicate that
> fact, and userspace can retry as desired. In addition to stopping
> the scheduler tick, the code takes any actions that might avoid
> a future interrupt to the core, such as a worker thread being
> scheduled that could be quiesced now (e.g. the vmstat worker)
> or a future IPI to the core to clean up some state that could be
> cleaned up now (e.g. the mm lru per-cpu cache).
>
> Once the task has returned to userspace after issuing the prctl(),
> if it enters the kernel again via system call, page fault, or any
> other exception or irq, the kernel will kill it with SIGKILL.
> In addition to sending a signal, the code supports a kernel
> command-line "task_isolation_debug" flag which causes a stack
> backtrace to be generated whenever a task loses isolation.
>
> To allow the state to be entered and exited, the syscall checking
> test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
> clear the bit again later, and ignores exit/exit_group to allow
> exiting the task without a pointless signal being delivered.
>
> The prctl() API allows for specifying a signal number to use instead
> of the default SIGKILL, to allow for catching the notification
> signal; for example, in a production environment, it might be
> helpful to log information to the application logging mechanism
> before exiting. Or, the signal handler might choose to reset the
> program counter back to the code segment intended to be run isolated
> via prctl() to continue execution.
>
> In a number of cases we can tell on a remote cpu that we are
> going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
> In that case we generate the diagnostic (and optional stack dump)
> on the remote core to be able to deliver better diagnostics.
> If the interrupt is not something caught by Linux (e.g. a
> hypervisor interrupt) we can also request a reschedule IPI to
> be sent to the remote core so it can be sure to generate a
> signal to notify the process.
>
> Separate patches that follow provide these changes for x86, tile,
> arm, and arm64.
>
> Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 6 +
> include/linux/isolation.h | 175 +++++++++++
> include/linux/sched.h | 4 +
> include/uapi/linux/prctl.h | 6 +
> init/Kconfig | 28 ++
> kernel/Makefile | 1 +
> kernel/context_tracking.c | 2 +
> kernel/isolation.c | 402 ++++++++++++++++++++++++
> kernel/signal.c | 2 +
> kernel/sys.c | 6 +
> 10 files changed, 631 insertions(+)
> create mode 100644 include/linux/isolation.h
> create mode 100644 kernel/isolation.c
[...]
> + * This routine is called from syscall entry, prevents most syscalls
> + * from executing, and if needed raises a signal to notify the process.
> + *
> + * Note that we have to stop isolation before we even print a message
> + * here, since otherwise we might end up reporting an interrupt due to
> + * kicking the printk handling code, rather than reporting the true
> + * cause of interrupt here.
> + */
> +int task_isolation_syscall(int syscall)
> +{
All callers of this function call it like this:
if (work & _TIF_TASK_ISOLATION) {
if (task_isolation_syscall(regs->syscallno) ==
-1)
return -1;
}
Would it make sense to move check of _TIF_TASK_ISOLATION flag inside
the function?
> + struct task_struct *task = current;
> +
> + if (is_acceptable_syscall(syscall)) {
> + stop_isolation(task);
> + return 0;
> + }
> +
> + send_isolation_signal(task);
> +
> + pr_warn("%s/%d (cpu %d): task_isolation lost due to syscall %d\n",
> + task->comm, task->pid, smp_processor_id(), syscall);
> + debug_dump_stack();
> +
> + syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
> + return -1;
> +}
Yury
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v16 00/13] support "task_isolation" mode
[not found] <1509728692-10460-1-git-send-email-cmetcalf@mellanox.com>
2018-03-07 10:07 ` [PATCH v16 00/13] support "task_isolation" mode Yury Norov
[not found] ` <1509728692-10460-7-git-send-email-cmetcalf@mellanox.com>
@ 2018-07-12 12:29 ` Yury Norov
2 siblings, 0 replies; 3+ messages in thread
From: Yury Norov @ 2018-07-12 12:29 UTC (permalink / raw)
To: Chris Metcalf
Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, Andrew Morton,
Rik van Riel, Tejun Heo, Frederic Weisbecker, Thomas Gleixner,
Paul E. McKenney, Christoph Lameter, Viresh Kumar,
Catalin Marinas, Will Deacon, Andy Lutomirski, Daniel Lezcano,
Francis Giraldeau, Sunil Goutham, linux-mm, linux-doc, linux-api,
linux-kernel
On Fri, Nov 03, 2017 at 01:04:39PM -0400, Chris Metcalf wrote:
> Here, finally, is a new spin of the task isolation work (v16), with
> changes based on the issues that were raised at last year's Linux
> Plumbers Conference and in the email discussion that followed.
Hi Chris,
There's another possible way to break task isolation, by net subsystem.
See patch below.
Yury
From 8025e9330bf06ce146d4ba96833aad6eafe24759 Mon Sep 17 00:00:00 2001
From: Yury Norov <ynorov@caviumnetworks.com>
Date: Sun, 8 Jul 2018 00:40:46 +0300
Subject: [PATCH] net: don't let user assign task isolation CPUs for RPS
Receive Packet Steering (RPS) subsystem distributes network traffic
handling to CPUs defined by user in
/sys/class/net/<dev>/queues/rx-<n>/rps_cpus.
If rps_cpus intersects with task_isolation_map, RPS may break task
isolation by assigning RPS work to CPU that runs isolated task.
In this patch user-provided rps_cpus map filtered to avoid that.
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
include/linux/isolation.h | 2 ++
net/core/net-sysfs.c | 13 +++++++++++++
2 files changed, 15 insertions(+)
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index f467545ad37d..b7f0a9085b13 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -14,6 +14,8 @@ struct task_struct;
#ifdef CONFIG_TASK_ISOLATION
+extern cpumask_var_t task_isolation_map;
+
/**
* task_isolation_request() - prctl hook to request task isolation
* @flags: Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 927a6dcbad96..18e576893984 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
#include <linux/capability.h>
#include <linux/kernel.h>
+#include <linux/isolation.h>
#include <linux/netdevice.h>
#include <net/switchdev.h>
#include <linux/if_arp.h>
@@ -727,6 +728,18 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
return err;
}
+#ifdef CONFIG_TASK_ISOLATION
+ if (cpumask_intersects(mask, task_isolation_map)) {
+ char tmp[256];
+
+ pr_warn("RPS is not allowed on CPUs allocated for isolated tasks\n");
+
+ cpumask_andnot(mask, mask, task_isolation_map);
+ cpumap_print_to_pagebuf(1, tmp, mask);
+ pr_warn("RPS CPUs list is reduced to: %s\n", tmp);
+ }
+#endif
+
map = kzalloc(max_t(unsigned int,
RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
GFP_KERNEL);
--
2.17.1
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2018-07-12 12:29 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1509728692-10460-1-git-send-email-cmetcalf@mellanox.com>
2018-03-07 10:07 ` [PATCH v16 00/13] support "task_isolation" mode Yury Norov
[not found] ` <1509728692-10460-7-git-send-email-cmetcalf@mellanox.com>
2018-03-18 14:22 ` [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel Yury Norov
2018-07-12 12:29 ` [PATCH v16 00/13] support "task_isolation" mode Yury Norov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).