linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Frederic Weisbecker <fweisbec@gmail.com>
To: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Gilad Ben Yossef <giladb@ezchip.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>, Tejun Heo <tj@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Christoph Lameter <cl@linux.com>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	Andy Lutomirski <luto@amacapital.net>,
	linux-doc@vger.kernel.org, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v9 04/13] task_isolation: add initial support
Date: Tue, 5 Jul 2016 16:41:09 +0200	[thread overview]
Message-ID: <20160705144107.GC5332@lerouge> (raw)
In-Reply-To: <25c4ace1-6903-abb3-59e9-aedc11ac32fc@mellanox.com>

On Fri, Jul 01, 2016 at 04:59:26PM -0400, Chris Metcalf wrote:
> On 6/29/2016 11:18 AM, Frederic Weisbecker wrote:
> >
> >I just feel that quiescing, on the way back to user after an unwanted
> >interruption, is awkward. The quiescing should work once and for all
> >on return back from the prctl. If we still get disturbed afterward,
> >either the quiescing is buggy or incomplete, or something is on the
> >way that can not be quiesced.
> 
> If we are thinking of an initial implementation that doesn't allow any
> subsequent kernel entry to be valid, then this all gets much easier,
> since any subsequent kernel entry except for a prctl() syscall will
> result in a signal, which will turn off task isolation, and we will
> never have to worry about additional quiescing.  I think that's where
> we got from the discussion at the bottom of this email.

Right.

> 
> So for your question here, we're really just thinking about future
> directions as far as how to handle interrupts, and if in the future we
> add support for allowing syscalls and/or exceptions without leaving
> task isolation mode, then we have to think about how that interacts
> with interrupts.  The problem is that it's hard to tell, as you're
> returning to userspace, whether you're returning from an exception or
> an interrupt; you typically don't have that information available.  So
> from a purely ease-of-implementation perspective, we'd likely want to
> handle exceptions and interrupts the same way, and quiesce both.

Sure but what I don't understand is why do we need to quiesce more than
once (ie: at the prctl() call). Quiescing should be a single operation
that prevents from any further disturbance. Like offlining anything
we to other CPUs. And entering again in the kernel shouldn't break
that.

> 
> In general, I think it would also be a better explanation to users of
> task isolation to say "every enter/exit to the kernel is either an
> error that causes a signal, or it quiesces on return".  It's a simpler
> semantic, and I think it also is better for interrupts anyway, since
> it potentially avoids multiple interrupts to the application (whatever
> interrupted to begin with, plus potential timer interrupts later).
> 
> But that said, if we start with "pure strict" mode only, all of this
> becomes hypothetical, and we may in fact choose never to allow "safe"
> modes of entering the kernel.

Right. And starting with pure strict mode would be a good first step,
provided it is a mode you need.


> >>That's true, but I'd argue the behavior in that case should be that you can
> >>raise that kind of exception validly (so you can debug), and then you should
> >>quiesce on return to userspace so the application doesn't see additional
> >>exceptions.
> >
> >I don't see how we can quiesce such things.
> 
> I'm imagining task A is in dataplane mode, and task B wants to debug
> it by writing a breakpoint into its text.  When task A hits the
> breakpoint, it will enter the kernel, and hold there while task B
> pokes at it with ptrace.  When task A finally is allowed to return to
> userspace, it should quiesce before entering userspace in case any
> timer interrupts got scheduled (again, maybe due to softirqs or
> whatever, or random other kernel activity targeting that core while it
> was in the kernel, or whatever).  This is just the same kind of
> quiescing we do on return from the initial prctl().

Well again I think it shouldn't happen. Quiescing should be done once
and for all.


> With a "pure strict" mode it does get a little tricky, since we will
> end up killing task A as it comes back from its breakpoint.  We might
> just choose to say that task A should not enable task isolation if it
> is going to be debugged (some runtime switch).  This isn't really a
> great solution; I do kind of feel that the nicest thing to do is
> quiesce the task again at this point.  This feels like the biggest
> argument in favor of supporting a mode where a task-isolated task can
> safely enter the kernel for exceptions.  What do you think?

Yeah probably we'll need to introduce some sort of debugability. Allow
debug/trap exceptions only for example.


> >> - Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
> >>   on return to userspace, and asynchronous interrupts don't even cause a signal.
> >>   It's basically "best effort", just nohz_full plus the code that tries to get things
> >>   like LRU or vmstat to run before returning to userspace.  I think there isn't enough
> >>   "value add" to make this a separate mode, though.
> >
> >I can imagine HPC to be willing this mode.
> 
> Yes, perhaps.  I'm not convinced we want to target HPC without a much
> clearer sense of why this is better than nohz_full, though.  I fear
> people might think "task isolation" is better by definition and not
> think too much about it, but I'm really not sure it is better for the
> HPC use case, necessarily.

I don't know. Perhaps HPC could just consist in quiescing once for all
(offline everything that can) and not signal when there is a rare disturbance.

> >Otherwise perhaps just drop a warning.
> 
> Are you saying that we should printk a warning in the prctl() rather
> than returning an error in the case where it's not on a full dynticks
> cpu?  I could be convinced by that just to keep things consistent.

Yeah that's what I meant.


> How about doing it this way?  If you invoke prctl() with the default
> "strict" mode where any kernel entry results in a signal, the prctl()
> will be strict, and require you to be affinitized to a single, full
> dynticks cpu.

But if you do that, you need to do it properly and care about races against
affinity changes. It involves heavy synchronization against scheduler code.

I tend to think we shouldn't bother with that. If we enter in task isolation
mode on a non-nohz-full CPU, the task will be signalled and kicked out of task
isolation mode in the next tick that happens very soon after the prctl().


> But, if you enable the "allow syscalls" mode, then the prctl isn't
> strict either, since you can use syscalls to get into a state where
> you're not on a full dynticks cpu, and you just get a console warning
> if you enter task isolation on the wrong cpu.  (Of course, we may end
> up not doing the "allow syscalls" mode for the first version of this
> patch anyway, as we discuss below.)

Right.

> >Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?
> 
> In this mode we don't worry about quiescing for interrupts, since we
> are generating a signal, and when you send a signal, you first have to
> disable task isolation mode to avoid getting into various bad states
> (sending too many signals, or worse, getting deadlocked because you
> are signalling the task BECAUSE it was about to receive a signal).  So
> we only quiesce after syscalls/exceptions.

Ok. And are you interested in such strict mode? :-)

If so it would be nice to start with just that and iterate on top of it.

> >Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
> >does strict pure isolation mode and have future flags for more granularity.
> 
> I think just implementing the basic _ENABLE mode with pure strict task
> isolation makes sense for now.  We can wait to enable syscalls or
> exceptions until we have a better use case.  Meanwhile, even without
> support for allowing syscalls, you can always use prctl() to turn off
> task isolation, and then you can do your syscalls, and prctl() it back
> on again.  prctl() to disable task isolation always has to work :-)

Perfect!

> Or, if we want to make it easy to do debugging, and as a result maybe
> also support the plausible mode where task-isolation tasks make
> occasional syscalls, we could say that the _ALLOW_EXCEPTIONS flag
> above implies syscalls as well, and support that mode.  Perhaps that
> makes the most sense...

I fear that _ALLOW_EXCEPTIONS is too wide for a special case if all we
want it to allow debugging.

The most granular way to express custom isolation would be to use BPF.
Not sure we want to go that far though.


> I'll spin it as a new patch series and you can take a look.

Ok. Ideally it would be nice to respin a simple version (strict mode)
on top of which we can later iterate.

Thanks.

  reply	other threads:[~2016-07-05 14:41 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-04 19:34 [PATCH v9 00/13] support "task_isolation" mode for nohz_full Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 01/13] vmstat: provide a function to quiet down the diff processing Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 02/13] vmstat: add vmstat_idle function Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 03/13] lru_add_drain_all: factor out lru_add_drain_needed Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 04/13] task_isolation: add initial support Chris Metcalf
2016-01-19 15:42   ` Frederic Weisbecker
2016-01-19 20:45     ` Chris Metcalf
2016-01-28  0:28       ` Frederic Weisbecker
2016-01-29 18:18         ` Chris Metcalf
2016-01-30 21:11           ` Frederic Weisbecker
2016-02-11 19:24             ` Chris Metcalf
2016-03-04 12:56               ` Frederic Weisbecker
2016-03-09 19:39                 ` Chris Metcalf
2016-04-08 13:56                   ` Frederic Weisbecker
2016-04-08 16:34                     ` Chris Metcalf
2016-04-12 18:41                       ` Chris Metcalf
2016-04-22 13:16                       ` Frederic Weisbecker
2016-04-25 20:36                         ` Chris Metcalf
2016-05-26  1:07                       ` Frederic Weisbecker
2016-06-03 19:32                         ` Chris Metcalf
2016-06-29 15:18                           ` Frederic Weisbecker
2016-07-01 20:59                             ` Chris Metcalf
2016-07-05 14:41                               ` Frederic Weisbecker [this message]
2016-07-05 17:47                                 ` Christoph Lameter
2016-01-04 19:34 ` [PATCH v9 05/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 06/13] task_isolation: add debug boot flag Chris Metcalf
2016-01-04 22:52   ` Steven Rostedt
2016-01-04 23:42     ` Chris Metcalf
2016-01-05 13:42       ` Steven Rostedt
2016-01-04 19:34 ` [PATCH v9 07/13] arch/x86: enable task isolation functionality Chris Metcalf
2016-01-04 21:02   ` [PATCH v9bis " Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf
2016-01-04 20:33   ` Mark Rutland
2016-01-04 21:01     ` Chris Metcalf
2016-01-05 17:21       ` Mark Rutland
2016-01-05 17:33         ` [PATCH 1/2] arm64: entry: remove pointless SPSR mode check Mark Rutland
2016-01-06 12:15           ` Catalin Marinas
2016-01-05 17:33         ` [PATCH 2/2] arm64: factor work_pending state machine to C Mark Rutland
2016-01-05 18:53           ` Chris Metcalf
2016-01-06 12:30           ` Catalin Marinas
2016-01-06 12:47             ` Mark Rutland
2016-01-06 13:43           ` Mark Rutland
2016-01-06 14:17             ` Catalin Marinas
2016-01-04 22:31     ` [PATCH v9 08/13] arch/arm64: adopt prepare_exit_to_usermode() model from x86 Andy Lutomirski
2016-01-05 18:01       ` Mark Rutland
2016-01-04 19:34 ` [PATCH v9 09/13] arch/arm64: enable task isolation functionality Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 10/13] arch/tile: adopt prepare_exit_to_usermode() model from x86 Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 11/13] arch/tile: move user_exit() to early kernel entry sequence Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 12/13] arch/tile: enable task isolation functionality Chris Metcalf
2016-01-04 19:34 ` [PATCH v9 13/13] arm, tile: turn off timer tick for oneshot_stopped state Chris Metcalf
2016-01-11 21:15 ` [PATCH v9 00/13] support "task_isolation" mode for nohz_full Chris Metcalf
2016-01-12 10:07   ` Will Deacon
2016-01-12 17:49     ` Chris Metcalf
2016-01-13 10:44       ` Ingo Molnar
2016-01-13 21:19         ` Chris Metcalf
2016-01-20 13:27           ` Mark Rutland
2016-01-12 10:53   ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160705144107.GC5332@lerouge \
    --to=fweisbec@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@linux.com \
    --cc=cmetcalf@mellanox.com \
    --cc=giladb@ezchip.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=viresh.kumar@linaro.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).