From: Marcelo Tosatti <mtosatti@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Lameter <cl@gentwo.de>,
linux-kernel@vger.kernel.org, Nitesh Lal <nilal@redhat.com>,
Nicolas Saenz Julienne <nsaenzju@redhat.com>,
Frederic Weisbecker <frederic@kernel.org>,
Juri Lelli <juri.lelli@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Alex Belits <abelits@belits.com>, Peter Xu <peterx@redhat.com>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Oscar Shiang <oscar0225@livemail.tw>,
linux-rdma@vger.kernel.org
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
Date: Wed, 4 May 2022 15:56:51 -0300 [thread overview]
Message-ID: <YnLMc5X8MZElk0NT@fuller.cnet> (raw)
In-Reply-To: <87h765juyk.ffs@tglx>
On Wed, May 04, 2022 at 03:20:03PM +0200, Thomas Gleixner wrote:
> On Tue, May 03 2022 at 15:57, Marcelo Tosatti wrote:
> > On Wed, Apr 27, 2022 at 11:19:02AM +0200, Christoph Lameter wrote:
> >> I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
> >> provided by this patch instead of the NOHZ_FULL.
> >>
> >> What kind of metric could I be using to show the difference in idleness of
> >> the quality of the cpu isolation?
> >
> > Interruption length and frequencies:
> >
> > -------|xxxxx|---------------|xxx|---------
> > 5us 3us
> >
> > which is what should be reported by oslat ?
>
> How is oslat helpful there? That's running artifical workload benchmarks
> which are not necessarily representing the actual
> idle->interrupt->idle... timing sequence of the real world usecase.
Ok, so what is happening today on production on some telco installations
(considering virtualized RAN usecase) is this:
1) Basic testing: Verify the hardware, software and its configuration
(cpu isolation parameters etc) are able to achieve the desired maximum
interruption length/frequencies through a synthetic benchmark like
cyclictest and oslat, for a duration that is considered sufficient.
One might also use the actual application in a synthetic configuration
(for example FlexRAN with synthetic data).
2) From the above, assume real world usecase is able to achieve the
desired maximum interruption length/frequency.
Of course, that is sub-optimal. The rt-trace-bcc.py scripts instruments
certain functions in the kernel (say smp_call_function family),
allowing one to check if certain interruptions have happened
on production. Example:
$ sudo ./rt-trace-bcc.py -c 36-39
[There can be some warnings dumped; we can ignore them]
Enabled hook point: process_one_work
Enabled hook point: __queue_work
Enabled hook point: __queue_delayed_work
Enabled hook point: generic_exec_single
Enabled hook point: smp_call_function_many_cond
Enabled hook point: irq_work_queue
Enabled hook point: irq_work_queue_on
TIME(s) COMM CPU PID MSG
0.009599293 rcuc/8 8 75 irq_work_queue_on (target=36, func=nohz_full_kick_func)
0.009603039 rcuc/8 8 75 irq_work_queue_on (target=37, func=nohz_full_kick_func)
0.009604047 rcuc/8 8 75 irq_work_queue_on (target=38, func=nohz_full_kick_func)
0.009604848 rcuc/8 8 75 irq_work_queue_on (target=39, func=nohz_full_kick_func)
0.103600589 rcuc/8 8 75 irq_work_queue_on (target=36, func=nohz_full_kick_func)
...
Currently it does not record the length of each interruption, but it
could (or you can do that from its output).
Note however that the sum of the interruptions is not the entire
overhead caused by the interruptions: there might also cachelines thrown
away which are only going to be "counted" when the latency sensitive
app executes.
But you could say the overhead is _at least_ the sum of interruptions +
cache effects unaccounted for.
Also note that for idle->interrupt->idle type scenarios (the vRAN
usecase above currently does not idle at all, but there is interest
from the field for that to happen for power saving reasons), you'd
also sum return from idle.
> > Inheritance is an attempt to support unmodified binaries like so:
> >
> > 1) configure task isolation parameters (eg sync per-CPU vmstat to global
> > stats on system call returns).
> > 2) enable inheritance (so that task isolation configuration and
> > activation states are copied across to child processes).
> > 3) enable task isolation.
> > 4) execv(binary, params)
>
> What for? If an application has isolation requirements, then the
> specific requirements are part of the application design and not of some
> arbitrary wrapper.
To be able to configure and active task isolation for an unmodified
binary, which seems a useful feature. However, have no problem of not
supporting unmodified binaries (would have then to change the
applications).
There are 3 types of application arrangements:
==================
Userspace support
==================
Task isolation is divided in two main steps: configuration and activation.
Each step can be performed by an external tool or the latency sensitive
application itself. util-linux contains the "chisol" tool for this
purpose.
This results in three combinations:
1. Both configuration and activation performed by the
latency sensitive application.
Allows fine grained control of what task isolation
features are enabled and when (see samples section below).
2. Only activation can be performed by the latency sensitive app
(and configuration performed by chisol).
This allows the admin/user to control task isolation parameters,
and applications have to be modified only once.
3. Configuration and activation performed by an external tool.
This allows unmodified applications to take advantage of
task isolation. Activation is performed by the "-a" option
of chisol.
---
Some features might not be supportable (or have awkward behavior) on a
given combination. For example, if a feature such as "warn if
sched_out/sched_in is ever performed if task isolation is
configured/activated", then you'll get those warnings
for combination 3 (which is the case of unmodified binaries above).
> Inheritance is an orthogonal problem and there is no reason to have this
> initially.
No problem, will drop it.
> Can we please focus on the initial problem of
> providing a sensible isolation mechanism with well defined semantics?
Case 2, however, was implicitly suggested by you (or at least i
understood that):
"Summary: The problem to be solved cannot be restricted to
self_defined_important_task(OWN_WORLD);
Policy is not a binary on/off problem. It's manifold across all levels
of the stack and only a kernel problem when it comes down to the last
line of defence.
Up to the point where the kernel puts the line of last defence, policy
is defined by the user/admin via mechanims provided by the kernel.
Emphasis on "mechanims provided by the kernel", aka. user API.
Just in case, I hope that I don't have to explain what level of scrunity
and thought this requires."
The idea, as i understood was that certain task isolation features (or
they parameters) might have to be changed at runtime (which depends on
the task isolation features themselves, and the plan is to create
an extensible interface). So for case 2, all you'd have to do is to
modify the application only once and allow the admin to configure
the features. From the documentation:
This is a snippet of code to activate task isolation if
it has been previously configured (by chisol for example)::
#include <sys/prctl.h>
#include <linux/types.h>
#ifdef PR_ISOL_CFG_GET
unsigned long long fmask;
ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
if (ret != -1 && fmask != 0) {
ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
if (ret == -1) {
perror("prctl PR_ISOL_ACTIVATE_SET");
return ret;
}
}
#endif
This seemed pretty useful to me (which is possible if the features
being discussed do not require further modifications on the part of the
application). For example, a new task isolation feature can be enabled
without having to modify the application.
Again, maybe that was misunderstood (and i'm OK with dropping this
and forcing both configuration and activation to be performed
inside the app), no problem.
> >> Special handling when the scheduler
> >> switches a task? If tasks are being switched that requires them to be low
> >> latency and undisturbed then something went very very wrong with the
> >> system configuration and the only thing I would suggest is to issue some
> >> kernel warning that this is not the way one should configure the system.
> >
> > Trying to provide mechanisms, not policy?
>
> This preemption notifier is not a mechanism, it's simply mindless
> hackery as I told you already.
Sure, if there is another way of checking "if per-CPU vmstats require
syncing" that is cheap (which seems you suggested on the other email),
can drop preempt notifiers.
next prev parent reply other threads:[~2022-05-04 18:57 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 02/13] x86: " Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 03/13] add basic task isolation prctl interface Marcelo Tosatti
2022-04-25 22:23 ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 04/13] add prctl task isolation prctl docs and samples Marcelo Tosatti
2022-04-26 0:15 ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2022-04-25 23:06 ` Thomas Gleixner
2022-04-27 6:56 ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 06/13] procfs: add per-pid task isolation state Marcelo Tosatti
2022-04-25 23:27 ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
2022-03-17 14:51 ` Frederic Weisbecker
2022-04-27 8:03 ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 08/13] task isolation: enable return to userspace processing Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
2022-03-16 2:41 ` Oscar Shiang
2022-04-27 7:11 ` Thomas Gleixner
2022-04-27 12:09 ` Thomas Gleixner
2022-05-04 16:32 ` Marcelo Tosatti
2022-05-04 17:39 ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 10/13] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 11/13] mm: vmstat: move need_update Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
2022-04-27 7:23 ` Thomas Gleixner
2022-05-03 19:17 ` Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
2022-04-27 7:45 ` Thomas Gleixner
2022-05-03 19:12 ` Marcelo Tosatti
2022-05-04 13:03 ` Thomas Gleixner
2022-03-17 15:08 ` [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Frederic Weisbecker
2022-04-25 16:29 ` Marcelo Tosatti
2022-04-25 21:12 ` Thomas Gleixner
2022-05-03 18:57 ` Marcelo Tosatti
2022-04-27 9:19 ` Christoph Lameter
2022-05-03 18:57 ` Marcelo Tosatti
2022-05-04 13:20 ` Thomas Gleixner
2022-05-04 18:56 ` Marcelo Tosatti [this message]
2022-05-04 20:15 ` Thomas Gleixner
2022-05-05 16:52 ` Marcelo Tosatti
2022-06-01 16:14 ` Marcelo Tosatti
2022-05-04 17:01 ` Tim Chen
2022-05-04 20:08 ` Marcelo Tosatti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YnLMc5X8MZElk0NT@fuller.cnet \
--to=mtosatti@redhat.com \
--cc=abelits@belits.com \
--cc=bristot@redhat.com \
--cc=cl@gentwo.de \
--cc=frederic@kernel.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=nilal@redhat.com \
--cc=nsaenzju@redhat.com \
--cc=oscar0225@livemail.tw \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox