All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Tim Chen <tim.c.chen@linux.intel.com>,
	Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org, Nitesh Lal <nilal@redhat.com>,
	Nicolas Saenz Julienne <nsaenzju@redhat.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Alex Belits <abelits@belits.com>, Peter Xu <peterx@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Oscar Shiang <oscar0225@livemail.tw>
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync
Date: Wed, 4 May 2022 17:08:31 -0300	[thread overview]
Message-ID: <YnLdP3Hq1d69NSzz@fuller.cnet> (raw)
In-Reply-To: <f6217c9398e252f026db54851080a6f73f3f4da6.camel@linux.intel.com>

On Wed, May 04, 2022 at 10:01:47AM -0700, Tim Chen wrote:
> On Tue, 2022-03-15 at 12:31 -0300, Marcelo Tosatti wrote:
> > The logic to disable vmstat worker thread, when entering
> > nohz full, does not cover all scenarios. For example, it is possible
> > for the following to happen:
> > 
> > 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> > 2) app runs mlock, which increases counters for mlock'ed pages.
> > 3) start -RT loop
> > 
> > Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> > the mlock, vmstat shepherd can restart vmstat worker thread on
> > the CPU in question.
> > 
> > To fix this, add task isolation prctl interface to quiesce
> > deferred actions when returning to userspace.
> > 
> > The patchset is based on ideas and code from the
> > task isolation patchset from Alex Belits:
> > https://lwn.net/Articles/816298/
> > 
> > Please refer to Documentation/userspace-api/task_isolation.rst
> > (patch 1) for details. Its attached at the end of this message
> 
> Patch 1 doesn't seem to be the documentation patch but rather is
> in patch 4.

Hi Tim,

Yes, one might consider that awkward (but that order made sense when 
writing the patches).

Is there a different order that makes more sense?

> > 
> > Task isolation prctl interface
> > ******************************
> > 
> > Certain types of applications benefit from running uninterrupted by
> > background OS activities. Realtime systems and high-bandwidth
> > networking applications with user-space drivers can fall into the
> > category.
> > 
> > To create an OS noise free environment for the application, this
> > interface allows userspace to inform the kernel the start and end of
> > the latency sensitive application section (with configurable system
> > behaviour for that section).
> > 
> > Note: the prctl interface is independent of nohz_full=.
> > 
> > The prctl options are:
> > 
> >    * PR_ISOL_FEAT_GET: Retrieve supported features.
> > 
> >    * PR_ISOL_CFG_GET: Retrieve task isolation configuration.
> > 
> >    * PR_ISOL_CFG_SET: Set task isolation configuration.
> > 
> >    * PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
> > 
> >    * PR_ISOL_ACTIVATE_SET: Set task isolation activation state.
> > 
> > Summary of terms:
> > 
> > * feature:
> > 
> >      A distinct attribute or aspect of task isolation. Examples of
> >      features could be logging, new operating modes (eg: syscalls
> >      disallowed), userspace notifications, etc. The only feature
> >      currently available is quiescing.
> > 
> > * configuration:
> > 
> >      A specific choice from a given set of possible choices that
> >      dictate how the particular feature in question should behave.
> > 
> > * activation state:
> > 
> >      The activation state (whether active/inactive) of the task
> >      isolation features (features must be configured before being
> >      activated).
> > 
> > Inheritance of the isolation parameters and state, across fork(2) and
> > clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.
> > 
> > At a high-level, task isolation is divided in two steps:
> > 
> > 1. Configuration.
> > 
> > 2. Activation.
> > 
> > Section "Userspace support" describes how to use task isolation.
> > 
> > In terms of the interface, the sequence of steps to activate task
> > isolation are:
> > 
> > 1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).
> > 
> > 2. Configure task isolation features
> >    (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).
> > 
> > 3. Activate or deactivate task isolation features
> >    (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).
> > 
> > This interface is based on ideas and code from the task isolation
> > patchset from Alex Belits: https://lwn.net/Articles/816298/
> > 
> > Note: if the need arises to configure an individual quiesce feature
> > with its own extensible structure, please add ISOL_F_QUIESCE_ONE to
> > PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on multiple
> > features per syscall currently).
> > 
> > 
> > Feature description
> > ===================
> > 
> >    * "ISOL_F_QUIESCE"
> > 
> >    This feature allows quiescing selected kernel activities on return
> >    from system calls.
> > 
> > 
> > Interface description
> > =====================
> > 
> > **PR_ISOL_FEAT**:
> > 
> >    Returns the supported features and feature capabilities, as a
> >    bitmask:
> > 
> >       prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> > 
> >    The 'feat' argument specifies whether to return supported features
> >    (if zero), or feature capabilities (if not zero). Possible values
> >    for 'feat' are:
> > 
> >    * "0":
> > 
> >         Return the bitmask of supported features, in the location
> >         pointed  to  by  "(int *)arg3". The buffer should allow space
> >         for 8 bytes.
> > 
> >    * "ISOL_F_QUIESCE":
> > 
> >         Return a structure containing which kernel activities are
> >         supported for quiescing, in the location pointed to by "(int
> >         *)arg3":
> > 
> >            struct task_isol_quiesce_extensions {
> >                    __u64 flags;
> >                    __u64 supported_quiesce_bits;
> >                    __u64 pad[6];
> >            };
> > 
> >         Where:
> > 
> >         *flags*: Additional flags (should be zero).
> > 
> >         *supported_quiesce_bits*: Bitmask indicating
> >            which features are supported for quiescing.
> > 
> >         *pad*: Additional space for future enhancements.
> > 
> >    Features and its capabilities are defined at
> >    include/uapi/linux/task_isolation.h.
> > 
> > **PR_ISOL_CFG_GET**:
> > 
> >    Retrieve task isolation configuration. The general format is:
> > 
> >       prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
> > 
> >    The 'what' argument specifies what to configure. Possible values
> >    are:
> > 
> >    * "I_CFG_FEAT":
> > 
> >         Return configuration of task isolation features. The 'arg3'
> >         argument specifies whether to return configured features (if
> >         zero), or individual feature configuration (if not zero), as
> >         follows.
> > 
> >         * "0":
> > 
> >              Return the bitmask of configured features, in the
> >              location pointed  to  by  "(int *)arg4". The buffer
> >              should allow space for 8 bytes.
> > 
> >         * "ISOL_F_QUIESCE":
> > 
> >              If arg4 is QUIESCE_CONTROL, return the control structure
> >              for quiescing of background kernel activities, in the
> >              location pointed to by "(int *)arg5":
> > 
> >                 struct task_isol_quiesce_control {
> >                        __u64 flags;
> >                        __u64 quiesce_mask;
> >                        __u64 quiesce_oneshot_mask;
> >                        __u64 pad[5];
> >                 };
> > 
> >              See PR_ISOL_CFG_SET description for meaning of fields.
> > 
> >    * "I_CFG_INHERIT":
> > 
> >         Retrieve inheritance configuration across fork/clone.
> > 
> >         Return the structure which configures inheritance across
> >         fork/clone, in the location pointed to by "(int *)arg4":
> > 
> >            struct task_isol_inherit_control {
> >                    __u8    inherit_mask;
> >                    __u8    pad[7];
> >            };
> > 
> >         See PR_ISOL_CFG_SET description for meaning of fields.
> > 
> > **PR_ISOL_CFG_SET**:
> > 
> >    Set task isolation configuration. The general format is:
> > 
> >       prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
> > 
> >    The 'what' argument specifies what to configure. Possible values
> >    are:
> > 
> >    * "I_CFG_FEAT":
> > 
> >         Set configuration of task isolation features. 'arg3' specifies
> >         the feature. Possible values are:
> > 
> >         * "ISOL_F_QUIESCE":
> 
> Is it really necessary for such fine grain control for which kernel
> activity to quiesce?  
> 
> For most user, all they care about is their
> task is not disturbed by kernel activities and not be bothered about
> setting which particular activities to quiesce.  And in your patches there
> is only ISOL_F_QUIESCE_VMSTATS and nothing else.  I think you could
> probably skip the QUIESCE control for now and add it when there's really
> a true need for fine grain control.  This will make the interface simpler
> for user applications.
> 
> Tim

Yes, this could be done (for example can maintain the current scheme,
add a way to query all supported features, enable them at chisol).
Then the application only has to be modified with:

This is a snippet of code to activate task isolation if
it has been previously configured (by chisol for example)::

        #include <sys/prctl.h>
        #include <linux/types.h>

        #ifdef PR_ISOL_CFG_GET
        unsigned long long fmask;

        ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
        if (ret != -1 && fmask != 0) {
                ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
                if (ret == -1) {
                        perror("prctl PR_ISOL_ACTIVATE_SET");
                        return ret;
                }
        }
        #endif

Or not modified at all (well not initially since support for executing
unmodified binaries will be dropped).









      reply	other threads:[~2022-05-04 20:09 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-15 15:31 [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 01/13] s390: add support for TIF_TASK_ISOL Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 02/13] x86: " Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 03/13] add basic task isolation prctl interface Marcelo Tosatti
2022-04-25 22:23   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 04/13] add prctl task isolation prctl docs and samples Marcelo Tosatti
2022-04-26  0:15   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 05/13] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2022-04-25 23:06   ` Thomas Gleixner
2022-04-27  6:56   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 06/13] procfs: add per-pid task isolation state Marcelo Tosatti
2022-04-25 23:27   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 07/13] task isolation: sync vmstats conditional on changes Marcelo Tosatti
2022-03-17 14:51   ` Frederic Weisbecker
2022-04-27  8:03   ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 08/13] task isolation: enable return to userspace processing Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 09/13] task isolation: add preempt notifier to sync per-CPU vmstat dirty info to thread info Marcelo Tosatti
2022-03-16  2:41   ` Oscar Shiang
2022-04-27  7:11   ` Thomas Gleixner
2022-04-27 12:09     ` Thomas Gleixner
2022-05-04 16:32       ` Marcelo Tosatti
2022-05-04 17:39         ` Thomas Gleixner
2022-03-15 15:31 ` [patch v12 10/13] KVM: x86: process isolation work from VM-entry code path Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 11/13] mm: vmstat: move need_update Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 12/13] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
2022-04-27  7:23   ` Thomas Gleixner
2022-05-03 19:17     ` Marcelo Tosatti
2022-03-15 15:31 ` [patch v12 13/13] task isolation: only TIF_TASK_ISOL if task isolation is enabled Marcelo Tosatti
2022-04-27  7:45   ` Thomas Gleixner
2022-05-03 19:12     ` Marcelo Tosatti
2022-05-04 13:03       ` Thomas Gleixner
2022-03-17 15:08 ` [patch v12 00/13] extensible prctl task isolation interface and vmstat sync Frederic Weisbecker
2022-04-25 16:29   ` Marcelo Tosatti
2022-04-25 21:12     ` Thomas Gleixner
2022-05-03 18:57       ` Marcelo Tosatti
2022-04-27  9:19 ` Christoph Lameter
2022-05-03 18:57   ` Marcelo Tosatti
2022-05-04 13:20     ` Thomas Gleixner
2022-05-04 18:56       ` Marcelo Tosatti
2022-05-04 20:15         ` Thomas Gleixner
2022-05-05 16:52           ` Marcelo Tosatti
2022-06-01 16:14             ` Marcelo Tosatti
2022-05-04 17:01 ` Tim Chen
2022-05-04 20:08   ` Marcelo Tosatti [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YnLdP3Hq1d69NSzz@fuller.cnet \
    --to=mtosatti@redhat.com \
    --cc=abelits@belits.com \
    --cc=bristot@redhat.com \
    --cc=cl@linux.com \
    --cc=frederic@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nilal@redhat.com \
    --cc=nsaenzju@redhat.com \
    --cc=oscar0225@livemail.tw \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.