Re: [patch v4 1/8] add basic task isolation prctl interface

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, Nitesh Lal <nilal@redhat.com>,
	Nicolas Saenz Julienne <nsaenzju@redhat.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Alex Belits <abelits@belits.com>, Peter Xu <peterx@redhat.com>
Subject: Re: [patch v4 1/8] add basic task isolation prctl interface
Date: Wed, 13 Oct 2021 07:56:37 -0300	[thread overview]
Message-ID: <20211013105637.GA88322@fuller.cnet> (raw)
In-Reply-To: <YWWIHkoAdTkzU0TP@hirez.programming.kicks-ass.net>

[-- Attachment #1: Type: text/plain, Size: 5840 bytes --]

Hi Peter,

On Tue, Oct 12, 2021 at 03:05:34PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 07, 2021 at 04:23:47PM -0300, Marcelo Tosatti wrote:
> > Add basic prctl task isolation interface, which allows
> > informing the kernel that application is executing 
> > latency sensitive code (where interruptions are undesired).
> > 
> > Interface is described by task_isolation.rst (added by
> > next patch).
> 
> That does not absolve you from actually writing a changelog here.
> Life is too short to try and read rst shit.

The rst is concise and contains all necessary information.

Changelog is on the patch header (I would appreciate reviews of
the interface itself, not sure why the changelog is important).

The rst compiled in PDF form is attached. Its only 6 pages long, it
described the interface (please if you think of any improvement 
to that document, and not only the interface).

Also check the examples on the second patch.

"===============================
Task isolation prctl interface
===============================

Certain types of applications benefit from running uninterrupted by
background OS activities. Realtime systems and high-bandwidth networking
applications with user-space drivers can fall into the category.

To create an OS noise free environment for the application, this
interface allows userspace to inform the kernel the start and
end of the latency sensitive application section (with configurable
system behaviour for that section).

Note: the prctl interface is independent of nohz_full=.

The prctl options are:

        - PR_ISOL_FEAT_GET: Retrieve supported features.
        - PR_ISOL_CFG_GET: Retrieve task isolation configuration.
        - PR_ISOL_CFG_SET: Set task isolation configuration.
        - PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
        - PR_ISOL_ACTIVATE_SET: Set task isolation activation state."

> What is the envisioned usage of these isolating prctl() thingies,

To inform the kernel that an application is entering/exiting 
task isolated mode, where the kernel can behave in different ways
(it depends on what commands you implement using the interface).

At the moment, only a single action is performed: to sync any pending
per-CPU vm statistics that might wake up a CPU's vmstat_worker 
thread.

The point of the interface, as requested, is that it is extensible 
allowing for new features to be implemented.

For example, one could implement blocking of certain kernel activities
which require interruption to userspace.

> including the kill-me-on-any-interruption thing, vs the inherently racy
> nature of some of the don't disturb me stuff.

The kill-me-on-any-interruption thing is not included in this patchset
(maybe it can be implemented using this interface, if desired, but we
do not see the need for such feature at the moment).

The inherently racy nature of some of the don't disturb me stuff:

> 
> Also, see:
> 
>   https://lkml.kernel.org/r/20210929152429.186930629@infradead.org

As you can see from the below pseudocode, we were thinking of queueing
the (invalidate icache or TLB flush) in case app is in userspace,
to perform on return to kernel space, but the approach in your patch might be
superior (will take sometime to parse that thread...).

> Suppose:
> 
> 	CPU0					CPU1
> 
> 	sys_prctl()
> 	<kernel entry>
> 	  // marks task 'important'
> 						text_poke_sync()
> 						  // checks CPU0, not userspace, queues IPI
> 	<kernel exit>
> 
> 	$important userspace			  arch_send_call_function_ipi_mask()
> 	<IPI>
> 	  // finds task is 'important' and
> 	  // can't take interrupts
> 	  sigkill()
> 
> *Whoopsie*
> 
> 
> Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
> therefore CPU0 can't wait for quescence in kernelspace, but if it goes
> to userspace, it'll get killed on interruption. Catch-22.

We have been using a BPF tool for logging and monitoring of
interruptions:
https://github.com/xzpeter/rt-trace-bpf

But there is no such thing as 

         // finds task is 'important' and
         // can't take interrupts
         sigkill()

On this patchset.

We have been discussing something like this to avoid TLBs / invalidate
i-cache IPIs, but 

 #define CPU_REQ_FLUSH_TLB       0x1     /*      __flush_tlb_all()       */ 
 #define CPU_REQ_INV_ICACHE      0x2     /*      sync_core()             */ 

 #define IN_USER_MODE            (0x1 << 16) 

 /* when CPU is in kernel mode, should IPI rather than queuing the 
    request on per-cpu state variable */ 
 #define IN_KERNEL_MODE                (0) 

 Then on entry/exit would have to add a conditional branch: 

 Exit to userspace: 
 ----------------- 

       cpu = smp_processor_id(); 

       if (isolation_enabled(cpu)) { 
               atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); 
       } 

 Kernel entry: 
 ------------- 

       cpu = smp_processor_id(); 

       if (isolation_enabled(cpu)) { 
               reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); 
               if (reqs & CPU_REQ_FLUSH_TLB) 
                       flush_tlb_all(); 
               if (reqs & CPU_REQ_INV_ICACHE) 
                       invalidate_icache(); 
       } 

 Request side: 
 ------------- 

       int targetcpu; 

       do { 
               struct percpudata *pcpudata = per_cpu(&percpudata, targetcpu); 

               old_state = pcpudata->user_kernel_state; 

               /* in kernel mode ? */ 
               if (!(old_state & IN_USER_MODE)) { 
                       smp_call_function_single(request_fn, targetcpu, 1); 
                       break; 
               } 
               new_state = remote_state | CPU_REQ_FLUSH_TLB; // (or CPU_REQ_INV_ICACHE)
       } while (atomic_cmpxchg(&pcpudata->user_kernel_state, old_state, new_state) != old_state); 

Thanks for your comments!

[-- Attachment #2: task_isolation.pdf --]
[-- Type: application/pdf, Size: 90947 bytes --]

next prev parent reply	other threads:[~2021-10-13 11:37 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-07 19:23 [patch v4 0/8] extensible prctl task isolation interface and vmstat sync Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 1/8] add basic task isolation prctl interface Marcelo Tosatti
2021-10-12 13:05   ` Peter Zijlstra
2021-10-13 10:56     ` Marcelo Tosatti [this message]
2021-10-13 11:37       ` Marcelo Tosatti
2021-10-13 15:01       ` Peter Zijlstra
2021-10-13 16:06         ` Marcelo Tosatti
2021-10-14 13:02           ` Marcelo Tosatti
2021-10-19 15:07         ` Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 2/8] add prctl task isolation prctl docs and samples Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 3/8] task isolation: sync vmstats on return to userspace Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 4/8] procfs: add per-pid task isolation state Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 5/8] task isolation: sync vmstats conditional on changes Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 6/8] KVM: x86: call isolation prepare from VM-entry code path Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 7/8] mm: vmstat: move need_update Marcelo Tosatti
2021-10-07 19:23 ` [patch v4 8/8] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211013105637.GA88322@fuller.cnet \
    --to=mtosatti@redhat.com \
    --cc=abelits@belits.com \
    --cc=cl@linux.com \
    --cc=frederic@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nilal@redhat.com \
    --cc=nsaenzju@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox