Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Frederic Weisbecker <fweisbec@gmail.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org, Kees Cook <keescook@chromium.org>,
	Will Drewry <wad@chromium.org>,
	x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-mips@linux-mips.org, linux-arch@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	Alexei Starovoitov <ast@plumgrid.com>,
	hpa@zytor.com
Subject: Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
Date: Thu, 31 Jul 2014 02:30:37 +0200	[thread overview]
Message-ID: <20140731003034.GA32078@localhost.localdomain> (raw)
In-Reply-To: <20140730174630.GA30862@redhat.com>

On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> On 07/30, Frederic Weisbecker wrote:
> >
> > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> >
> > >
> > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > Then this flag will be propagated by copy_process().
> >
> > Right, that would be much better. Good catch! context tracking is enabled from
> > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> 
> actually init 1 task, but this doesn't matter.

Are you sure? It does matter because that would invalidate everything I understood
about init/main.c :) I was convinced that the very first kernel init task is PID 0 then
it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
idle task of the boot CPU.

> 
> > I still think we need a for_each_process_thread() set as well though because some
> > kernel threads may well have been created at this stage already.
> 
> Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().

Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
or random kernel threads?

> 
> > > Or I am totally confused? (quite possible).
> > >
> > > > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> > > > tracking call before returning from a syscall to userspace, and gets an interrupt. The
> > > > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> > > > after which it is going to resume to userspace.
> > > >
> > > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> > > > the task is resuming to userspace, because we passed through the context tracking probe
> > > > already and it was ignored on CPU 0.
> > >
> > > Thanks. But I still can't understand... So if we only track CPU 1, then in this
> > > case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
> > > on CPU 1.
> >
> > I'm not sure I understand your question.
> 
> Probably because it was stupid. Seriously, I still have no idea what this code
> actually does.
> 
> > Context tracking is either enabled everywhere or
> > nowhere.
> >
> > I need to say though that there is a per CPU context tracking state named context_tracking.active.
> > It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked
> > everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or
> > not.
> >
> > So only nohz full CPUs have context_tracking.active set because only these need to call the RCU
> > and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime
> > functions.
> 
> I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> really make a difference, afaics.
> 
> Lets assume that context tracking is only enabled on CPU 1. To simplify,
> assume that we have a single usermode task T which sleeps in kernel mode.
> 
> So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> 
> T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> context_tracking[0].state = IN_USER but otherwise does nothing else, this
> CPU is not tracked and .active is false.
> 
> Right after local_irq_restore() this task can migrate to CPU_1 and finish
> its ret-to-usermode path. But since it had already passed user_enter() we
> do not change context_tracking[1].state and do not play with rcu/vtime.
> (unless this task hits SCHEDULE_USER in asm).
> 
> The same for user_exit() of course.

So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
such situation where CPU 1 has wrong context tracking.

But global TIF_NOHZ should enforce context tracking everywhere. And also it's
less context switch overhead.

> 
> Oleg.
>

WARNING: multiple messages have this Message-ID (diff)

From: fweisbec@gmail.com (Frederic Weisbecker)
To: linux-arm-kernel@lists.infradead.org
Subject: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
Date: Thu, 31 Jul 2014 02:30:37 +0200	[thread overview]
Message-ID: <20140731003034.GA32078@localhost.localdomain> (raw)
In-Reply-To: <20140730174630.GA30862@redhat.com>

On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> On 07/30, Frederic Weisbecker wrote:
> >
> > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> >
> > >
> > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > Then this flag will be propagated by copy_process().
> >
> > Right, that would be much better. Good catch! context tracking is enabled from
> > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> 
> actually init 1 task, but this doesn't matter.

Are you sure? It does matter because that would invalidate everything I understood
about init/main.c :) I was convinced that the very first kernel init task is PID 0 then
it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
idle task of the boot CPU.

> 
> > I still think we need a for_each_process_thread() set as well though because some
> > kernel threads may well have been created at this stage already.
> 
> Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().

Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
or random kernel threads?

> 
> > > Or I am totally confused? (quite possible).
> > >
> > > > So here is a scenario where this is a problem: a task runs on CPU 0, passes the context
> > > > tracking call before returning from a syscall to userspace, and gets an interrupt. The
> > > > interrupt preempts the task and it moves to CPU 1. So it returns from preempt_schedule_irq()
> > > > after which it is going to resume to userspace.
> > > >
> > > > In this scenario, if context tracking is only enabled on CPU 1, we have no way to know that
> > > > the task is resuming to userspace, because we passed through the context tracking probe
> > > > already and it was ignored on CPU 0.
> > >
> > > Thanks. But I still can't understand... So if we only track CPU 1, then in this
> > > case context_tracking.state == IN_USER on CPU 0, but it can be IN_USER or IN_KERNEL
> > > on CPU 1.
> >
> > I'm not sure I understand your question.
> 
> Probably because it was stupid. Seriously, I still have no idea what this code
> actually does.
> 
> > Context tracking is either enabled everywhere or
> > nowhere.
> >
> > I need to say though that there is a per CPU context tracking state named context_tracking.active.
> > It's confusing because it suggests that context tracking is active per CPU. Actually it's tracked
> > everywhere when globally enabled, but active determines if we call the RCU and vtime callbacks or
> > not.
> >
> > So only nohz full CPUs have context_tracking.active set because only these need to call the RCU
> > and vtime callbacks. Other CPUs still do the context tracking but they won't call rcu and vtime
> > functions.
> 
> I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> really make a difference, afaics.
> 
> Lets assume that context tracking is only enabled on CPU 1. To simplify,
> assume that we have a single usermode task T which sleeps in kernel mode.
> 
> So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> 
> T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> context_tracking[0].state = IN_USER but otherwise does nothing else, this
> CPU is not tracked and .active is false.
> 
> Right after local_irq_restore() this task can migrate to CPU_1 and finish
> its ret-to-usermode path. But since it had already passed user_enter() we
> do not change context_tracking[1].state and do not play with rcu/vtime.
> (unless this task hits SCHEDULE_USER in asm).
> 
> The same for user_exit() of course.

So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
such situation where CPU 1 has wrong context tracking.

But global TIF_NOHZ should enforce context tracking everywhere. And also it's
less context switch overhead.

> 
> Oleg.
>

next prev parent reply	other threads:[~2014-07-31  0:30 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-22  1:49 [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes Andy Lutomirski
2014-07-22  1:49 ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 1/8] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
2014-07-22  1:49   ` [PATCH v3 1/8] seccomp, x86, arm, mips, s390: " Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 2/8] seccomp: Refactor the filter callback and the API Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 3/8] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 4/8] seccomp: Document two-phase seccomp and arch-provided seccomp_data Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 5/8] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit Andy Lutomirski
2014-07-22  1:53   ` Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-07-22  1:53   ` Andy Lutomirski
2014-07-28 17:37   ` Oleg Nesterov
2014-07-28 17:37     ` Oleg Nesterov
2014-07-28 18:58     ` TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases) Oleg Nesterov
2014-07-28 18:58       ` Oleg Nesterov
2014-07-28 19:22       ` Frederic Weisbecker
2014-07-28 19:22         ` Frederic Weisbecker
2014-07-29 17:54         ` Oleg Nesterov
2014-07-29 17:54           ` Oleg Nesterov
2014-07-30 16:35           ` Frederic Weisbecker
2014-07-30 16:35             ` Frederic Weisbecker
2014-07-30 17:46             ` Oleg Nesterov
2014-07-30 17:46               ` Oleg Nesterov
2014-07-31  0:30               ` Frederic Weisbecker [this message]
2014-07-31  0:30                 ` Frederic Weisbecker
2014-07-31 16:03                 ` Oleg Nesterov
2014-07-31 16:03                   ` Oleg Nesterov
2014-07-31 17:13                   ` Frederic Weisbecker
2014-07-31 17:13                     ` Frederic Weisbecker
2014-07-31 18:12                     ` Oleg Nesterov
2014-07-31 18:12                       ` Oleg Nesterov
2014-07-31 18:47                       ` Frederic Weisbecker
2014-07-31 18:47                         ` Frederic Weisbecker
2014-07-31 18:50                         ` Frederic Weisbecker
2014-07-31 18:50                           ` Frederic Weisbecker
2014-07-31 19:05                           ` Oleg Nesterov
2014-07-31 19:05                             ` Oleg Nesterov
2014-08-02 17:30                         ` Oleg Nesterov
2014-08-02 17:30                           ` Oleg Nesterov
2014-08-04 12:02                           ` Paul E. McKenney
2014-08-04 12:02                             ` Paul E. McKenney
2014-08-04 12:02                             ` Paul E. McKenney
2014-07-28 20:23     ` [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-07-28 20:23       ` Andy Lutomirski
2014-07-29 16:54       ` Oleg Nesterov
2014-07-29 16:54         ` Oleg Nesterov
2014-07-29 17:01         ` Andy Lutomirski
2014-07-29 17:01           ` Andy Lutomirski
2014-07-29 17:31           ` Oleg Nesterov
2014-07-29 17:31             ` Oleg Nesterov
2014-07-29 17:55             ` Andy Lutomirski
2014-07-29 17:55               ` Andy Lutomirski
2014-07-29 18:16               ` Oleg Nesterov
2014-07-29 18:16                 ` Oleg Nesterov
2014-07-29 18:22                 ` Andy Lutomirski
2014-07-29 18:22                   ` Andy Lutomirski
2014-07-29 18:44                   ` Oleg Nesterov
2014-07-29 18:44                     ` Oleg Nesterov
2014-07-22  1:53 ` [PATCH v3 7/8] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
2014-07-22  1:53   ` [PATCH v3 7/8] x86_64, entry: " Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 8/8] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls Andy Lutomirski
2014-07-22  1:53   ` [PATCH v3 8/8] x86_64, entry: " Andy Lutomirski
2014-07-22 19:37 ` [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes Kees Cook
2014-07-22 19:37   ` Kees Cook
2014-07-23 19:20   ` Andy Lutomirski
2014-07-23 19:20     ` Andy Lutomirski
2014-07-28 17:59     ` H. Peter Anvin
2014-07-28 17:59       ` H. Peter Anvin
2014-07-28 23:29       ` Kees Cook
2014-07-28 23:29         ` Kees Cook
2014-07-28 23:34         ` H. Peter Anvin
2014-07-28 23:34           ` H. Peter Anvin
2014-07-28 23:42           ` Kees Cook
2014-07-28 23:42             ` Kees Cook
2014-07-28 23:45             ` H. Peter Anvin
2014-07-28 23:45               ` H. Peter Anvin
2014-07-28 23:54               ` Kees Cook
2014-07-28 23:54                 ` Kees Cook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140731003034.GA32078@localhost.localdomain \
    --to=fweisbec@gmail.com \
    --cc=ast@plumgrid.com \
    --cc=hpa@zytor.com \
    --cc=keescook@chromium.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@linux-mips.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=wad@chromium.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.