Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Oleg Nesterov <oleg@redhat.com>
To: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org, Kees Cook <keescook@chromium.org>,
	Will Drewry <wad@chromium.org>,
	x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-mips@linux-mips.org, linux-arch@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	Alexei Starovoitov <ast@plumgrid.com>,
	hpa@zytor.com
Subject: Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
Date: Thu, 31 Jul 2014 18:03:53 +0200	[thread overview]
Message-ID: <20140731160353.GA14772@redhat.com> (raw)
In-Reply-To: <20140731003034.GA32078@localhost.localdomain>

On 07/31, Frederic Weisbecker wrote:
>
> On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> > On 07/30, Frederic Weisbecker wrote:
> > >
> > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> > >
> > > >
> > > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > > Then this flag will be propagated by copy_process().
> > >
> > > Right, that would be much better. Good catch! context tracking is enabled from
> > > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> >
> > actually init 1 task, but this doesn't matter.
>
> Are you sure? It does matter because that would invalidate everything I understood
> about init/main.c :)

Sorry for confusion ;)

> I was convinced that the very first kernel init task is PID 0 then
> it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> idle task of the boot CPU.

Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
by "swapper". And we do not care about idle threads at all.

> > > I still think we need a for_each_process_thread() set as well though because some
> > > kernel threads may well have been created at this stage already.
> >
> > Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().
>
> Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
> or random kernel threads?

Sure, but we do not care. A kernel thread can never return to user space, it
must never call user_enter/exit().

> > I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> > really make a difference, afaics.
> >
> > Lets assume that context tracking is only enabled on CPU 1. To simplify,
> > assume that we have a single usermode task T which sleeps in kernel mode.
> >
> > So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> >
> > T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> > context_tracking[0].state = IN_USER but otherwise does nothing else, this
> > CPU is not tracked and .active is false.
> >
> > Right after local_irq_restore() this task can migrate to CPU_1 and finish
> > its ret-to-usermode path. But since it had already passed user_enter() we
> > do not change context_tracking[1].state and do not play with rcu/vtime.
> > (unless this task hits SCHEDULE_USER in asm).
> >
> > The same for user_exit() of course.
>
> So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
> such situation where CPU 1 has wrong context tracking.

OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
CPU_0, and CPU_1 can miss it anyway.

> But global TIF_NOHZ should enforce context tracking everywhere.

And this is what I can't understand. Lets return to my initial question, why
we can't change __context_tracking_task_switch()

	void __context_tracking_task_switch(struct task_struct *prev,
					    struct task_struct *next)
	{
		if (context_tracking_cpu_is_enabled())
			set_tsk_thread_flag(next, TIF_NOHZ);
		else
			clear_tsk_thread_flag(next, TIF_NOHZ);
	}

? How can the global TIF_NOHZ help?

OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
comes to user_enter(). But this case is very unlikely, certainly this can't
explain why do we penalize the untracked CPU's ?

> And also it's
> less context switch overhead.

Why???

I think I have a blind spot here. Help!



And of course I can't understand exception_enter/exit(). Not to mention that
(afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
if we forget that the caller can migrate in between. Just because, once again,
a tracked CPU can miss user_exit().

So, why not

	static inline void exception_enter(void)
	{
		user_exit();
	}

	static inline void exception_exit(struct pt_regs *regs)
	{
		if (user_mode(regs))
			user_enter();
	}

?

Oleg.

WARNING: multiple messages have this Message-ID (diff)

From: oleg@redhat.com (Oleg Nesterov)
To: linux-arm-kernel@lists.infradead.org
Subject: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
Date: Thu, 31 Jul 2014 18:03:53 +0200	[thread overview]
Message-ID: <20140731160353.GA14772@redhat.com> (raw)
In-Reply-To: <20140731003034.GA32078@localhost.localdomain>

On 07/31, Frederic Weisbecker wrote:
>
> On Wed, Jul 30, 2014 at 07:46:30PM +0200, Oleg Nesterov wrote:
> > On 07/30, Frederic Weisbecker wrote:
> > >
> > > On Tue, Jul 29, 2014 at 07:54:14PM +0200, Oleg Nesterov wrote:
> > >
> > > >
> > > > Looks like, we can kill context_tracking_task_switch() and simply change the
> > > > "__init" callers of context_tracking_cpu_set() to do set_thread_flag(TIF_NOHZ) ?
> > > > Then this flag will be propagated by copy_process().
> > >
> > > Right, that would be much better. Good catch! context tracking is enabled from
> > > tick_nohz_init(). This is the init 0 task so the flag should be propagated from there.
> >
> > actually init 1 task, but this doesn't matter.
>
> Are you sure? It does matter because that would invalidate everything I understood
> about init/main.c :)

Sorry for confusion ;)

> I was convinced that the very first kernel init task is PID 0 then
> it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> idle task of the boot CPU.

Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
by "swapper". And we do not care about idle threads at all.

> > > I still think we need a for_each_process_thread() set as well though because some
> > > kernel threads may well have been created at this stage already.
> >
> > Yes... Or we can add set_thread_flag(TIF_NOHZ) into ____call_usermodehelper().
>
> Couldn't there be some other tasks than usermodehelper stuffs at this stage? Like workqueues
> or random kernel threads?

Sure, but we do not care. A kernel thread can never return to user space, it
must never call user_enter/exit().

> > I meant that in the scenario you described above the "global" TIF_NOHZ doesn't
> > really make a difference, afaics.
> >
> > Lets assume that context tracking is only enabled on CPU 1. To simplify,
> > assume that we have a single usermode task T which sleeps in kernel mode.
> >
> > So context_tracking[0].state == context_tracking[1].state == IN_KERNEL.
> >
> > T wakes up on CPU_0, returns to user space, calls user_enter(). This sets
> > context_tracking[0].state = IN_USER but otherwise does nothing else, this
> > CPU is not tracked and .active is false.
> >
> > Right after local_irq_restore() this task can migrate to CPU_1 and finish
> > its ret-to-usermode path. But since it had already passed user_enter() we
> > do not change context_tracking[1].state and do not play with rcu/vtime.
> > (unless this task hits SCHEDULE_USER in asm).
> >
> > The same for user_exit() of course.
>
> So indeed if context tracking is enabled on CPU 1 and not in CPU 0, we risk
> such situation where CPU 1 has wrong context tracking.

OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
CPU_0, and CPU_1 can miss it anyway.

> But global TIF_NOHZ should enforce context tracking everywhere.

And this is what I can't understand. Lets return to my initial question, why
we can't change __context_tracking_task_switch()

	void __context_tracking_task_switch(struct task_struct *prev,
					    struct task_struct *next)
	{
		if (context_tracking_cpu_is_enabled())
			set_tsk_thread_flag(next, TIF_NOHZ);
		else
			clear_tsk_thread_flag(next, TIF_NOHZ);
	}

? How can the global TIF_NOHZ help?

OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
comes to user_enter(). But this case is very unlikely, certainly this can't
explain why do we penalize the untracked CPU's ?

> And also it's
> less context switch overhead.

Why???

I think I have a blind spot here. Help!

And of course I can't understand exception_enter/exit(). Not to mention that
(afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
if we forget that the caller can migrate in between. Just because, once again,
a tracked CPU can miss user_exit().

So, why not

	static inline void exception_enter(void)
	{
		user_exit();
	}

	static inline void exception_exit(struct pt_regs *regs)
	{
		if (user_mode(regs))
			user_enter();
	}

?

Oleg.

next prev parent reply	other threads:[~2014-07-31 16:03 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-22  1:49 [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes Andy Lutomirski
2014-07-22  1:49 ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 1/8] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
2014-07-22  1:49   ` [PATCH v3 1/8] seccomp, x86, arm, mips, s390: " Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 2/8] seccomp: Refactor the filter callback and the API Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 3/8] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 4/8] seccomp: Document two-phase seccomp and arch-provided seccomp_data Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 5/8] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit Andy Lutomirski
2014-07-22  1:53   ` Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-07-22  1:53   ` Andy Lutomirski
2014-07-28 17:37   ` Oleg Nesterov
2014-07-28 17:37     ` Oleg Nesterov
2014-07-28 18:58     ` TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases) Oleg Nesterov
2014-07-28 18:58       ` Oleg Nesterov
2014-07-28 19:22       ` Frederic Weisbecker
2014-07-28 19:22         ` Frederic Weisbecker
2014-07-29 17:54         ` Oleg Nesterov
2014-07-29 17:54           ` Oleg Nesterov
2014-07-30 16:35           ` Frederic Weisbecker
2014-07-30 16:35             ` Frederic Weisbecker
2014-07-30 17:46             ` Oleg Nesterov
2014-07-30 17:46               ` Oleg Nesterov
2014-07-31  0:30               ` Frederic Weisbecker
2014-07-31  0:30                 ` Frederic Weisbecker
2014-07-31 16:03                 ` Oleg Nesterov [this message]
2014-07-31 16:03                   ` Oleg Nesterov
2014-07-31 17:13                   ` Frederic Weisbecker
2014-07-31 17:13                     ` Frederic Weisbecker
2014-07-31 18:12                     ` Oleg Nesterov
2014-07-31 18:12                       ` Oleg Nesterov
2014-07-31 18:47                       ` Frederic Weisbecker
2014-07-31 18:47                         ` Frederic Weisbecker
2014-07-31 18:50                         ` Frederic Weisbecker
2014-07-31 18:50                           ` Frederic Weisbecker
2014-07-31 19:05                           ` Oleg Nesterov
2014-07-31 19:05                             ` Oleg Nesterov
2014-08-02 17:30                         ` Oleg Nesterov
2014-08-02 17:30                           ` Oleg Nesterov
2014-08-04 12:02                           ` Paul E. McKenney
2014-08-04 12:02                             ` Paul E. McKenney
2014-08-04 12:02                             ` Paul E. McKenney
2014-07-28 20:23     ` [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-07-28 20:23       ` Andy Lutomirski
2014-07-29 16:54       ` Oleg Nesterov
2014-07-29 16:54         ` Oleg Nesterov
2014-07-29 17:01         ` Andy Lutomirski
2014-07-29 17:01           ` Andy Lutomirski
2014-07-29 17:31           ` Oleg Nesterov
2014-07-29 17:31             ` Oleg Nesterov
2014-07-29 17:55             ` Andy Lutomirski
2014-07-29 17:55               ` Andy Lutomirski
2014-07-29 18:16               ` Oleg Nesterov
2014-07-29 18:16                 ` Oleg Nesterov
2014-07-29 18:22                 ` Andy Lutomirski
2014-07-29 18:22                   ` Andy Lutomirski
2014-07-29 18:44                   ` Oleg Nesterov
2014-07-29 18:44                     ` Oleg Nesterov
2014-07-22  1:53 ` [PATCH v3 7/8] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
2014-07-22  1:53   ` [PATCH v3 7/8] x86_64, entry: " Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 8/8] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls Andy Lutomirski
2014-07-22  1:53   ` [PATCH v3 8/8] x86_64, entry: " Andy Lutomirski
2014-07-22 19:37 ` [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes Kees Cook
2014-07-22 19:37   ` Kees Cook
2014-07-23 19:20   ` Andy Lutomirski
2014-07-23 19:20     ` Andy Lutomirski
2014-07-28 17:59     ` H. Peter Anvin
2014-07-28 17:59       ` H. Peter Anvin
2014-07-28 23:29       ` Kees Cook
2014-07-28 23:29         ` Kees Cook
2014-07-28 23:34         ` H. Peter Anvin
2014-07-28 23:34           ` H. Peter Anvin
2014-07-28 23:42           ` Kees Cook
2014-07-28 23:42             ` Kees Cook
2014-07-28 23:45             ` H. Peter Anvin
2014-07-28 23:45               ` H. Peter Anvin
2014-07-28 23:54               ` Kees Cook
2014-07-28 23:54                 ` Kees Cook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140731160353.GA14772@redhat.com \
    --to=oleg@redhat.com \
    --cc=ast@plumgrid.com \
    --cc=fweisbec@gmail.com \
    --cc=hpa@zytor.com \
    --cc=keescook@chromium.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@linux-mips.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=wad@chromium.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.