Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Frederic Weisbecker <fweisbec@gmail.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org, Kees Cook <keescook@chromium.org>,
	Will Drewry <wad@chromium.org>,
	x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-mips@linux-mips.org, linux-arch@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	Alexei Starovoitov <ast@plumgrid.com>,
	hpa@zytor.com
Subject: Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
Date: Thu, 31 Jul 2014 20:47:31 +0200	[thread overview]
Message-ID: <20140731184729.GA12296@localhost.localdomain> (raw)
In-Reply-To: <20140731181230.GA18695@redhat.com>

On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> > > On 07/31, Frederic Weisbecker wrote:
> > > >
> > > > I was convinced that the very first kernel init task is PID 0 then
> > > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > > > idle task of the boot CPU.
> > >
> > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > by "swapper".
> >
> > Are you sure? It's called from start_kernel() which is init/0.
> 
> But do_initcalls() is called by kernel_init(), this is the init process which is
> going to exec /sbin/init later.
> 
> But this doesn't really matter,

Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
before initcalls.

> 
> > Maybe we should just enable it everywhere.
> 
> Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> of start_kernel(). The question is, I still can't understand why do we want to
> have the global TIF_NOHZ.

Because then the flags is inherited in forks. It's better than inheriting it on
context switch due to context switch being called much more often than fork.

> 
> > > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> > > CPU_0, and CPU_1 can miss it anyway.
> > >
> > > > But global TIF_NOHZ should enforce context tracking everywhere.
> > >
> > > And this is what I can't understand. Lets return to my initial question, why
> > > we can't change __context_tracking_task_switch()
> > >
> > > 	void __context_tracking_task_switch(struct task_struct *prev,
> > > 					    struct task_struct *next)
> > > 	{
> > > 		if (context_tracking_cpu_is_enabled())
> > > 			set_tsk_thread_flag(next, TIF_NOHZ);
> > > 		else
> > > 			clear_tsk_thread_flag(next, TIF_NOHZ);
> > > 	}
> > >
> > > ?
> >
> > Well we can change it to global TIF_NOHZ
> >
> > > How can the global TIF_NOHZ help?
> >
> > It avoids that flag swap on task_switch.
> 
> Ah, you probably meant that we can kill context_tracking_task_switch() as
> we discussed.

Yeah.

> 
> But I meant another thing, TIF_NOHZ is already global because it is always
> set after context_tracking_cpu_set().
> 
> Performance-wise, this set/clear code above can be better because it avoids
> the slow paths on the untracked CPU's.

But all CPUs are tracked when context tracking is enabled. So there is no such
per CPU granularity.

> But let's ignore this, the question is
> why the change above is not correct? Or why it can make the things worse?

Which change above?

> 
> 
> > > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> > > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> > > comes to user_enter(). But this case is very unlikely, certainly this can't
> > > explain why do we penalize the untracked CPU's ?
> >
> > It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
> > to userspace.
> 
> And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses
> the necessary user_enter().

No, user_enter() is necessary on CPU 0 so that CPU 1 sees that it is in userspace
from context tracking POV.

> 
> > It's unlikely but possible. I actually observed that very easily on early testing.
> 
> Sure. And this can happen anyway? Why the change in __context_tracking_task_switch()
> is wrong?

Which change?

> 
> > And it's a big problem because then the CPU runs in userspace, possibly for a long while
> > in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
> > for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
> > userspace.
> >
> > It looks like a harmless race but it can have big consequences.
> 
> I see. Again, does the global TIF_NOHZ really help?

Yes, to remove the context switch overhead. But it doesn't change anything on those races.

> > Because calling context_switch_task_switch() on every context switch is costly.
> 
> See above. This depends, but forcing the slow paths on all CPU's can be more
> costly.

Yeah but I don't have a safe solution that avoids that.

> 
> > > And of course I can't understand exception_enter/exit(). Not to mention that
> > > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> > > if we forget that the caller can migrate in between. Just because, once again,
> > > a tracked CPU can miss user_exit().
> >
> > You lost me on this. How can a tracked CPU miss user_exit()?
> 
> I am lost too ;) Didn't we already discuss this? A task calls user_exit() on
> CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user
> space leaving this cpu in IN_KERNEL state?

Yeah, so? :)

I'm pretty sure there is a small but important element here that makes us misunderstanding
what each others says. It's like we aren't taking about the same thing :)

> > That's how I implemented it first. But then I changed it the way it is now:
> > 6c1e0256fad84a843d915414e4b5973b7443d48d
> > ("context_tracking: Restore correct previous context state on exception exit")
> >
> > This is again due to the shift between hard and soft userspace boundaries.
> > user_mode(regs) checks hard boundaries only.
> >
> > Lets get back to our beloved example:
> >
> >           CPU 0                                  CPU 1
> >           ---------------------------------------------
> >
> >           returning from syscall {
> >                user_enter();
> >                exception {
> >                     exception_enter()
> >                     PREEMPT!
> >                     ----------------------->
> >                                                  //resume exception
> >                                                    exception_exit();
> >                                                    return to userspace
> 
> OK, thanks, so in this case we miss user_enter().
> 
> But again, we can miss it anyway if preemptions comes before "exception" ?
> if CPU 1 was in IN_KERNEL state.

No, because preempt_schedule_irq() does the ctx_state save and restore with
exception_enter/exception_exit.

WARNING: multiple messages have this Message-ID (diff)

From: fweisbec@gmail.com (Frederic Weisbecker)
To: linux-arm-kernel@lists.infradead.org
Subject: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
Date: Thu, 31 Jul 2014 20:47:31 +0200	[thread overview]
Message-ID: <20140731184729.GA12296@localhost.localdomain> (raw)
In-Reply-To: <20140731181230.GA18695@redhat.com>

On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> > > On 07/31, Frederic Weisbecker wrote:
> > > >
> > > > I was convinced that the very first kernel init task is PID 0 then
> > > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > > > idle task of the boot CPU.
> > >
> > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > by "swapper".
> >
> > Are you sure? It's called from start_kernel() which is init/0.
> 
> But do_initcalls() is called by kernel_init(), this is the init process which is
> going to exec /sbin/init later.
> 
> But this doesn't really matter,

Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
before initcalls.

> 
> > Maybe we should just enable it everywhere.
> 
> Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> of start_kernel(). The question is, I still can't understand why do we want to
> have the global TIF_NOHZ.

Because then the flags is inherited in forks. It's better than inheriting it on
context switch due to context switch being called much more often than fork.

> 
> > > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> > > CPU_0, and CPU_1 can miss it anyway.
> > >
> > > > But global TIF_NOHZ should enforce context tracking everywhere.
> > >
> > > And this is what I can't understand. Lets return to my initial question, why
> > > we can't change __context_tracking_task_switch()
> > >
> > > 	void __context_tracking_task_switch(struct task_struct *prev,
> > > 					    struct task_struct *next)
> > > 	{
> > > 		if (context_tracking_cpu_is_enabled())
> > > 			set_tsk_thread_flag(next, TIF_NOHZ);
> > > 		else
> > > 			clear_tsk_thread_flag(next, TIF_NOHZ);
> > > 	}
> > >
> > > ?
> >
> > Well we can change it to global TIF_NOHZ
> >
> > > How can the global TIF_NOHZ help?
> >
> > It avoids that flag swap on task_switch.
> 
> Ah, you probably meant that we can kill context_tracking_task_switch() as
> we discussed.

Yeah.

> 
> But I meant another thing, TIF_NOHZ is already global because it is always
> set after context_tracking_cpu_set().
> 
> Performance-wise, this set/clear code above can be better because it avoids
> the slow paths on the untracked CPU's.

But all CPUs are tracked when context tracking is enabled. So there is no such
per CPU granularity.

> But let's ignore this, the question is
> why the change above is not correct? Or why it can make the things worse?

Which change above?

> 
> 
> > > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> > > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> > > comes to user_enter(). But this case is very unlikely, certainly this can't
> > > explain why do we penalize the untracked CPU's ?
> >
> > It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
> > to userspace.
> 
> And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses
> the necessary user_enter().

No, user_enter() is necessary on CPU 0 so that CPU 1 sees that it is in userspace
from context tracking POV.

> 
> > It's unlikely but possible. I actually observed that very easily on early testing.
> 
> Sure. And this can happen anyway? Why the change in __context_tracking_task_switch()
> is wrong?

Which change?

> 
> > And it's a big problem because then the CPU runs in userspace, possibly for a long while
> > in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
> > for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
> > userspace.
> >
> > It looks like a harmless race but it can have big consequences.
> 
> I see. Again, does the global TIF_NOHZ really help?

Yes, to remove the context switch overhead. But it doesn't change anything on those races.

> > Because calling context_switch_task_switch() on every context switch is costly.
> 
> See above. This depends, but forcing the slow paths on all CPU's can be more
> costly.

Yeah but I don't have a safe solution that avoids that.

> 
> > > And of course I can't understand exception_enter/exit(). Not to mention that
> > > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> > > if we forget that the caller can migrate in between. Just because, once again,
> > > a tracked CPU can miss user_exit().
> >
> > You lost me on this. How can a tracked CPU miss user_exit()?
> 
> I am lost too ;) Didn't we already discuss this? A task calls user_exit() on
> CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user
> space leaving this cpu in IN_KERNEL state?

Yeah, so? :)

I'm pretty sure there is a small but important element here that makes us misunderstanding
what each others says. It's like we aren't taking about the same thing :)

> > That's how I implemented it first. But then I changed it the way it is now:
> > 6c1e0256fad84a843d915414e4b5973b7443d48d
> > ("context_tracking: Restore correct previous context state on exception exit")
> >
> > This is again due to the shift between hard and soft userspace boundaries.
> > user_mode(regs) checks hard boundaries only.
> >
> > Lets get back to our beloved example:
> >
> >           CPU 0                                  CPU 1
> >           ---------------------------------------------
> >
> >           returning from syscall {
> >                user_enter();
> >                exception {
> >                     exception_enter()
> >                     PREEMPT!
> >                     ----------------------->
> >                                                  //resume exception
> >                                                    exception_exit();
> >                                                    return to userspace
> 
> OK, thanks, so in this case we miss user_enter().
> 
> But again, we can miss it anyway if preemptions comes before "exception" ?
> if CPU 1 was in IN_KERNEL state.

No, because preempt_schedule_irq() does the ctx_state save and restore with
exception_enter/exception_exit.

next prev parent reply	other threads:[~2014-07-31 18:47 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-22  1:49 [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes Andy Lutomirski
2014-07-22  1:49 ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 1/8] seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing Andy Lutomirski
2014-07-22  1:49   ` [PATCH v3 1/8] seccomp, x86, arm, mips, s390: " Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 2/8] seccomp: Refactor the filter callback and the API Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 3/8] seccomp: Allow arch code to provide seccomp_data Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49 ` [PATCH v3 4/8] seccomp: Document two-phase seccomp and arch-provided seccomp_data Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:49   ` Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 5/8] x86,x32,audit: Fix x32's AUDIT_ARCH wrt audit Andy Lutomirski
2014-07-22  1:53   ` Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-07-22  1:53   ` Andy Lutomirski
2014-07-28 17:37   ` Oleg Nesterov
2014-07-28 17:37     ` Oleg Nesterov
2014-07-28 18:58     ` TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases) Oleg Nesterov
2014-07-28 18:58       ` Oleg Nesterov
2014-07-28 19:22       ` Frederic Weisbecker
2014-07-28 19:22         ` Frederic Weisbecker
2014-07-29 17:54         ` Oleg Nesterov
2014-07-29 17:54           ` Oleg Nesterov
2014-07-30 16:35           ` Frederic Weisbecker
2014-07-30 16:35             ` Frederic Weisbecker
2014-07-30 17:46             ` Oleg Nesterov
2014-07-30 17:46               ` Oleg Nesterov
2014-07-31  0:30               ` Frederic Weisbecker
2014-07-31  0:30                 ` Frederic Weisbecker
2014-07-31 16:03                 ` Oleg Nesterov
2014-07-31 16:03                   ` Oleg Nesterov
2014-07-31 17:13                   ` Frederic Weisbecker
2014-07-31 17:13                     ` Frederic Weisbecker
2014-07-31 18:12                     ` Oleg Nesterov
2014-07-31 18:12                       ` Oleg Nesterov
2014-07-31 18:47                       ` Frederic Weisbecker [this message]
2014-07-31 18:47                         ` Frederic Weisbecker
2014-07-31 18:50                         ` Frederic Weisbecker
2014-07-31 18:50                           ` Frederic Weisbecker
2014-07-31 19:05                           ` Oleg Nesterov
2014-07-31 19:05                             ` Oleg Nesterov
2014-08-02 17:30                         ` Oleg Nesterov
2014-08-02 17:30                           ` Oleg Nesterov
2014-08-04 12:02                           ` Paul E. McKenney
2014-08-04 12:02                             ` Paul E. McKenney
2014-08-04 12:02                             ` Paul E. McKenney
2014-07-28 20:23     ` [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases Andy Lutomirski
2014-07-28 20:23       ` Andy Lutomirski
2014-07-29 16:54       ` Oleg Nesterov
2014-07-29 16:54         ` Oleg Nesterov
2014-07-29 17:01         ` Andy Lutomirski
2014-07-29 17:01           ` Andy Lutomirski
2014-07-29 17:31           ` Oleg Nesterov
2014-07-29 17:31             ` Oleg Nesterov
2014-07-29 17:55             ` Andy Lutomirski
2014-07-29 17:55               ` Andy Lutomirski
2014-07-29 18:16               ` Oleg Nesterov
2014-07-29 18:16                 ` Oleg Nesterov
2014-07-29 18:22                 ` Andy Lutomirski
2014-07-29 18:22                   ` Andy Lutomirski
2014-07-29 18:44                   ` Oleg Nesterov
2014-07-29 18:44                     ` Oleg Nesterov
2014-07-22  1:53 ` [PATCH v3 7/8] x86_64,entry: Treat regs->ax the same in fastpath and slowpath syscalls Andy Lutomirski
2014-07-22  1:53   ` [PATCH v3 7/8] x86_64, entry: " Andy Lutomirski
2014-07-22  1:53 ` [PATCH v3 8/8] x86_64,entry: Use split-phase syscall_trace_enter for 64-bit syscalls Andy Lutomirski
2014-07-22  1:53   ` [PATCH v3 8/8] x86_64, entry: " Andy Lutomirski
2014-07-22 19:37 ` [PATCH v3 0/8] Two-phase seccomp and x86 tracing changes Kees Cook
2014-07-22 19:37   ` Kees Cook
2014-07-23 19:20   ` Andy Lutomirski
2014-07-23 19:20     ` Andy Lutomirski
2014-07-28 17:59     ` H. Peter Anvin
2014-07-28 17:59       ` H. Peter Anvin
2014-07-28 23:29       ` Kees Cook
2014-07-28 23:29         ` Kees Cook
2014-07-28 23:34         ` H. Peter Anvin
2014-07-28 23:34           ` H. Peter Anvin
2014-07-28 23:42           ` Kees Cook
2014-07-28 23:42             ` Kees Cook
2014-07-28 23:45             ` H. Peter Anvin
2014-07-28 23:45               ` H. Peter Anvin
2014-07-28 23:54               ` Kees Cook
2014-07-28 23:54                 ` Kees Cook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140731184729.GA12296@localhost.localdomain \
    --to=fweisbec@gmail.com \
    --cc=ast@plumgrid.com \
    --cc=hpa@zytor.com \
    --cc=keescook@chromium.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@linux-mips.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=wad@chromium.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.