public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>,
	Jason Baron <jbaron@redhat.com>,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	laijs@cn.fujitsu.com, rostedt@goodmis.org, peterz@infradead.org,
	jiayingz@google.com, mbligh@google.com, lizf@cn.fujitsu.com,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>
Subject: Re: [PATCH 08/12] add trace events for each syscall entry/exit
Date: Tue, 25 Aug 2009 14:31:19 -0400	[thread overview]
Message-ID: <20090825183119.GC2448@Krystal> (raw)
In-Reply-To: <20090825173107.GJ6114@nowhere>

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> On Tue, Aug 25, 2009 at 06:59:14PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 25, 2009 at 12:20:04PM -0400, Mathieu Desnoyers wrote:
> > > * Hendrik Brueckner (brueckner@linux.vnet.ibm.com) wrote:
> > > > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > > > 
> > > > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > > > >    tracing code resets it.
> > > > > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > > > 
> > > > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > > > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > > > >    kernel threads.
> > > > > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > > > > >    trace points:
> > > > > >       - The kernel thread is started as ususal (do_fork()),
> > > > > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > > > > >       - the ret_from_fork() function is triggered and starts
> > > > > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > > > 
> > > > > 
> > > > > 
> > > > > I wonder if there is any way to identify such situation...?
> > > > For the second case, it might be an option to avoid setting the
> > > > TIF_SYSCALL_FTRACE flag for kernel threads.
> > > > 
> > > > Kernel threads have task_struct->mm set to NULL.
> > > > (Thanks to Heiko for that hint ;-)
> > > > 
> > > > The idea is then to check the mm field in syscall_regfunc() and
> > > > set the flag accordingly.
> > > > 
> > > > However, I think the patch is an optional add-on becase checking
> > > > the syscall number is still required for case 1).
> > > > 
> > > > ---
> > > >  kernel/tracepoint.c |    4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > --- a/kernel/tracepoint.c
> > > > +++ b/kernel/tracepoint.c
> > > > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> > > >  	if (!sys_tracepoint_refcount) {
> > > >  		read_lock_irqsave(&tasklist_lock, flags);
> > > >  		do_each_thread(g, t) {
> > > > -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > > +			/* Skip kernel threads. */
> > > > +			if (t->mm)
> > > > +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > 
> > > Uh ? kernel threads can invoke a system call. There are rare places
> > > where kernel code actually invoke system calls. I don't see why we
> > > should not deal with them.
> > 
> > 
> > 
> > Yeah they do, but they don't use the sysenter path, they call the
> > syscall helpers directly, such as do_fork() or things like that.
> > 
> > The syscall tracepoints are set in the sysenter/sysexit path, then
> > it's no use to trace the kernel threads, it doesn't have any effect,
> > except random results in case of fork() calls, because we take
> > the ret_from_fork() path that also ends up to trace_sys_exit()
> > if the TIF_SYSCALL_TRACEPOINT thing is set, leading to such
> > asymetric tracing.
> > 
> > Kernel threads use syscalls toward wrappers such as create_thread().
> > So instead, statically defined tracepoints in create_thread() and such
> > other syscall wrappers for kernel threads seem more valuable, hmm?
> > 
> >  
> > > Moreover, the problem you face is more general: if we set the
> > > TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
> > > system call, x86_64 will cause the syscall exit to execute by re-reading
> > > the thread flags and run a syscall trace exit.
> > 
> > 
> > Well, I don't think that's the problem. The issue here, if I understand
> > correctly, is that kernel threads don't take the sysenter path, then never hit
> > the trace_sys_enter() call. And usually they won't ever hit any
> > trace_sys_exit() calls except in the fork() case, because we take
> > the ret_from_fork() path, which lead to syscall exit tracing due
> > to the TIF flags set.
> > 
> > At this stage, the syscall number is supposed to be stored in orig_eax,
> > but because the kernel thread hasn't called fork() through a syscall and
> > has called do_fork() directly, the regs values have nothing that look
> > like syscall parameters.
> 
> 
> 
> I mean, I don't know how look like orig_eax at this stage.
> 
> Looking at arch/x86/kernel/process_32.c:copy_thread()
> 
> childregs = task_pt_regs(p);
> *childregs = *regs;
> childregs->ax = 0;
> childregs->sp = sp;
> 
> p->thread.sp = (unsigned long) childregs;
> p->thread.sp0 = (unsigned long) (childregs+1);
> 
> p->thread.ip = (unsigned long) ret_from_fork;
> 
> 
> sp will be the struct pt_regs * passed to syscall_trace_leave()
> later.
> 
> ax has the result of the fork syscall -> 0 for the child.
> What about orig_eax which has the syscall nr? It depends on the pareent
> and I don't know what it has at this stage.
> 
> I haven't seen crashes in x86 with kernel threads tracing, may be because
> orig_eax is set to a valid syscall nr (may be even fork nr).
> 
> Perhaps it's not the case in s390 ?
> 
> Anyway, tracing kernel threads syscalls only gives us the fork return,
> so it's something me may want to drop and trace higher level kernel
> thread syscall wrappers instead.
> 
> Moreover every kernel threads is created through a kthreadd fork if
> I'm not wrong, then it wouldn't be an accurate thing for us to trace the
> fork calls in kernel thread. Tracing higher level kernel thread managment
> sounds more interesting, we would then know who really created the thread,
> etc...
> 

(Well, I do not have time currently to look into the gory details
(sorry), but let's try to take a step back from the problem.)

The design proposal for this kthread behavior wrt syscalls is based on a
very specific and current kernel behavior, that may happen to change and
that I have actually seen proven incorrect. For instance, some
proprietary Linux driver does very odd things with system calls within
kernel threads, like invoking them with int 0x80.

Yes, this is odd, but do we really want to tie the tracer that much to
the actual OS implementation specificities ?

That sounds like a recipe for endless breakages and missing bits of
instrumentation.

So my advice would be: if we want to trace the syscall entry/exit paths,
let's trace them for the _whole_ system, and find ways to make it work
for corner-cases rather than finding clever ways to diminish
instrumentation coverage.

Given the ret from fork example happens to be the first event fired
after the thread is created, we should be able to deal with this problem
by initializing the thread structure used by syscall exit tracing to an
initial "ret from fork" value.

Mathieu

> 
> > 
> > I guess we don't need to take the sys_enter tracing path to have a sane
> > orig_eax in the sys_exit tracing path (for non kernel threads).
> > Though I'm not sure about that, I should check to be sure.
> >
> > > We could simply initialize the "saved system calls id" number to
> > > something like -1, so that if we happen to return from a syscall that
> > > did not get its id recorded at syscall entry, we know it because it's
> > > not initialized.
> > > 
> > > We would need to carefully put back the -1 value after clearing the
> > > thread flag when we stop tracing too (while still holding a mutex).
> > > 
> > > Mathieu
> > > 
> > > >  		} while_each_thread(g, t);
> > > >  		read_unlock_irqrestore(&tasklist_lock, flags);
> > > >  	}
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

  reply	other threads:[~2009-08-25 18:31 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
2009-08-10 20:52 ` [PATCH 01/12] map syscall name to number Jason Baron
2009-08-10 20:52 ` [PATCH 02/12] call arch_init_ftrace_syscalls at boot Jason Baron
2009-08-10 20:52 ` [PATCH 03/12] add DECLARE_TRACE_WITH_CALLBACK() macro Jason Baron
2009-08-10 20:52 ` [PATCH 04/12] add syscall tracepoints Jason Baron
2009-08-10 20:52 ` [PATCH 05/12] update FTRACE_SYSCALL_MAX Jason Baron
2009-08-11 11:00   ` Frederic Weisbecker
2009-08-11 19:39     ` Matt Fleming
2009-08-24 13:41     ` Paul Mundt
2009-08-24 14:06       ` Jason Baron
2009-08-24 14:15         ` Paul Mundt
2009-08-24 14:34           ` Frederic Weisbecker
2009-08-24 14:37             ` Paul Mundt
2009-08-24 14:42           ` Jason Baron
2009-08-24 14:50             ` Paul Mundt
2009-08-24 18:34               ` Ingo Molnar
2009-08-10 20:52 ` [PATCH 06/12] trace_event - raw_init bailout Jason Baron
2009-08-10 20:52 ` [PATCH 07/12] add ftrace_event_call void * 'data' field Jason Baron
2009-08-11 10:09   ` Frederic Weisbecker
2009-08-17 22:19     ` Steven Rostedt
2009-08-17 23:09       ` Frederic Weisbecker
2009-08-18  0:06         ` Steven Rostedt
2009-08-10 20:52 ` [PATCH 08/12] add trace events for each syscall entry/exit Jason Baron
2009-08-11 10:50   ` Frederic Weisbecker
2009-08-11 11:45     ` Ingo Molnar
2009-08-11 12:01       ` Frederic Weisbecker
2009-08-25 12:50   ` Hendrik Brueckner
2009-08-25 14:15     ` Frederic Weisbecker
2009-08-25 16:02       ` Hendrik Brueckner
2009-08-25 16:20         ` Mathieu Desnoyers
2009-08-25 16:59           ` Frederic Weisbecker
2009-08-25 17:31             ` Frederic Weisbecker
2009-08-25 18:31               ` Mathieu Desnoyers [this message]
2009-08-25 19:42                 ` Frederic Weisbecker
2009-08-25 19:51                   ` Mathieu Desnoyers
2009-08-26  0:19                     ` Frederic Weisbecker
2009-08-26  0:42                       ` Mathieu Desnoyers
2009-08-26  7:28                         ` Ingo Molnar
2009-08-26 17:11                           ` Mathieu Desnoyers
2009-08-26  6:48                   ` Peter Zijlstra
2009-08-25 22:04                 ` Martin Schwidefsky
2009-08-26  7:38                   ` Heiko Carstens
2009-08-26 12:32                     ` Frederic Weisbecker
2009-08-26  6:21                 ` Peter Zijlstra
2009-08-26 17:08                   ` Mathieu Desnoyers
2009-08-26 18:41                     ` Christoph Hellwig
2009-08-26 18:42                       ` Christoph Hellwig
2009-08-26 19:01                         ` Mathieu Desnoyers
2009-08-26  7:10                 ` Peter Zijlstra
2009-08-26 17:10                   ` Mathieu Desnoyers
2009-08-26 17:24                   ` H. Peter Anvin
2009-08-25 17:04           ` Jason Baron
2009-08-25 18:15             ` Mathieu Desnoyers
2009-08-26 12:35         ` Frederic Weisbecker
2009-08-26 12:59           ` Heiko Carstens
2009-08-26 13:30             ` Frederic Weisbecker
2009-08-26 13:48               ` Steven Rostedt
2009-08-26 13:53                 ` Frederic Weisbecker
2009-08-26 14:44                   ` Steven Rostedt
2009-08-26 13:56                 ` Peter Zijlstra
2009-08-26 14:41                   ` Steven Rostedt
2009-08-26 14:10               ` Heiko Carstens
2009-08-26 14:27                 ` Frederic Weisbecker
2009-08-26 14:43                   ` Steven Rostedt
2009-08-26 16:14                     ` Frederic Weisbecker
2009-08-26 14:43                 ` Steven Rostedt
2009-08-26 14:41           ` Hendrik Brueckner
2009-08-28 12:28         ` [tip:tracing/core] tracing: Don't trace kernel thread syscalls tip-bot for Hendrik Brueckner
2009-08-25 21:40     ` [PATCH 08/12] add trace events for each syscall entry/exit Frederic Weisbecker
2009-08-25 22:09       ` Frederic Weisbecker
2009-08-26  7:47         ` Heiko Carstens
2009-08-28 12:27     ` [tip:tracing/core] tracing: Check invalid syscall nr while tracing syscalls tip-bot for Hendrik Brueckner
2009-08-10 20:52 ` [PATCH 09/12] add support traceopint ids Jason Baron
2009-08-11 11:28   ` Frederic Weisbecker
2009-08-10 20:53 ` [PATCH 10/12] add perf counter support Jason Baron
2009-08-11 12:12   ` Frederic Weisbecker
2009-08-11 12:17     ` Ingo Molnar
2009-08-11 12:25       ` Frederic Weisbecker
2009-08-10 20:53 ` [PATCH 11/12] add more namespace area to 'perf list' output Jason Baron
2009-08-10 20:53 ` [PATCH 12/12] convert x86_64 mmap and uname to use DEFINE_SYSCALL Jason Baron
2009-08-25 12:31 ` [PATCH 00/12] add syscall tracepoints V3 - s390 arch update Hendrik Brueckner
2009-08-25 13:52   ` Frederic Weisbecker
2009-08-25 14:39     ` Heiko Carstens
2009-08-25 19:52       ` Frederic Weisbecker
2009-08-25 15:38     ` Hendrik Brueckner
2009-08-26 16:53   ` Frederic Weisbecker
2009-08-27  7:27     ` [PATCH]: tracing: s390 arch updates for tracing syscalls Hendrik Brueckner
2009-08-28 12:27   ` [tip:tracing/core] tracing: Add syscall tracepoints - s390 arch update tip-bot for Hendrik Brueckner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090825183119.GC2448@Krystal \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=brueckner@linux.vnet.ibm.com \
    --cc=fweisbec@gmail.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=jbaron@redhat.com \
    --cc=jiayingz@google.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=mbligh@google.com \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=schwidefsky@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox