public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
To: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Frederic Weisbecker <fweisbec@gmail.com>,
	Hendrik Brueckner <brueckner@linux.vnet.ibm.com>,
	Jason Baron <jbaron@redhat.com>,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	laijs@cn.fujitsu.com, rostedt@goodmis.org, peterz@infradead.org,
	jiayingz@google.com, mbligh@google.com, lizf@cn.fujitsu.com,
	Heiko Carstens <heiko.carstens@de.ibm.com>
Subject: Re: [PATCH 08/12] add trace events for each syscall entry/exit
Date: Wed, 26 Aug 2009 00:04:26 +0200	[thread overview]
Message-ID: <20090826000426.0332cbf4@skybase> (raw)
In-Reply-To: <20090825183119.GC2448@Krystal>

On Tue, 25 Aug 2009 14:31:19 -0400
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > On Tue, Aug 25, 2009 at 06:59:14PM +0200, Frederic Weisbecker wrote:
> > > On Tue, Aug 25, 2009 at 12:20:04PM -0400, Mathieu Desnoyers wrote:
> > > > * Hendrik Brueckner (brueckner@linux.vnet.ibm.com) wrote:
> > > > > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > > > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > > > > 
> > > > > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > > > > >    tracing code resets it.
> > > > > > >    (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > > > > 
> > > > > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > > > > >    (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > > > > >    kernel threads.
> > > > > > >    However, the ftrace selftest triggers a kernel oops when testing syscall
> > > > > > >    trace points:
> > > > > > >       - The kernel thread is started as ususal (do_fork()),
> > > > > > >       - tracing code sets TIF_SYSCALL_FTRACE,
> > > > > > >       - the ret_from_fork() function is triggered and starts
> > > > > > > 	ftrace_syscall_exit() with an invalid syscall number.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > I wonder if there is any way to identify such situation...?
> > > > > For the second case, it might be an option to avoid setting the
> > > > > TIF_SYSCALL_FTRACE flag for kernel threads.
> > > > > 
> > > > > Kernel threads have task_struct->mm set to NULL.
> > > > > (Thanks to Heiko for that hint ;-)
> > > > > 
> > > > > The idea is then to check the mm field in syscall_regfunc() and
> > > > > set the flag accordingly.
> > > > > 
> > > > > However, I think the patch is an optional add-on becase checking
> > > > > the syscall number is still required for case 1).
> > > > > 
> > > > > ---
> > > > >  kernel/tracepoint.c |    4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > --- a/kernel/tracepoint.c
> > > > > +++ b/kernel/tracepoint.c
> > > > > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> > > > >  	if (!sys_tracepoint_refcount) {
> > > > >  		read_lock_irqsave(&tasklist_lock, flags);
> > > > >  		do_each_thread(g, t) {
> > > > > -			set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > > > +			/* Skip kernel threads. */
> > > > > +			if (t->mm)
> > > > > +				set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > > 
> > > > Uh ? kernel threads can invoke a system call. There are rare places
> > > > where kernel code actually invoke system calls. I don't see why we
> > > > should not deal with them.
> > > 
> > > 
> > > 
> > > Yeah they do, but they don't use the sysenter path, they call the
> > > syscall helpers directly, such as do_fork() or things like that.
> > > 
> > > The syscall tracepoints are set in the sysenter/sysexit path, then
> > > it's no use to trace the kernel threads, it doesn't have any effect,
> > > except random results in case of fork() calls, because we take
> > > the ret_from_fork() path that also ends up to trace_sys_exit()
> > > if the TIF_SYSCALL_TRACEPOINT thing is set, leading to such
> > > asymetric tracing.
> > > 
> > > Kernel threads use syscalls toward wrappers such as create_thread().
> > > So instead, statically defined tracepoints in create_thread() and such
> > > other syscall wrappers for kernel threads seem more valuable, hmm?
> > > 
> > >  
> > > > Moreover, the problem you face is more general: if we set the
> > > > TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
> > > > system call, x86_64 will cause the syscall exit to execute by re-reading
> > > > the thread flags and run a syscall trace exit.
> > > 
> > > 
> > > Well, I don't think that's the problem. The issue here, if I understand
> > > correctly, is that kernel threads don't take the sysenter path, then never hit
> > > the trace_sys_enter() call. And usually they won't ever hit any
> > > trace_sys_exit() calls except in the fork() case, because we take
> > > the ret_from_fork() path, which lead to syscall exit tracing due
> > > to the TIF flags set.
> > > 
> > > At this stage, the syscall number is supposed to be stored in orig_eax,
> > > but because the kernel thread hasn't called fork() through a syscall and
> > > has called do_fork() directly, the regs values have nothing that look
> > > like syscall parameters.
> > 
> > 
> > 
> > I mean, I don't know how look like orig_eax at this stage.
> > 
> > Looking at arch/x86/kernel/process_32.c:copy_thread()
> > 
> > childregs = task_pt_regs(p);
> > *childregs = *regs;
> > childregs->ax = 0;
> > childregs->sp = sp;
> > 
> > p->thread.sp = (unsigned long) childregs;
> > p->thread.sp0 = (unsigned long) (childregs+1);
> > 
> > p->thread.ip = (unsigned long) ret_from_fork;
> > 
> > 
> > sp will be the struct pt_regs * passed to syscall_trace_leave()
> > later.
> > 
> > ax has the result of the fork syscall -> 0 for the child.
> > What about orig_eax which has the syscall nr? It depends on the pareent
> > and I don't know what it has at this stage.
> > 
> > I haven't seen crashes in x86 with kernel threads tracing, may be because
> > orig_eax is set to a valid syscall nr (may be even fork nr).
> > 
> > Perhaps it's not the case in s390 ?
> > 
> > Anyway, tracing kernel threads syscalls only gives us the fork return,
> > so it's something me may want to drop and trace higher level kernel
> > thread syscall wrappers instead.
> > 
> > Moreover every kernel threads is created through a kthreadd fork if
> > I'm not wrong, then it wouldn't be an accurate thing for us to trace the
> > fork calls in kernel thread. Tracing higher level kernel thread managment
> > sounds more interesting, we would then know who really created the thread,
> > etc...
> > 
> 
> (Well, I do not have time currently to look into the gory details
> (sorry), but let's try to take a step back from the problem.)
> 
> The design proposal for this kthread behavior wrt syscalls is based on a
> very specific and current kernel behavior, that may happen to change and
> that I have actually seen proven incorrect. For instance, some
> proprietary Linux driver does very odd things with system calls within
> kernel threads, like invoking them with int 0x80.

On s390 it is not allowed to execute the system call instruction svc
from kernel code to execute a system call function. You need to call
the system call function by name. The why is hidden in the critical
section cleanup in entry.S. There is a good reason why the inline
assemblies to execute an inline system call have been removed from
the kernel code.
 
> Yes, this is odd, but do we really want to tie the tracer that much to
> the actual OS implementation specificities ?
> 
> That sounds like a recipe for endless breakages and missing bits of
> instrumentation.
> 
> So my advice would be: if we want to trace the syscall entry/exit paths,
> let's trace them for the _whole_ system, and find ways to make it work
> for corner-cases rather than finding clever ways to diminish
> instrumentation coverage.

I guess that the real reason for the crash is hidden in the initialization
of the pt_regs structure of the kernel thread.

> Given the ret from fork example happens to be the first event fired
> after the thread is created, we should be able to deal with this problem
> by initializing the thread structure used by syscall exit tracing to an
> initial "ret from fork" value.

That is my best guess as well.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


  parent reply	other threads:[~2009-08-25 22:04 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-10 20:52 [PATCH 00/12] add syscall tracepoints V3 Jason Baron
2009-08-10 20:52 ` [PATCH 01/12] map syscall name to number Jason Baron
2009-08-10 20:52 ` [PATCH 02/12] call arch_init_ftrace_syscalls at boot Jason Baron
2009-08-10 20:52 ` [PATCH 03/12] add DECLARE_TRACE_WITH_CALLBACK() macro Jason Baron
2009-08-10 20:52 ` [PATCH 04/12] add syscall tracepoints Jason Baron
2009-08-10 20:52 ` [PATCH 05/12] update FTRACE_SYSCALL_MAX Jason Baron
2009-08-11 11:00   ` Frederic Weisbecker
2009-08-11 19:39     ` Matt Fleming
2009-08-24 13:41     ` Paul Mundt
2009-08-24 14:06       ` Jason Baron
2009-08-24 14:15         ` Paul Mundt
2009-08-24 14:34           ` Frederic Weisbecker
2009-08-24 14:37             ` Paul Mundt
2009-08-24 14:42           ` Jason Baron
2009-08-24 14:50             ` Paul Mundt
2009-08-24 18:34               ` Ingo Molnar
2009-08-10 20:52 ` [PATCH 06/12] trace_event - raw_init bailout Jason Baron
2009-08-10 20:52 ` [PATCH 07/12] add ftrace_event_call void * 'data' field Jason Baron
2009-08-11 10:09   ` Frederic Weisbecker
2009-08-17 22:19     ` Steven Rostedt
2009-08-17 23:09       ` Frederic Weisbecker
2009-08-18  0:06         ` Steven Rostedt
2009-08-10 20:52 ` [PATCH 08/12] add trace events for each syscall entry/exit Jason Baron
2009-08-11 10:50   ` Frederic Weisbecker
2009-08-11 11:45     ` Ingo Molnar
2009-08-11 12:01       ` Frederic Weisbecker
2009-08-25 12:50   ` Hendrik Brueckner
2009-08-25 14:15     ` Frederic Weisbecker
2009-08-25 16:02       ` Hendrik Brueckner
2009-08-25 16:20         ` Mathieu Desnoyers
2009-08-25 16:59           ` Frederic Weisbecker
2009-08-25 17:31             ` Frederic Weisbecker
2009-08-25 18:31               ` Mathieu Desnoyers
2009-08-25 19:42                 ` Frederic Weisbecker
2009-08-25 19:51                   ` Mathieu Desnoyers
2009-08-26  0:19                     ` Frederic Weisbecker
2009-08-26  0:42                       ` Mathieu Desnoyers
2009-08-26  7:28                         ` Ingo Molnar
2009-08-26 17:11                           ` Mathieu Desnoyers
2009-08-26  6:48                   ` Peter Zijlstra
2009-08-25 22:04                 ` Martin Schwidefsky [this message]
2009-08-26  7:38                   ` Heiko Carstens
2009-08-26 12:32                     ` Frederic Weisbecker
2009-08-26  6:21                 ` Peter Zijlstra
2009-08-26 17:08                   ` Mathieu Desnoyers
2009-08-26 18:41                     ` Christoph Hellwig
2009-08-26 18:42                       ` Christoph Hellwig
2009-08-26 19:01                         ` Mathieu Desnoyers
2009-08-26  7:10                 ` Peter Zijlstra
2009-08-26 17:10                   ` Mathieu Desnoyers
2009-08-26 17:24                   ` H. Peter Anvin
2009-08-25 17:04           ` Jason Baron
2009-08-25 18:15             ` Mathieu Desnoyers
2009-08-26 12:35         ` Frederic Weisbecker
2009-08-26 12:59           ` Heiko Carstens
2009-08-26 13:30             ` Frederic Weisbecker
2009-08-26 13:48               ` Steven Rostedt
2009-08-26 13:53                 ` Frederic Weisbecker
2009-08-26 14:44                   ` Steven Rostedt
2009-08-26 13:56                 ` Peter Zijlstra
2009-08-26 14:41                   ` Steven Rostedt
2009-08-26 14:10               ` Heiko Carstens
2009-08-26 14:27                 ` Frederic Weisbecker
2009-08-26 14:43                   ` Steven Rostedt
2009-08-26 16:14                     ` Frederic Weisbecker
2009-08-26 14:43                 ` Steven Rostedt
2009-08-26 14:41           ` Hendrik Brueckner
2009-08-28 12:28         ` [tip:tracing/core] tracing: Don't trace kernel thread syscalls tip-bot for Hendrik Brueckner
2009-08-25 21:40     ` [PATCH 08/12] add trace events for each syscall entry/exit Frederic Weisbecker
2009-08-25 22:09       ` Frederic Weisbecker
2009-08-26  7:47         ` Heiko Carstens
2009-08-28 12:27     ` [tip:tracing/core] tracing: Check invalid syscall nr while tracing syscalls tip-bot for Hendrik Brueckner
2009-08-10 20:52 ` [PATCH 09/12] add support traceopint ids Jason Baron
2009-08-11 11:28   ` Frederic Weisbecker
2009-08-10 20:53 ` [PATCH 10/12] add perf counter support Jason Baron
2009-08-11 12:12   ` Frederic Weisbecker
2009-08-11 12:17     ` Ingo Molnar
2009-08-11 12:25       ` Frederic Weisbecker
2009-08-10 20:53 ` [PATCH 11/12] add more namespace area to 'perf list' output Jason Baron
2009-08-10 20:53 ` [PATCH 12/12] convert x86_64 mmap and uname to use DEFINE_SYSCALL Jason Baron
2009-08-25 12:31 ` [PATCH 00/12] add syscall tracepoints V3 - s390 arch update Hendrik Brueckner
2009-08-25 13:52   ` Frederic Weisbecker
2009-08-25 14:39     ` Heiko Carstens
2009-08-25 19:52       ` Frederic Weisbecker
2009-08-25 15:38     ` Hendrik Brueckner
2009-08-26 16:53   ` Frederic Weisbecker
2009-08-27  7:27     ` [PATCH]: tracing: s390 arch updates for tracing syscalls Hendrik Brueckner
2009-08-28 12:27   ` [tip:tracing/core] tracing: Add syscall tracepoints - s390 arch update tip-bot for Hendrik Brueckner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090826000426.0332cbf4@skybase \
    --to=schwidefsky@de.ibm.com \
    --cc=brueckner@linux.vnet.ibm.com \
    --cc=fweisbec@gmail.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=jbaron@redhat.com \
    --cc=jiayingz@google.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=mathieu.desnoyers@polymtl.ca \
    --cc=mbligh@google.com \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox