System call instrumentation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* System call instrumentation
@ 2008-05-04 13:48 Mathieu Desnoyers
  2008-05-05  6:55 ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2008-05-04 13:48 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, systemtap, Frank Ch. Eigler

Hi Ingo,

I looked at the system call instrumentation present in LTTng lately. I
tried different solutions, e.g. hooking a kernel-wide syscall trace in
do_syscall_trace, but it appears that I ended up re-doing another
syscall table, which consists of specialized functions which extracts
the string and data structure parameters from user-space. Since code
duplication is not exactly wanted, I think that the original approach
taken in my patchset, which is to instrument the kernel code at the
sys_* level (e.g. sys_open), which is the earliest level where the
parameter information is made available to the kernel, is still the best
way to go.

I would still identify the execution mode changes in the same way I do
currently, which is by instrumenting do_syscall_trace, just to know as
soon as possible when the mode has changed from user-space to
kernel-space so we can do time accounting more accurately. I already
have the patchset which adds the KERNEL_TRACE thread flag to every
architectures. It's tested in assembly in the same way SYSCALL_TRACE is
tested, but is activated globally by iterating on all the threads.

So, the currently proposed scheme for a system call would be (for the
open() example)

shown as : 
kernel stack
  trace: event name (parameters)

do_syscall_trace()
  trace: kernel_arch_syscall_entry (syscall id, instruction pointer)

do_sys_open()
  trace: fs_open (fd, filename)

do_syscall_trace()
  kernel_arch_syscall_exit (return value)

If we take this open() example, filename is ready only in do_sys_open,
which is called by sys_open and sys_openat. So the logical
instrumentation site for this would really be do_sys_open(). The
information about which system call has been done is made available in
the kernel_arch_syscall_entry event. It is not present anymore at the
do_sys_open level because this execution path can be called from more
than one syscall.

What do you think of this approach ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: System call instrumentation
  2008-05-04 13:48 System call instrumentation Mathieu Desnoyers
@ 2008-05-05  6:55 ` Ingo Molnar
  2008-05-05 10:59   ` Mathieu Desnoyers
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2008-05-05  6:55 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel, systemtap, Frank Ch. Eigler

* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Hi Ingo,
> 
> I looked at the system call instrumentation present in LTTng lately. I 
> tried different solutions, e.g. hooking a kernel-wide syscall trace in 
> do_syscall_trace, but it appears that I ended up re-doing another 
> syscall table, which consists of specialized functions which extracts 
> the string and data structure parameters from user-space. Since code 
> duplication is not exactly wanted, I think that the original approach 
> taken in my patchset, which is to instrument the kernel code at the 
> sys_* level (e.g. sys_open), which is the earliest level where the 
> parameter information is made available to the kernel, is still the 
> best way to go.

hm, i'm not sure about this. I've implemented system call tracing in -rt 
[embedded in the latency tracer] and it only needed changes in entry.S, 
not in every system call site. Now, granted, that tracer was simpler 
than what LTTng tries to do, but do we _really_ need more complexity? A 
trace point that simply expresses:

   sys_call_event(int sysno, long param1, long param2, long param3,
                             long param4, long param5, long param6);

would be more than enough in 99% of the cases. If you want to extract 
the strings from the system call table, to make the filtering of these 
syscall events easier, do it during build time (by for example modifying 
the __SYSCALL() macros in unistd.h), instead of a parallel syscall 
table.

OTOH, as long as it's just one simple nonintrusive line per syscall, 
it's not a big deal - as long as it only traces the parameters as they 
come from the syscall ABI - we wont change them anyway. I.e. hide the 
ugly string portion, just turn them into a simple:

  trace_sys_getpid();
  trace_sys_time(tloc);
  trace_sys_gettimeofday(tz, tv);

(although even such a solution would still need a general policy level 
buy-in i guess - we dont want half of the syscalls traced, the other 
half objected to by maintainers. So it's either all or nothing.)

and the question also arises: why not do this on a debuginfo level? This 
information can be extracted from the debuginfo. We could change 
'asmlinkage' to 'syscall_linkage' to make it clear which functions are 
syscalls.

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: System call instrumentation
  2008-05-05  6:55 ` Ingo Molnar
@ 2008-05-05 10:59   ` Mathieu Desnoyers
  2008-05-05 11:10     ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2008-05-05 10:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, systemtap, Frank Ch. Eigler

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Hi Ingo,
> > 
> > I looked at the system call instrumentation present in LTTng lately. I 
> > tried different solutions, e.g. hooking a kernel-wide syscall trace in 
> > do_syscall_trace, but it appears that I ended up re-doing another 
> > syscall table, which consists of specialized functions which extracts 
> > the string and data structure parameters from user-space. Since code 
> > duplication is not exactly wanted, I think that the original approach 
> > taken in my patchset, which is to instrument the kernel code at the 
> > sys_* level (e.g. sys_open), which is the earliest level where the 
> > parameter information is made available to the kernel, is still the 
> > best way to go.
> 
> hm, i'm not sure about this. I've implemented system call tracing in -rt 
> [embedded in the latency tracer] and it only needed changes in entry.S, 
> not in every system call site. Now, granted, that tracer was simpler 
> than what LTTng tries to do, but do we _really_ need more complexity? A 
> trace point that simply expresses:
> 
>    sys_call_event(int sysno, long param1, long param2, long param3,
>                              long param4, long param5, long param6);
> 

That would work for all system calls that doesn't have parameters like
"const char __user *filename".

> would be more than enough in 99% of the cases.

Hrm, not sure about that :

grep asmlinkage include/linux/syscalls.h|wc -l
321

grep -e asmlinkage.*__user include/linux/syscalls.h | wc -l
162

So about 50% of system calls have to get pointers from userspace.


> If you want to extract 
> the strings from the system call table, to make the filtering of these 
> syscall events easier, do it during build time (by for example modifying 
> the __SYSCALL() macros in unistd.h), instead of a parallel syscall 
> table.
> 

The goal of my second syscall table was to call callbacks to extract the
__user parameters, nothing else, really. I currently extract the system
call function names by iterating on the real system call table and by
using kallsyms (it could be turned into a built-time extraction since
the system call table is static to the kernel core).

> OTOH, as long as it's just one simple nonintrusive line per syscall, 
> it's not a big deal - as long as it only traces the parameters as they 
> come from the syscall ABI - we wont change them anyway. I.e. hide the 
> ugly string portion, just turn them into a simple:
> 
>   trace_sys_getpid();
>   trace_sys_time(tloc);
>   trace_sys_gettimeofday(tz, tv);
> 
> (although even such a solution would still need a general policy level 
> buy-in i guess - we dont want half of the syscalls traced, the other 
> half objected to by maintainers. So it's either all or nothing.)
> 

Yup. That's ok with me.


> and the question also arises: why not do this on a debuginfo level? This 
> information can be extracted from the debuginfo. We could change 
> 'asmlinkage' to 'syscall_linkage' to make it clear which functions are 
> syscalls.

Given we already have the syscall table, I am not sure we would need any
supplementary debuginfo support. Plus, the syscall table gives us the
mapping between the syscall number and the symbol of the system call
function, which is exactly what we need. The only thing I dislike in my
current approach is that I depend on kallsyms; it would be good to have
the system call symbols without depending on it.

Mathieu

> 
> 	Ingo

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: System call instrumentation
  2008-05-05 10:59   ` Mathieu Desnoyers
@ 2008-05-05 11:10     ` Ingo Molnar
  2008-05-05 11:30       ` Mathieu Desnoyers
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2008-05-05 11:10 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel, systemtap, Frank Ch. Eigler


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> > hm, i'm not sure about this. I've implemented system call tracing in 
> > -rt [embedded in the latency tracer] and it only needed changes in 
> > entry.S, not in every system call site. Now, granted, that tracer 
> > was simpler than what LTTng tries to do, but do we _really_ need 
> > more complexity? A trace point that simply expresses:
> > 
> >    sys_call_event(int sysno, long param1, long param2, long param3,
> >                              long param4, long param5, long param6);
> > 
> 
> That would work for all system calls that doesn't have parameters like 
> "const char __user *filename".

what's the problem with them? Do you want to follow those parameters?

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: System call instrumentation
  2008-05-05 11:10     ` Ingo Molnar
@ 2008-05-05 11:30       ` Mathieu Desnoyers
  2008-05-05 12:28         ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2008-05-05 11:30 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, systemtap, Frank Ch. Eigler

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > > hm, i'm not sure about this. I've implemented system call tracing in 
> > > -rt [embedded in the latency tracer] and it only needed changes in 
> > > entry.S, not in every system call site. Now, granted, that tracer 
> > > was simpler than what LTTng tries to do, but do we _really_ need 
> > > more complexity? A trace point that simply expresses:
> > > 
> > >    sys_call_event(int sysno, long param1, long param2, long param3,
> > >                              long param4, long param5, long param6);
> > > 
> > 
> > That would work for all system calls that doesn't have parameters like 
> > "const char __user *filename".
> 
> what's the problem with them? Do you want to follow those parameters?
> 
> 	Ingo

Ideally, I'd like to have this kind of high-level information :

event name : kernel syscall
syscall name : open
arg1 (%s) : "somefile"    <-----
arg2 (%d) : flags
arg3 (%d) : mode

However, "somefile" has to be read from userspace. With the protection
involved, it would cause a performance impact to read it a second time
rather than tracing the string once it's been copied to kernel-space.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: System call instrumentation
  2008-05-05 11:30       ` Mathieu Desnoyers
@ 2008-05-05 12:28         ` Ingo Molnar
  2008-05-06 20:52           ` Masami Hiramatsu
  2008-05-20  3:44           ` Mathieu Desnoyers
  0 siblings, 2 replies; 10+ messages in thread
From: Ingo Molnar @ 2008-05-05 12:28 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel, systemtap, Frank Ch. Eigler


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Ideally, I'd like to have this kind of high-level information :
> 
> event name : kernel syscall
> syscall name : open
> arg1 (%s) : "somefile"    <-----
> arg2 (%d) : flags
> arg3 (%d) : mode
> 
> However, "somefile" has to be read from userspace. With the protection 
> involved, it would cause a performance impact to read it a second time 
> rather than tracing the string once it's been copied to kernel-space.

performance is a secondary issue here, and copies are fast anyway _if_ 
someone wants to trace a syscall. (because the first copy brings the 
cacheline into the cache, subsequent copies are almost for free compared 
to the first copy)

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: System call instrumentation
  2008-05-05 12:28         ` Ingo Molnar
@ 2008-05-06 20:52           ` Masami Hiramatsu
  2008-05-20  3:44           ` Mathieu Desnoyers
  1 sibling, 0 replies; 10+ messages in thread
From: Masami Hiramatsu @ 2008-05-06 20:52 UTC (permalink / raw)
  To: Ingo Molnar, Mathieu Desnoyers; +Cc: linux-kernel, systemtap, Frank Ch. Eigler

Hi Ingo and Mathieu,

Ingo Molnar wrote:
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
>> Ideally, I'd like to have this kind of high-level information :
>>
>> event name : kernel syscall
>> syscall name : open
>> arg1 (%s) : "somefile"    <-----
>> arg2 (%d) : flags
>> arg3 (%d) : mode
>>
>> However, "somefile" has to be read from userspace. With the protection 
>> involved, it would cause a performance impact to read it a second time 
>> rather than tracing the string once it's been copied to kernel-space.
> 
> performance is a secondary issue here, and copies are fast anyway _if_ 
> someone wants to trace a syscall. (because the first copy brings the 
> cacheline into the cache, subsequent copies are almost for free compared 
> to the first copy)

I think that the code duplication is also an issue.
If we introduce functions which copy userspace strings same as
original syscall functions, we have to maintain similar codes.
So, I think Mathieu's approach (separating syscall parameters from
syscall entering event) is enough reasonable.

BTW, it seems that using KERNEL_TRACE per thread flag might be a bit
tricky to trace all threads. Out of curiosity:-), why did not you use a
global percpu flag to do it?

Thanks,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: System call instrumentation
  2008-05-05 12:28         ` Ingo Molnar
  2008-05-06 20:52           ` Masami Hiramatsu
@ 2008-05-20  3:44           ` Mathieu Desnoyers
  2008-05-20 14:18             ` Arjan van de Ven
  1 sibling, 1 reply; 10+ messages in thread
From: Mathieu Desnoyers @ 2008-05-20  3:44 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, systemtap, Frank Ch. Eigler

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Ideally, I'd like to have this kind of high-level information :
> > 
> > event name : kernel syscall
> > syscall name : open
> > arg1 (%s) : "somefile"    <-----
> > arg2 (%d) : flags
> > arg3 (%d) : mode
> > 
> > However, "somefile" has to be read from userspace. With the protection 
> > involved, it would cause a performance impact to read it a second time 
> > rather than tracing the string once it's been copied to kernel-space.
> 
> performance is a secondary issue here, and copies are fast anyway _if_ 
> someone wants to trace a syscall. (because the first copy brings the 
> cacheline into the cache, subsequent copies are almost for free compared 
> to the first copy)
> 
> 	Ingo

Hrm, a quick benchmark on my pentium 4 comparing a normal open() system
call executed in a loop to a modified open() syscall which executes the
lines added in the following patch adds 450 cycles to each open() system
call. I added a putname/getname on purpose to see the cost of a second
userspace copy and it's not exactly free.

The normal getname correctly nested, re-using the string previously
copied, should not suffer from that kind of performance hit. Also, given
that the string would be copied only once from userspace, it would
eliminate race scenarios where multithreaded applications could change
the string underneath, so the kernel would trace a different string than
the one being really used for the system call.

However, strings are not the only userspace arguments passed to system
calls. For all these other arguments, performance could be an issue as
well as racy user-level data modification which would let the kernel
trace a different paramenter than the one being used in the system call.

For those two reasons, I think extracting these parameters could be
faster/cleaner/safer if done in the system call function, where the
parameters are already copied in kernel space.

Mathieu

Index: linux-2.6-lttng/fs/open.c
===================================================================
--- linux-2.6-lttng.orig/fs/open.c	2008-05-19 22:51:16.000000000 -0400
+++ linux-2.6-lttng/fs/open.c	2008-05-19 23:11:07.000000000 -0400
@@ -1043,6 +1043,8 @@ long do_sys_open(int dfd, const char __u
 	int fd = PTR_ERR(tmp);

 	if (!IS_ERR(tmp)) {
+		putname(tmp);
+		tmp = getname(filename);
 		fd = get_unused_fd_flags(flags);
 		if (fd >= 0) {
 			struct file *f = do_filp_open(dfd, tmp, flags, mode);

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: System call instrumentation
  2008-05-20  3:44           ` Mathieu Desnoyers
@ 2008-05-20 14:18             ` Arjan van de Ven
  2008-05-22 12:47               ` Mathieu Desnoyers
  0 siblings, 1 reply; 10+ messages in thread
From: Arjan van de Ven @ 2008-05-20 14:18 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Ingo Molnar, linux-kernel, systemtap, Frank Ch. Eigler

On Mon, 19 May 2008 23:44:53 -0400
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > 
> > > Ideally, I'd like to have this kind of high-level information :
> > > 
> > > event name : kernel syscall
> > > syscall name : open
> > > arg1 (%s) : "somefile"    <-----
> > > arg2 (%d) : flags
> > > arg3 (%d) : mode
> > > 
> > > However, "somefile" has to be read from userspace. With the
> > > protection involved, it would cause a performance impact to read
> > > it a second time rather than tracing the string once it's been
> > > copied to kernel-space.

the audit subsystem already does all of this... why not use that??
(And it goes through great lengths to do it securely)

> > 

> Hrm, a quick benchmark on my pentium 4 comparing a normal open()
> system call executed in a loop to a modified open() syscall which
> executes the lines added in the following patch adds 450 cycles to
> each open() system call. I added a putname/getname on purpose to see
> the cost of a second userspace copy and it's not exactly free.

copying twice does mean that if the user wants, he can cheat you. He
can, in another thread, change the string under you. So say you're
doing this for anti-virus purposes, he can make you scan one file and
open another.


The audit subsystem was carefully designed to avoid this trap... how
about using that?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: System call instrumentation
  2008-05-20 14:18             ` Arjan van de Ven
@ 2008-05-22 12:47               ` Mathieu Desnoyers
  0 siblings, 0 replies; 10+ messages in thread
From: Mathieu Desnoyers @ 2008-05-22 12:47 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Ingo Molnar, linux-kernel, systemtap, Frank Ch. Eigler

* Arjan van de Ven (arjan@infradead.org) wrote:
> On Mon, 19 May 2008 23:44:53 -0400
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Ingo Molnar (mingo@elte.hu) wrote:
> > > 
> > > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > > 
> > > > Ideally, I'd like to have this kind of high-level information :
> > > > 
> > > > event name : kernel syscall
> > > > syscall name : open
> > > > arg1 (%s) : "somefile"    <-----
> > > > arg2 (%d) : flags
> > > > arg3 (%d) : mode
> > > > 
> > > > However, "somefile" has to be read from userspace. With the
> > > > protection involved, it would cause a performance impact to read
> > > > it a second time rather than tracing the string once it's been
> > > > copied to kernel-space.
> 
> the audit subsystem already does all of this... why not use that??
> (And it goes through great lengths to do it securely)
> 
> > > 
> 
> > Hrm, a quick benchmark on my pentium 4 comparing a normal open()
> > system call executed in a loop to a modified open() syscall which
> > executes the lines added in the following patch adds 450 cycles to
> > each open() system call. I added a putname/getname on purpose to see
> > the cost of a second userspace copy and it's not exactly free.
> 
> copying twice does mean that if the user wants, he can cheat you. He
> can, in another thread, change the string under you. So say you're
> doing this for anti-virus purposes, he can make you scan one file and
> open another.
> 
> 
> The audit subsystem was carefully designed to avoid this trap... how
> about using that?

Hrm, given tracing will have to grab __user * parameters passed to
various system calls, not limited to strings, the getname/putname
infrastructure would need to be expanded a lot. I doubt it's worth
adding such complexity (copy to temporary memory buffers and reference
counting) in those system calls to support kernel-wide tracing.

On the other hand, adding a marker in the traced function, at a code
location where the data copied into the kernel is accessible, won't add
such complexity and will help to keep good locality of reference (the
stack is meant to be a good cache-hot memory region). Because a dormant
marker does not have a significant performance hit (actually, my
benchmarks shows a small acceleration of the overall system, probably
due to cache line code layout modifications), I think it's legitimate to
add this kind of instrumentation in the existing kernel system call
functions.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-05-22 12:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-04 13:48 System call instrumentation Mathieu Desnoyers
2008-05-05  6:55 ` Ingo Molnar
2008-05-05 10:59   ` Mathieu Desnoyers
2008-05-05 11:10     ` Ingo Molnar
2008-05-05 11:30       ` Mathieu Desnoyers
2008-05-05 12:28         ` Ingo Molnar
2008-05-06 20:52           ` Masami Hiramatsu
2008-05-20  3:44           ` Mathieu Desnoyers
2008-05-20 14:18             ` Arjan van de Ven
2008-05-22 12:47               ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox