* System call instrumentation @ 2008-05-04 13:48 Mathieu Desnoyers 2008-05-05 6:55 ` Ingo Molnar 0 siblings, 1 reply; 10+ messages in thread From: Mathieu Desnoyers @ 2008-05-04 13:48 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, systemtap, Frank Ch. Eigler Hi Ingo, I looked at the system call instrumentation present in LTTng lately. I tried different solutions, e.g. hooking a kernel-wide syscall trace in do_syscall_trace, but it appears that I ended up re-doing another syscall table, which consists of specialized functions which extracts the string and data structure parameters from user-space. Since code duplication is not exactly wanted, I think that the original approach taken in my patchset, which is to instrument the kernel code at the sys_* level (e.g. sys_open), which is the earliest level where the parameter information is made available to the kernel, is still the best way to go. I would still identify the execution mode changes in the same way I do currently, which is by instrumenting do_syscall_trace, just to know as soon as possible when the mode has changed from user-space to kernel-space so we can do time accounting more accurately. I already have the patchset which adds the KERNEL_TRACE thread flag to every architectures. It's tested in assembly in the same way SYSCALL_TRACE is tested, but is activated globally by iterating on all the threads. So, the currently proposed scheme for a system call would be (for the open() example) shown as : kernel stack trace: event name (parameters) do_syscall_trace() trace: kernel_arch_syscall_entry (syscall id, instruction pointer) do_sys_open() trace: fs_open (fd, filename) do_syscall_trace() kernel_arch_syscall_exit (return value) If we take this open() example, filename is ready only in do_sys_open, which is called by sys_open and sys_openat. So the logical instrumentation site for this would really be do_sys_open(). The information about which system call has been done is made available in the kernel_arch_syscall_entry event. It is not present anymore at the do_sys_open level because this execution path can be called from more than one syscall. What do you think of this approach ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: System call instrumentation 2008-05-04 13:48 System call instrumentation Mathieu Desnoyers @ 2008-05-05 6:55 ` Ingo Molnar 2008-05-05 10:59 ` Mathieu Desnoyers 0 siblings, 1 reply; 10+ messages in thread From: Ingo Molnar @ 2008-05-05 6:55 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: linux-kernel, systemtap, Frank Ch. Eigler * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > Hi Ingo, > > I looked at the system call instrumentation present in LTTng lately. I > tried different solutions, e.g. hooking a kernel-wide syscall trace in > do_syscall_trace, but it appears that I ended up re-doing another > syscall table, which consists of specialized functions which extracts > the string and data structure parameters from user-space. Since code > duplication is not exactly wanted, I think that the original approach > taken in my patchset, which is to instrument the kernel code at the > sys_* level (e.g. sys_open), which is the earliest level where the > parameter information is made available to the kernel, is still the > best way to go. hm, i'm not sure about this. I've implemented system call tracing in -rt [embedded in the latency tracer] and it only needed changes in entry.S, not in every system call site. Now, granted, that tracer was simpler than what LTTng tries to do, but do we _really_ need more complexity? A trace point that simply expresses: sys_call_event(int sysno, long param1, long param2, long param3, long param4, long param5, long param6); would be more than enough in 99% of the cases. If you want to extract the strings from the system call table, to make the filtering of these syscall events easier, do it during build time (by for example modifying the __SYSCALL() macros in unistd.h), instead of a parallel syscall table. OTOH, as long as it's just one simple nonintrusive line per syscall, it's not a big deal - as long as it only traces the parameters as they come from the syscall ABI - we wont change them anyway. I.e. hide the ugly string portion, just turn them into a simple: trace_sys_getpid(); trace_sys_time(tloc); trace_sys_gettimeofday(tz, tv); (although even such a solution would still need a general policy level buy-in i guess - we dont want half of the syscalls traced, the other half objected to by maintainers. So it's either all or nothing.) and the question also arises: why not do this on a debuginfo level? This information can be extracted from the debuginfo. We could change 'asmlinkage' to 'syscall_linkage' to make it clear which functions are syscalls. Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: System call instrumentation 2008-05-05 6:55 ` Ingo Molnar @ 2008-05-05 10:59 ` Mathieu Desnoyers 2008-05-05 11:10 ` Ingo Molnar 0 siblings, 1 reply; 10+ messages in thread From: Mathieu Desnoyers @ 2008-05-05 10:59 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, systemtap, Frank Ch. Eigler * Ingo Molnar (mingo@elte.hu) wrote: > > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > > Hi Ingo, > > > > I looked at the system call instrumentation present in LTTng lately. I > > tried different solutions, e.g. hooking a kernel-wide syscall trace in > > do_syscall_trace, but it appears that I ended up re-doing another > > syscall table, which consists of specialized functions which extracts > > the string and data structure parameters from user-space. Since code > > duplication is not exactly wanted, I think that the original approach > > taken in my patchset, which is to instrument the kernel code at the > > sys_* level (e.g. sys_open), which is the earliest level where the > > parameter information is made available to the kernel, is still the > > best way to go. > > hm, i'm not sure about this. I've implemented system call tracing in -rt > [embedded in the latency tracer] and it only needed changes in entry.S, > not in every system call site. Now, granted, that tracer was simpler > than what LTTng tries to do, but do we _really_ need more complexity? A > trace point that simply expresses: > > sys_call_event(int sysno, long param1, long param2, long param3, > long param4, long param5, long param6); > That would work for all system calls that doesn't have parameters like "const char __user *filename". > would be more than enough in 99% of the cases. Hrm, not sure about that : grep asmlinkage include/linux/syscalls.h|wc -l 321 grep -e asmlinkage.*__user include/linux/syscalls.h | wc -l 162 So about 50% of system calls have to get pointers from userspace. > If you want to extract > the strings from the system call table, to make the filtering of these > syscall events easier, do it during build time (by for example modifying > the __SYSCALL() macros in unistd.h), instead of a parallel syscall > table. > The goal of my second syscall table was to call callbacks to extract the __user parameters, nothing else, really. I currently extract the system call function names by iterating on the real system call table and by using kallsyms (it could be turned into a built-time extraction since the system call table is static to the kernel core). > OTOH, as long as it's just one simple nonintrusive line per syscall, > it's not a big deal - as long as it only traces the parameters as they > come from the syscall ABI - we wont change them anyway. I.e. hide the > ugly string portion, just turn them into a simple: > > trace_sys_getpid(); > trace_sys_time(tloc); > trace_sys_gettimeofday(tz, tv); > > (although even such a solution would still need a general policy level > buy-in i guess - we dont want half of the syscalls traced, the other > half objected to by maintainers. So it's either all or nothing.) > Yup. That's ok with me. > and the question also arises: why not do this on a debuginfo level? This > information can be extracted from the debuginfo. We could change > 'asmlinkage' to 'syscall_linkage' to make it clear which functions are > syscalls. Given we already have the syscall table, I am not sure we would need any supplementary debuginfo support. Plus, the syscall table gives us the mapping between the syscall number and the symbol of the system call function, which is exactly what we need. The only thing I dislike in my current approach is that I depend on kallsyms; it would be good to have the system call symbols without depending on it. Mathieu > > Ingo -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: System call instrumentation 2008-05-05 10:59 ` Mathieu Desnoyers @ 2008-05-05 11:10 ` Ingo Molnar 2008-05-05 11:30 ` Mathieu Desnoyers 0 siblings, 1 reply; 10+ messages in thread From: Ingo Molnar @ 2008-05-05 11:10 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: linux-kernel, systemtap, Frank Ch. Eigler * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > hm, i'm not sure about this. I've implemented system call tracing in > > -rt [embedded in the latency tracer] and it only needed changes in > > entry.S, not in every system call site. Now, granted, that tracer > > was simpler than what LTTng tries to do, but do we _really_ need > > more complexity? A trace point that simply expresses: > > > > sys_call_event(int sysno, long param1, long param2, long param3, > > long param4, long param5, long param6); > > > > That would work for all system calls that doesn't have parameters like > "const char __user *filename". what's the problem with them? Do you want to follow those parameters? Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: System call instrumentation 2008-05-05 11:10 ` Ingo Molnar @ 2008-05-05 11:30 ` Mathieu Desnoyers 2008-05-05 12:28 ` Ingo Molnar 0 siblings, 1 reply; 10+ messages in thread From: Mathieu Desnoyers @ 2008-05-05 11:30 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, systemtap, Frank Ch. Eigler * Ingo Molnar (mingo@elte.hu) wrote: > > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > > > hm, i'm not sure about this. I've implemented system call tracing in > > > -rt [embedded in the latency tracer] and it only needed changes in > > > entry.S, not in every system call site. Now, granted, that tracer > > > was simpler than what LTTng tries to do, but do we _really_ need > > > more complexity? A trace point that simply expresses: > > > > > > sys_call_event(int sysno, long param1, long param2, long param3, > > > long param4, long param5, long param6); > > > > > > > That would work for all system calls that doesn't have parameters like > > "const char __user *filename". > > what's the problem with them? Do you want to follow those parameters? > > Ingo Ideally, I'd like to have this kind of high-level information : event name : kernel syscall syscall name : open arg1 (%s) : "somefile" <----- arg2 (%d) : flags arg3 (%d) : mode However, "somefile" has to be read from userspace. With the protection involved, it would cause a performance impact to read it a second time rather than tracing the string once it's been copied to kernel-space. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: System call instrumentation 2008-05-05 11:30 ` Mathieu Desnoyers @ 2008-05-05 12:28 ` Ingo Molnar 2008-05-06 20:52 ` Masami Hiramatsu 2008-05-20 3:44 ` Mathieu Desnoyers 0 siblings, 2 replies; 10+ messages in thread From: Ingo Molnar @ 2008-05-05 12:28 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: linux-kernel, systemtap, Frank Ch. Eigler * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > Ideally, I'd like to have this kind of high-level information : > > event name : kernel syscall > syscall name : open > arg1 (%s) : "somefile" <----- > arg2 (%d) : flags > arg3 (%d) : mode > > However, "somefile" has to be read from userspace. With the protection > involved, it would cause a performance impact to read it a second time > rather than tracing the string once it's been copied to kernel-space. performance is a secondary issue here, and copies are fast anyway _if_ someone wants to trace a syscall. (because the first copy brings the cacheline into the cache, subsequent copies are almost for free compared to the first copy) Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: System call instrumentation 2008-05-05 12:28 ` Ingo Molnar @ 2008-05-06 20:52 ` Masami Hiramatsu 2008-05-20 3:44 ` Mathieu Desnoyers 1 sibling, 0 replies; 10+ messages in thread From: Masami Hiramatsu @ 2008-05-06 20:52 UTC (permalink / raw) To: Ingo Molnar, Mathieu Desnoyers; +Cc: linux-kernel, systemtap, Frank Ch. Eigler Hi Ingo and Mathieu, Ingo Molnar wrote: > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > >> Ideally, I'd like to have this kind of high-level information : >> >> event name : kernel syscall >> syscall name : open >> arg1 (%s) : "somefile" <----- >> arg2 (%d) : flags >> arg3 (%d) : mode >> >> However, "somefile" has to be read from userspace. With the protection >> involved, it would cause a performance impact to read it a second time >> rather than tracing the string once it's been copied to kernel-space. > > performance is a secondary issue here, and copies are fast anyway _if_ > someone wants to trace a syscall. (because the first copy brings the > cacheline into the cache, subsequent copies are almost for free compared > to the first copy) I think that the code duplication is also an issue. If we introduce functions which copy userspace strings same as original syscall functions, we have to maintain similar codes. So, I think Mathieu's approach (separating syscall parameters from syscall entering event) is enough reasonable. BTW, it seems that using KERNEL_TRACE per thread flag might be a bit tricky to trace all threads. Out of curiosity:-), why did not you use a global percpu flag to do it? Thanks, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: System call instrumentation 2008-05-05 12:28 ` Ingo Molnar 2008-05-06 20:52 ` Masami Hiramatsu @ 2008-05-20 3:44 ` Mathieu Desnoyers 2008-05-20 14:18 ` Arjan van de Ven 1 sibling, 1 reply; 10+ messages in thread From: Mathieu Desnoyers @ 2008-05-20 3:44 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, systemtap, Frank Ch. Eigler * Ingo Molnar (mingo@elte.hu) wrote: > > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > > Ideally, I'd like to have this kind of high-level information : > > > > event name : kernel syscall > > syscall name : open > > arg1 (%s) : "somefile" <----- > > arg2 (%d) : flags > > arg3 (%d) : mode > > > > However, "somefile" has to be read from userspace. With the protection > > involved, it would cause a performance impact to read it a second time > > rather than tracing the string once it's been copied to kernel-space. > > performance is a secondary issue here, and copies are fast anyway _if_ > someone wants to trace a syscall. (because the first copy brings the > cacheline into the cache, subsequent copies are almost for free compared > to the first copy) > > Ingo Hrm, a quick benchmark on my pentium 4 comparing a normal open() system call executed in a loop to a modified open() syscall which executes the lines added in the following patch adds 450 cycles to each open() system call. I added a putname/getname on purpose to see the cost of a second userspace copy and it's not exactly free. The normal getname correctly nested, re-using the string previously copied, should not suffer from that kind of performance hit. Also, given that the string would be copied only once from userspace, it would eliminate race scenarios where multithreaded applications could change the string underneath, so the kernel would trace a different string than the one being really used for the system call. However, strings are not the only userspace arguments passed to system calls. For all these other arguments, performance could be an issue as well as racy user-level data modification which would let the kernel trace a different paramenter than the one being used in the system call. For those two reasons, I think extracting these parameters could be faster/cleaner/safer if done in the system call function, where the parameters are already copied in kernel space. Mathieu Index: linux-2.6-lttng/fs/open.c =================================================================== --- linux-2.6-lttng.orig/fs/open.c 2008-05-19 22:51:16.000000000 -0400 +++ linux-2.6-lttng/fs/open.c 2008-05-19 23:11:07.000000000 -0400 @@ -1043,6 +1043,8 @@ long do_sys_open(int dfd, const char __u int fd = PTR_ERR(tmp); if (!IS_ERR(tmp)) { + putname(tmp); + tmp = getname(filename); fd = get_unused_fd_flags(flags); if (fd >= 0) { struct file *f = do_filp_open(dfd, tmp, flags, mode); -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: System call instrumentation 2008-05-20 3:44 ` Mathieu Desnoyers @ 2008-05-20 14:18 ` Arjan van de Ven 2008-05-22 12:47 ` Mathieu Desnoyers 0 siblings, 1 reply; 10+ messages in thread From: Arjan van de Ven @ 2008-05-20 14:18 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: Ingo Molnar, linux-kernel, systemtap, Frank Ch. Eigler On Mon, 19 May 2008 23:44:53 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > * Ingo Molnar (mingo@elte.hu) wrote: > > > > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > > > > Ideally, I'd like to have this kind of high-level information : > > > > > > event name : kernel syscall > > > syscall name : open > > > arg1 (%s) : "somefile" <----- > > > arg2 (%d) : flags > > > arg3 (%d) : mode > > > > > > However, "somefile" has to be read from userspace. With the > > > protection involved, it would cause a performance impact to read > > > it a second time rather than tracing the string once it's been > > > copied to kernel-space. the audit subsystem already does all of this... why not use that?? (And it goes through great lengths to do it securely) > > > Hrm, a quick benchmark on my pentium 4 comparing a normal open() > system call executed in a loop to a modified open() syscall which > executes the lines added in the following patch adds 450 cycles to > each open() system call. I added a putname/getname on purpose to see > the cost of a second userspace copy and it's not exactly free. copying twice does mean that if the user wants, he can cheat you. He can, in another thread, change the string under you. So say you're doing this for anti-virus purposes, he can make you scan one file and open another. The audit subsystem was carefully designed to avoid this trap... how about using that? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: System call instrumentation 2008-05-20 14:18 ` Arjan van de Ven @ 2008-05-22 12:47 ` Mathieu Desnoyers 0 siblings, 0 replies; 10+ messages in thread From: Mathieu Desnoyers @ 2008-05-22 12:47 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Ingo Molnar, linux-kernel, systemtap, Frank Ch. Eigler * Arjan van de Ven (arjan@infradead.org) wrote: > On Mon, 19 May 2008 23:44:53 -0400 > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > > * Ingo Molnar (mingo@elte.hu) wrote: > > > > > > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > > > > > > Ideally, I'd like to have this kind of high-level information : > > > > > > > > event name : kernel syscall > > > > syscall name : open > > > > arg1 (%s) : "somefile" <----- > > > > arg2 (%d) : flags > > > > arg3 (%d) : mode > > > > > > > > However, "somefile" has to be read from userspace. With the > > > > protection involved, it would cause a performance impact to read > > > > it a second time rather than tracing the string once it's been > > > > copied to kernel-space. > > the audit subsystem already does all of this... why not use that?? > (And it goes through great lengths to do it securely) > > > > > > > Hrm, a quick benchmark on my pentium 4 comparing a normal open() > > system call executed in a loop to a modified open() syscall which > > executes the lines added in the following patch adds 450 cycles to > > each open() system call. I added a putname/getname on purpose to see > > the cost of a second userspace copy and it's not exactly free. > > copying twice does mean that if the user wants, he can cheat you. He > can, in another thread, change the string under you. So say you're > doing this for anti-virus purposes, he can make you scan one file and > open another. > > > The audit subsystem was carefully designed to avoid this trap... how > about using that? Hrm, given tracing will have to grab __user * parameters passed to various system calls, not limited to strings, the getname/putname infrastructure would need to be expanded a lot. I doubt it's worth adding such complexity (copy to temporary memory buffers and reference counting) in those system calls to support kernel-wide tracing. On the other hand, adding a marker in the traced function, at a code location where the data copied into the kernel is accessible, won't add such complexity and will help to keep good locality of reference (the stack is meant to be a good cache-hot memory region). Because a dormant marker does not have a significant performance hit (actually, my benchmarks shows a small acceleration of the overall system, probably due to cache line code layout modifications), I think it's legitimate to add this kind of instrumentation in the existing kernel system call functions. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2008-05-22 12:47 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-04 13:48 System call instrumentation Mathieu Desnoyers 2008-05-05 6:55 ` Ingo Molnar 2008-05-05 10:59 ` Mathieu Desnoyers 2008-05-05 11:10 ` Ingo Molnar 2008-05-05 11:30 ` Mathieu Desnoyers 2008-05-05 12:28 ` Ingo Molnar 2008-05-06 20:52 ` Masami Hiramatsu 2008-05-20 3:44 ` Mathieu Desnoyers 2008-05-20 14:18 ` Arjan van de Ven 2008-05-22 12:47 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox