* [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules @ 2010-09-03 13:12 Mathieu Desnoyers 2010-09-03 14:47 ` Andi Kleen 0 siblings, 1 reply; 5+ messages in thread From: Mathieu Desnoyers @ 2010-09-03 13:12 UTC (permalink / raw) To: ltt-dev Cc: linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen Hi everyone, Here is a news that should please Linux distributions which have been overwhelmed by the size of the LTTng patchset. I have extracted the LTTng tracer patches from the LTTng kernel tree and repackaged it into a new "lttng-modules" package. There is still a dependency on the LTTng kernel tree at the moment, but the objective is to gradually reduce the size of this 5 years long mainline fork. The objective of this re-packaging is to make life easier for the LTTng users. Some distributions have been shipping the LTTng tree for years (Wind River, MontaVista, STLinux). I am still planning to contribute some LTTng pieces to mainline, but this restructuration will let me focus on the parts that really need to go into mainline without making my users suffer any longer. The development of this package is done in the following tree: git://lttng.org/lttng-modules.git It will follow a linear development workflow (no more rebases). The kernel LTTng tree (on which the LTTng tracer module depends) will still be rebased on Linux mainline. This is a first step towards cleaning up the LTTng kernel tree. The following steps will be to refactor the LTTng tree patches, remove part of the instrumentation from the kernel tree, migrate to TRACE_EVENT(), and migrate to the generic ring buffer library. This change is effective as of LTTng 0.227 for kernel 2.6.35.4. This matches the new lttng-modules package version 0.16. LTTng, the Linux Trace Toolkit Next Generation, is a project that aims at producing a highly efficient full system tracing solution. It is composed of several components to allow tracing of the kernel, of userspace, trace viewing and analysis and trace streaming. Project website: http://lttng.org Download link: http://lttng.org/content/download (please refer to the LTTng Manual for installation instructions) Enjoy! Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules 2010-09-03 13:12 [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules Mathieu Desnoyers @ 2010-09-03 14:47 ` Andi Kleen 2010-09-06 17:29 ` Mathieu Desnoyers 0 siblings, 1 reply; 5+ messages in thread From: Andi Kleen @ 2010-09-03 14:47 UTC (permalink / raw) To: Mathieu Desnoyers Cc: ltt-dev, linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro On Fri, 3 Sep 2010 09:12:13 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > Here is a news that should please Linux distributions which have been > overwhelmed by the size of the LTTng patchset. I have extracted the > LTTng tracer patches from the LTTng kernel tree and repackaged it > into a new "lttng-modules" package. There is still a dependency on > the LTTng kernel tree at the moment, but the objective is to > gradually reduce the size of this 5 years long mainline fork. Efforts to get rid of forks are always good. Could you perhaps elaborate a bit what changes you need in mainline (ideally separated in "essential" and "nice to have") and how big the left over patches are? Or rather how difficult would it be to simply run the LTT userland on top of the tracing code that is in mainline, even at loss of some functionality? Thanks, -Andi ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules 2010-09-03 14:47 ` Andi Kleen @ 2010-09-06 17:29 ` Mathieu Desnoyers 2010-09-07 7:22 ` Andi Kleen 0 siblings, 1 reply; 5+ messages in thread From: Mathieu Desnoyers @ 2010-09-06 17:29 UTC (permalink / raw) To: Andi Kleen Cc: ltt-dev, linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro * Andi Kleen (andi@firstfloor.org) wrote: > On Fri, 3 Sep 2010 09:12:13 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > Here is a news that should please Linux distributions which have been > > overwhelmed by the size of the LTTng patchset. I have extracted the > > LTTng tracer patches from the LTTng kernel tree and repackaged it > > into a new "lttng-modules" package. There is still a dependency on > > the LTTng kernel tree at the moment, but the objective is to > > gradually reduce the size of this 5 years long mainline fork. > > Efforts to get rid of forks are always good. > > Could you perhaps elaborate a bit what changes you need > in mainline (ideally separated in "essential" and "nice to > have") and how big the left over patches are? Sure. Here is the detail of the pieces that still have to be kept in the LTTng tree: * Essential: - Adding interfaces to dynamic kprobes and tracepoints to list the currently available instrumentation as well as notifiers to let LTTng know about events appearing while tracing runs (e.g. module loaded, new dynamic probe added). - Export the splice_to_pipe symbol (and probably some more I do not recall at the moment). - Add ability to read the module list coherently in multiple reads when racing with module load/unload. - Either add the ability to fault in NMI handlers, or add call to vmalloc_sync_all() each time a module is loaded, or export vmalloc_sync_all() to GPL modules so they can ensure that the fault-in memory after using vmalloc but before the memory is used by the tracer. These essential patches are very small. * Nice to have: - Support for the LTTng statedump, which saves the initial kernel state into the trace at trace start: - EXPORT_SYMBOL_GPL() for tasklist_lock, irq_desc, ... - Add per-arch iterators to dump the list of system calls and IDT into the trace. - Generic Ring Buffer Library. - Generic alignment API. - Mark atomic notifier call chain "notrace". - CPU idle notifier notifiers (for trace streaming with deferrable timers). - Poll wait exclusive (to address thundering herd problem in poll()). - prio_heap.c new remove_maximum(), replace() and cherrypick(). - Inline memcpy(). - Trace clock - Faster trace clock implementation. - Export the faster trace clock to userspace for UST through a vDSO. - Jump based on asm goto, which will minimize the impact of disabled tracepoints. (the patchset is being proposed by Jason Baron) - Kernel OOPS "lttng_nesting" level printout. These "nice to have" patches are a bit larger. The other patches in the LTTng tree can either wait or are planned for deprecation. The instrumentation patches can be considered for mainlining later on. Replacing the "Kernel Markers" infrastructure still being used in LTTng by TRACE_EVENT() will shorten the LTTng tree considerably. > Or rather how difficult would it be to simply run > the LTT userland on top of the tracing code that is > in mainline, even at loss of some functionality? Moving LTT userland to these tools would require a large rewrite of the code that reads the trace format, and still, my users needs would not be fulfilled. By moving the LTT userland code to the less efficient schemes used in Perf and Ftrace, my users would just be to lose in terms of performance, features, and information accuracy. The problem here is that the tracing tools in mainline are not suitable for my user's needs. Amongst them, Perf lacks the flight recorder mode, is painfully slow and generates huge amount of useless event header data. Ftrace event header size is slightly better than Perf, but its handling of time-stamps with respect to concurrency can lead users to wrong results in terms of irq and softirq handler duration. LTTng event headers, IMHO, are by far superior to those of Perf and Ftrace, and this is what lets LTTng have a lower overhead while keeping less complex, yet more generic, event headers, and provide more accurate information. I am currently working on a trace format converter to/from a common trace format (for which I'm at round 3 of the requirements RFC document: http://lkml.indiana.edu/hypermail/linux/kernel/1009.0/00116.html), so at least we can start sharing the userland tools. However, given the missing pieces in Perf and Ftrace, LTTng still fulfills a pressing user need that's overlooked by the other tracers. Thanks, Mathieu > > Thanks, > > -Andi -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules 2010-09-06 17:29 ` Mathieu Desnoyers @ 2010-09-07 7:22 ` Andi Kleen 2010-09-07 19:12 ` Mathieu Desnoyers 0 siblings, 1 reply; 5+ messages in thread From: Andi Kleen @ 2010-09-07 7:22 UTC (permalink / raw) To: Mathieu Desnoyers Cc: ltt-dev, linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro On Mon, 6 Sep 2010 13:29:20 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: Mathieu, One experience I have with closely looking at other long term forked patchkits is that over time they tend to accumulate stuff that is not really needed and various things which are very easy to integrate. It sounds like you have some candidates like this here. > - Adding interfaces to dynamic kprobes and tracepoints to list the > currently available instrumentation as well as notifiers to let LTTng > know about events appearing while tracing runs (e.g. module loaded, > new dynamic probe added). That sounds trivial. > - Export the splice_to_pipe symbol (and probably some more I do not > recall at the moment). Dito. > - Add ability to read the module list coherently in multiple reads > when racing with module load/unload. Can't you just take the module_mutex? > - Either add the ability to fault in NMI handlers, or add call to > vmalloc_sync_all() each time a module is loaded, or export > vmalloc_sync_all() to GPL modules so they can ensure that the > fault-in memory after using vmalloc but before the memory is used by > the tracer. I thought Linus had fixed that in the page fault handler? It's a generic problem hit by other code, so it needs to be fixed in mainline in any case. > - CPU idle notifier notifiers (for trace streaming with deferrable > timers). x86 has them already, otherwise i7300_idle et.al. wouldn't work. What do you need what they don't do? > - Poll wait exclusive (to address thundering herd problem in > poll()). How does that work? Wouldn't that break poll semantics? If not it sounds like a general improvement. I assume epoll already does it? > - prio_heap.c new remove_maximum(), replace() and cherrypick(). > - Inline memcpy(). What's that? gcc does inline memcpy > - Trace clock > - Faster trace clock implementation. What's the problem here? If it's faster it should be integrated. I know that the old sched_clock did some horrible things that could be improved. > - Export the faster trace clock to userspace for UST through a vDSO. A new vDSO? This should be just a register_posix_clock and some glue in x86 vdso/ Makes sense to have, although I would prefer a per thread clock than a per CPU clock I think. But per CPU should be also fine. > - Jump based on asm goto, which will minimize the impact of disabled > tracepoints. (the patchset is being proposed by Jason Baron) I think that is in progress. BTW I'm still hoping that the old "self modifying booleans" patchkit will make it back at some point. I liked it as a general facility. > - Kernel OOPS "lttng_nesting" level printout. This sounds very optional. > Ftrace event header size is slightly better than > Perf, but its handling of time-stamps with respect to concurrency can > lead users to wrong results in terms of irq and softirq handler > duration. What is the problem with ftrace time stamps? They should be all just per CPU? -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules 2010-09-07 7:22 ` Andi Kleen @ 2010-09-07 19:12 ` Mathieu Desnoyers 0 siblings, 0 replies; 5+ messages in thread From: Mathieu Desnoyers @ 2010-09-07 19:12 UTC (permalink / raw) To: Andi Kleen Cc: ltt-dev, linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro * Andi Kleen (andi@firstfloor.org) wrote: > On Mon, 6 Sep 2010 13:29:20 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > Mathieu, > > One experience I have with closely looking at other long term > forked patchkits is that over time they tend to accumulate > stuff that is not really needed and various things which > are very easy to integrate. It sounds like you have some > candidates like this here. > > > - Adding interfaces to dynamic kprobes and tracepoints to list the > > currently available instrumentation as well as notifiers to let LTTng > > know about events appearing while tracing runs (e.g. module loaded, > > new dynamic probe added). > > That sounds trivial. > > > - Export the splice_to_pipe symbol (and probably some more I do not > > recall at the moment). > > Dito. > > > - Add ability to read the module list coherently in multiple reads > > when racing with module load/unload. > > Can't you just take the module_mutex? I just looked over my code again to remember that LTTng don't use read() to list the available markers anymore. It shows them in a debugfs directory tree instead, which makes the problem caused by chunk-wise read() of seqfiles go away. So by making sure the LTTng tracepoint table mutex always nests inside the modules mutex, I should be OK. Now that the modules_mutex is exported to modules, it should work fine. module load lock module_mutex (1) execute module notifiers call lttng callback lock lttng tracepoint mutex (2) trace session creation lock module_mutex (1) lock lttng tracepoint mutex (2) iterate on core kernel tracepoints iterate on each module's tracepoints unlock lttng tracepoint mutex unlock module_mutex > > > - Either add the ability to fault in NMI handlers, or add call to > > vmalloc_sync_all() each time a module is loaded, or export > > vmalloc_sync_all() to GPL modules so they can ensure that the > > fault-in memory after using vmalloc but before the memory is used by > > the tracer. > > I thought Linus had fixed that in the page fault handler? > > It's a generic problem hit by other code, so it needs to be fixed > in mainline in any case. Yes, Linus and I exchanged a few emails about this, and I think it lead to an interesting solution (we still had to take care of MCE by using another stack rather than reserving space at the beginning of the NMI stack). I had to postpone working more on this issue, since I have to follow the priority of the companies for whom I'm contracting. I plan to work more on this as soon as I can push the more pressing stuff into mainline. > > > - CPU idle notifier notifiers (for trace streaming with deferrable > > timers). > > x86 has them already, otherwise i7300_idle et.al. wouldn't work. > What do you need what they don't do? The problem is that this is only implemented for x86, and LTTng is arch-agnostic. So I need this cpu idle notifiers for all architectures, and at least a dummy notifier chain that lets LTTng build on architectures other than x86. We can complement this with a HAVE_CPU_IDLE_NOTIFIER build define so the arch-agnostic code can choose to use deferrable timers or not. > > > - Poll wait exclusive (to address thundering herd problem in > > poll()). > > How does that work? I let the LTTng poll file operation calls a new: poll_wait_set_exclusive(wait); Which makes sure that when we have multiple threads waiting on the same file descriptor (which represents a ring buffer), only one of the threads is woken up. > Wouldn't that break poll semantics? The way I currently do it, yes, but I think we could do better. Basically, what I need is that a poll wakeup triggers an exclusive synchronous wakeup, and then re-checks the wakeup condition. AFAIU, the usual poll semantics seems to be that all poll()/epoll() should be notified of state changes on all examined file descriptors. But wether we should do the wakeup first, wait for the woken up thread to run (possibly consume the data), and then only after that check if we must continue going through the wakeup chain is left as a grey zone (ref. http://www.opengroup.org/onlinepubs/009695399/functions/poll.html). > If not it sounds like a general improvement. > > I assume epoll already does it? Nope, if I believe epoll(7): " Q2 Can two epoll instances wait for the same file descriptor? If so, are events reported to both epoll file descriptors? A2 Yes, and events would be reported to both. However, careful pro‐ gramming may be needed to do this correctly." > > > - prio_heap.c new remove_maximum(), replace() and cherrypick(). > > - Inline memcpy(). > > What's that? gcc does inline memcpy My understanding (and the result of some testing) is that gcc provides inline memcpy when the length to copy is known statically (and smaller or eq. to 64 bytes). However, as soon as the length is determined dynamically, we go for the function call in the Linux kernel. The problem with tracing is that the typical data copy is a) of a small payload b) which has an unknown size So paying the price of an extra function call for each field hurts performances a lot. > > > - Trace clock > > - Faster trace clock implementation. > > What's the problem here? If it's faster it should be integrated. > > I know that the old sched_clock did some horrible things > that could be improved. I think the problem lays in many aspects. Let's have a looks at all 3 trace clock flavors provided: "local: CPU-local trace clock" * uses sched_clock(), which provides completely inaccurate synchronization across CPUs. "medium: scalable global clock with some jitter", which seems to be the compromise between high overhead and precision: * It can have a ~1 jiffy jitter between the CPUs. This is just too much for low-level tracing. "global: globally monotonic, serialized clock" * Disables interrupts (slow) * Takes a spin lock (awefully slow, non-scalable) * Uses sequence lock, deals with NMI deadlock risk by providing inaccurate clock values to NMI handlers. In addition, the whole trace clock infrastructure in the Linux kernel is based on callbacks, somehow assuming that function calls cost nothing, which is wrong. Also, the mainline trace clock returns the time in nanoseconds, which adds useless calculations on the fast path. The LTTng trace clock does the following: 1) Uses a RCU read-side instead of a sequence lock to deal with concurrency in a NMI-safe fashion. RCU is faster than the sequence lock too because it does not require a memory barrier. 2) LTTng trace clock returns the time in an arbitrary unit, which is scaled to ns by exporting the unit to ns ratio once at trace start. For architectures with constant synchronized TSC, this means that reading time is a simple rdtsc. 3) LTTng trace clock implements the trace_clock() read function as a static inline, because it is typically embedded in the tracing function, so we save a function call. 4) (done on ARM OMAP3, not x86 yet) LTTng handles dynamic frequency scaling affecting the cycle counter speed by hooking into CPU freq notifiers and power management notifiers, which resynchronize on an external clock or simply change the current frequency and last time-to-cycles snapshot. Units returned are nanoseconds. For x86, we would also need to hook in the idle loop to monitor the "hlt" instruction. The goal here is to ensure that the returned time is within a controlled max delta from other CPUs. All this management is done with RCU synchronization. > > > - Export the faster trace clock to userspace for UST through a vDSO. > > A new vDSO? This should be just a register_posix_clock and some > glue in x86 vdso/ Makes sense to have, although I would prefer > a per thread clock than a per CPU clock I think. But per CPU > should be also fine. I'm not sure I follow. In LTTng, we don't really care about having per-cpu time. We want a fast and precise enough global clock. Extending the posix clock possibly makes sense, although it would be important that the fast path is not much more than a "rdtsc" instruction and a few tests. > > > - Jump based on asm goto, which will minimize the impact of disabled > > tracepoints. (the patchset is being proposed by Jason Baron) > > I think that is in progress. Yes, it is. > > BTW I'm still hoping that the old "self modifying booleans" > patchkit will make it back at some point. I liked it as a general > facility. I still have it in the LTTng tree, although I stopped actively pusing it when the static jump patching project started, because this project has all I need for tracing. We can revisit the "Immediate Values" patches once I get the most important tracing pieces in. > > > - Kernel OOPS "lttng_nesting" level printout. > > This sounds very optional. Sure. That's more for my own peace of mind when I see OOPS coming from my users: it tells me if the problem came from the Linux kernel per se or from my tracer. It's been a while since I've seen my tracer cause any OOPS though. > > > > Ftrace event header size is slightly better than > > Perf, but its handling of time-stamps with respect to concurrency can > > lead users to wrong results in terms of irq and softirq handler > > duration. > > What is the problem with ftrace time stamps? They should > be all just per CPU? Ftrace uses both per-cpu (the "local" flavor) and "global" timestamps (it can choose between the two depending on the tracing scope). So already it's either fast and inaccurate, or globally accurate and painfully slow/non-scalable. But the problem goes further: In the Ftrace ring buffer implementation, all nested commits have zero time deltas, so in the following execution sequence: trace_some_event() reserve ring buffer space and read trace clock (time = 10000000 ns) start to copy data into the buffer - nested interrupt comes in trace_some_other_event() reserve ring buffer space and read trace clock (time = 10000000 ns) copy data into ring buffer commit event execute softirq trace_yet_another_event() reserve ring buffer space and read trace clock (time = 10000000 ns) copy data into ring buffer commit event iret finish copying data commit event So we end up having all events nested over the tracing code (including interrupt handlers and softirqs) having a duration of exactly 0 ns. This is not really convenient when we try to figure out things like the average duration of an interrupt handler, longest/shorted handler duration, etc. Thanks, Mathieu > > -Andi > > -- > ak@linux.intel.com -- Speaking for myself only. -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-09-07 19:12 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-03 13:12 [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules Mathieu Desnoyers 2010-09-03 14:47 ` Andi Kleen 2010-09-06 17:29 ` Mathieu Desnoyers 2010-09-07 7:22 ` Andi Kleen 2010-09-07 19:12 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox