[ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules
@ 2010-09-03 13:12 Mathieu Desnoyers
  2010-09-03 14:47 ` Andi Kleen
  0 siblings, 1 reply; 5+ messages in thread
From: Mathieu Desnoyers @ 2010-09-03 13:12 UTC (permalink / raw)
  To: ltt-dev
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen

Hi everyone,

Here is a news that should please Linux distributions which have been
overwhelmed by the size of the LTTng patchset. I have extracted the LTTng tracer
patches from the LTTng kernel tree and repackaged it into a new "lttng-modules"
package. There is still a dependency on the LTTng kernel tree at the moment, but
the objective is to gradually reduce the size of this 5 years long mainline
fork.

The objective of this re-packaging is to make life easier for the LTTng users.
Some distributions have been shipping the LTTng tree for years (Wind River,
MontaVista, STLinux). I am still planning to contribute some LTTng pieces to
mainline, but this restructuration will let me focus on the parts that really
need to go into mainline without making my users suffer any longer.

The development of this package is done in the following tree:

git://lttng.org/lttng-modules.git

It will follow a linear development workflow (no more rebases). The kernel LTTng
tree (on which the LTTng tracer module depends) will still be rebased on Linux
mainline.

This is a first step towards cleaning up the LTTng kernel tree. The following
steps will be to refactor the LTTng tree patches, remove part of the
instrumentation from the kernel tree, migrate to TRACE_EVENT(), and migrate to
the generic ring buffer library.

This change is effective as of LTTng 0.227 for kernel 2.6.35.4.
This matches the new lttng-modules package version 0.16.

LTTng, the Linux Trace Toolkit Next Generation, is a project that aims at
producing a highly efficient full system tracing solution.  It is composed of
several components to allow tracing of the kernel, of userspace, trace viewing
and analysis and trace streaming.

Project website: http://lttng.org
Download link: http://lttng.org/content/download
(please refer to the LTTng Manual for installation instructions)

Enjoy!

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules
  2010-09-03 13:12 [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules Mathieu Desnoyers
@ 2010-09-03 14:47 ` Andi Kleen
  2010-09-06 17:29   ` Mathieu Desnoyers
  0 siblings, 1 reply; 5+ messages in thread
From: Andi Kleen @ 2010-09-03 14:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: ltt-dev, linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro

On Fri, 3 Sep 2010 09:12:13 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> Here is a news that should please Linux distributions which have been
> overwhelmed by the size of the LTTng patchset. I have extracted the
> LTTng tracer patches from the LTTng kernel tree and repackaged it
> into a new "lttng-modules" package. There is still a dependency on
> the LTTng kernel tree at the moment, but the objective is to
> gradually reduce the size of this 5 years long mainline fork.

Efforts to get rid of forks are always good.

Could you perhaps elaborate a bit what changes you need
in mainline (ideally separated in "essential" and "nice to
have") and how big the left over patches are?

Or rather how difficult would it be to simply run
the LTT userland on top of the tracing code that is
in mainline, even at loss of some functionality?

Thanks,

-Andi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules
  2010-09-03 14:47 ` Andi Kleen
@ 2010-09-06 17:29   ` Mathieu Desnoyers
  2010-09-07  7:22     ` Andi Kleen
  0 siblings, 1 reply; 5+ messages in thread
From: Mathieu Desnoyers @ 2010-09-06 17:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: ltt-dev, linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro

* Andi Kleen (andi@firstfloor.org) wrote:
> On Fri, 3 Sep 2010 09:12:13 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > Here is a news that should please Linux distributions which have been
> > overwhelmed by the size of the LTTng patchset. I have extracted the
> > LTTng tracer patches from the LTTng kernel tree and repackaged it
> > into a new "lttng-modules" package. There is still a dependency on
> > the LTTng kernel tree at the moment, but the objective is to
> > gradually reduce the size of this 5 years long mainline fork.
> 
> Efforts to get rid of forks are always good.
> 
> Could you perhaps elaborate a bit what changes you need
> in mainline (ideally separated in "essential" and "nice to
> have") and how big the left over patches are?

Sure. Here is the detail of the pieces that still have to be kept in the LTTng
tree:

* Essential:

- Adding interfaces to dynamic kprobes and tracepoints to list the currently
  available instrumentation as well as notifiers to let LTTng know about events
  appearing while tracing runs (e.g. module loaded, new dynamic probe added).
- Export the splice_to_pipe symbol (and probably some more I do not recall at
  the moment).
- Add ability to read the module list coherently in multiple reads when racing
  with module load/unload.
- Either add the ability to fault in NMI handlers, or add call to
  vmalloc_sync_all() each time a module is loaded, or export vmalloc_sync_all()
  to GPL modules so they can ensure that the fault-in memory after using
  vmalloc but before the memory is used by the tracer.

These essential patches are very small.

* Nice to have:

- Support for the LTTng statedump, which saves the initial kernel state into the
  trace at trace start:
  - EXPORT_SYMBOL_GPL() for tasklist_lock, irq_desc, ...
  - Add per-arch iterators to dump the list of system calls and IDT into the
    trace.
- Generic Ring Buffer Library.
  - Generic alignment API.
  - Mark atomic notifier call chain "notrace".
  - CPU idle notifier notifiers (for trace streaming with deferrable timers).
  - Poll wait exclusive (to address thundering herd problem in poll()).
  - prio_heap.c new remove_maximum(), replace() and cherrypick().
  - Inline memcpy().
- Trace clock
  - Faster trace clock implementation.
  - Export the faster trace clock to userspace for UST through a vDSO.
- Jump based on asm goto, which will minimize the impact of disabled
  tracepoints. (the patchset is being proposed by Jason Baron)
- Kernel OOPS "lttng_nesting" level printout.

These "nice to have" patches are a bit larger.

The other patches in the LTTng tree can either wait or are planned for
deprecation. The instrumentation patches can be considered for mainlining later
on. Replacing the "Kernel Markers" infrastructure still being used in LTTng by
TRACE_EVENT() will shorten the LTTng tree considerably.

> Or rather how difficult would it be to simply run
> the LTT userland on top of the tracing code that is
> in mainline, even at loss of some functionality?

Moving LTT userland to these tools would require a large rewrite of the code
that reads the trace format, and still, my users needs would not be fulfilled.
By moving the LTT userland code to the less efficient schemes used in Perf and
Ftrace, my users would just be to lose in terms of performance, features, and
information accuracy.

The problem here is that the tracing tools in mainline are not suitable for my
user's needs. Amongst them, Perf lacks the flight recorder mode, is painfully
slow and generates huge amount of useless event header data. Ftrace event header
size is slightly better than Perf, but its handling of time-stamps with respect
to concurrency can lead users to wrong results in terms of irq and softirq
handler duration. LTTng event headers, IMHO, are by far superior to those of
Perf and Ftrace, and this is what lets LTTng have a lower overhead while keeping
less complex, yet more generic, event headers, and provide more accurate
information.

I am currently working on a trace format converter to/from a common trace format
(for which I'm at round 3 of the requirements RFC document:
http://lkml.indiana.edu/hypermail/linux/kernel/1009.0/00116.html), so at least
we can start sharing the userland tools. However, given the missing pieces in
Perf and Ftrace, LTTng still fulfills a pressing user need that's overlooked by
the other tracers.

Thanks,

Mathieu

> 
> Thanks,
> 
> -Andi

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules
  2010-09-06 17:29   ` Mathieu Desnoyers
@ 2010-09-07  7:22     ` Andi Kleen
  2010-09-07 19:12       ` Mathieu Desnoyers
  0 siblings, 1 reply; 5+ messages in thread
From: Andi Kleen @ 2010-09-07  7:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: ltt-dev, linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro

On Mon, 6 Sep 2010 13:29:20 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

Mathieu,

One experience I have with closely looking at other long term
forked patchkits is that over time they tend to accumulate
stuff that is not really needed and various things which
are very easy to integrate. It sounds like you have some 
candidates like this here.

> - Adding interfaces to dynamic kprobes and tracepoints to list the
> currently available instrumentation as well as notifiers to let LTTng
> know about events appearing while tracing runs (e.g. module loaded,
> new dynamic probe added).

That sounds trivial.

> - Export the splice_to_pipe symbol (and probably some more I do not
> recall at the moment).

Dito.

> - Add ability to read the module list coherently in multiple reads
> when racing with module load/unload.

Can't you just take the module_mutex?

> - Either add the ability to fault in NMI handlers, or add call to
>   vmalloc_sync_all() each time a module is loaded, or export
> vmalloc_sync_all() to GPL modules so they can ensure that the
> fault-in memory after using vmalloc but before the memory is used by
> the tracer.

I thought Linus had fixed that in the page fault handler?

It's a generic problem hit by other code, so it needs to be fixed
in mainline in any case.

>   - CPU idle notifier notifiers (for trace streaming with deferrable
> timers).

x86 has them already, otherwise i7300_idle et.al. wouldn't work.
What do you need what they don't do?

>   - Poll wait exclusive (to address thundering herd problem in
> poll()).

How does that work? Wouldn't that break poll semantics?
If not it sounds like a general improvement.

I assume epoll already does it?

>   - prio_heap.c new remove_maximum(), replace() and cherrypick().
>   - Inline memcpy().

What's that? gcc does inline memcpy

> - Trace clock
>   - Faster trace clock implementation.

What's the problem here? If it's faster it should be integrated.

I know that the old sched_clock did some horrible things
that could be improved.

>   - Export the faster trace clock to userspace for UST through a vDSO.

A new vDSO? This should be just a register_posix_clock and some
glue in x86 vdso/ Makes sense to have, although I would prefer
a per thread clock than a per CPU clock I think. But per CPU
should be also fine.

> - Jump based on asm goto, which will minimize the impact of disabled
>   tracepoints. (the patchset is being proposed by Jason Baron)

I think that is in progress.

BTW I'm still hoping that the old "self modifying booleans"
patchkit will make it back at some point. I liked it as a general
facility.

> - Kernel OOPS "lttng_nesting" level printout.

This sounds very optional.

> Ftrace event header size is slightly better than
> Perf, but its handling of time-stamps with respect to concurrency can
> lead users to wrong results in terms of irq and softirq handler
> duration. 

What is the problem with ftrace time stamps? They should 
be all just per CPU? 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules
  2010-09-07  7:22     ` Andi Kleen
@ 2010-09-07 19:12       ` Mathieu Desnoyers
  0 siblings, 0 replies; 5+ messages in thread
From: Mathieu Desnoyers @ 2010-09-07 19:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: ltt-dev, linux-kernel, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro

* Andi Kleen (andi@firstfloor.org) wrote:
> On Mon, 6 Sep 2010 13:29:20 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> Mathieu,
> 
> One experience I have with closely looking at other long term
> forked patchkits is that over time they tend to accumulate
> stuff that is not really needed and various things which
> are very easy to integrate. It sounds like you have some 
> candidates like this here.
> 
> > - Adding interfaces to dynamic kprobes and tracepoints to list the
> > currently available instrumentation as well as notifiers to let LTTng
> > know about events appearing while tracing runs (e.g. module loaded,
> > new dynamic probe added).
> 
> That sounds trivial.
> 
> > - Export the splice_to_pipe symbol (and probably some more I do not
> > recall at the moment).
> 
> Dito.
> 
> > - Add ability to read the module list coherently in multiple reads
> > when racing with module load/unload.
> 
> Can't you just take the module_mutex?

I just looked over my code again to remember that LTTng don't use read() to list
the available markers anymore. It shows them in a debugfs directory tree
instead, which makes the problem caused by chunk-wise read() of seqfiles go
away. So by making sure the LTTng tracepoint table mutex always nests inside the
modules mutex, I should be OK. Now that the modules_mutex is exported to
modules, it should work fine.

module load
  lock module_mutex (1)
  execute module notifiers
  call lttng callback
    lock lttng tracepoint mutex (2)

trace session creation
  lock module_mutex (1)
  lock lttng tracepoint mutex (2)
  iterate on core kernel tracepoints
  iterate on each module's tracepoints
  unlock lttng tracepoint mutex
  unlock module_mutex

> 
> > - Either add the ability to fault in NMI handlers, or add call to
> >   vmalloc_sync_all() each time a module is loaded, or export
> > vmalloc_sync_all() to GPL modules so they can ensure that the
> > fault-in memory after using vmalloc but before the memory is used by
> > the tracer.
> 
> I thought Linus had fixed that in the page fault handler?
> 
> It's a generic problem hit by other code, so it needs to be fixed
> in mainline in any case.

Yes, Linus and I exchanged a few emails about this, and I think it lead to an
interesting solution (we still had to take care of MCE by using another stack
rather than reserving space at the beginning of the NMI stack). I had to
postpone working more on this issue, since I have to follow the priority of the
companies for whom I'm contracting. I plan to work more on this as soon as I can
push the more pressing stuff into mainline.

> 
> >   - CPU idle notifier notifiers (for trace streaming with deferrable
> > timers).
> 
> x86 has them already, otherwise i7300_idle et.al. wouldn't work.
> What do you need what they don't do?

The problem is that this is only implemented for x86, and LTTng is
arch-agnostic. So I need this cpu idle notifiers for all architectures, and at
least a dummy notifier chain that lets LTTng build on architectures other than
x86. We can complement this with a HAVE_CPU_IDLE_NOTIFIER build define so the
arch-agnostic code can choose to use deferrable timers or not.

> 
> >   - Poll wait exclusive (to address thundering herd problem in
> > poll()).
> 
> How does that work?

I let the LTTng poll file operation calls a new:

poll_wait_set_exclusive(wait);

Which makes sure that when we have multiple threads waiting on the same file
descriptor (which represents a ring buffer), only one of the threads is woken
up.

> Wouldn't that break poll semantics?

The way I currently do it, yes, but I think we could do better.

Basically, what I need is that a poll wakeup triggers an exclusive synchronous
wakeup, and then re-checks the wakeup condition. AFAIU, the usual poll semantics
seems to be that all poll()/epoll() should be notified of state changes on all
examined file descriptors. But wether we should do the wakeup first, wait for
the woken up thread to run (possibly consume the data), and then only after that
check if we must continue going through the wakeup chain is left as a grey zone
(ref.  http://www.opengroup.org/onlinepubs/009695399/functions/poll.html).

> If not it sounds like a general improvement.
> 
> I assume epoll already does it?

Nope, if I believe epoll(7):

"      Q2  Can two epoll instances wait for the same file descriptor?  If  so,
           are events reported to both epoll file descriptors?

       A2  Yes,  and  events would be reported to both.  However, careful pro‐
           gramming may be needed to do this correctly."

> 
> >   - prio_heap.c new remove_maximum(), replace() and cherrypick().
> >   - Inline memcpy().
> 
> What's that? gcc does inline memcpy

My understanding (and the result of some testing) is that gcc provides inline
memcpy when the length to copy is known statically (and smaller or eq. to 64
bytes). However, as soon as the length is determined dynamically, we go for the
function call in the Linux kernel. The problem with tracing is that the typical
data copy is 

a) of a small payload
b) which has an unknown size

So paying the price of an extra function call for each field hurts performances
a lot.

> 
> > - Trace clock
> >   - Faster trace clock implementation.
> 
> What's the problem here? If it's faster it should be integrated.
> 
> I know that the old sched_clock did some horrible things
> that could be improved.

I think the problem lays in many aspects. Let's have a looks at all 3 trace
clock flavors provided:

"local: CPU-local trace clock"

* uses sched_clock(), which provides completely inaccurate synchronization
  across CPUs.

"medium: scalable global clock with some jitter", which seems to be the
compromise between high overhead and precision:

* It can have a ~1 jiffy jitter between the CPUs. This is just too much for
  low-level tracing.

"global: globally monotonic, serialized clock"

* Disables interrupts (slow)
* Takes a spin lock (awefully slow, non-scalable)
* Uses sequence lock, deals with NMI deadlock risk by providing inaccurate
  clock values to NMI handlers.

In addition, the whole trace clock infrastructure in the Linux kernel is based
on callbacks, somehow assuming that function calls cost nothing, which is wrong.
Also, the mainline trace clock returns the time in nanoseconds, which adds
useless calculations on the fast path.

The LTTng trace clock does the following:

1) Uses a RCU read-side instead of a sequence lock to deal with concurrency in a
   NMI-safe fashion. RCU is faster than the sequence lock too because it does
   not require a memory barrier.
2) LTTng trace clock returns the time in an arbitrary unit, which is scaled to
   ns by exporting the unit to ns ratio once at trace start. For architectures
   with constant synchronized TSC, this means that reading time is a simple
   rdtsc.
3) LTTng trace clock implements the trace_clock() read function as a static
   inline, because it is typically embedded in the tracing function, so we
   save a function call.
4) (done on ARM OMAP3, not x86 yet) LTTng handles dynamic frequency scaling
   affecting the cycle counter speed by hooking into CPU freq notifiers and
   power management notifiers, which resynchronize on an external clock or
   simply change the current frequency and last time-to-cycles snapshot. Units
   returned are nanoseconds.  For x86, we would also need to hook in the idle
   loop to monitor the "hlt" instruction. The goal here is to ensure that the
   returned time is within a controlled max delta from other CPUs. All this
   management is done with RCU synchronization.

> 
> >   - Export the faster trace clock to userspace for UST through a vDSO.
> 
> A new vDSO? This should be just a register_posix_clock and some
> glue in x86 vdso/ Makes sense to have, although I would prefer
> a per thread clock than a per CPU clock I think. But per CPU
> should be also fine.

I'm not sure I follow. In LTTng, we don't really care about having per-cpu time.
We want a fast and precise enough global clock. Extending the posix clock
possibly makes sense, although it would be important that the fast path is not
much more than a "rdtsc" instruction and a few tests.

> 
> > - Jump based on asm goto, which will minimize the impact of disabled
> >   tracepoints. (the patchset is being proposed by Jason Baron)
> 
> I think that is in progress.

Yes, it is.

> 
> BTW I'm still hoping that the old "self modifying booleans"
> patchkit will make it back at some point. I liked it as a general
> facility.

I still have it in the LTTng tree, although I stopped actively pusing it when
the static jump patching project started, because this project has all I need
for tracing. We can revisit the "Immediate Values" patches once I get the most
important tracing pieces in.

> 
> > - Kernel OOPS "lttng_nesting" level printout.
> 
> This sounds very optional.

Sure. That's more for my own peace of mind when I see OOPS coming from my users:
it tells me if the problem came from the Linux kernel per se or from my tracer.
It's been a while since I've seen my tracer cause any OOPS though.

> 
> 
> > Ftrace event header size is slightly better than
> > Perf, but its handling of time-stamps with respect to concurrency can
> > lead users to wrong results in terms of irq and softirq handler
> > duration. 
> 
> What is the problem with ftrace time stamps? They should 
> be all just per CPU? 

Ftrace uses both per-cpu (the "local" flavor) and "global" timestamps (it can
choose between the two depending on the tracing scope). So already it's either
fast and inaccurate, or globally accurate and painfully slow/non-scalable. But
the problem goes further: In the Ftrace ring buffer implementation, all nested
commits have zero time deltas, so in the following execution sequence:

trace_some_event()
  reserve ring buffer space and read trace clock (time = 10000000 ns)
  start to copy data into the buffer
  - nested interrupt comes in
    trace_some_other_event()
      reserve ring buffer space and read trace clock (time = 10000000 ns)
      copy data into ring buffer
      commit event
    execute softirq
    trace_yet_another_event()
      reserve ring buffer space and read trace clock (time = 10000000 ns)
      copy data into ring buffer
      commit event
    iret
  finish copying data
  commit event

So we end up having all events nested over the tracing code (including interrupt
handlers and softirqs) having a duration of exactly 0 ns. This is not really
convenient when we try to figure out things like the average duration of an
interrupt handler, longest/shorted handler duration, etc.

Thanks,

Mathieu

> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-09-07 19:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-03 13:12 [ANNOUNCEMENT] LTTng tracer re-packaged as stand-alone modules Mathieu Desnoyers
2010-09-03 14:47 ` Andi Kleen
2010-09-06 17:29   ` Mathieu Desnoyers
2010-09-07  7:22     ` Andi Kleen
2010-09-07 19:12       ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox