[RFC] Unified Ring Buffer (Next Generation)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Unified Ring Buffer (Next Generation)
@ 2010-05-19 17:51 Steven Rostedt
  2010-05-19 18:10 ` Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Steven Rostedt @ 2010-05-19 17:51 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen

More than a year and a half ago (September 2008), at Linux Plumbers, we
had a meeting with several kernel developers to come up with a unified
ring buffer. A generic ring buffer in the kernel that any subsystem
could use. After coming up with a set of requirements, I worked on
implementing it. One of the requirements was to start off simple and
work to become a more complete buffering system.

I posted a set of patches to LKML and several developers (including
Linus) got involved in the design of the ring buffer:

Here's the thread that started the development:

http://lkml.org/lkml/2008/9/19/275

And the ring buffer we ended with here:

http://lkml.org/lkml/2008/9/29/155

And a nice article in LWN about it as well:

http://lwn.net/Articles/300992/

This ring buffer replaced ftrace's ring buffer, as well as oprofile's
ring buffer, and other utilities in the kernel moved over to interacting
with ftrace directly. Although, the ring buffer was a separate entity
from ftrace and it was not required to use ftrace to use the ring
buffer.

The design of the ring buffer was focused more towards in kernel users
and for use with the splice() system call. It did not (and still does
not) support a mmap interface.

In December of 2008 a new utility was created called "perf". At the time
it was a performance counter. In September of 2009, it was converted
over into performance events.

At the time, the unified ring buffer was still not lockless, so it could
lose events in NMI context.

Peter Zijlstra, took a look at the unified ring buffer and found that it
did not suite his needs. He needed a reliable ring buffer in NMI context
as well as something that can mmap to userspace.

At that time, I was working on other aspects of the kernel and did not
have the time to help him come up with something that he could use.
Having to get work done, Peter implemented his own ring buffer for use
with perf.

I do not blame Peter for this, since any developer (including myself)
would have done the same.

Unfortunately, we are now back with more than one ring buffer in the
kernel. What's worse, neither of them can perform all the features
needed. This is putting a bit of stress on the users of these tools, not
to mention the stress on the developers as well.

In June of 2009, I finally made the ring buffer lockless:

http://lkml.org/lkml/2009/6/10/381

Again, LWN wrote up a nice article about this as well:

http://lwn.net/Articles/340400/

But it was too late, and still did not support mmap. Perf was already
dependent on its own ring buffer, and now we are back to where we were
before the unified ring buffer existed.

This email is about finding a solution to the problem. If we can once
again create a generic ring buffer that handles all requirements, then
we can also merge the functionality of ftrace into perf, and lower the
duplication of code within the kernel.

This time around, I'm asking Mathieu Desnoyers to come to the plate, and
see if he can handle the task.

I'm hoping that this email will start a thread that gets everyone into
agreement and produces something that will make everyone happy.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Unified Ring Buffer (Next Generation)
  2010-05-19 17:51 [RFC] Unified Ring Buffer (Next Generation) Steven Rostedt
@ 2010-05-19 18:10 ` Andi Kleen
  2010-05-19 18:44   ` Steven Rostedt
  2010-05-19 18:47   ` Mathieu Desnoyers
  2010-05-19 19:05 ` Mathieu Desnoyers
  2010-05-20  9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar
  2 siblings, 2 replies; 24+ messages in thread
From: Andi Kleen @ 2010-05-19 18:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen

On Wed, May 19, 2010 at 01:51:54PM -0400, Steven Rostedt wrote:

Hi Steven,

> More than a year and a half ago (September 2008), at Linux Plumbers, we
> had a meeting with several kernel developers to come up with a unified
> ring buffer. A generic ring buffer in the kernel that any subsystem
> could use. After coming up with a set of requirements, I worked on

If we take a step back.

Why do you want a single ring buffer for everyone?

A lot more low profile subsystems subsystems simply use kfifo
(which is also actively developed by Stefanie). In fact there
are far more users of it than of your ring buffer. And it's 
really quite simple and easy to use. And it works fine for them.

I don't think it's that great a goal to have a single ring buffer
for all possible ring buffer needs. After all the requirements
are quite different.

Some want a simple ring buffer with minimal overhead
and simple interface, others need a mmaped one or have other special 
requirements. One size doesn't fit all. 

It's also not that we're talking about gigantic amounts of code
in all cases where there is a pressing need to unify.

If perf's current ring buffer works for it why not keep using it? 

One problem I always had with your version was that it's quite 
bloated frankly, especially in terms of code size, but 
also in its data structures and in the interface complexity.

For debugging kernels etc. with tracing that's not that big an issue, but
I think it's a problem for "non debugging" use. After all Linux
still has the goal to be at least configurable as a low footprint operating
system.

Part of the reason for its big code size seems to be that 
it tries to support everyone's requirements, which unsurprisingly
leads to some bloat both in implementation and interface.

Also to be honest it's so clever now that at least I have
a hard time understanding it, and I personally prefer code
that I can understand over too clever code. After all if there
is a bug in there and you need to be more clever than the programmer
to debug it, how would that be done? 

Perhaps a better goal would be to have a smaller simpler more
maintainable buffer for ftrace and let the other users their own?

Just my 0.05 cent.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Unified Ring Buffer (Next Generation)
  2010-05-19 18:10 ` Andi Kleen
@ 2010-05-19 18:44   ` Steven Rostedt
  2010-05-19 19:25     ` Andi Kleen
  2010-05-19 18:47   ` Mathieu Desnoyers
  1 sibling, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2010-05-19 18:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro

On Wed, 2010-05-19 at 20:10 +0200, Andi Kleen wrote:
> On Wed, May 19, 2010 at 01:51:54PM -0400, Steven Rostedt wrote:
> 
> Hi Steven,
> 
> > More than a year and a half ago (September 2008), at Linux Plumbers, we
> > had a meeting with several kernel developers to come up with a unified
> > ring buffer. A generic ring buffer in the kernel that any subsystem
> > could use. After coming up with a set of requirements, I worked on
> 
> If we take a step back.
> 
> Why do you want a single ring buffer for everyone?

Actually, I would like to have one ring buffer interface, not one ring
buffer per say.

> 
> A lot more low profile subsystems subsystems simply use kfifo
> (which is also actively developed by Stefanie). In fact there
> are far more users of it than of your ring buffer. And it's 
> really quite simple and easy to use. And it works fine for them.
> 
> I don't think it's that great a goal to have a single ring buffer
> for all possible ring buffer needs. After all the requirements
> are quite different.
> 
> Some want a simple ring buffer with minimal overhead
> and simple interface, others need a mmaped one or have other special 
> requirements. One size doesn't fit all. 

Exactly! One size does not fit all. I've discussed this off list with
Maitheu, and we agree that different requirements call for different
types of ring buffers. Depending on mmapping, splice, single writer (no
need for per CPU buffers if there's only ever one writer), multiple
writers, etc.

What would be nice to have is a single interface where the same event
can be recorded using the same interface.

> 
> It's also not that we're talking about gigantic amounts of code
> in all cases where there is a pressing need to unify.
> 
> If perf's current ring buffer works for it why not keep using it? 

Because it does not work for the functionality of ftrace. And there's a
driving need to try to unify the two.

> 
> One problem I always had with your version was that it's quite 
> bloated frankly, especially in terms of code size, but 
> also in its data structures and in the interface complexity.

These are valid concerns, and even Mathieu has mentioned this. One way
is to develop a layered approach. Where the lowest layer is very simple
and small, but without much features. The higher layers have more
features but are built on top of the lower ones.

> 
> For debugging kernels etc. with tracing that's not that big an issue, but
> I think it's a problem for "non debugging" use. After all Linux
> still has the goal to be at least configurable as a low footprint operating
> system.

Agreed.

> 
> Part of the reason for its big code size seems to be that 
> it tries to support everyone's requirements, which unsurprisingly
> leads to some bloat both in implementation and interface.

OK, as I mentioned, what about a layered approach? Where one level
builds off of another? This may actually solve some of the bloat. That
is, try to keep the more complex users using the same code, but have
that on top of code that is simple for the users that have a restrictive
environment.

> 
> Also to be honest it's so clever now that at least I have
> a hard time understanding it, and I personally prefer code
> that I can understand over too clever code. After all if there
> is a bug in there and you need to be more clever than the programmer
> to debug it, how would that be done? 

Thanks for the complement ;-)

Yes, I've heard this complaint, and it is also why I'm asking Mathieu to
do the work and I'm staying away (except for helping with review and
requirements). I like the "clever" solutions.

> 
> Perhaps a better goal would be to have a smaller simpler more
> maintainable buffer for ftrace and let the other users their own?

Perhaps, but the goal is actually to handle all the features of perf and
ftrace in a generic fashion.

You also bring up a point that I try very hard to get across. When
people think of a ring buffer, they think of the ones that they created
in CS101, not realizing that when you are dealing with production
systems, handling the requirements makes the buffering much more
complex.

Perhaps the answer is we are fine with more than one ring buffer. I'm OK
with that too. But Linus did push this agenda back in the 2008 Kernel
Summit. I'm just continuing it.

> 
> Just my 0.05 cent.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Unified Ring Buffer (Next Generation)
  2010-05-19 18:44   ` Steven Rostedt
@ 2010-05-19 19:25     ` Andi Kleen
  2010-05-19 19:38       ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2010-05-19 19:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Ingo Molnar, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro

On Wed, May 19, 2010 at 02:44:18PM -0400, Steven Rostedt wrote:
> > > More than a year and a half ago (September 2008), at Linux Plumbers, we
> > > had a meeting with several kernel developers to come up with a unified
> > > ring buffer. A generic ring buffer in the kernel that any subsystem
> > > could use. After coming up with a set of requirements, I worked on
> > 
> > If we take a step back.
> > 
> > Why do you want a single ring buffer for everyone?
> 
> Actually, I would like to have one ring buffer interface, not one ring
> buffer per say.

Why does every user in the kernel need the same interface? 

A user with simple requirements would prefer a simple and fast interface,
an user with complex requirement a complex requirements a complex one.
But they're not the same.

Or do you just want the same interface between ftrace and perf?

> What would be nice to have is a single interface where the same event
> can be recorded using the same interface.

I'm not fully sure what advantage that would have. Would you like
to exchange different ring buffers for existing users frequently?

> OK, as I mentioned, what about a layered approach? Where one level
> builds off of another? This may actually solve some of the bloat. That
> is, try to keep the more complex users using the same code, but have
> that on top of code that is simple for the users that have a restrictive
> environment.

Some of the complexity is in the underlying implementation isn't it?
e.g. single continuous buffer versus multiple ones. 

That cannot really be addressed by layers, unless you go to 
virtual call tables (which would seem a bad idea to me for such a hot
path) 

> > 
> > Perhaps a better goal would be to have a smaller simpler more
> > maintainable buffer for ftrace and let the other users their own?
> 
> Perhaps, but the goal is actually to handle all the features of perf and
> ftrace in a generic fashion.

Ok so your goal is to only cover perf and ftrace and let all 
the other users be alone? 

> 
> You also bring up a point that I try very hard to get across. When
> people think of a ring buffer, they think of the ones that they created
> in CS101, not realizing that when you are dealing with production
> systems, handling the requirements makes the buffering much more
> complex.

For most users the simple ones are totally adequate though.

> Perhaps the answer is we are fine with more than one ring buffer. I'm OK

That's what we already have in fact: kfifo, ftrace, perf, a couple
of private ones all over.

I think a couple of private ones could be converted to kfifo
(I did some of that for my own code) 

But I don't really see how you can marry the simple kfifo world
to the highend bells'n'whistles ftrace world. And even in the highend
world there might be space for multiple specialized ones.

On the other hand if the highend world had less overhead
then at least some of the "higher end" kfifo users could start 
considering to use it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Unified Ring Buffer (Next Generation)
  2010-05-19 19:25     ` Andi Kleen
@ 2010-05-19 19:38       ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2010-05-19 19:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro

On Wed, 2010-05-19 at 21:25 +0200, Andi Kleen wrote:

> > Perhaps, but the goal is actually to handle all the features of perf and
> > ftrace in a generic fashion.
> 
> Ok so your goal is to only cover perf and ftrace and let all 
> the other users be alone? 
> 

Yes, my goal is to unify perf and ftrace, but that's because that is
what I am working most on.

> > 
> > You also bring up a point that I try very hard to get across. When
> > people think of a ring buffer, they think of the ones that they created
> > in CS101, not realizing that when you are dealing with production
> > systems, handling the requirements makes the buffering much more
> > complex.
> 
> For most users the simple ones are totally adequate though.
> 
> > Perhaps the answer is we are fine with more than one ring buffer. I'm OK
> 
> That's what we already have in fact: kfifo, ftrace, perf, a couple
> of private ones all over.
> 
> I think a couple of private ones could be converted to kfifo
> (I did some of that for my own code) 
> 
> But I don't really see how you can marry the simple kfifo world
> to the highend bells'n'whistles ftrace world. And even in the highend
> world there might be space for multiple specialized ones.
> 
> On the other hand if the highend world had less overhead
> then at least some of the "higher end" kfifo users could start 
> considering to use it.

If Mathieu can pull off the requirements that can handle all of ftrace
and perf in a light weight manner, then great. But if the end result is
still to much for these "private" users, then we should have a
"ring_buffer_lite", and perhaps even base it off of a kfifo itself.

Note, it would be nice to have a single location that (a) ring buffer(s)
reside at. This way if a developer has a need for a ring buffer then
they can have a single location to look for one.

Maybe the ring buffer code should go in lib/ringbuffers/ instead of
kernel/ringbuffer/.

-- Steve



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Unified Ring Buffer (Next Generation)
  2010-05-19 18:10 ` Andi Kleen
  2010-05-19 18:44   ` Steven Rostedt
@ 2010-05-19 18:47   ` Mathieu Desnoyers
  2010-05-20  6:41     ` Andi Kleen
  1 sibling, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 18:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro

* Andi Kleen (andi@firstfloor.org) wrote:
> On Wed, May 19, 2010 at 01:51:54PM -0400, Steven Rostedt wrote:
> 

Hi Andi,

> Hi Steven,
> 
> > More than a year and a half ago (September 2008), at Linux Plumbers, we
> > had a meeting with several kernel developers to come up with a unified
> > ring buffer. A generic ring buffer in the kernel that any subsystem
> > could use. After coming up with a set of requirements, I worked on
> 
> If we take a step back.
> 
> Why do you want a single ring buffer for everyone?
> 
> A lot more low profile subsystems subsystems simply use kfifo
> (which is also actively developed by Stefanie). In fact there
> are far more users of it than of your ring buffer. And it's 
> really quite simple and easy to use. And it works fine for them.
> 
> I don't think it's that great a goal to have a single ring buffer
> for all possible ring buffer needs. After all the requirements
> are quite different.

One requirement seems to be shared for most tracing heavy users: it has to be
blazingly fast. This is a requirement that is commonly overlooked by the "very
simple" implementations.

> 
> Some want a simple ring buffer with minimal overhead
> and simple interface, others need a mmaped one or have other special 
> requirements. One size doesn't fit all. 

The plan here is to create a ring buffer that supports per-buffer instance
"flags" that specify what must be supported: e.g. either splice() or mmap(),
global vs per-cpu buffers, etc.

I already have much of the code that can easily support all these use-cases with
only minor modifications. So I don't see why we should duplicate the code.

> 
> It's also not that we're talking about gigantic amounts of code
> in all cases where there is a pressing need to unify.

For efficient flavors of ring buffer, it's not that much code, that's true, but
the complexity level/LOC is rather higher than for standard code. So it's good
if we can focus our review efforts on a single ring buffer.

> If perf's current ring buffer works for it why not keep using it? 

I'll let the perf developers answer this one.

> One problem I always had with your version was that it's quite 
> bloated frankly, especially in terms of code size, but 
> also in its data structures and in the interface complexity.

I share your concern about the level of complexity of Steven's lockless ring
buffer implementation. Even though Lai and I are listed as "reviewed-by" in the
ring buffer code header, I remember that neither him nor myself were
particularly comfortable with the way complexity has been handled.

The new implementation I propose lessens the complexity level, presents clear
abstractions to deal with that complexity, and comes with a formal proof of
correctness, all of which I think is really very important to give a good level
of insurance that the ring buffer works as expected.

> 
> For debugging kernels etc. with tracing that's not that big an issue, but
> I think it's a problem for "non debugging" use. After all Linux
> still has the goal to be at least configurable as a low footprint operating
> system.

My implementation, at the moment, has 50% less lines of code and is 25% smaller
in object size than the current ring buffer.

But all in all, I think users needing _something_ to perform system-wide tracing
shout a lot louder than users who need to save a few bytes. So let's try to get
something good in first, while keeping an eye on the object size, and if it
happens to be too large for some users, then they can always implement a
slower and less efficient ring_buffer_tiny.c if they feel like it.

The main concern for the ring buffer code is more about the number of cache
lines it grabs from the system when it is active, and about being blazingly
fast. We must keep the number of active cache lines very low, sometimes at the
expense of overall code size (e.g., having separate fast/slow paths).

> 
> Part of the reason for its big code size seems to be that 
> it tries to support everyone's requirements, which unsurprisingly
> leads to some bloat both in implementation and interface.

At the very least Steven's code did not meet the perf requirements nor the LTTng
requirements. As far as the "next gen" version is concerned, I plan to allow
myself to sometimes say "no" (with proper counter-arguments of course) when
people come up with ideas that will either needlessly complexify the ring buffer
code or won't take into account the "special" very heavy usage nature of the
ring buffer.

I really believe in code review and comments. I just don't feel I must always
incorporate all the modifications that are requested without proper
justification. It's just the approach I have, and it seems to have led to good
results so far if we take Tracepoints as an example.

> 
> Also to be honest it's so clever now that at least I have
> a hard time understanding it, and I personally prefer code
> that I can understand over too clever code. After all if there
> is a bug in there and you need to be more clever than the programmer
> to debug it, how would that be done? 

I totally agree with you. This is in good part why I spent a large part of 2009
writing papers explaining my ring buffer, doing Promela models and formal proofs
of correctness. I think after all that work, the abstractions I will use will be
much easier to grap by anyone willing to do a bit of reading.

> 
> Perhaps a better goal would be to have a smaller simpler more
> maintainable buffer for ftrace and let the other users their own?

The advantage of having both ftrace and perf as ring buffer users is that they
can both give me feedback in the development, and this increases chances that
the API will be generic, clean and well documented.

Thanks,

Mathieu

> 
> Just my 0.05 cent.
> 
> -Andi
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Unified Ring Buffer (Next Generation)
  2010-05-19 18:47   ` Mathieu Desnoyers
@ 2010-05-20  6:41     ` Andi Kleen
  2010-05-20 13:48       ` Mathieu Desnoyers
  0 siblings, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2010-05-20  6:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro

> The plan here is to create a ring buffer that supports per-buffer instance
> "flags" that specify what must be supported: e.g. either splice() or mmap(),
> global vs per-cpu buffers, etc.

And you plan to test all those flags in the hot path? 

> The new implementation I propose lessens the complexity level, presents clear
> abstractions to deal with that complexity, and comes with a formal proof of
> correctness, all of which I think is really very important to give a good level
> of insurance that the ring buffer works as expected.

Any simplifcation for the ftrace buffer would be a good thing IMHO.

> > For debugging kernels etc. with tracing that's not that big an issue, but
> > I think it's a problem for "non debugging" use. After all Linux
> > still has the goal to be at least configurable as a low footprint operating
> > system.
> 
> My implementation, at the moment, has 50% less lines of code and is 25% smaller
> in object size than the current ring buffer.

Good.

> 
> But all in all, I think users needing _something_ to perform system-wide tracing
> shout a lot louder than users who need to save a few bytes. So let's try to get
> something good in first, while keeping an eye on the object size, and if it
> happens to be too large for some users, then they can always implement a
> slower and less efficient ring_buffer_tiny.c if they feel like it.

They don't need to, they already have kfifo.

> I totally agree with you. This is in good part why I spent a large part of 2009
> writing papers explaining my ring buffer, doing Promela models and formal proofs
> of correctness. I think after all that work, the abstractions I will use will be
> much easier to grap by anyone willing to do a bit of reading.

Writing papers is not a replacement for simple maintainable code.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Unified Ring Buffer (Next Generation)
  2010-05-20  6:41     ` Andi Kleen
@ 2010-05-20 13:48       ` Mathieu Desnoyers
  0 siblings, 0 replies; 24+ messages in thread
From: Mathieu Desnoyers @ 2010-05-20 13:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro

* Andi Kleen (andi@firstfloor.org) wrote:
> > The plan here is to create a ring buffer that supports per-buffer instance
> > "flags" that specify what must be supported: e.g. either splice() or mmap(),
> > global vs per-cpu buffers, etc.
> 
> And you plan to test all those flags in the hot path? 

No, I plan to make the hot path inline in the caller, so the flags can
statically select the proper behavior at compile-time. We can test flags
dynamically on the slow-path to save space.

[...]

> > 
> > But all in all, I think users needing _something_ to perform system-wide tracing
> > shout a lot louder than users who need to save a few bytes. So let's try to get
> > something good in first, while keeping an eye on the object size, and if it
> > happens to be too large for some users, then they can always implement a
> > slower and less efficient ring_buffer_tiny.c if they feel like it.
> 
> They don't need to, they already have kfifo.
> 

Right.

> > I totally agree with you. This is in good part why I spent a large part of 2009
> > writing papers explaining my ring buffer, doing Promela models and formal proofs
> > of correctness. I think after all that work, the abstractions I will use will be
> > much easier to grap by anyone willing to do a bit of reading.
> 
> Writing papers is not a replacement for simple maintainable code.

True. I rather mean "in addition to simple maintainable code".

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Unified Ring Buffer (Next Generation)
  2010-05-19 17:51 [RFC] Unified Ring Buffer (Next Generation) Steven Rostedt
  2010-05-19 18:10 ` Andi Kleen
@ 2010-05-19 19:05 ` Mathieu Desnoyers
  2010-05-20  9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar
  2 siblings, 0 replies; 24+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 19:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen

* Steven Rostedt (rostedt@goodmis.org) wrote:
[...]
> Unfortunately, we are now back with more than one ring buffer in the
> kernel. What's worse, neither of them can perform all the features
> needed. This is putting a bit of stress on the users of these tools, not
> to mention the stress on the developers as well.
[...]
> This email is about finding a solution to the problem. If we can once
> again create a generic ring buffer that handles all requirements, then
> we can also merge the functionality of ftrace into perf, and lower the
> duplication of code within the kernel.
> 
> 
> This time around, I'm asking Mathieu Desnoyers to come to the plate, and
> see if he can handle the task.
> 
> I'm hoping that this email will start a thread that gets everyone into
> agreement and produces something that will make everyone happy.

I also hope this will get us to a state of concensus.

For the records: I'm willing to handle that task. I plan to use the

  git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-ringbuffer.git

git tree to publish my work (and to submit occasional rounds of patches to LKML
too)

The current state of that tree is that it includes only the bare-bone ring
buffer core, without the filesystem operations, only for per-cpu buffers, and
still needs a bit of cleanup. It's currently in kernel/trace/, but, as we
discussed, it's probably better if I move it to kernel/ringbuffer/, because it
does not have to be tied specifically to tracers. I'm doing it with tracing in
mind, but it could be reused as generic ring buffer in the future.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFD] Future tracing/instrumentation directions
  2010-05-19 17:51 [RFC] Unified Ring Buffer (Next Generation) Steven Rostedt
  2010-05-19 18:10 ` Andi Kleen
  2010-05-19 19:05 ` Mathieu Desnoyers
@ 2010-05-20  9:31 ` Ingo Molnar
  2010-05-20 10:07   ` Frederic Weisbecker
                     ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: Ingo Molnar @ 2010-05-20  9:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter

* Steven Rostedt <rostedt@goodmis.org> wrote:

> More than a year and a half ago (September 2008), at 
> Linux Plumbers, we had a meeting with several kernel 
> developers to come up with a unified ring buffer. A 
> generic ring buffer in the kernel that any subsystem 
> could use. After coming up with a set of requirements, I 
> worked on implementing it. One of the requirements was 
> to start off simple and work to become a more complete 
> buffering system.
>
> [...]

The thing is, in tracing land and more broadly in 
instrumentation land we have _much_ more earthly problems 
these days:

 - Lets face it, performance of the ring-buffer sucks, in 
   a big way. I've recently benchmarked it and it takes 
   hundreds of instructions to trace a single event. 
   Puh-lease ...

 - It has grown a lot of slack. It's complex and hard to
   read.

 - Over the last year or so the majority of bleeding-edge
   tracing developers have gradually migrated over to perf 
   and 'perf trace' / 'perf probe' in particular. As far 
   as the past two merge windows go they are 
   out-developing the old ftrace UIs by a ratio of 4:1.

   So this angle is becoming a prime thing to improve and
   users and developers are hurting from the ftrace/perf
   duality.

 - [ While it's still a long way off, if this trend continues
     we eventually might even be able to get rid of the 
     /debug/tracing/ temporary debug API and get rid of 
     the ugly in-kernel pretty-printing bits. This is 
     good: it may make Andrew very happy for a change ;-)

     The main detail here to be careful of is that lots of
     people are fond of the simplicity of the 
     /debug/tracing/ debug UI, so when we replace it we 
     want to do it by keeping that simple workflow (or 
     best by making it even simpler). I have a few ideas 
     how to do this.

     There's also the detail that in some cases we want to
     print events in the kernel in a human readable way: 
     for example EDAC/MCE and other critical events,
     trace-on-oops, etc. This too can be solved. ]

Regarding performance and complexity, which is our main 
worry atm, fortunately there's work going on in that 
direction - please see PeterZ's recent string of patches 
on lkml:

  4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events
  a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware
  ef60777: perf: Optimize the perf_output() path by removing IRQ-disables
  fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t
  6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg()

And it may sound harsh but at this stage i'm personally 
not at all interested in big design talk. This isnt rocket 
science, we have developers and users and we know what 
they are doing and we know what we need to do: we need to 
improve our crap and we need to reduce complexity. Less is 
more.

So i'd like to see iterative, useful action first, and i 
am somewhat sceptical about yet another grand tracing 
design trying to match 100 requirements.

Steve, Mathieu, if you are interested please help out 
Peter with the performance and simplification work. The 
last thing we need is yet another replace-everything 
event.

If we really want to create a new ring-buffer abstraction 
i'd suggest we start with Peter's, it has a quite sane 
design and stayed simple and flexible - if then it could 
be factored out a bit.

Here are more bits of what i see as the 'action' going 
forward, in no particular order:

1) Push the /debug/tracing/events/ event description
   into sysfs, as per this thread on lkml:

     [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs

     http://groups.google.com/group/linux.kernel/msg/ab9aa075016c639e

   I.e. one more step towards integrating ftrace into perf.

2) Use 1) to unify the perf events and the ftrace
   ring-buffer. This, as things are standing is
   best done by factoring out Peter's ring-buffer
   in kernel/perf_event.c. It's properly abstracted
   and it _far_ simpler than kernel/tracing/ring_buffer.c,
   which has become a monstrosity.

   (but i'm open to other simplifications as well)

3) Add the function-tracer and function-graph tracer
   as an event and integrate it into perf.

   This will live-test the efficiency of the unification
   and brings over the last big ftrace plugin to perf.

4) Gradually convert/port/migrate all the remaining 
   plugins over as well. We need to do this very gently 
   because there are users - but stop piling new 
   functionality on to the old ftrace side. This usually 
   involves:

    - Conversion of an explicit tracing callback to
      TRACE_EVENT (for example in the case of mmiotrace),
      while keeping all tool functionality.

    - Migrate any 'special' ftrace feature to perf 
      capabilities so that it's available via the 
      syscall interface as well. (for example 
      'latency maximum tracking' is something that we 
      probably want to do with kernel-side help - we 
      probably dont want to implement it via tracing 
      everything all the time and finding the maximum 
      based on terabytes of data.)

   (And there are other complications here too, but you 
    get the idea.)

All in one, i think we can reuse more than 50% of all 
current ftrace code (possibly up to 70-80%) - and we are 
already reusing bits of it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20  9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar
@ 2010-05-20 10:07   ` Frederic Weisbecker
  2010-05-20 11:07     ` Thomas Gleixner
  2010-05-20 11:36     ` Ingo Molnar
  2010-05-20 14:38   ` Steven Rostedt
  2010-05-21  9:24   ` Li Zefan
  2 siblings, 2 replies; 24+ messages in thread
From: Frederic Weisbecker @ 2010-05-20 10:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter

On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote:
>  - [ While it's still a long way off, if this trend continues
>      we eventually might even be able to get rid of the 
>      /debug/tracing/ temporary debug API and get rid of 
>      the ugly in-kernel pretty-printing bits. This is 
>      good: it may make Andrew very happy for a change ;-)
> 
>      The main detail here to be careful of is that lots of
>      people are fond of the simplicity of the 
>      /debug/tracing/ debug UI, so when we replace it we 
>      want to do it by keeping that simple workflow (or 
>      best by making it even simpler). I have a few ideas 
>      how to do this.



How? We can emulate the /debug/tracing result with something
like perf trace, still that won't replace the immediate
availability of the result of any trace, which makes it
valuable for any simplest workflows.




> Regarding performance and complexity, which is our main 
> worry atm, fortunately there's work going on in that 
> direction - please see PeterZ's recent string of patches 
> on lkml:
> 
>   4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events
>   a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware
>   ef60777: perf: Optimize the perf_output() path by removing IRQ-disables
>   fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t



I would like to highlight the following commit too that _totally_
changes the requirements of our next common ring buffer, whatever it is: 

    c792061: perf: Disallow mmap() on per-task inherited events

Now we only need to care about local contention, we have removed the
support for buffers that contend across cpus in a single process.

Do I understand it right?



> 3) Add the function-tracer and function-graph tracer
>    as an event and integrate it into perf.
> 
>    This will live-test the efficiency of the unification
>    and brings over the last big ftrace plugin to perf.


I may start to take care of this soon.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 10:07   ` Frederic Weisbecker
@ 2010-05-20 11:07     ` Thomas Gleixner
  2010-05-20 11:42       ` Ingo Molnar
                         ` (2 more replies)
  2010-05-20 11:36     ` Ingo Molnar
  1 sibling, 3 replies; 24+ messages in thread
From: Thomas Gleixner @ 2010-05-20 11:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Christoph Hellwig, Mathieu Desnoyers, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov,
	Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter

On Thu, 20 May 2010, Frederic Weisbecker wrote:
> On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote:
> >  - [ While it's still a long way off, if this trend continues
> >      we eventually might even be able to get rid of the 
> >      /debug/tracing/ temporary debug API and get rid of 
> >      the ugly in-kernel pretty-printing bits. This is 
> >      good: it may make Andrew very happy for a change ;-)
> > 
> >      The main detail here to be careful of is that lots of
> >      people are fond of the simplicity of the 
> >      /debug/tracing/ debug UI, so when we replace it we 
> >      want to do it by keeping that simple workflow (or 
> >      best by making it even simpler). I have a few ideas 
> >      how to do this.
> 
> How? We can emulate the /debug/tracing result with something
> like perf trace, still that won't replace the immediate
> availability of the result of any trace, which makes it
> valuable for any simplest workflows.

I'm a bit torn about this. I really like the availability of the ascii
interface, but if we can come up with a very basic trace binary tool
which can be built for deep embedded w/o requiring the world and some
more libs to be available, then I might give up my resistance. Ideally
it should be done so it can be easily integrated into busybox.

I don't care whether I do

echo 1 >/debug/..../XXX/enable
cat /debug/tracing/trace

or

perfmini trace enable XXX
perfmini trace dump

as long as the tool is built in a way that it does not need updates
when we add trace points or other functionality to the kernel.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 11:07     ` Thomas Gleixner
@ 2010-05-20 11:42       ` Ingo Molnar
  2010-05-20 11:48         ` Ingo Molnar
  2010-05-20 12:15       ` Frederic Weisbecker
  2010-05-20 12:19       ` Theodore Tso
  2 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2010-05-20 11:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Steven Rostedt, LKML, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Thu, 20 May 2010, Frederic Weisbecker wrote:
> > On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote:
> > >  - [ While it's still a long way off, if this trend continues
> > >      we eventually might even be able to get rid of the 
> > >      /debug/tracing/ temporary debug API and get rid of 
> > >      the ugly in-kernel pretty-printing bits. This is 
> > >      good: it may make Andrew very happy for a change ;-)
> > > 
> > >      The main detail here to be careful of is that lots of
> > >      people are fond of the simplicity of the 
> > >      /debug/tracing/ debug UI, so when we replace it we 
> > >      want to do it by keeping that simple workflow (or 
> > >      best by making it even simpler). I have a few ideas 
> > >      how to do this.
> > 
> > How? We can emulate the /debug/tracing result with something
> > like perf trace, still that won't replace the immediate
> > availability of the result of any trace, which makes it
> > valuable for any simplest workflows.
> 
> I'm a bit torn about this. I really like the availability of the ascii
> interface, but if we can come up with a very basic trace binary tool
> which can be built for deep embedded w/o requiring the world and some
> more libs to be available, then I might give up my resistance. Ideally
> it should be done so it can be easily integrated into busybox.
> 
> I don't care whether I do
> 
> echo 1 >/debug/..../XXX/enable
> cat /debug/tracing/trace
> 
> or
> 
> perfmini trace enable XXX
> perfmini trace dump

My suggestion for that flow is even shorter:

  trace --enable XXX
  trace

Plus:

  trace --list
  trace --enable
  trace --disable

> as long as the tool is built in a way that it does not 
> need updates when we add trace points or other 
> functionality to the kernel.

Yeah, most definitely. The sysfs event_source class will 
ensure that whatever (new) events are available propagate 
through the tool and are available to it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 11:42       ` Ingo Molnar
@ 2010-05-20 11:48         ` Ingo Molnar
  0 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2010-05-20 11:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Steven Rostedt, LKML, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter


* Ingo Molnar <mingo@elte.hu> wrote:

> > as long as the tool is built in a way that it does not 
> > need updates when we add trace points or other 
> > functionality to the kernel.
> 
> Yeah, most definitely. The sysfs event_source class will 
> ensure that whatever (new) events are available 
> propagate through the tool and are available to it.

Note that this approach isnt just for tools/perf/, it will 
be useful for other perf events based tools as well: 
SysProf and for the libpfm based apps.

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 11:07     ` Thomas Gleixner
  2010-05-20 11:42       ` Ingo Molnar
@ 2010-05-20 12:15       ` Frederic Weisbecker
  2010-05-20 12:19       ` Theodore Tso
  2 siblings, 0 replies; 24+ messages in thread
From: Frederic Weisbecker @ 2010-05-20 12:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Christoph Hellwig, Mathieu Desnoyers, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov,
	Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter

On Thu, May 20, 2010 at 01:07:57PM +0200, Thomas Gleixner wrote:
> On Thu, 20 May 2010, Frederic Weisbecker wrote:
> > On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote:
> > >  - [ While it's still a long way off, if this trend continues
> > >      we eventually might even be able to get rid of the 
> > >      /debug/tracing/ temporary debug API and get rid of 
> > >      the ugly in-kernel pretty-printing bits. This is 
> > >      good: it may make Andrew very happy for a change ;-)
> > > 
> > >      The main detail here to be careful of is that lots of
> > >      people are fond of the simplicity of the 
> > >      /debug/tracing/ debug UI, so when we replace it we 
> > >      want to do it by keeping that simple workflow (or 
> > >      best by making it even simpler). I have a few ideas 
> > >      how to do this.
> > 
> > How? We can emulate the /debug/tracing result with something
> > like perf trace, still that won't replace the immediate
> > availability of the result of any trace, which makes it
> > valuable for any simplest workflows.
> 
> I'm a bit torn about this. I really like the availability of the ascii
> interface, but if we can come up with a very basic trace binary tool
> which can be built for deep embedded w/o requiring the world and some
> more libs to be available, then I might give up my resistance. Ideally
> it should be done so it can be easily integrated into busybox.
> 
> I don't care whether I do
> 
> echo 1 >/debug/..../XXX/enable
> cat /debug/tracing/trace
> 
> or
> 
> perfmini trace enable XXX
> perfmini trace dump
> 
> as long as the tool is built in a way that it does not need updates
> when we add trace points or other functionality to the kernel.
> 
> Thanks,
> 
> 	tglx


You have a (much) deeper experience than me about the embedded world.
And this was the area I worried most about, especially because dropping
the preformatted output means we add a random dependency.

If you think people can be fine a minimal tool that does this, then ok,
as far as the given tool is really easy to get and to build.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 11:07     ` Thomas Gleixner
  2010-05-20 11:42       ` Ingo Molnar
  2010-05-20 12:15       ` Frederic Weisbecker
@ 2010-05-20 12:19       ` Theodore Tso
  2 siblings, 0 replies; 24+ messages in thread
From: Theodore Tso @ 2010-05-20 12:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Ingo Molnar, Steven Rostedt, LKML,
	Linus Torvalds, Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter


On May 20, 2010, at 7:07 AM, Thomas Gleixner wrote:
> 
> I don't care whether I do
> 
> echo 1 >/debug/..../XXX/enable
> cat /debug/tracing/trace
> 
> or
> 
> perfmini trace enable XXX
> perfmini trace dump
> 
> as long as the tool is built in a way that it does not need updates
> when we add trace points or other functionality to the kernel.

I won't care that much assuming that there is a bash_completion script so I can type "perf mini trace enable ext4_da_writep<TAB>" and the right thing happens....

The point is there are a number of advantages of the current system that go beyond just guaranteed availability of the binary->ascii translation tool.  Command competion, shell globbing to find interesting events you want to enable, etc., are also part of the picture

-- Ted


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 10:07   ` Frederic Weisbecker
  2010-05-20 11:07     ` Thomas Gleixner
@ 2010-05-20 11:36     ` Ingo Molnar
  2010-05-20 13:06       ` Frederic Weisbecker
  1 sibling, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2010-05-20 11:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote:
> >  - [ While it's still a long way off, if this trend continues
> >      we eventually might even be able to get rid of the 
> >      /debug/tracing/ temporary debug API and get rid of 
> >      the ugly in-kernel pretty-printing bits. This is 
> >      good: it may make Andrew very happy for a change ;-)
> > 
> >      The main detail here to be careful of is that lots of
> >      people are fond of the simplicity of the 
> >      /debug/tracing/ debug UI, so when we replace it we 
> >      want to do it by keeping that simple workflow (or 
> >      best by making it even simpler). I have a few ideas 
> >      how to do this.
> 
> How? We can emulate the /debug/tracing result with 
> something like perf trace, still that won't replace the 
> immediate availability of the result of any trace, which 
> makes it valuable for any simplest workflows.

Firstly, one thing is sure: until there's no full 
replacement we obviously dont want to phase out 
/sys/kernel/debug/tracing. This was more of a 'our future' 
email (as i see it), the process that will lead to solve 
some of our more strategic problems in tracing land.

Regarding the issues you raised, there are several 
solutions that dont need /sys/kernel/debug/tracing but 
still support the very useful and usable 'immediate 
tracing' workflow that ftrace prototyped. We can have a 
combination of several things:

 - Have a simple 'ftrace' command aliased to perf trace.

   Means less typing, and it also allows a much more 
   finegrained tracing workflow: per user and per task/job 
   workflows, instead of the global/exclusive tracing mode 
   that /sys/kernel/debug/tracing. There would be ready 
   equivalents:

     ftrace --available-tracers
     ftrace --current-tracer
     ftrace --start
     ftrace --stop

     ... etc ...

 - Immediate availability of a trace: persistent events
   combined with roundrobin ('flight-recorder') recording
   would solve this.

   If events are active then if type 'ftrace' you get the 
   current trace. Simple to scrip and simple to use - no 
   need to have access to /sys/kernel/debug/tracing, also 
   can evidently be turned into a per user facility, 
   supports multiple plugins active at once, etc.

 - For lightweight embedded tracing there are two separate
   solutions that would work:

    - we already have perf 'minimal builds' (when certain 
      libraries are not available), we could push that 
      concept some more to create a 'lightweight' command
      that embedded systems can run just fine.

    - extend our proxy recording and proxy execution/analysis
      concepts. We can already run a perf trace recording 
      through a pipe and netcat, and we have perf archive
      and cross-arch analysis code.

      If there's interest, then this could be made more 
      convenient and the functionality around this could 
      be collected into a handy proxy tool:

        ftrace --proxy smallbox --start
        ftrace --proxy smallbox --stop
        ftrace --proxy smallbox           # prints trace

        ... etc ...

      Thus the recording is done on the small box, while 
      all the analysis (and even all the commands) is 
      typed/executed on the bigger box.

So there are lots of possibilities - and there are other 
options as well.

Does this address your worries?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 11:36     ` Ingo Molnar
@ 2010-05-20 13:06       ` Frederic Weisbecker
  0 siblings, 0 replies; 24+ messages in thread
From: Frederic Weisbecker @ 2010-05-20 13:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter

On Thu, May 20, 2010 at 01:36:15PM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote:
> > >  - [ While it's still a long way off, if this trend continues
> > >      we eventually might even be able to get rid of the 
> > >      /debug/tracing/ temporary debug API and get rid of 
> > >      the ugly in-kernel pretty-printing bits. This is 
> > >      good: it may make Andrew very happy for a change ;-)
> > > 
> > >      The main detail here to be careful of is that lots of
> > >      people are fond of the simplicity of the 
> > >      /debug/tracing/ debug UI, so when we replace it we 
> > >      want to do it by keeping that simple workflow (or 
> > >      best by making it even simpler). I have a few ideas 
> > >      how to do this.
> > 
> > How? We can emulate the /debug/tracing result with 
> > something like perf trace, still that won't replace the 
> > immediate availability of the result of any trace, which 
> > makes it valuable for any simplest workflows.
> 
> Firstly, one thing is sure: until there's no full 
> replacement we obviously dont want to phase out 
> /sys/kernel/debug/tracing. This was more of a 'our future' 
> email (as i see it), the process that will lead to solve 
> some of our more strategic problems in tracing land.



Yeah.



 
> Regarding the issues you raised, there are several 
> solutions that dont need /sys/kernel/debug/tracing but 
> still support the very useful and usable 'immediate 
> tracing' workflow that ftrace prototyped. We can have a 
> combination of several things:
> 
>  - Have a simple 'ftrace' command aliased to perf trace.
> 
>    Means less typing, and it also allows a much more 
>    finegrained tracing workflow: per user and per task/job 
>    workflows, instead of the global/exclusive tracing mode 
>    that /sys/kernel/debug/tracing. There would be ready 
>    equivalents:
> 
>      ftrace --available-tracers
>      ftrace --current-tracer
>      ftrace --start
>      ftrace --stop
> 
>      ... etc ...



We should probably more talk about events there, but yeah
I get your point.




> 
>  - Immediate availability of a trace: persistent events
>    combined with roundrobin ('flight-recorder') recording
>    would solve this.
> 
>    If events are active then if type 'ftrace' you get the 
>    current trace. Simple to scrip and simple to use - no 
>    need to have access to /sys/kernel/debug/tracing, also 
>    can evidently be turned into a per user facility, 
>    supports multiple plugins active at once, etc.



Yep.




> 
>  - For lightweight embedded tracing there are two separate
>    solutions that would work:
> 
>     - we already have perf 'minimal builds' (when certain 
>       libraries are not available), we could push that 
>       concept some more to create a 'lightweight' command
>       that embedded systems can run just fine.



Yeah that need to be one, very simple, dedicated to tracing.
With the very minimal requirements about dependencies.




>     - extend our proxy recording and proxy execution/analysis
>       concepts. We can already run a perf trace recording 
>       through a pipe and netcat, and we have perf archive
>       and cross-arch analysis code.
> 
>       If there's interest, then this could be made more 
>       convenient and the functionality around this could 
>       be collected into a handy proxy tool:
> 
>         ftrace --proxy smallbox --start
>         ftrace --proxy smallbox --stop
>         ftrace --proxy smallbox           # prints trace
> 
>         ... etc ...
> 
>       Thus the recording is done on the small box, while 
>       all the analysis (and even all the commands) is 
>       typed/executed on the bigger box.



You mean a client/server control? Controlling the recorder
with remote commands?


> So there are lots of possibilities - and there are other 
> options as well.
> 
> Does this address your worries?


My worries were more about bringing dependencies to read or launch
traces than the possibility to produce a very easy to use tool, even
easier and more powerful than the debugfs interface.

To sum up, my worries were (and sort of still are a bit): how painful
would that be to set up such a tool on a hard embedded box, compared
to the directly available stream we have.

But Thomas opinion makes me reconsider somehow the situation.

Once we have such tool and it becomes truly easy and intuitive,
people may migrate to it. At this point, it's possible the situation
have evolved enough that it will become obvious to deprecate the
debugfs interface.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20  9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar
  2010-05-20 10:07   ` Frederic Weisbecker
@ 2010-05-20 14:38   ` Steven Rostedt
  2010-05-20 15:04     ` Steven Rostedt
                       ` (2 more replies)
  2010-05-21  9:24   ` Li Zefan
  2 siblings, 3 replies; 24+ messages in thread
From: Steven Rostedt @ 2010-05-20 14:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter

On Thu, 2010-05-20 at 11:31 +0200, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > More than a year and a half ago (September 2008), at 
> > Linux Plumbers, we had a meeting with several kernel 
> > developers to come up with a unified ring buffer. A 
> > generic ring buffer in the kernel that any subsystem 
> > could use. After coming up with a set of requirements, I 
> > worked on implementing it. One of the requirements was 
> > to start off simple and work to become a more complete 
> > buffering system.
> >
> > [...]
> 
> The thing is, in tracing land and more broadly in 
> instrumentation land we have _much_ more earthly problems 
> these days:
> 
>  - Lets face it, performance of the ring-buffer sucks, in 
>    a big way. I've recently benchmarked it and it takes 
>    hundreds of instructions to trace a single event. 
>    Puh-lease ...

"Puh-lease"?

Hmm, what were the timings?  I just ran the ring_buffer_benchmark
module. It's in mainline, so anyone could do the same. Remember to
disable KERNEL_DEBUG because things like lockdep can have a significant
impact on it.

Note, instructions themselves are not that bad, but what those
instructions do count a lot more. I avoid all lock functions and memory
barriers which cause the code to be a little bigger. But a single
cmpxchg can take much more time than a lot of other instructions put
together.

On this machine:

model name	: Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz

It's a 2x2 (4 cores)

With running:

# modprobe ring_buffer_benchmark producer_fifo=10
# sleep 30
# cat /debug/tracing/trace

           <...>-2636  [003]   761.969850: ring_buffer_producer_thread: Starting ring buffer hammer
           <...>-2636  [003]   771.910121: ring_buffer_producer_thread: End ring buffer hammer
           <...>-2636  [003]   771.910172: ring_buffer_producer_thread: Running Consumer at nice: 19
           <...>-2636  [003]   771.910172: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
           <...>-2636  [003]   771.910173: ring_buffer_producer_thread: Time:     9965321 (usecs)
           <...>-2636  [003]   771.910173: ring_buffer_producer_thread: Overruns: 0
           <...>-2636  [003]   771.910174: ring_buffer_producer_thread: Read:     76580325  (by pages)
           <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Entries:  75
           <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Total:    76580400
           <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Missed:   0
           <...>-2636  [003]   771.910176: ring_buffer_producer_thread: Hit:      76580400
           <...>-2636  [003]   771.910176: ring_buffer_producer_thread: Entries per millisec: 7684
           <...>-2636  [003]   771.910176: ring_buffer_producer_thread: 130 ns per entry
           <...>-2636  [003]   771.910177: ring_buffer_producer_thread: Sleeping for 10 secs

           <...>-2636  [003]   781.884686: ring_buffer_producer_thread: Starting ring buffer hammer
           <...>-2636  [003]   790.862392: ring_buffer_producer_thread: End ring buffer hammer
           <...>-2636  [003]   790.897399: ring_buffer_producer_thread: Running Consumer at nice: 19
           <...>-2636  [003]   790.897400: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
           <...>-2636  [003]   790.897401: ring_buffer_producer_thread: Time:     9000321 (usecs)
           <...>-2636  [003]   790.897401: ring_buffer_producer_thread: Overruns: 26581200
           <...>-2636  [003]   790.897402: ring_buffer_producer_thread: Read:     6767200  (by events)
           <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Entries:  0
           <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Total:    33348400
           <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Missed:   0
           <...>-2636  [003]   790.897404: ring_buffer_producer_thread: Hit:      33348400
           <...>-2636  [003]   790.897404: ring_buffer_producer_thread: Entries per millisec: 3705
           <...>-2636  [003]   790.897405: ring_buffer_producer_thread: 269 ns per entry

While having the reader read by pages (splice mode) a write takes 130ns
per entry. Not that bad.

But when the reader is reading each event, (I'm assuming we get some
cache line bouncing here, it is still lockless), a write takes 269ns.

Note, the reader is an aggressive reader. That is, it goes into a strong
loop and just reads as fast as it can. Which would cause cache line
bouncing.

When I disable the reader and just measure what a write takes:

# rmmod ring_buffer_benchmark
# modprobe ring_buffer_benchmark disable_reader=1 producer_fifo=10
# sleep 30
# cat /debug/tracing/trace

           <...>-2649  [003]  1141.981639: ring_buffer_producer_thread: Starting ring buffer hammer
           <...>-2649  [003]  1151.950840: ring_buffer_producer_thread: End ring buffer hammer
           <...>-2649  [003]  1151.950841: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
           <...>-2649  [003]  1151.950842: ring_buffer_producer_thread: Time:     9994238 (usecs)
           <...>-2649  [003]  1151.950842: ring_buffer_producer_thread: Overruns: 90635670
           <...>-2649  [003]  1151.950843: ring_buffer_producer_thread: Read:     (reader disabled)
           <...>-2649  [003]  1151.950843: ring_buffer_producer_thread: Entries:  62480
           <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Total:    90698150
           <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Missed:   0
           <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Hit:      90698150
           <...>-2649  [003]  1151.950845: ring_buffer_producer_thread: Entries per millisec: 9075
           <...>-2649  [003]  1151.950845: ring_buffer_producer_thread: 110 ns per entry
           <...>-2649  [003]  1151.950846: ring_buffer_producer_thread: Sleeping for 10 secs

It takes just a 110ns.

Lets now use perf to see what is happening:

# perf record -a -c10000 -f
(wait 30 secs)
Ctrl-C

# perf report

no symbols found in /usr/sbin/irqbalance, maybe install a debug package?
# Samples: 1995496
#
# Overhead          Command                                                     Shared Object  Symbol
# ........  ...............  ................................................................  ......
#
    23.80%      rb_producer  [kernel.kallsyms]                                                 [k] native_sched_clock
    14.83%      rb_producer  [kernel.kallsyms]                                                 [k] rb_reserve_next_event
    13.20%      rb_producer  [kernel.kallsyms]                                                 [k] ring_buffer_unlock_commit
    12.84%      rb_producer  [kernel.kallsyms]                                                 [k] ring_buffer_lock_reserve
     9.27%      rb_producer  [kernel.kallsyms]                                                 [k] rb_end_commit
     9.19%      rb_producer  [kernel.kallsyms]                                                 [k] __rb_reserve_next
     5.24%      rb_producer  [kernel.kallsyms]                                                 [k] trace_clock_local
     3.11%      rb_producer  [kernel.kallsyms]                                                 [k] ring_buffer_event_data
     2.88%      rb_producer  [ring_buffer_benchmark]                                           [k] ring_buffer_benchmark_init
     2.53%      rb_producer  [kernel.kallsyms]                                                 [k] trace_recursive_unlock
     0.39%      rb_producer  [kernel.kallsyms]                                                 [k] do_gettimeofday
     0.34%      rb_producer  [kernel.kallsyms]                                                 [k] read_tsc
     0.18%      rb_producer  [kernel.kallsyms]                                                 [k] getnstimeofday
     0.14%          swapper  [kernel.kallsyms]                                                 [k] mwait_idle

Now that the sched_clock is %23.8 percent, it doesn't look as bad.

rb_reserve_next_event is 14.83, ring_buffer_unlock_commit is 13.20,
ring_buffer_lock_reserve is 12.84, rb_end_commit is 9.27,
__rb_reserve_next is 8.18, trace_clock_local is 5.24 (which only
disables preemption and calls sched_clock), ring_buffer_event_data is
3.11 which only returns a pointer on a structure (once a event).

ring_buffer_benchmark_init is the benchmark code itself, and
trace_recursive_unlock is the code you asked me to add to check for
cases where the tracer could have recursed itself.

14.83 + 13.20 + 9.27 + 8.18 + 5.24 + 3.11 = 53.83%

Considering that the system is basically idle except for this benchmark,
that makes sense to have it that high. We need to compare it. The
sched_clock() takes up 1/3 of the entire trace event.

> 
>  - It has grown a lot of slack. It's complex and hard to
>    read.

I admit, that it got over designed, and I'm all for replacing it. We
were all involved in the initial design, and I posted several versions
of RFC when I created it. But I admit, it became complex, and now that
we know what we want, we can take a step back and redesign a new one
from scratch.

But until we do so, I'm continuing to support the current one, and
adding small updates when needed.

> 
>  - Over the last year or so the majority of bleeding-edge
>    tracing developers have gradually migrated over to perf 
>    and 'perf trace' / 'perf probe' in particular. As far 
>    as the past two merge windows go they are 
>    out-developing the old ftrace UIs by a ratio of 4:1.

Of course a new project will be in a high state of flux. I'm quite
satisfied with the interface with ftrace and did not feel that it needed
changing.

> 
>    So this angle is becoming a prime thing to improve and
>    users and developers are hurting from the ftrace/perf
>    duality.

I agree.

> 
>  - [ While it's still a long way off, if this trend continues
>      we eventually might even be able to get rid of the 
>      /debug/tracing/ temporary debug API and get rid of 
>      the ugly in-kernel pretty-printing bits. This is 
>      good: it may make Andrew very happy for a change ;-)

So how do you plan on getting rid of the in-kernel pretty-printing bits
and keeping the trace_dump_on_oops?  There's a strong dependency between
those two.

> 
>      The main detail here to be careful of is that lots of
>      people are fond of the simplicity of the 
>      /debug/tracing/ debug UI, so when we replace it we 
>      want to do it by keeping that simple workflow (or 
>      best by making it even simpler). I have a few ideas 
>      how to do this.

Note, I still would not want to remove the /debug interface. It is a
debug interface anyway. Having both a syscall access the functionality
of the code and having a debug interface access it very trivial. This
could also be a way we can debug the tool too.

We can officially support only the tool side, but why throw away the
baby with the bath water?

> 
>      There's also the detail that in some cases we want to
>      print events in the kernel in a human readable way: 
>      for example EDAC/MCE and other critical events,
>      trace-on-oops, etc. This too can be solved. ]

But if the code for trace-on-oops is there, then the pretty print code
is also there.

> 
> Regarding performance and complexity, which is our main 
> worry atm, fortunately there's work going on in that 
> direction - please see PeterZ's recent string of patches 
> on lkml:
> 
>   4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events
>   a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware
>   ef60777: perf: Optimize the perf_output() path by removing IRQ-disables
>   fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t
>   6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg()

Yes, I saw these, and it helps a little, but still not everything is
there.

> 
> And it may sound harsh but at this stage i'm personally 
> not at all interested in big design talk.

Rewriting ftrace on top of perf is a big design change.

>  This isnt rocket 
> science, we have developers and users and we know what 
> they are doing and we know what we need to do: we need to 
> improve our crap and we need to reduce complexity. Less is 
> more.
> 
> So i'd like to see iterative, useful action first, and i 
> am somewhat sceptical about yet another grand tracing 
> design trying to match 100 requirements.

Here's a few things that I find required:

The ability to use splice. You keep telling me this is not important,
but splice is magnitudes faster than mmaping into userspace and then
copying into a file. If the splice removes the extra copy as we are
working on, it will get even faster. It also makes writing over the
network much easier. Which trace-cmd does, and it took me half a day to
make it do so.

We also need a way to easily read the buffer from the kernel. This is
for trace-on-oops.

Overwrite mode must also be supported. I know that Peter thinks its
silly to do this and have it write to a file, but I find it helpful to
do so. But fine, I can live without overwrite to a file, if no one else
cares. But overwrite mode itself must work for trace-on-oops.

We need a way to start and stop tracing from the kernel and userspace.
Again, for trace-on-oops. When a bug is hit, we need a way to stop it.
We also must have a way to easily start it again from userspace.

A simple on/off switch. This is something that you and Thomas hammered
into me which was the creation of tracing_on.

The above can be done, but lets not just hack the changes in. Lets
really look at a proper design.

> 
> Steve, Mathieu, if you are interested please help out 
> Peter with the performance and simplification work. The 
> last thing we need is yet another replace-everything 
> event.

But that is what you are asking to happen to ftrace.
"replace-everything".

> 
> If we really want to create a new ring-buffer abstraction 
> i'd suggest we start with Peter's, it has a quite sane 
> design and stayed simple and flexible - if then it could 
> be factored out a bit.

I'm sorry, but it is very coupled with perf. It is simple and flexible
with what perf does today, but not what ftrace does today, or to make
perf do what ftrace does.

> 
> Here are more bits of what i see as the 'action' going 
> forward, in no particular order:
> 
> 1) Push the /debug/tracing/events/ event description
>    into sysfs, as per this thread on lkml:
> 
>      [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs
> 
>      http://groups.google.com/group/linux.kernel/msg/ab9aa075016c639e
> 
>    I.e. one more step towards integrating ftrace into perf.

We could do something like this, but it is not as trivial as it may
seem.

> 
> 2) Use 1) to unify the perf events and the ftrace
>    ring-buffer. This, as things are standing is
>    best done by factoring out Peter's ring-buffer
>    in kernel/perf_event.c. It's properly abstracted
>    and it _far_ simpler than kernel/tracing/ring_buffer.c,
>    which has become a monstrosity.

Sure, my code has become quite complex, I'll admit that. I'm for
refactoring. I really don't care what code is used as long as I still
have the requirements that make ftrace useful for me and several others.

And yes, any ring buffer needs to be factored out to a defined
interface. The current ring buffer in perf is very coupled with the perf
design. The event code written to perf in include/trace/ftrace.h is
still quite a hack.

> 
>    (but i'm open to other simplifications as well)
> 
> 3) Add the function-tracer and function-graph tracer
>    as an event and integrate it into perf.

Here's something that I think perf needs to do. It needs to become a bit
more modulized, and not absorb everything into perf instead of having
things use just a part of perf. Like the breakpoint code requires full
perf to be done, and x86 has PERF_EVENTS permanently on. Which is really
a PITA.

As I designed ftrace from day one, I tried to make it modular. Thus, the
function trace (kernel/trace/ftrace.c) has always been a separate entity
from ftrace itself. Heck, LTTng uses it. It would be actually trivial to
implement the function tracer in perf today. Just register a function
you want called at every function event and start running.

I need to abstract out the "what function to call bit" but that's about
it.

> 
>    This will live-test the efficiency of the unification
>    and brings over the last big ftrace plugin to perf.
> 
> 4) Gradually convert/port/migrate all the remaining 
>    plugins over as well. We need to do this very gently 
>    because there are users - but stop piling new 
>    functionality on to the old ftrace side. This usually 
>    involves:

I have not been piling functionality into the ftrace side, as you stated
at the beginning of the email, that the development has been quite slow.
Yes, I've been still working on things, but I would say it is a far cry
from "piling".

I did add some simple features, like the "lost events" to the ring
buffer, but that code can be looked at for ideas for the future. It does
not make merging any more difficult than it already is.

Fine, I'll stop some of the code I was going to work on (multiple
buffers for ftrace, event triggers, configuring events to stop the
tracer (like for different kinds of BUG() and WARN_ON()), etc).

> 
>     - Conversion of an explicit tracing callback to
>       TRACE_EVENT (for example in the case of mmiotrace),
>       while keeping all tool functionality.

Not sure what this is.

> 
>     - Migrate any 'special' ftrace feature to perf 
>       capabilities so that it's available via the 
>       syscall interface as well. (for example 
>       'latency maximum tracking' is something that we 
>       probably want to do with kernel-side help - we 
>       probably dont want to implement it via tracing 
>       everything all the time and finding the maximum 
>       based on terabytes of data.)
> 
>    (And there are other complications here too, but you 
>     get the idea.)

Yes, the latency tracers are special, and I treat them special in
ftrace. They can be redesigned to keep their current functionality. That
needed to be done anyway.

> 
> All in one, i think we can reuse more than 50% of all 
> current ftrace code (possibly up to 70-80%) - and we are 
> already reusing bits of it.

Ingo, as we have had this conversation, I don't know if you realized
that I agreed to merge ftrace to perf. There's just some functionality
about ftrace I really want to keep, which does include accessing
via /debugfs, pretty-print (which is needed by trace-on-oops) and
splice.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 14:38   ` Steven Rostedt
@ 2010-05-20 15:04     ` Steven Rostedt
  2010-05-20 20:12     ` Steven Rostedt
  2010-05-21 17:49     ` Ingo Molnar
  2 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2010-05-20 15:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter

On Thu, 2010-05-20 at 10:38 -0400, Steven Rostedt wrote:
> n this machine:
> 
> model name	: Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz
> 
> It's a 2x2 (4 cores)
> 

Oh, and this was done with 2.6.34 freshly checked out.

-- Steve

> With running:
> 
> # modprobe ring_buffer_benchmark producer_fifo=10
> # sleep 30
> # cat /debug/tracing/trace
> 




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 14:38   ` Steven Rostedt
  2010-05-20 15:04     ` Steven Rostedt
@ 2010-05-20 20:12     ` Steven Rostedt
  2010-05-21 17:49     ` Ingo Molnar
  2 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2010-05-20 20:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter

On Thu, 2010-05-20 at 10:38 -0400, Steven Rostedt wrote:
> On Thu, 2010-05-20 at 11:31 +0200, Ingo Molnar wrote:

> # modprobe ring_buffer_benchmark producer_fifo=10
> # sleep 30
> # cat /debug/tracing/trace
> 
> 
>            <...>-2636  [003]   761.969850: ring_buffer_producer_thread: Starting ring buffer hammer
>            <...>-2636  [003]   771.910121: ring_buffer_producer_thread: End ring buffer hammer
>            <...>-2636  [003]   771.910172: ring_buffer_producer_thread: Running Consumer at nice: 19
>            <...>-2636  [003]   771.910172: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
>            <...>-2636  [003]   771.910173: ring_buffer_producer_thread: Time:     9965321 (usecs)
>            <...>-2636  [003]   771.910173: ring_buffer_producer_thread: Overruns: 0
>            <...>-2636  [003]   771.910174: ring_buffer_producer_thread: Read:     76580325  (by pages)
>            <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Entries:  75
>            <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Total:    76580400
>            <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Missed:   0
>            <...>-2636  [003]   771.910176: ring_buffer_producer_thread: Hit:      76580400
>            <...>-2636  [003]   771.910176: ring_buffer_producer_thread: Entries per millisec: 7684
>            <...>-2636  [003]   771.910176: ring_buffer_producer_thread: 130 ns per entry
>            <...>-2636  [003]   771.910177: ring_buffer_producer_thread: Sleeping for 10 secs
> 
>            <...>-2636  [003]   781.884686: ring_buffer_producer_thread: Starting ring buffer hammer
>            <...>-2636  [003]   790.862392: ring_buffer_producer_thread: End ring buffer hammer
>            <...>-2636  [003]   790.897399: ring_buffer_producer_thread: Running Consumer at nice: 19
>            <...>-2636  [003]   790.897400: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
>            <...>-2636  [003]   790.897401: ring_buffer_producer_thread: Time:     9000321 (usecs)
>            <...>-2636  [003]   790.897401: ring_buffer_producer_thread: Overruns: 26581200
>            <...>-2636  [003]   790.897402: ring_buffer_producer_thread: Read:     6767200  (by events)
>            <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Entries:  0
>            <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Total:    33348400
>            <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Missed:   0
>            <...>-2636  [003]   790.897404: ring_buffer_producer_thread: Hit:      33348400
>            <...>-2636  [003]   790.897404: ring_buffer_producer_thread: Entries per millisec: 3705
>            <...>-2636  [003]   790.897405: ring_buffer_producer_thread: 269 ns per entry
> 
> While having the reader read by pages (splice mode) a write takes 130ns
> per entry. Not that bad.
> 
> But when the reader is reading each event, (I'm assuming we get some
> cache line bouncing here, it is still lockless), a write takes 269ns.
> 
> Note, the reader is an aggressive reader. That is, it goes into a strong
> loop and just reads as fast as it can. Which would cause cache line
> bouncing.
> 
> When I disable the reader and just measure what a write takes:
> 
> # rmmod ring_buffer_benchmark
> # modprobe ring_buffer_benchmark disable_reader=1 producer_fifo=10
> # sleep 30
> # cat /debug/tracing/trace
> 
>            <...>-2649  [003]  1141.981639: ring_buffer_producer_thread: Starting ring buffer hammer
>            <...>-2649  [003]  1151.950840: ring_buffer_producer_thread: End ring buffer hammer
>            <...>-2649  [003]  1151.950841: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
>            <...>-2649  [003]  1151.950842: ring_buffer_producer_thread: Time:     9994238 (usecs)
>            <...>-2649  [003]  1151.950842: ring_buffer_producer_thread: Overruns: 90635670
>            <...>-2649  [003]  1151.950843: ring_buffer_producer_thread: Read:     (reader disabled)
>            <...>-2649  [003]  1151.950843: ring_buffer_producer_thread: Entries:  62480
>            <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Total:    90698150
>            <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Missed:   0
>            <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Hit:      90698150
>            <...>-2649  [003]  1151.950845: ring_buffer_producer_thread: Entries per millisec: 9075
>            <...>-2649  [003]  1151.950845: ring_buffer_producer_thread: 110 ns per entry
>            <...>-2649  [003]  1151.950846: ring_buffer_producer_thread: Sleeping for 10 secs
> 
> It takes just a 110ns.

I checked out latest tip/perf/core with commit:

commit dfacc4d6c98b89609250269f518c1f54c30454ef
Merge: f869097 85cb68b
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu May 20 14:38:55 2010 +0200

    Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core

This seems to include the latest optimizations from Peter.

I decided to modify the trace/kernel/ring_buffer_benchmark.c to
benchmark the perf ring buffer. I spent an hour or so trying to get it
to work but since the perf ring buffer is so coupled with perf, I gave
up in trying to test it directly. It requires a reader, and it has no
good way to set up a buffer and read it from the kernel.

So instead, I figured I would test perf and ftrace with TRACE_EVENT().

Here's what I did, and my changes are at the bottom of this email as a
patch too.

I created a function within the kernel proper (since the benchmark is a
module) called ring_buffer_benchmark_caller(). All this does is call:

trace_rb_benchmark(raw_smp_processor_id());

Where trace_rb_benchmark is created with TRACE_EVENT() (as shown in the
patch below).

I modified the ring buffer benchmark to take a "trace_event" parameter
where instead of testing the ring buffer itself, it would just time the
calling of "ring_buffer_benchmark_caller()", which includes the trace
event.

This is nice way to see the difference between perf and ftrace since
this is the code that they both use.

Note, I noticed that the ring buffer slowed down since 2.6.34. Probably
to do with the "lost events" recording that I added. But it went from
110ns to 153ns in the normal benchmark (not this one).

Anyway, the benchmark would pause until the tracepoint was activated, as
the tracepoint would enable rb_start_bench. When the tracepoint is
called the first time, the benchmark would begin.

Here's what I did:

# modprobe ring_buffer_benchmark producer_fifo=10 trace_event=1
# perf record -e rb_bench:rb_benchmark -c1 -a sleep 30
# cat /debug/tracing/trace

Here's the results:

# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <...>-22836 [001]  2804.849264: : wait to start
           <...>-22836 [001]  2816.419909: : starting
           <...>-22836 [001]  2816.419911: : Starting ring buffer hammer
           <...>-22836 [002]  2826.056105: : End ring buffer hammer
           <...>-22836 [002]  2826.056108: : Running Producer at SCHED_FIFO 10
           <...>-22836 [002]  2826.056109: : Time:     9660458 (usecs)
           <...>-22836 [002]  2826.056110: : Overruns: 0
           <...>-22836 [002]  2826.056110: : Read:     (reader disabled)
           <...>-22836 [002]  2826.056111: : Entries:  0
           <...>-22836 [002]  2826.056111: : Total:    0
           <...>-22836 [002]  2826.056112: : Missed:   0
           <...>-22836 [002]  2826.056112: : Hit:      14637150
           <...>-22836 [002]  2826.056113: : Entries per millisec: 1515
           <...>-22836 [002]  2826.056113: : 660 ns per entry
           <...>-22836 [002]  2826.056114: : Sleeping for 10 secs
           <...>-22836 [002]  2836.030464: : wait to start
           <...>-22836 [002]  2836.030465: : starting
           <...>-22836 [002]  2836.030465: : Starting ring buffer hammer
           <...>-22836 [001]  2845.008379: : End ring buffer hammer
           <...>-22836 [001]  2845.008383: : Running Producer at SCHED_FIFO 10
           <...>-22836 [001]  2845.008384: : Time:     9000518 (usecs)
           <...>-22836 [001]  2845.008384: : Overruns: 0
           <...>-22836 [001]  2845.008384: : Read:     (reader disabled)
           <...>-22836 [001]  2845.008385: : Entries:  0
           <...>-22836 [001]  2845.008385: : Total:    0
           <...>-22836 [001]  2845.008386: : Missed:   0
           <...>-22836 [001]  2845.008387: : Hit:      13642200
           <...>-22836 [001]  2845.008387: : Entries per millisec: 1515
           <...>-22836 [001]  2845.008388: : 660 ns per entry

Writing an event in perf took 660ns each.


To do the same for ftrace, I first had to set a filter. This is because
the events for ftrace write to the same place the ring buffer benchmark
is writing its output, and I would have to search through thousands of
called events to see the output. But since filtering is actually
expensive, it should still give us a basic idea.

# modprobe ring_buffer_benchmark producer_fifo=10 trace_event=1
# echo mydata == 10 > /debug/tracing/events/rb_bench/rb_benchmark/filter
# echo 1 > /debug/tracing/events/rb_bench/rb_benchmark/enable
# cat /debug/tracing/trace

# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
     rb_producer-22849 [000]  3015.236994: ring_buffer_producer_thread: wait to start
     rb_producer-22849 [000]  3032.193995: ring_buffer_producer_thread: starting
     rb_producer-22849 [000]  3032.193996: ring_buffer_producer_thread: Starting ring buffer hammer
     rb_producer-22849 [000]  3041.513072: ring_buffer_producer_thread: End ring buffer hammer
     rb_producer-22849 [000]  3041.513073: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
     rb_producer-22849 [000]  3041.513074: ring_buffer_producer_thread: Time:     9342579 (usecs)
     rb_producer-22849 [000]  3041.513074: ring_buffer_producer_thread: Overruns: 0
     rb_producer-22849 [000]  3041.513075: ring_buffer_producer_thread: Read:     (reader disabled)
     rb_producer-22849 [000]  3041.513075: ring_buffer_producer_thread: Entries:  0
     rb_producer-22849 [000]  3041.513076: ring_buffer_producer_thread: Total:    0
     rb_producer-22849 [000]  3041.513076: ring_buffer_producer_thread: Missed:   0
     rb_producer-22849 [000]  3041.513077: ring_buffer_producer_thread: Hit:      51588000
     rb_producer-22849 [000]  3041.513077: ring_buffer_producer_thread: Entries per millisec: 5522
     rb_producer-22849 [000]  3041.513078: ring_buffer_producer_thread: 181 ns per entry
     rb_producer-22849 [000]  3041.513078: ring_buffer_producer_thread: Sleeping for 10 secs
     rb_producer-22849 [000]  3051.487356: ring_buffer_producer_thread: wait to start
     rb_producer-22849 [000]  3051.487356: ring_buffer_producer_thread: starting
     rb_producer-22849 [000]  3051.487357: ring_buffer_producer_thread: Starting ring buffer hammer
     rb_producer-22849 [000]  3060.465270: ring_buffer_producer_thread: End ring buffer hammer
     rb_producer-22849 [000]  3060.465271: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
     rb_producer-22849 [000]  3060.465272: ring_buffer_producer_thread: Time:     9000562 (usecs)
     rb_producer-22849 [000]  3060.465272: ring_buffer_producer_thread: Overruns: 0
     rb_producer-22849 [000]  3060.465272: ring_buffer_producer_thread: Read:     (reader disabled)
     rb_producer-22849 [000]  3060.465273: ring_buffer_producer_thread: Entries:  0
     rb_producer-22849 [000]  3060.465273: ring_buffer_producer_thread: Total:    0
     rb_producer-22849 [000]  3060.465274: ring_buffer_producer_thread: Missed:   0
     rb_producer-22849 [000]  3060.465274: ring_buffer_producer_thread: Hit:      49675400
     rb_producer-22849 [000]  3060.465275: ring_buffer_producer_thread: Entries per millisec: 5519
     rb_producer-22849 [000]  3060.465275: ring_buffer_producer_thread: 181 ns per entry

Note, enabling an event in ftrace enables the comm line record, which
shows the "rb_producer".


Here we see each event took 181ns as called by TRACE_EVENT() and then
discarded.

But since I'm discarding the events this may be giving ftrace an unfair
advantage. Thus, I modified the ring buffer benchmark code as follows:

#undef trace_printk
#define trace_printk(fmt, args...) printk(fmt, ##args)

This makes the ring buffer benchmark write to the console instead of the
ring buffer, and lets me run without filtering.

# modprobe ring_buffer_benchmark producer_fifo=10 trace_event=1
# echo 0 > /debug/tracing/events/rb_bench/rb_benchmark/filter
# echo 1 > /debug/tracing/events/rb_bench/rb_benchmark/enable

I got this:

wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9972749 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      57079200
Entries per millisec: 5723
174 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9972786 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      57086350
Entries per millisec: 5724
174 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9972772 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      57058250
Entries per millisec: 5721
174 ns per entry
Sleeping for 10 secs

The non-filtering case (the one that actually records) was 174ns per event.


Here's the patch:

-- Steve

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index ffb1a5b..3b559d3 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -21,10 +21,12 @@ endif
 #
 obj-y += trace_clock.o
 
+CFLAGS_ring_buffer_bm_event.o += -I$(src)
+
 obj-$(CONFIG_FUNCTION_TRACER) += libftrace.o
 obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 obj-$(CONFIG_RING_BUFFER_BENCHMARK) += ring_buffer_benchmark.o
-
+obj-y += ring_buffer_bm_event.o
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_TRACING) += trace_output.o
 obj-$(CONFIG_TRACING) += trace_stat.o
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 302f8a6..9d5cdb1 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -4,12 +4,23 @@
  * Copyright (C) 2009 Steven Rostedt <srostedt@redhat.com>
  */
 #include <linux/ring_buffer.h>
+#include <linux/ftrace_event.h>
 #include <linux/completion.h>
 #include <linux/kthread.h>
 #include <linux/module.h>
 #include <linux/time.h>
 #include <asm/local.h>
 
+#include "ring_buffer_bm_event.h"
+
+/* Write the strings directly into the buffers. */
+#undef trace_printk
+#define trace_printk(fmt, args...)					\
+do {									\
+	__trace_printk_check_format(fmt, ##args);			\
+	__trace_printk(_THIS_IP_, fmt, ##args);				\
+} while (0)
+
 struct rb_page {
 	u64		ts;
 	local_t		commit;
@@ -32,6 +43,10 @@ static struct task_struct *producer;
 static struct task_struct *consumer;
 static unsigned long read;
 
+static int trace_event;
+module_param(trace_event, uint, 0644);
+MODULE_PARM_DESC(trace_event, "record a trace event");
+
 static int disable_reader;
 module_param(disable_reader, uint, 0644);
 MODULE_PARM_DESC(disable_reader, "only run producer");
@@ -211,6 +226,50 @@ static void ring_buffer_consumer(void)
 	complete(&read_done);
 }
 
+static inline int ring_buffer_write_trace_event(void)
+{
+	ring_buffer_benchmark_caller();
+	return 0;
+}
+
+static inline int ring_buffer_write_rb_event(void)
+{
+	struct ring_buffer_event *event;
+	int *entry;
+
+	event = ring_buffer_lock_reserve(buffer, 10);
+	if (!event)
+		return -1;
+
+	entry = ring_buffer_event_data(event);
+	*entry = smp_processor_id();
+	ring_buffer_unlock_commit(buffer, event);
+
+	return 0;
+}
+
+static inline int ring_buffer_write_event(void)
+{
+	if (trace_event)
+		return ring_buffer_write_trace_event();
+	else
+		return ring_buffer_write_rb_event();
+}
+
+static int wait_to_start(void)
+{
+	while (!rb_start_bench) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (kthread_should_stop())
+			break;
+		schedule_timeout(HZ/10);
+		ring_buffer_benchmark_caller();
+	}
+	__set_current_state(TASK_RUNNING);
+
+	return kthread_should_stop();
+}
+
 static void ring_buffer_producer(void)
 {
 	struct timeval start_tv;
@@ -223,6 +282,13 @@ static void ring_buffer_producer(void)
 	unsigned long avg;
 	int cnt = 0;
 
+	if (trace_event) {
+		trace_printk("wait to start\n");
+		if (wait_to_start())
+			return;
+		trace_printk("starting\n");
+	}
+
 	/*
 	 * Hammer the buffer for 10 secs (this may
 	 * make the system stall)
@@ -230,20 +296,13 @@ static void ring_buffer_producer(void)
 	trace_printk("Starting ring buffer hammer\n");
 	do_gettimeofday(&start_tv);
 	do {
-		struct ring_buffer_event *event;
-		int *entry;
 		int i;
 
 		for (i = 0; i < write_iteration; i++) {
-			event = ring_buffer_lock_reserve(buffer, 10);
-			if (!event) {
+			if (ring_buffer_write_event() < 0)
 				missed++;
-			} else {
+			else
 				hit++;
-				entry = ring_buffer_event_data(event);
-				*entry = smp_processor_id();
-				ring_buffer_unlock_commit(buffer, event);
-			}
 		}
 		do_gettimeofday(&end_tv);
 
@@ -425,6 +484,12 @@ static int __init ring_buffer_benchmark_init(void)
 	if (!buffer)
 		return -ENOMEM;
 
+	/* The reader is userspace */
+	if (trace_event) {
+		disable_reader = 1;
+		rb_start_bench = 0;
+	}
+
 	if (!disable_reader) {
 		consumer = kthread_create(ring_buffer_consumer_thread,
 					  NULL, "rb_consumer");
diff --git a/kernel/trace/ring_buffer_bm_event.c b/kernel/trace/ring_buffer_bm_event.c
new file mode 100644
index 0000000..862e80c
--- /dev/null
+++ b/kernel/trace/ring_buffer_bm_event.c
@@ -0,0 +1,13 @@
+#include <linux/module.h>
+
+#define CREATE_TRACE_POINTS
+#include "ring_buffer_bm_event.h"
+
+int rb_start_bench;
+EXPORT_SYMBOL(rb_start_bench);
+
+void ring_buffer_benchmark_caller(void)
+{
+	trace_rb_benchmark(raw_smp_processor_id());
+}
+EXPORT_SYMBOL(ring_buffer_benchmark_caller);
diff --git a/kernel/trace/ring_buffer_bm_event.h b/kernel/trace/ring_buffer_bm_event.h
new file mode 100644
index 0000000..a167c6d
--- /dev/null
+++ b/kernel/trace/ring_buffer_bm_event.h
@@ -0,0 +1,38 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM rb_bench
+
+#if !defined(_TRACE_RB_BENCHMARK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_RB_BENCHMARK_H
+
+#include <linux/tracepoint.h>
+
+extern int rb_start_bench;
+void ring_buffer_benchmark_caller(void);
+
+TRACE_EVENT(rb_benchmark,
+
+	TP_PROTO(int mydata),
+
+	TP_ARGS(mydata),
+
+	TP_STRUCT__entry(
+		__field(	int,	mydata			)
+	),
+
+	TP_fast_assign(
+		rb_start_bench = 1;
+		__entry->mydata = mydata;
+	),
+
+	TP_printk("data=%d", __entry->mydata)
+);
+
+#endif /* _TRACE_RB_BENCHMARK_H */
+
+#undef TRACE_INCLUDE_FILE
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE ring_buffer_bm_event
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>







^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20 14:38   ` Steven Rostedt
  2010-05-20 15:04     ` Steven Rostedt
  2010-05-20 20:12     ` Steven Rostedt
@ 2010-05-21 17:49     ` Ingo Molnar
  2010-05-24 20:13       ` Steven Rostedt
  2 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2010-05-21 17:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 2010-05-20 at 11:31 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > More than a year and a half ago (September 2008), at 
> > > Linux Plumbers, we had a meeting with several kernel 
> > > developers to come up with a unified ring buffer. A 
> > > generic ring buffer in the kernel that any subsystem 
> > > could use. After coming up with a set of requirements, I 
> > > worked on implementing it. One of the requirements was 
> > > to start off simple and work to become a more complete 
> > > buffering system.
> > >
> > > [...]
> > 
> > The thing is, in tracing land and more broadly in 
> > instrumentation land we have _much_ more earthly problems 
> > these days:
> > 
> >  - Lets face it, performance of the ring-buffer sucks, in 
> >    a big way. I've recently benchmarked it and it takes 
> >    hundreds of instructions to trace a single event. 
> >    Puh-lease ...
> 
> "Puh-lease"?

Yeah, we suck :-/

> Hmm, what were the timings?  I just ran the 
> ring_buffer_benchmark module. It's in mainline, so 
> anyone could do the same. [...]

i tested it with perf stat and got 700+ instructions for 
the sched:sched_wakeup tracepoint.

That's without extracting the trace - that would add 
another significant chunk of overhead.

Btw., it's not just ftrace, on the perf side i see big 
tracing overhead as well, it's even worse than this 
because we layer it on each other and because perf does a 
few stupid things (that need fixing).

> [...] Remember to disable KERNEL_DEBUG because things 
> like lockdep can have a significant impact on it.

I had lockdep off.

> Note, instructions themselves are not that bad, but what 
> those instructions do count a lot more. I avoid all lock 
> functions and memory barriers which cause the code to be 
> a little bigger. But a single cmpxchg can take much more 
> time than a lot of other instructions put together.

It's bad cycle-wise as well and obviously running through 
700 instructions is bad no matter what.

> On this machine:
> 
> model name	: Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz
> 
> It's a 2x2 (4 cores)
> 
> With running:
> 
> # modprobe ring_buffer_benchmark producer_fifo=10
> # sleep 30
> # cat /debug/tracing/trace
> 
> 
>            <...>-2636  [003]   761.969850: ring_buffer_producer_thread: Starting ring buffer hammer
>            <...>-2636  [003]   771.910121: ring_buffer_producer_thread: End ring buffer hammer
>            <...>-2636  [003]   771.910172: ring_buffer_producer_thread: Running Consumer at nice: 19
>            <...>-2636  [003]   771.910172: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
>            <...>-2636  [003]   771.910173: ring_buffer_producer_thread: Time:     9965321 (usecs)
>            <...>-2636  [003]   771.910173: ring_buffer_producer_thread: Overruns: 0
>            <...>-2636  [003]   771.910174: ring_buffer_producer_thread: Read:     76580325  (by pages)
>            <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Entries:  75
>            <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Total:    76580400
>            <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Missed:   0
>            <...>-2636  [003]   771.910176: ring_buffer_producer_thread: Hit:      76580400
>            <...>-2636  [003]   771.910176: ring_buffer_producer_thread: Entries per millisec: 7684
>            <...>-2636  [003]   771.910176: ring_buffer_producer_thread: 130 ns per entry
>            <...>-2636  [003]   771.910177: ring_buffer_producer_thread: Sleeping for 10 secs
> 
>            <...>-2636  [003]   781.884686: ring_buffer_producer_thread: Starting ring buffer hammer
>            <...>-2636  [003]   790.862392: ring_buffer_producer_thread: End ring buffer hammer
>            <...>-2636  [003]   790.897399: ring_buffer_producer_thread: Running Consumer at nice: 19
>            <...>-2636  [003]   790.897400: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
>            <...>-2636  [003]   790.897401: ring_buffer_producer_thread: Time:     9000321 (usecs)
>            <...>-2636  [003]   790.897401: ring_buffer_producer_thread: Overruns: 26581200
>            <...>-2636  [003]   790.897402: ring_buffer_producer_thread: Read:     6767200  (by events)
>            <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Entries:  0
>            <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Total:    33348400
>            <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Missed:   0
>            <...>-2636  [003]   790.897404: ring_buffer_producer_thread: Hit:      33348400
>            <...>-2636  [003]   790.897404: ring_buffer_producer_thread: Entries per millisec: 3705
>            <...>-2636  [003]   790.897405: ring_buffer_producer_thread: 269 ns per entry
> 
> While having the reader read by pages (splice mode) a 
> write takes 130ns per entry. Not that bad.

I tried a simple loop of:

int main(void)
{
        int i;

        for (i = 0; i < 1000000; i++)
                getppid();

        return 0;
}

with kernel/timer.c's getppid syscall having this:

#undef CREATE_TRACE_POINTS
#include <trace/events/sched.h>

SYSCALL_DEFINE0(getppid)
...
	trace_sched_wakeup(current, 0);

and then used:

   echo 1 > /debug/tracing/events/sched/sched_wakeup/enable
   echo sched_switch > /debug/tracing/current_tracer

   perf stat --repeat 10 -e instructions -e cycles ./getppid-1m

with the tracepoint off i got:

 # perf stat -B --repeat 10 -e instructions -e cycles ./getppid-1m

 Performance counter stats for './getppid-1m' (10 runs):

         89.700.970  instructions             #      0,614 IPC     ( +-   0,146% )
        146.204.088  cycles                     ( +-   1,160% )

        0,046142868  seconds time elapsed   ( +-   1,200% )


with it on i got:

 # perf stat -B --repeat 10 -e instructions -e cycles ./getppid-1m

 Performance counter stats for './getppid-1m' (10 runs):

      1.241.321.592  instructions             #      1,617 IPC     ( +-   0,054% )
        767.890.685  cycles                     ( +-   1,298% )

        0,241048084  seconds time elapsed   ( +-   1,288% )


i.e. from 89 instructions to 1241 instructions. From 146 
cycles to 767 cycles.

And that's just the recording - doesnt count the 
recovering of the trace. (which can argued to be a 
slowpath in many tracing workflows when buffers are big 
enough.)

Nehalem box. Can send the config.

> But when the reader is reading each event, (I'm assuming 
> we get some cache line bouncing here, it is still 
> lockless), a write takes 269ns.

That would be like 800 cycles? So it's roughly the same 
ballpark figure as mine - we suck. (this is a 3GHz box)

> Note, the reader is an aggressive reader. That is, it 
> goes into a strong loop and just reads as fast as it 
> can. Which would cause cache line bouncing.
> 
> When I disable the reader and just measure what a write 
> takes:
> 
> # rmmod ring_buffer_benchmark
> # modprobe ring_buffer_benchmark disable_reader=1 producer_fifo=10
> # sleep 30
> # cat /debug/tracing/trace
> 
>            <...>-2649  [003]  1141.981639: ring_buffer_producer_thread: Starting ring buffer hammer
>            <...>-2649  [003]  1151.950840: ring_buffer_producer_thread: End ring buffer hammer
>            <...>-2649  [003]  1151.950841: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
>            <...>-2649  [003]  1151.950842: ring_buffer_producer_thread: Time:     9994238 (usecs)
>            <...>-2649  [003]  1151.950842: ring_buffer_producer_thread: Overruns: 90635670
>            <...>-2649  [003]  1151.950843: ring_buffer_producer_thread: Read:     (reader disabled)
>            <...>-2649  [003]  1151.950843: ring_buffer_producer_thread: Entries:  62480
>            <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Total:    90698150
>            <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Missed:   0
>            <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Hit:      90698150
>            <...>-2649  [003]  1151.950845: ring_buffer_producer_thread: Entries per millisec: 9075
>            <...>-2649  [003]  1151.950845: ring_buffer_producer_thread: 110 ns per entry
>            <...>-2649  [003]  1151.950846: ring_buffer_producer_thread: Sleeping for 10 secs
> 
> It takes just a 110ns.

I dont seem to be able to get such low numbers - but even 
~250 instructions are pretty expensive.

> Lets now use perf to see what is happening:
> 
> # perf record -a -c10000 -f
> (wait 30 secs)
> Ctrl-C
> 
> # perf report
> 
> no symbols found in /usr/sbin/irqbalance, maybe install a debug package?
> # Samples: 1995496
> #
> # Overhead          Command                                                     Shared Object  Symbol
> # ........  ...............  ................................................................  ......
> #
>     23.80%      rb_producer  [kernel.kallsyms]                                                 [k] native_sched_clock
>     14.83%      rb_producer  [kernel.kallsyms]                                                 [k] rb_reserve_next_event
>     13.20%      rb_producer  [kernel.kallsyms]                                                 [k] ring_buffer_unlock_commit
>     12.84%      rb_producer  [kernel.kallsyms]                                                 [k] ring_buffer_lock_reserve
>      9.27%      rb_producer  [kernel.kallsyms]                                                 [k] rb_end_commit
>      9.19%      rb_producer  [kernel.kallsyms]                                                 [k] __rb_reserve_next
>      5.24%      rb_producer  [kernel.kallsyms]                                                 [k] trace_clock_local
>      3.11%      rb_producer  [kernel.kallsyms]                                                 [k] ring_buffer_event_data
>      2.88%      rb_producer  [ring_buffer_benchmark]                                           [k] ring_buffer_benchmark_init
>      2.53%      rb_producer  [kernel.kallsyms]                                                 [k] trace_recursive_unlock
>      0.39%      rb_producer  [kernel.kallsyms]                                                 [k] do_gettimeofday
>      0.34%      rb_producer  [kernel.kallsyms]                                                 [k] read_tsc
>      0.18%      rb_producer  [kernel.kallsyms]                                                 [k] getnstimeofday
>      0.14%          swapper  [kernel.kallsyms]                                                 [k] mwait_idle
> 
> 
> Now that the sched_clock is %23.8 percent, it doesn't 
> look as bad.
> 
> rb_reserve_next_event is 14.83, 
> ring_buffer_unlock_commit is 13.20, 
> ring_buffer_lock_reserve is 12.84, rb_end_commit is 
> 9.27, __rb_reserve_next is 8.18, trace_clock_local is 
> 5.24 (which only disables preemption and calls 
> sched_clock), ring_buffer_event_data is 3.11 which only 
> returns a pointer on a structure (once a event).
> 
> ring_buffer_benchmark_init is the benchmark code itself, 
> and trace_recursive_unlock is the code you asked me to 
> add to check for cases where the tracer could have 
> recursed itself.
> 
> 14.83 + 13.20 + 9.27 + 8.18 + 5.24 + 3.11 = 53.83%
> 
> Considering that the system is basically idle except for 
> this benchmark, that makes sense to have it that high. 
> We need to compare it. The sched_clock() takes up 1/3 of 
> the entire trace event.

Should be better on my box, which has a usable TSC and my 
sched_clock() is in essence an RDTSC.


> >  - It has grown a lot of slack. It's complex and hard to
> >    read.
> 
> I admit, that it got over designed, and I'm all for 
> replacing it. We were all involved in the initial 
> design, [...]

Yes, by implication i definitely suck too, no argument 
about that :-)

> [...] and I posted several versions of RFC when I 
> created it. But I admit, it became complex, and now that 
> we know what we want, we can take a step back and 
> redesign a new one from scratch.
> 
> But until we do so, I'm continuing to support the 
> current one, and adding small updates when needed.
> 
> > 
> >  - Over the last year or so the majority of bleeding-edge
> >    tracing developers have gradually migrated over to perf 
> >    and 'perf trace' / 'perf probe' in particular. As far 
> >    as the past two merge windows go they are 
> >    out-developing the old ftrace UIs by a ratio of 4:1.
> 
> Of course a new project will be in a high state of flux. 
> I'm quite satisfied with the interface with ftrace and 
> did not feel that it needed changing.
> 
> > 
> >    So this angle is becoming a prime thing to improve and
> >    users and developers are hurting from the ftrace/perf
> >    duality.
> 
> I agree.
> 
> > 
> >  - [ While it's still a long way off, if this trend continues
> >      we eventually might even be able to get rid of the 
> >      /debug/tracing/ temporary debug API and get rid of 
> >      the ugly in-kernel pretty-printing bits. This is 
> >      good: it may make Andrew very happy for a change ;-)
> 
> So how do you plan on getting rid of the in-kernel 
> pretty-printing bits and keeping the trace_dump_on_oops?  
> There's a strong dependency between those two.

One approach would be to add a library that can turn the 
specification into a pretty-print output - say in 
tools/perf/lib/printlib/ (or lib/printlib/), and co-build 
it both in perf and in the kernel.

That way we get most of the pretty-printing functionality 
at oops (or critical event) time, but have it in a 
manageable form.

> >      The main detail here to be careful of is that lots of
> >      people are fond of the simplicity of the 
> >      /debug/tracing/ debug UI, so when we replace it we 
> >      want to do it by keeping that simple workflow (or 
> >      best by making it even simpler). I have a few ideas 
> >      how to do this.
> 
> Note, I still would not want to remove the /debug 
> interface. It is a debug interface anyway. Having both a 
> syscall access the functionality of the code and having 
> a debug interface access it very trivial. This could 
> also be a way we can debug the tool too.
> 
> We can officially support only the tool side, but why 
> throw away the baby with the bath water?

As long as we declare it to go away the moment we have 
something equivalent or better on the 'perf ftrace' / 
'ftrace' tool side, sure.

I worry about what Andrew mentioned a few months (years?) 
ago: unintentional ABI tool bindings to /debug/tracing, 
just due to us offering it.

We want to gravitate towards offering one superb solution 
for all the tracing goodies we've built and are going to 
build.


> >      There's also the detail that in some cases we want to
> >      print events in the kernel in a human readable way: 
> >      for example EDAC/MCE and other critical events,
> >      trace-on-oops, etc. This too can be solved. ]
> 
> But if the code for trace-on-oops is there, then the 
> pretty print code is also there.

Not if we share it with the user-space tool (in a generic 
way) and drive it from the user-space side, not the kernel 
side.

The executable instructions will still be there in the 
kernel, for those relatively rare cases that need it (lets 
face it, the usage proportion between ordinary tracing and 
dump-on-oops is probably 1:100 or worse).

> > Regarding performance and complexity, which is our 
> > main worry atm, fortunately there's work going on in 
> > that direction - please see PeterZ's recent string of 
> > patches on lkml:
> > 
> >   4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events
> >   a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware
> >   ef60777: perf: Optimize the perf_output() path by removing IRQ-disables
> >   fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t
> >   6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg()
> 
> Yes, I saw these, and it helps a little, but still not everything is
> there.

Most definitely. We still suck :-)

> > And it may sound harsh but at this stage i'm 
> > personally not at all interested in big design talk.
> 
> Rewriting ftrace on top of perf is a big design change.

Yeah, and a pretty rewarding one.

> > This isnt rocket science, we have developers and users 
> > and we know what they are doing and we know what we 
> > need to do: we need to improve our crap and we need to 
> > reduce complexity. Less is more.
> > 
> > So i'd like to see iterative, useful action first, and 
> > i am somewhat sceptical about yet another grand 
> > tracing design trying to match 100 requirements.
> 
> Here's a few things that I find required:
> 
> The ability to use splice. You keep telling me this is 
> not important, but splice is magnitudes faster than 
> mmaping into userspace and then copying into a file. 
> [...]

That claim i find hard to believe, have you got the 
numbers for that?

It's faster for sure if implemented correctly, but i'd be 
surprised if it made more than 20% of a difference to the 
total tracing overhead.

> [...] If the splice removes the extra copy as we are 
> working on, it will get even faster. It also makes 
> writing over the network much easier. Which trace-cmd 
> does, and it took me half a day to make it do so.
> 
> We also need a way to easily read the buffer from the 
> kernel. This is for trace-on-oops.
> 
> Overwrite mode must also be supported. I know that Peter 
> thinks its silly to do this and have it write to a file, 
> but I find it helpful to do so. But fine, I can live 
> without overwrite to a file, if no one else cares. But 
> overwrite mode itself must work for trace-on-oops.

We can do overwrite mode, it doesnt look like a central 
issue.

I'm sure we'll find a dozen other things missing as well. 
We have to start looking.

> We need a way to start and stop tracing from the kernel 
> and userspace. Again, for trace-on-oops. When a bug is 
> hit, we need a way to stop it. We also must have a way 
> to easily start it again from userspace.
> 
> A simple on/off switch. This is something that you and 
> Thomas hammered into me which was the creation of 
> tracing_on.
> 
> The above can be done, but lets not just hack the 
> changes in. Lets really look at a proper design.

This kind of capability would be met by persistent events 
attached to struct user.

We have to start looking and doing.

> > Steve, Mathieu, if you are interested please help out 
> > Peter with the performance and simplification work. 
> > The last thing we need is yet another 
> > replace-everything event.
> 
> But that is what you are asking to happen to ftrace. 
> "replace-everything".

We've got two ring-buffer implementations right now, one 
has to be replaced.

It doesnt mean we should disrupt _two_ implementations and 
put in a third one, unless there are strong technical 
reasons for doing so.

> > If we really want to create a new ring-buffer 
> > abstraction i'd suggest we start with Peter's, it has 
> > a quite sane design and stayed simple and flexible - 
> > if then it could be factored out a bit.
> 
> I'm sorry, but it is very coupled with perf. [...]

Then decouple it.

> [...] It is simple and flexible with what perf does 
> today, but not what ftrace does today, or to make perf 
> do what ftrace does.

Why? Neither overwrite mode, nor splice support, nor 
persistent buffering looks hard at all.

> > Here are more bits of what i see as the 'action' going 
> > forward, in no particular order:
> > 
> > 1) Push the /debug/tracing/events/ event description
> >    into sysfs, as per this thread on lkml:
> > 
> >      [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs
> > 
> >      http://groups.google.com/group/linux.kernel/msg/ab9aa075016c639e
> > 
> >    I.e. one more step towards integrating ftrace into perf.
> 
> We could do something like this, but it is not as 
> trivial as it may seem.

Never is.

> > 2) Use 1) to unify the perf events and the ftrace
> >    ring-buffer. This, as things are standing is
> >    best done by factoring out Peter's ring-buffer
> >    in kernel/perf_event.c. It's properly abstracted
> >    and it _far_ simpler than kernel/tracing/ring_buffer.c,
> >    which has become a monstrosity.
> 
> Sure, my code has become quite complex, I'll admit that. 
> I'm for refactoring. I really don't care what code is 
> used as long as I still have the requirements that make 
> ftrace useful for me and several others.
> 
> And yes, any ring buffer needs to be factored out to a 
> defined interface. The current ring buffer in perf is 
> very coupled with the perf design. The event code 
> written to perf in include/trace/ftrace.h is still quite 
> a hack.

Then lets make it less of a hack.

> >    (but i'm open to other simplifications as well)
> > 
> > 3) Add the function-tracer and function-graph tracer
> >    as an event and integrate it into perf.
> 
> Here's something that I think perf needs to do. It needs 
> to become a bit more modulized, and not absorb 
> everything into perf instead of having things use just a 
> part of perf. Like the breakpoint code requires full 
> perf to be done, and x86 has PERF_EVENTS permanently on. 
> Which is really a PITA.

the Kconfig rule should be fixed so that PERF_EVENTS can 
be disabled on EMBEDDED.

On non-embedded, why is it a PITA?

> As I designed ftrace from day one, I tried to make it 
> modular. Thus, the function trace 
> (kernel/trace/ftrace.c) has always been a separate 
> entity from ftrace itself. Heck, LTTng uses it. It would 
> be actually trivial to implement the function tracer in 
> perf today. Just register a function you want called at 
> every function event and start running.

Here i disagree: the last thing Linux needs is various 
fragmented pieces of instrumentation.

We try hard with perf to unify. It takes time but has to 
happen.

> I need to abstract out the "what function to call bit" 
> but that's about it.
> 
> > 
> >    This will live-test the efficiency of the unification
> >    and brings over the last big ftrace plugin to perf.
> > 
> > 4) Gradually convert/port/migrate all the remaining 
> >    plugins over as well. We need to do this very gently 
> >    because there are users - but stop piling new 
> >    functionality on to the old ftrace side. This usually 
> >    involves:
> 
> I have not been piling functionality into the ftrace 
> side, as you stated at the beginning of the email, that 
> the development has been quite slow. Yes, I've been 
> still working on things, but I would say it is a far cry 
> from "piling".

Sorry, that was a poor choice of a word. We did manage to 
pile quite a bit of crap into the tracing hotpath though.

> I did add some simple features, like the "lost events" 
> to the ring buffer, but that code can be looked at for 
> ideas for the future. It does not make merging any more 
> difficult than it already is.
> 
> Fine, I'll stop some of the code I was going to work on 
> (multiple buffers for ftrace, event triggers, 
> configuring events to stop the tracer (like for 
> different kinds of BUG() and WARN_ON()), etc).


I think those are excellent features that seem to fit 
rather well into the persisent buffering concept, what do 
you think?

Via the 'ftrace' tool we could as a first step extend 
tracing sessions to per user. We could also bring the 
start/stop/dont-worry-about-where-the-trace-is workflow to 
perf trace that way.

Hm?

> >     - Conversion of an explicit tracing callback to
> >       TRACE_EVENT (for example in the case of mmiotrace),
> >       while keeping all tool functionality.
> 
> Not sure what this is.

We have a lot of explicit ad-hoc callbacks, hooks, 
notifiers, function pointers in the kernel, which feed 
some sort of event logging mechanism. These should be 
unified while keeping their current functionality - but 
exposing the callback as an event source as well. This 
enriches total pool of events.

For example the mmiotrace tracepoints.

> >     - Migrate any 'special' ftrace feature to perf 
> >       capabilities so that it's available via the 
> >       syscall interface as well. (for example 
> >       'latency maximum tracking' is something that we 
> >       probably want to do with kernel-side help - we 
> >       probably dont want to implement it via tracing 
> >       everything all the time and finding the maximum 
> >       based on terabytes of data.)
> > 
> >    (And there are other complications here too, but you 
> >     get the idea.)
> 
> Yes, the latency tracers are special, and I treat them 
> special in ftrace. They can be redesigned to keep their 
> current functionality. That needed to be done anyway.

Yeah.

> > All in one, i think we can reuse more than 50% of all 
> > current ftrace code (possibly up to 70-80%) - and we 
> > are already reusing bits of it.
> 
> Ingo, as we have had this conversation, I don't know if 
> you realized that I agreed to merge ftrace to perf. 
> There's just some functionality about ftrace I really 
> want to keep, which does include accessing via /debugfs, 
> pretty-print (which is needed by trace-on-oops) and 
> splice.

Ok, good :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-21 17:49     ` Ingo Molnar
@ 2010-05-24 20:13       ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2010-05-24 20:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming,
	Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake,
	Robert Richter

On Fri, 2010-05-21 at 19:49 +0200, Ingo Molnar wrote: 
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 

> > > The thing is, in tracing land and more broadly in 
> > > instrumentation land we have _much_ more earthly problems 
> > > these days:
> > > 
> > >  - Lets face it, performance of the ring-buffer sucks, in 
> > >    a big way. I've recently benchmarked it and it takes 
> > >    hundreds of instructions to trace a single event. 
> > >    Puh-lease ...
> > 
> > "Puh-lease"?
> 
> Yeah, we suck :-/

But the question is, do we want to suck more?

> 
> > Hmm, what were the timings?  I just ran the 
> > ring_buffer_benchmark module. It's in mainline, so 
> > anyone could do the same. [...]
> 
> i tested it with perf stat and got 700+ instructions for 
> the sched:sched_wakeup tracepoint.
> 
> That's without extracting the trace - that would add 
> another significant chunk of overhead.
> 
> Btw., it's not just ftrace, on the perf side i see big 
> tracing overhead as well, it's even worse than this 
> because we layer it on each other and because perf does a 
> few stupid things (that need fixing).

But this has nothing to do with the ring buffer. It has to do with the
way the trace event itself is set up. There's little room for
improvement if we want to keep things automated. Otherwise, all trace
events will need to be written by us, or make all writers of
TRACE_EVENT() into tracing experts.

The beauty of TRACE_EVENT() was to make it simple enough that anyone
could add one without needing to know the details of how it appears.

The are also flexible enough to have in kernel users attach directly to
them.


[coming back up after reading below ]

Did you test only the event 
"echo 1 > /debug/tracing/event/sched/sched_wakeup" or did you only do it
in conjunction with the sched_switch tracer?

> 
> > [...] Remember to disable KERNEL_DEBUG because things 
> > like lockdep can have a significant impact on it.
> 
> I had lockdep off.
> 
> > Note, instructions themselves are not that bad, but what 
> > those instructions do count a lot more. I avoid all lock 
> > functions and memory barriers which cause the code to be 
> > a little bigger. But a single cmpxchg can take much more 
> > time than a lot of other instructions put together.
> 
> It's bad cycle-wise as well and obviously running through 
> 700 instructions is bad no matter what.
> 
> > On this machine:
> > 
> > model name	: Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz
> > 
> > It's a 2x2 (4 cores)
> > 
> > With running:
> > 
> > # modprobe ring_buffer_benchmark producer_fifo=10
> > # sleep 30
> > # cat /debug/tracing/trace
> > 
> > 
> >            <...>-2636  [003]   761.969850: ring_buffer_producer_thread: Starting ring buffer hammer
> >            <...>-2636  [003]   771.910121: ring_buffer_producer_thread: End ring buffer hammer
> >            <...>-2636  [003]   771.910172: ring_buffer_producer_thread: Running Consumer at nice: 19
> >            <...>-2636  [003]   771.910172: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
> >            <...>-2636  [003]   771.910173: ring_buffer_producer_thread: Time:     9965321 (usecs)
> >            <...>-2636  [003]   771.910173: ring_buffer_producer_thread: Overruns: 0
> >            <...>-2636  [003]   771.910174: ring_buffer_producer_thread: Read:     76580325  (by pages)
> >            <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Entries:  75
> >            <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Total:    76580400
> >            <...>-2636  [003]   771.910175: ring_buffer_producer_thread: Missed:   0
> >            <...>-2636  [003]   771.910176: ring_buffer_producer_thread: Hit:      76580400
> >            <...>-2636  [003]   771.910176: ring_buffer_producer_thread: Entries per millisec: 7684
> >            <...>-2636  [003]   771.910176: ring_buffer_producer_thread: 130 ns per entry
> >            <...>-2636  [003]   771.910177: ring_buffer_producer_thread: Sleeping for 10 secs
> > 
> >            <...>-2636  [003]   781.884686: ring_buffer_producer_thread: Starting ring buffer hammer
> >            <...>-2636  [003]   790.862392: ring_buffer_producer_thread: End ring buffer hammer
> >            <...>-2636  [003]   790.897399: ring_buffer_producer_thread: Running Consumer at nice: 19
> >            <...>-2636  [003]   790.897400: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
> >            <...>-2636  [003]   790.897401: ring_buffer_producer_thread: Time:     9000321 (usecs)
> >            <...>-2636  [003]   790.897401: ring_buffer_producer_thread: Overruns: 26581200
> >            <...>-2636  [003]   790.897402: ring_buffer_producer_thread: Read:     6767200  (by events)
> >            <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Entries:  0
> >            <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Total:    33348400
> >            <...>-2636  [003]   790.897403: ring_buffer_producer_thread: Missed:   0
> >            <...>-2636  [003]   790.897404: ring_buffer_producer_thread: Hit:      33348400
> >            <...>-2636  [003]   790.897404: ring_buffer_producer_thread: Entries per millisec: 3705
> >            <...>-2636  [003]   790.897405: ring_buffer_producer_thread: 269 ns per entry
> > 
> > While having the reader read by pages (splice mode) a 
> > write takes 130ns per entry. Not that bad.
> 
> I tried a simple loop of:
> 
> int main(void)
> {
>         int i;
> 
>         for (i = 0; i < 1000000; i++)
>                 getppid();
> 
>         return 0;
> }
> 
> with kernel/timer.c's getppid syscall having this:
> 
> #undef CREATE_TRACE_POINTS
> #include <trace/events/sched.h>
> 
> SYSCALL_DEFINE0(getppid)
> ...
> 	trace_sched_wakeup(current, 0);
> 
> and then used:
> 
>    echo 1 > /debug/tracing/events/sched/sched_wakeup/enable
>    echo sched_switch > /debug/tracing/current_tracer

This is not fair, you just enabled two probes, not just one. There's the
sched_wakeup generic event "/debug/tracing/events/sched/sched_wakeup"
and the sched_switch tracer that adds its own probe. Thus, the
trace_sched_wakeup() calls two probes each time.

Have you seen the code for the probe of sched_switch tracer's
trace_sched_wakeup?  What you did will cause the getppid to call both
the generic TRACE_EVENT() probe as well as the sched_switch attached
probe.

Here is the sched_switch sched_wakeup probe:


static void
probe_sched_wakeup(struct rq *__rq, struct task_struct *wakee, int success)
{
	struct trace_array_cpu *data;
	unsigned long flags;
	int cpu, pc;

	if (unlikely(!sched_ref))
		return;

	tracing_record_cmdline(current);

	if (!tracer_enabled || sched_stopped)
		return;

	pc = preempt_count();
	local_irq_save(flags);
	cpu = raw_smp_processor_id();
	data = ctx_trace->data[cpu];

	if (likely(!atomic_read(&data->disabled)))
		tracing_sched_wakeup_trace(ctx_trace, wakee, current,
					   flags, pc);

	local_irq_restore(flags);
}


It also disables interrupts, but lets look at some of the functions it
calls:

void tracing_record_cmdline(struct task_struct *tsk)
{
	if (atomic_read(&trace_record_cmdline_disabled) || !tracer_enabled ||
	    !tracing_is_on())
		return;

	trace_save_cmdline(tsk);
}

static void trace_save_cmdline(struct task_struct *tsk)
{
	unsigned pid, idx;

	if (!tsk->pid || unlikely(tsk->pid > PID_MAX_DEFAULT))
		return;

	/*
	 * It's not the end of the world if we don't get
	 * the lock, but we also don't want to spin
	 * nor do we want to disable interrupts,
	 * so if we miss here, then better luck next time.
	 */
	if (!arch_spin_trylock(&trace_cmdline_lock))
		return;

	idx = map_pid_to_cmdline[tsk->pid];
	if (idx == NO_CMDLINE_MAP) {
		idx = (cmdline_idx + 1) % SAVED_CMDLINES;

		/*
		 * Check whether the cmdline buffer at idx has a pid
		 * mapped. We are going to overwrite that entry so we
		 * need to clear the map_pid_to_cmdline. Otherwise we
		 * would read the new comm for the old pid.
		 */
		pid = map_cmdline_to_pid[idx];
		if (pid != NO_CMDLINE_MAP)
			map_pid_to_cmdline[pid] = NO_CMDLINE_MAP;

		map_cmdline_to_pid[idx] = tsk->pid;
		map_pid_to_cmdline[tsk->pid] = idx;

		cmdline_idx = idx;
	}

	memcpy(&saved_cmdlines[idx], tsk->comm, TASK_COMM_LEN);

	arch_spin_unlock(&trace_cmdline_lock);
}

The above is grabbing a spinlock!


tracing_sched_wakeup_trace(struct trace_array *tr,
			   struct task_struct *wakee,
			   struct task_struct *curr,
			   unsigned long flags, int pc)
{
	struct ftrace_event_call *call = &event_wakeup;
	struct ring_buffer_event *event;
	struct ctx_switch_entry *entry;
	struct ring_buffer *buffer = tr->buffer;

	event = trace_buffer_lock_reserve(buffer, TRACE_WAKE,
					  sizeof(*entry), flags, pc);
	if (!event)
		return;
	entry	= ring_buffer_event_data(event);
	entry->prev_pid			= curr->pid;
	entry->prev_prio		= curr->prio;
	entry->prev_state		= curr->state;
	entry->next_pid			= wakee->pid;
	entry->next_prio		= wakee->prio;
	entry->next_state		= wakee->state;
	entry->next_cpu			= task_cpu(wakee);

	if (!filter_check_discard(call, entry, buffer, event))
		ring_buffer_unlock_commit(buffer, event);
	ftrace_trace_stack(tr->buffer, flags, 6, pc);
	ftrace_trace_userstack(tr->buffer, flags, pc);
}


This has little to do with the ring buffer, or the way ftrace does
events. It can certainly be made faster, but wakeups are rather rare
(compared to other events that are traced).




> 
>    perf stat --repeat 10 -e instructions -e cycles ./getppid-1m
> 
> with the tracepoint off i got:
> 
>  # perf stat -B --repeat 10 -e instructions -e cycles ./getppid-1m
> 
>  Performance counter stats for './getppid-1m' (10 runs):
> 
>          89.700.970  instructions             #      0,614 IPC     ( +-   0,146% )
>         146.204.088  cycles                     ( +-   1,160% )
> 
>         0,046142868  seconds time elapsed   ( +-   1,200% )
> 
> 
> with it on i got:
> 
>  # perf stat -B --repeat 10 -e instructions -e cycles ./getppid-1m
> 
>  Performance counter stats for './getppid-1m' (10 runs):
> 
>       1.241.321.592  instructions             #      1,617 IPC     ( +-   0,054% )
>         767.890.685  cycles                     ( +-   1,298% )
> 
>         0,241048084  seconds time elapsed   ( +-   1,288% )
> 
> 
> i.e. from 89 instructions to 1241 instructions. From 146 
> cycles to 767 cycles.

But you enabled sched_switch too, right? That would cause two probes to
execute, and one is a lot bigger than the other.

> 
> And that's just the recording - doesnt count the 
> recovering of the trace. (which can argued to be a 
> slowpath in many tracing workflows when buffers are big 
> enough.)

But still, you inflated the numbers by enabling the sched_switch tracer.

> 
> Nehalem box. Can send the config.
> 
> > But when the reader is reading each event, (I'm assuming 
> > we get some cache line bouncing here, it is still 
> > lockless), a write takes 269ns.
> 
> That would be like 800 cycles? So it's roughly the same 
> ballpark figure as mine - we suck. (this is a 3GHz box)

That was with an aggressive reader, which even though it is lockless,
the reader is still dirtying the writer's cache lines.

> 
> > Note, the reader is an aggressive reader. That is, it 
> > goes into a strong loop and just reads as fast as it 
> > can. Which would cause cache line bouncing.
> > 
> > When I disable the reader and just measure what a write 
> > takes:
> > 
> > # rmmod ring_buffer_benchmark
> > # modprobe ring_buffer_benchmark disable_reader=1 producer_fifo=10
> > # sleep 30
> > # cat /debug/tracing/trace
> > 
> >            <...>-2649  [003]  1141.981639: ring_buffer_producer_thread: Starting ring buffer hammer
> >            <...>-2649  [003]  1151.950840: ring_buffer_producer_thread: End ring buffer hammer
> >            <...>-2649  [003]  1151.950841: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10
> >            <...>-2649  [003]  1151.950842: ring_buffer_producer_thread: Time:     9994238 (usecs)
> >            <...>-2649  [003]  1151.950842: ring_buffer_producer_thread: Overruns: 90635670
> >            <...>-2649  [003]  1151.950843: ring_buffer_producer_thread: Read:     (reader disabled)
> >            <...>-2649  [003]  1151.950843: ring_buffer_producer_thread: Entries:  62480
> >            <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Total:    90698150
> >            <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Missed:   0
> >            <...>-2649  [003]  1151.950844: ring_buffer_producer_thread: Hit:      90698150
> >            <...>-2649  [003]  1151.950845: ring_buffer_producer_thread: Entries per millisec: 9075
> >            <...>-2649  [003]  1151.950845: ring_buffer_producer_thread: 110 ns per entry
> >            <...>-2649  [003]  1151.950846: ring_buffer_producer_thread: Sleeping for 10 secs
> > 
> > It takes just a 110ns.
> 
> I dont seem to be able to get such low numbers

Well, don't run the sched_switch tracer ;-)

> - but even 
> ~250 instructions are pretty expensive.


That bench mark only tested the ring buffer by allocating its own ring
buffer. It did not use the ftrace infrastructure. I tried to do the same
with perf but gave up because of the limitations of the perf ring
buffer.


> 
> > Lets now use perf to see what is happening:
> > 
> > # perf record -a -c10000 -f
> > (wait 30 secs)
> > Ctrl-C
> > 
> > # perf report
> > 
> > no symbols found in /usr/sbin/irqbalance, maybe install a debug package?
> > # Samples: 1995496
> > #
> > # Overhead          Command                                                     Shared Object  Symbol
> > # ........  ...............  ................................................................  ......
> > #
> >     23.80%      rb_producer  [kernel.kallsyms]                                                 [k] native_sched_clock
> >     14.83%      rb_producer  [kernel.kallsyms]                                                 [k] rb_reserve_next_event
> >     13.20%      rb_producer  [kernel.kallsyms]                                                 [k] ring_buffer_unlock_commit
> >     12.84%      rb_producer  [kernel.kallsyms]                                                 [k] ring_buffer_lock_reserve
> >      9.27%      rb_producer  [kernel.kallsyms]                                                 [k] rb_end_commit
> >      9.19%      rb_producer  [kernel.kallsyms]                                                 [k] __rb_reserve_next
> >      5.24%      rb_producer  [kernel.kallsyms]                                                 [k] trace_clock_local
> >      3.11%      rb_producer  [kernel.kallsyms]                                                 [k] ring_buffer_event_data
> >      2.88%      rb_producer  [ring_buffer_benchmark]                                           [k] ring_buffer_benchmark_init
> >      2.53%      rb_producer  [kernel.kallsyms]                                                 [k] trace_recursive_unlock
> >      0.39%      rb_producer  [kernel.kallsyms]                                                 [k] do_gettimeofday
> >      0.34%      rb_producer  [kernel.kallsyms]                                                 [k] read_tsc
> >      0.18%      rb_producer  [kernel.kallsyms]                                                 [k] getnstimeofday
> >      0.14%          swapper  [kernel.kallsyms]                                                 [k] mwait_idle
> > 
> > 
> > Now that the sched_clock is %23.8 percent, it doesn't 
> > look as bad.
> > 
> > rb_reserve_next_event is 14.83, 
> > ring_buffer_unlock_commit is 13.20, 
> > ring_buffer_lock_reserve is 12.84, rb_end_commit is 
> > 9.27, __rb_reserve_next is 8.18, trace_clock_local is 
> > 5.24 (which only disables preemption and calls 
> > sched_clock), ring_buffer_event_data is 3.11 which only 
> > returns a pointer on a structure (once a event).
> > 
> > ring_buffer_benchmark_init is the benchmark code itself, 
> > and trace_recursive_unlock is the code you asked me to 
> > add to check for cases where the tracer could have 
> > recursed itself.
> > 
> > 14.83 + 13.20 + 9.27 + 8.18 + 5.24 + 3.11 = 53.83%
> > 
> > Considering that the system is basically idle except for 
> > this benchmark, that makes sense to have it that high. 
> > We need to compare it. The sched_clock() takes up 1/3 of 
> > the entire trace event.
> 
> Should be better on my box, which has a usable TSC and my 
> sched_clock() is in essence an RDTSC.

Hmm, this box does have an unstable TSC, but doesn't sched_clock still
do rdtsc?

> 
> 
> > >  - It has grown a lot of slack. It's complex and hard to
> > >    read.
> > 
> > I admit, that it got over designed, and I'm all for 
> > replacing it. We were all involved in the initial 
> > design, [...]
> 
> Yes, by implication i definitely suck too, no argument 
> about that :-)
> 
> > [...] and I posted several versions of RFC when I 
> > created it. But I admit, it became complex, and now that 
> > we know what we want, we can take a step back and 
> > redesign a new one from scratch.
> > 
> > But until we do so, I'm continuing to support the 
> > current one, and adding small updates when needed.
> > 
> > > 
> > >  - Over the last year or so the majority of bleeding-edge
> > >    tracing developers have gradually migrated over to perf 
> > >    and 'perf trace' / 'perf probe' in particular. As far 
> > >    as the past two merge windows go they are 
> > >    out-developing the old ftrace UIs by a ratio of 4:1.
> > 
> > Of course a new project will be in a high state of flux. 
> > I'm quite satisfied with the interface with ftrace and 
> > did not feel that it needed changing.
> > 
> > > 
> > >    So this angle is becoming a prime thing to improve and
> > >    users and developers are hurting from the ftrace/perf
> > >    duality.
> > 
> > I agree.
> > 
> > > 
> > >  - [ While it's still a long way off, if this trend continues
> > >      we eventually might even be able to get rid of the 
> > >      /debug/tracing/ temporary debug API and get rid of 
> > >      the ugly in-kernel pretty-printing bits. This is 
> > >      good: it may make Andrew very happy for a change ;-)
> > 
> > So how do you plan on getting rid of the in-kernel 
> > pretty-printing bits and keeping the trace_dump_on_oops?  
> > There's a strong dependency between those two.
> 
> One approach would be to add a library that can turn the 
> specification into a pretty-print output - say in 
> tools/perf/lib/printlib/ (or lib/printlib/), and co-build 
> it both in perf and in the kernel.

Not sure how this would work. But wouldn't this be a nightmare for
package developers? What about other tools that use the perf interface.
You allow them right?

An ideal world, the kernel should have an awesome tracing infrastructure
that several tools could use.

> 
> That way we get most of the pretty-printing functionality 
> at oops (or critical event) time, but have it in a 
> manageable form.
> 
> > >      The main detail here to be careful of is that lots of
> > >      people are fond of the simplicity of the 
> > >      /debug/tracing/ debug UI, so when we replace it we 
> > >      want to do it by keeping that simple workflow (or 
> > >      best by making it even simpler). I have a few ideas 
> > >      how to do this.
> > 
> > Note, I still would not want to remove the /debug 
> > interface. It is a debug interface anyway. Having both a 
> > syscall access the functionality of the code and having 
> > a debug interface access it very trivial. This could 
> > also be a way we can debug the tool too.
> > 
> > We can officially support only the tool side, but why 
> > throw away the baby with the bath water?
> 
> As long as we declare it to go away the moment we have 
> something equivalent or better on the 'perf ftrace' / 
> 'ftrace' tool side, sure.
> 
> I worry about what Andrew mentioned a few months (years?) 
> ago: unintentional ABI tool bindings to /debug/tracing, 
> just due to us offering it.

Well, it is in the /debug fs. Depending on something called "debug" is
always a risk.

> 
> We want to gravitate towards offering one superb solution 
> for all the tracing goodies we've built and are going to 
> build.

Right, and lets make it so any utility could bind to it and perform
well.


> 
> 
> > >      There's also the detail that in some cases we want to
> > >      print events in the kernel in a human readable way: 
> > >      for example EDAC/MCE and other critical events,
> > >      trace-on-oops, etc. This too can be solved. ]
> > 
> > But if the code for trace-on-oops is there, then the 
> > pretty print code is also there.
> 
> Not if we share it with the user-space tool (in a generic 
> way) and drive it from the user-space side, not the kernel 
> side.
> 
> The executable instructions will still be there in the 
> kernel, for those relatively rare cases that need it (lets 
> face it, the usage proportion between ordinary tracing and 
> dump-on-oops is probably 1:100 or worse).


But there should be a system wide tracing, that would most likely be
used within the kernel. This would require a global buffer and a way to
write to (and read from) in the kernel.


> 
> > > Regarding performance and complexity, which is our 
> > > main worry atm, fortunately there's work going on in 
> > > that direction - please see PeterZ's recent string of 
> > > patches on lkml:
> > > 
> > >   4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events
> > >   a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware
> > >   ef60777: perf: Optimize the perf_output() path by removing IRQ-disables
> > >   fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t
> > >   6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg()
> > 
> > Yes, I saw these, and it helps a little, but still not everything is
> > there.
> 
> Most definitely. We still suck :-)
> 
> > > And it may sound harsh but at this stage i'm 
> > > personally not at all interested in big design talk.
> > 
> > Rewriting ftrace on top of perf is a big design change.
> 
> Yeah, and a pretty rewarding one.
> 
> > > This isnt rocket science, we have developers and users 
> > > and we know what they are doing and we know what we 
> > > need to do: we need to improve our crap and we need to 
> > > reduce complexity. Less is more.
> > > 
> > > So i'd like to see iterative, useful action first, and 
> > > i am somewhat sceptical about yet another grand 
> > > tracing design trying to match 100 requirements.
> > 
> > Here's a few things that I find required:
> > 
> > The ability to use splice. You keep telling me this is 
> > not important, but splice is magnitudes faster than 
> > mmaping into userspace and then copying into a file. 
> > [...]
> 
> That claim i find hard to believe, have you got the 
> numbers for that?

Well, it's hard to compare since perf does only mmap, and trace-cmd does
only splice. But I can compare the two, although the way things are done
differently, which affect the results.

In my email I showed how I tested both perf and ftrace, by using the one
common denominator of both: TRACE_EVENT(). I modified the
ring_buffer_benchmark to test the time it takes to call a simple event.
I showed what I did in the previous email. This takes a look at how both
ftrace and perf handles events.

I just tested both perf and trace-cmd against 2.6.34 and latest
tip/perf/core. Here's the results I have:

On vanilla 2.6.34:

[root@ixf9 linux-trace.git]# perf record -c1 -e rb_bench:rb_benchmark  -a sleep 60
[ perf record: Woken up 62 times to write data ]
[ perf record: Captured and wrote 1031.015 MB perf.data (~45045722 samples) ]

>From console:
-------------
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9603958 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      11041600
Entries per millisec: 1149
870 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9973544 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      11518850
Entries per millisec: 1155
865 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9973545 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      10900500
Entries per millisec: 1093
914 ns per entry
Sleeping for 10 secs

[root@ixf9 linux-trace.git]# perf trace | grep rb_producer | wc -l
16495123


[root@ixf9 linux-trace.git]# trace-cmd record -b 512 -s10 -e rb_bench:rb_benchmark sleep 60
disable all
enable rb_bench:rb_benchmark
path = /debug/tracing/events/rb_bench/rb_benchmark/enable
offset=163000
offset=163000
offset=47a5e000
offset=47a5e000
Kernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
        recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0

CPU: 1
entries: 0
overrun: 102010404
commit overrun: 0

CPU: 2
entries: 0
overrun: 0
commit overrun: 0

CPU: 3
entries: 0
overrun: 0
commit overrun: 0



>From console:
-------------
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9162808 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      49703550
Entries per millisec: 5424
184 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9973570 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      54054950
Entries per millisec: 5420
184 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9973501 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      53908300
Entries per millisec: 5405
185 ns per entry
Sleeping for 10 secs

[root@ixf9 linux-trace.git]# trace-cmd report | grep rb_producer | wc -l
59795170


[root@ixf9 linux-trace.git]# ls -l perf.data trace.dat 
-rw-------  1 root root 1082104985 2010-05-24 12:05 perf.data
-rw-r--r--. 1 root root 1202053120 2010-05-24 12:10 trace.dat


On latest tip/perf/core:

commit d67f088e084755bdceb4f15bc6e05e309db1eea7
Author: Arnaldo Carvalho de Melo <acme@redhat.com>
Date:   Sun May 23 22:36:51 2010 -0300

    perf report: Support multiple events on the TUI


[root@ixf9 linux-trace.git]# perf record -c1 -e rb_bench:rb_benchmark  -a sleep 60
[ perf record: Woken up 394 times to write data ]
[ perf record: Captured and wrote 1093.414 MB perf.data (~47771989 samples) ]


>From console:
-------------
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9514214 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      19149150
Entries per millisec: 2012
497 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9972824 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      20701150
Entries per millisec: 2075
481 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9972802 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      20365250
Entries per millisec: 2042
489 ns per entry
Sleeping for 10 secs

[root@ixf9 linux-trace.git]# perf trace | grep rb_producer | wc -l
17694461


[root@ixf9 linux-trace.git]# trace-cmd record -b 512 -s10 -e rb_bench:rb_benchmark sleep 60
disable all
enable rb_bench:rb_benchmark
path = /debug/tracing/events/rb_bench/rb_benchmark/enable
offset=15d000
offset=15d000
offset=343b2000
offset=343b2000
Kernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
        recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0

CPU: 1
entries: 0
overrun: 63560076
commit overrun: 0

CPU: 2
entries: 0
overrun: 0
commit overrun: 0

CPU: 3
entries: 0
overrun: 31949460
commit overrun: 0


>From console:
-------------
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9461863 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      50454750
Entries per millisec: 5332
187 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9972793 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      54203850
Entries per millisec: 5435
183 ns per entry
Sleeping for 10 secs
wait to start
starting
Starting ring buffer hammer
End ring buffer hammer
Running Producer at SCHED_FIFO 10
Time:     9972746 (usecs)
Overruns: 0
Read:     (reader disabled)
Entries:  0
Total:    0
Missed:   0
Hit:      53295900
Entries per millisec: 5344
187 ns per entry
Sleeping for 10 secs


[root@ixf9 linux-trace.git]# trace-cmd report | grep rb_producer | wc -l
65393428


[root@ixf9 linux-trace.git]# ls -l perf.data trace.dat 
-rw-------  1 root root 1147509728 2010-05-24 12:56 perf.data
-rw-r--r--. 1 root root 1314439168 2010-05-24 13:01 trace.dat



In vanilla 2.6.34 we have:

perf:  865ns - 914ns per event
ftrace: 184ns - 185ns per event

events recorded to file:

perf: 16,495,123 events
trace-cmd: 59,795,170 events

file size:

perf: 1,082,104,985 bytes
ftrace: 1,202,053,120 bytes

which gives us:

perf: ~65 bytes per event
ftrace: ~20 bytes per event



That was vanilla, now lets look at the results with all the latest
optimizations. This has improved things in perf:


In lastest tip/perf/core we have:

perf:  497ns - 489ns per event
ftrace: 187ns per event

events recorded to file:

perf: 17,694,461 events
trace-cmd: 65,393,428 events

file size:

perf: 1,147,509,728 bytes
trace-cmd: 1,314,439,168 bytes

which gives us:

perf: ~64 bytes per event
ftrace: ~20 bytes per event


Looks like Peter's lastest optimizations have cut the time to do a perf
trace event almost in half. But it is still 2 1/2 times longer to do a
perf event than a ftrace event.


But I noticed that even though perf is recording events faster, the
number of events that were saved to file hardly changed at all. Which
means the reader is still a bottle neck for both cases.

I use to think that ftrace had bloated events, but perf triples the size
of each event. Perhaps this is also part of the reason perf events are
so slow.


> 
> It's faster for sure if implemented correctly, but i'd be 
> surprised if it made more than 20% of a difference to the 
> total tracing overhead.

It's hard to compare since there's so many variables. trace-cmd recorded
~13% more file than perf. But unless we can compare straight mmap vs
splice, I still take this with a grain of salt.


> 
> > [...] If the splice removes the extra copy as we are 
> > working on, it will get even faster. It also makes 
> > writing over the network much easier. Which trace-cmd 
> > does, and it took me half a day to make it do so.
> > 
> > We also need a way to easily read the buffer from the 
> > kernel. This is for trace-on-oops.
> > 
> > Overwrite mode must also be supported. I know that Peter 
> > thinks its silly to do this and have it write to a file, 
> > but I find it helpful to do so. But fine, I can live 
> > without overwrite to a file, if no one else cares. But 
> > overwrite mode itself must work for trace-on-oops.
> 
> We can do overwrite mode, it doesnt look like a central 
> issue.
> 
> I'm sure we'll find a dozen other things missing as well. 
> We have to start looking.
> 
> > We need a way to start and stop tracing from the kernel 
> > and userspace. Again, for trace-on-oops. When a bug is 
> > hit, we need a way to stop it. We also must have a way 
> > to easily start it again from userspace.
> > 
> > A simple on/off switch. This is something that you and 
> > Thomas hammered into me which was the creation of 
> > tracing_on.
> > 
> > The above can be done, but lets not just hack the 
> > changes in. Lets really look at a proper design.
> 
> This kind of capability would be met by persistent events 
> attached to struct user.
> 
> We have to start looking and doing.
> 
> > > Steve, Mathieu, if you are interested please help out 
> > > Peter with the performance and simplification work. 
> > > The last thing we need is yet another 
> > > replace-everything event.
> > 
> > But that is what you are asking to happen to ftrace. 
> > "replace-everything".
> 
> We've got two ring-buffer implementations right now, one 
> has to be replaced.

And you seem to be asking to replace the more efficient one.

I admit that my ring buffer can be replaced due to its complexity, but
perf needs a major face lift in its buffering scheme to keep up as a
competitive tracer.

> 
> It doesnt mean we should disrupt _two_ implementations and 
> put in a third one, unless there are strong technical 
> reasons for doing so.

But we have to disrupt perf anyway to make it come close to what ftrace
does today.

Since neither of the buffers seem to satisfy the other's requirements
then perhaps that's the best thing to do. Replace both of them.

I have yet to see that perf's ring buffer is any better than ftrace's.
On the contrary, the performance looks worse. Each time we add a
feature, it gets more complex, and the final result of the perf ring
buffer may end up being slower and more complex than the ftrace one.

I can't see hacking in changes to the perf ring buffer trying to make it
work. Perhaps the best thing to do is redesign it, start from scratch.
I'm willing to let the ftrace ring buffer be redesigned, why can't we do
that to perf as well, and merge the two utilities as we do it?

Lets start with a good design instead of just patching the current
design trying to make it work. That is what the O(1) scheduler suffered
from. Too many, "let's try this" hacks to fix inefficiencies. Replacing
it with a new design was the proper approach.


> 
> > > If we really want to create a new ring-buffer 
> > > abstraction i'd suggest we start with Peter's, it has 
> > > a quite sane design and stayed simple and flexible - 
> > > if then it could be factored out a bit.
> > 
> > I'm sorry, but it is very coupled with perf. [...]
> 
> Then decouple it.

I don't understand the coupling between perf and its ring buffer enough
to do it. I just tried to make it work with the ring_buffer_benchmark
and it became much more involved than I would like. It would require
Peter and/or Frederic to do this work.


> 
> > [...] It is simple and flexible with what perf does 
> > today, but not what ftrace does today, or to make perf 
> > do what ftrace does.
> 
> Why? Neither overwrite mode, nor splice support, nor 
> persistent buffering looks hard at all.

I tried to make it work by itself in the ring buffer benchmark. But it I
gave up because it is so coupled with perf's user interface.

1) I had no idea how to create a separate buffer. 

2) I had no idea how to read the buffer from the kernel.

I'm sure I can figure these out if I had the time, but I don't.


> 
> > > Here are more bits of what i see as the 'action' going 
> > > forward, in no particular order:
> > > 
> > > 1) Push the /debug/tracing/events/ event description
> > >    into sysfs, as per this thread on lkml:
> > > 
> > >      [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs
> > > 
> > >      http://groups.google.com/group/linux.kernel/msg/ab9aa075016c639e
> > > 
> > >    I.e. one more step towards integrating ftrace into perf.
> > 
> > We could do something like this, but it is not as 
> > trivial as it may seem.
> 
> Never is.

Right, and moving perf to splice/overwrite/etc, is not going to be as
easy as you believe it will.

> 
> > > 2) Use 1) to unify the perf events and the ftrace
> > >    ring-buffer. This, as things are standing is
> > >    best done by factoring out Peter's ring-buffer
> > >    in kernel/perf_event.c. It's properly abstracted
> > >    and it _far_ simpler than kernel/tracing/ring_buffer.c,
> > >    which has become a monstrosity.
> > 
> > Sure, my code has become quite complex, I'll admit that. 
> > I'm for refactoring. I really don't care what code is 
> > used as long as I still have the requirements that make 
> > ftrace useful for me and several others.
> > 
> > And yes, any ring buffer needs to be factored out to a 
> > defined interface. The current ring buffer in perf is 
> > very coupled with the perf design. The event code 
> > written to perf in include/trace/ftrace.h is still quite 
> > a hack.
> 
> Then lets make it less of a hack.

Agreed, lets redesign it :-)

> 
> > >    (but i'm open to other simplifications as well)
> > > 
> > > 3) Add the function-tracer and function-graph tracer
> > >    as an event and integrate it into perf.
> > 
> > Here's something that I think perf needs to do. It needs 
> > to become a bit more modulized, and not absorb 
> > everything into perf instead of having things use just a 
> > part of perf. Like the breakpoint code requires full 
> > perf to be done, and x86 has PERF_EVENTS permanently on. 
> > Which is really a PITA.
> 
> the Kconfig rule should be fixed so that PERF_EVENTS can 
> be disabled on EMBEDDED.
> 
> On non-embedded, why is it a PITA?

It was a PITA for me since I was not able to test !PERF_EVENTS generic
changes on x86.

> 
> > As I designed ftrace from day one, I tried to make it 
> > modular. Thus, the function trace 
> > (kernel/trace/ftrace.c) has always been a separate 
> > entity from ftrace itself. Heck, LTTng uses it. It would 
> > be actually trivial to implement the function tracer in 
> > perf today. Just register a function you want called at 
> > every function event and start running.
> 
> Here i disagree: the last thing Linux needs is various 
> fragmented pieces of instrumentation.
> 
> We try hard with perf to unify. It takes time but has to 
> happen.

I'm all for making it unified, it is that I do not want to have to learn
every aspect of perf just to change part of it. Or perhaps there's
another subsystem where it could use a part of perf nicely. This is
difficult to do with its design today.

> 
> > I need to abstract out the "what function to call bit" 
> > but that's about it.
> > 
> > > 
> > >    This will live-test the efficiency of the unification
> > >    and brings over the last big ftrace plugin to perf.
> > > 
> > > 4) Gradually convert/port/migrate all the remaining 
> > >    plugins over as well. We need to do this very gently 
> > >    because there are users - but stop piling new 
> > >    functionality on to the old ftrace side. This usually 
> > >    involves:
> > 
> > I have not been piling functionality into the ftrace 
> > side, as you stated at the beginning of the email, that 
> > the development has been quite slow. Yes, I've been 
> > still working on things, but I would say it is a far cry 
> > from "piling".
> 
> Sorry, that was a poor choice of a word. We did manage to 
> pile quite a bit of crap into the tracing hotpath though.

I was looking at the hot path of the ring buffer, and I think it was a
major design mistake in embedding the timestamp into the ring buffer
level. That should have been done in a higher level. By removing that
requirement, it would have made the ring buffer code hot path a lot
nicer.

We all failed in that design mistake. (I'm as guilty as anyone)

> 
> > I did add some simple features, like the "lost events" 
> > to the ring buffer, but that code can be looked at for 
> > ideas for the future. It does not make merging any more 
> > difficult than it already is.
> > 
> > Fine, I'll stop some of the code I was going to work on 
> > (multiple buffers for ftrace, event triggers, 
> > configuring events to stop the tracer (like for 
> > different kinds of BUG() and WARN_ON()), etc).
> 
> 
> I think those are excellent features that seem to fit 
> rather well into the persisent buffering concept, what do 
> you think?
> 
> Via the 'ftrace' tool we could as a first step extend 
> tracing sessions to per user. We could also bring the 
> start/stop/dont-worry-about-where-the-trace-is workflow to 
> perf trace that way.
> 
> Hm?
> 
> > >     - Conversion of an explicit tracing callback to
> > >       TRACE_EVENT (for example in the case of mmiotrace),
> > >       while keeping all tool functionality.
> > 
> > Not sure what this is.
> 
> We have a lot of explicit ad-hoc callbacks, hooks, 
> notifiers, function pointers in the kernel, which feed 
> some sort of event logging mechanism. These should be 
> unified while keeping their current functionality - but 
> exposing the callback as an event source as well. This 
> enriches total pool of events.
> 
> For example the mmiotrace tracepoints.

I'm not against any of this. Just need to take some time in finding a
proper design to this.

One thing we have to consider is looking at the use of the tracing
infrastructure in the kernel outside of kernel developers. At the Linux
Collaboration Summit, several people (non kernel developers) wanted to
have some amount of tracing happening in a production system. To have
this happen, the overhead must be at least configured to a minimum.

Especially if we want a real competitor to DTrace.

-- Steve

> 
> > >     - Migrate any 'special' ftrace feature to perf 
> > >       capabilities so that it's available via the 
> > >       syscall interface as well. (for example 
> > >       'latency maximum tracking' is something that we 
> > >       probably want to do with kernel-side help - we 
> > >       probably dont want to implement it via tracing 
> > >       everything all the time and finding the maximum 
> > >       based on terabytes of data.)
> > > 
> > >    (And there are other complications here too, but you 
> > >     get the idea.)
> > 
> > Yes, the latency tracers are special, and I treat them 
> > special in ftrace. They can be redesigned to keep their 
> > current functionality. That needed to be done anyway.
> 
> Yeah.
> 
> > > All in one, i think we can reuse more than 50% of all 
> > > current ftrace code (possibly up to 70-80%) - and we 
> > > are already reusing bits of it.
> > 
> > Ingo, as we have had this conversation, I don't know if 
> > you realized that I agreed to merge ftrace to perf. 
> > There's just some functionality about ftrace I really 
> > want to keep, which does include accessing via /debugfs, 
> > pretty-print (which is needed by trace-on-oops) and 
> > splice.
> 
> Ok, good :-)
> 
> Thanks,
> 
> 	Ingo




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFD] Future tracing/instrumentation directions
  2010-05-20  9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar
  2010-05-20 10:07   ` Frederic Weisbecker
  2010-05-20 14:38   ` Steven Rostedt
@ 2010-05-21  9:24   ` Li Zefan
  2 siblings, 0 replies; 24+ messages in thread
From: Li Zefan @ 2010-05-21  9:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Mathieu Desnoyers, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu,
	Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras,
	Hitoshi Mitake, Robert Richter

Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>> More than a year and a half ago (September 2008), at 
>> Linux Plumbers, we had a meeting with several kernel 
>> developers to come up with a unified ring buffer. A 
>> generic ring buffer in the kernel that any subsystem 
>> could use. After coming up with a set of requirements, I 
>> worked on implementing it. One of the requirements was 
>> to start off simple and work to become a more complete 
>> buffering system.
>>
>> [...]
> 
> The thing is, in tracing land and more broadly in 
> instrumentation land we have _much_ more earthly problems 
> these days:
> 
>  - Lets face it, performance of the ring-buffer sucks, in 
>    a big way. I've recently benchmarked it and it takes 
>    hundreds of instructions to trace a single event. 
>    Puh-lease ...
> 

We ran some benchmarks with all the trace events enabled except
lock and kmem events, and the results showed the overhead was
quite small and acceptable.

But that was 2.6.31, we didn't benchmark newer kernels which have
more tracepoints.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-05-24 20:13 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-19 17:51 [RFC] Unified Ring Buffer (Next Generation) Steven Rostedt
2010-05-19 18:10 ` Andi Kleen
2010-05-19 18:44   ` Steven Rostedt
2010-05-19 19:25     ` Andi Kleen
2010-05-19 19:38       ` Steven Rostedt
2010-05-19 18:47   ` Mathieu Desnoyers
2010-05-20  6:41     ` Andi Kleen
2010-05-20 13:48       ` Mathieu Desnoyers
2010-05-19 19:05 ` Mathieu Desnoyers
2010-05-20  9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar
2010-05-20 10:07   ` Frederic Weisbecker
2010-05-20 11:07     ` Thomas Gleixner
2010-05-20 11:42       ` Ingo Molnar
2010-05-20 11:48         ` Ingo Molnar
2010-05-20 12:15       ` Frederic Weisbecker
2010-05-20 12:19       ` Theodore Tso
2010-05-20 11:36     ` Ingo Molnar
2010-05-20 13:06       ` Frederic Weisbecker
2010-05-20 14:38   ` Steven Rostedt
2010-05-20 15:04     ` Steven Rostedt
2010-05-20 20:12     ` Steven Rostedt
2010-05-21 17:49     ` Ingo Molnar
2010-05-24 20:13       ` Steven Rostedt
2010-05-21  9:24   ` Li Zefan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).