* [RFC] Unified Ring Buffer (Next Generation)
@ 2010-05-19 17:51 Steven Rostedt
2010-05-19 18:10 ` Andi Kleen
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Steven Rostedt @ 2010-05-19 17:51 UTC (permalink / raw)
To: LKML
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
KOSAKI Motohiro, Andi Kleen
More than a year and a half ago (September 2008), at Linux Plumbers, we
had a meeting with several kernel developers to come up with a unified
ring buffer. A generic ring buffer in the kernel that any subsystem
could use. After coming up with a set of requirements, I worked on
implementing it. One of the requirements was to start off simple and
work to become a more complete buffering system.
I posted a set of patches to LKML and several developers (including
Linus) got involved in the design of the ring buffer:
Here's the thread that started the development:
http://lkml.org/lkml/2008/9/19/275
And the ring buffer we ended with here:
http://lkml.org/lkml/2008/9/29/155
And a nice article in LWN about it as well:
http://lwn.net/Articles/300992/
This ring buffer replaced ftrace's ring buffer, as well as oprofile's
ring buffer, and other utilities in the kernel moved over to interacting
with ftrace directly. Although, the ring buffer was a separate entity
from ftrace and it was not required to use ftrace to use the ring
buffer.
The design of the ring buffer was focused more towards in kernel users
and for use with the splice() system call. It did not (and still does
not) support a mmap interface.
In December of 2008 a new utility was created called "perf". At the time
it was a performance counter. In September of 2009, it was converted
over into performance events.
At the time, the unified ring buffer was still not lockless, so it could
lose events in NMI context.
Peter Zijlstra, took a look at the unified ring buffer and found that it
did not suite his needs. He needed a reliable ring buffer in NMI context
as well as something that can mmap to userspace.
At that time, I was working on other aspects of the kernel and did not
have the time to help him come up with something that he could use.
Having to get work done, Peter implemented his own ring buffer for use
with perf.
I do not blame Peter for this, since any developer (including myself)
would have done the same.
Unfortunately, we are now back with more than one ring buffer in the
kernel. What's worse, neither of them can perform all the features
needed. This is putting a bit of stress on the users of these tools, not
to mention the stress on the developers as well.
In June of 2009, I finally made the ring buffer lockless:
http://lkml.org/lkml/2009/6/10/381
Again, LWN wrote up a nice article about this as well:
http://lwn.net/Articles/340400/
But it was too late, and still did not support mmap. Perf was already
dependent on its own ring buffer, and now we are back to where we were
before the unified ring buffer existed.
This email is about finding a solution to the problem. If we can once
again create a generic ring buffer that handles all requirements, then
we can also merge the functionality of ftrace into perf, and lower the
duplication of code within the kernel.
This time around, I'm asking Mathieu Desnoyers to come to the plate, and
see if he can handle the task.
I'm hoping that this email will start a thread that gets everyone into
agreement and produces something that will make everyone happy.
-- Steve
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [RFC] Unified Ring Buffer (Next Generation) 2010-05-19 17:51 [RFC] Unified Ring Buffer (Next Generation) Steven Rostedt @ 2010-05-19 18:10 ` Andi Kleen 2010-05-19 18:44 ` Steven Rostedt 2010-05-19 18:47 ` Mathieu Desnoyers 2010-05-19 19:05 ` Mathieu Desnoyers 2010-05-20 9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar 2 siblings, 2 replies; 24+ messages in thread From: Andi Kleen @ 2010-05-19 18:10 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen On Wed, May 19, 2010 at 01:51:54PM -0400, Steven Rostedt wrote: Hi Steven, > More than a year and a half ago (September 2008), at Linux Plumbers, we > had a meeting with several kernel developers to come up with a unified > ring buffer. A generic ring buffer in the kernel that any subsystem > could use. After coming up with a set of requirements, I worked on If we take a step back. Why do you want a single ring buffer for everyone? A lot more low profile subsystems subsystems simply use kfifo (which is also actively developed by Stefanie). In fact there are far more users of it than of your ring buffer. And it's really quite simple and easy to use. And it works fine for them. I don't think it's that great a goal to have a single ring buffer for all possible ring buffer needs. After all the requirements are quite different. Some want a simple ring buffer with minimal overhead and simple interface, others need a mmaped one or have other special requirements. One size doesn't fit all. It's also not that we're talking about gigantic amounts of code in all cases where there is a pressing need to unify. If perf's current ring buffer works for it why not keep using it? One problem I always had with your version was that it's quite bloated frankly, especially in terms of code size, but also in its data structures and in the interface complexity. For debugging kernels etc. with tracing that's not that big an issue, but I think it's a problem for "non debugging" use. After all Linux still has the goal to be at least configurable as a low footprint operating system. Part of the reason for its big code size seems to be that it tries to support everyone's requirements, which unsurprisingly leads to some bloat both in implementation and interface. Also to be honest it's so clever now that at least I have a hard time understanding it, and I personally prefer code that I can understand over too clever code. After all if there is a bug in there and you need to be more clever than the programmer to debug it, how would that be done? Perhaps a better goal would be to have a smaller simpler more maintainable buffer for ftrace and let the other users their own? Just my 0.05 cent. -Andi ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Unified Ring Buffer (Next Generation) 2010-05-19 18:10 ` Andi Kleen @ 2010-05-19 18:44 ` Steven Rostedt 2010-05-19 19:25 ` Andi Kleen 2010-05-19 18:47 ` Mathieu Desnoyers 1 sibling, 1 reply; 24+ messages in thread From: Steven Rostedt @ 2010-05-19 18:44 UTC (permalink / raw) To: Andi Kleen Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro On Wed, 2010-05-19 at 20:10 +0200, Andi Kleen wrote: > On Wed, May 19, 2010 at 01:51:54PM -0400, Steven Rostedt wrote: > > Hi Steven, > > > More than a year and a half ago (September 2008), at Linux Plumbers, we > > had a meeting with several kernel developers to come up with a unified > > ring buffer. A generic ring buffer in the kernel that any subsystem > > could use. After coming up with a set of requirements, I worked on > > If we take a step back. > > Why do you want a single ring buffer for everyone? Actually, I would like to have one ring buffer interface, not one ring buffer per say. > > A lot more low profile subsystems subsystems simply use kfifo > (which is also actively developed by Stefanie). In fact there > are far more users of it than of your ring buffer. And it's > really quite simple and easy to use. And it works fine for them. > > I don't think it's that great a goal to have a single ring buffer > for all possible ring buffer needs. After all the requirements > are quite different. > > Some want a simple ring buffer with minimal overhead > and simple interface, others need a mmaped one or have other special > requirements. One size doesn't fit all. Exactly! One size does not fit all. I've discussed this off list with Maitheu, and we agree that different requirements call for different types of ring buffers. Depending on mmapping, splice, single writer (no need for per CPU buffers if there's only ever one writer), multiple writers, etc. What would be nice to have is a single interface where the same event can be recorded using the same interface. > > It's also not that we're talking about gigantic amounts of code > in all cases where there is a pressing need to unify. > > If perf's current ring buffer works for it why not keep using it? Because it does not work for the functionality of ftrace. And there's a driving need to try to unify the two. > > One problem I always had with your version was that it's quite > bloated frankly, especially in terms of code size, but > also in its data structures and in the interface complexity. These are valid concerns, and even Mathieu has mentioned this. One way is to develop a layered approach. Where the lowest layer is very simple and small, but without much features. The higher layers have more features but are built on top of the lower ones. > > For debugging kernels etc. with tracing that's not that big an issue, but > I think it's a problem for "non debugging" use. After all Linux > still has the goal to be at least configurable as a low footprint operating > system. Agreed. > > Part of the reason for its big code size seems to be that > it tries to support everyone's requirements, which unsurprisingly > leads to some bloat both in implementation and interface. OK, as I mentioned, what about a layered approach? Where one level builds off of another? This may actually solve some of the bloat. That is, try to keep the more complex users using the same code, but have that on top of code that is simple for the users that have a restrictive environment. > > Also to be honest it's so clever now that at least I have > a hard time understanding it, and I personally prefer code > that I can understand over too clever code. After all if there > is a bug in there and you need to be more clever than the programmer > to debug it, how would that be done? Thanks for the complement ;-) Yes, I've heard this complaint, and it is also why I'm asking Mathieu to do the work and I'm staying away (except for helping with review and requirements). I like the "clever" solutions. > > Perhaps a better goal would be to have a smaller simpler more > maintainable buffer for ftrace and let the other users their own? Perhaps, but the goal is actually to handle all the features of perf and ftrace in a generic fashion. You also bring up a point that I try very hard to get across. When people think of a ring buffer, they think of the ones that they created in CS101, not realizing that when you are dealing with production systems, handling the requirements makes the buffering much more complex. Perhaps the answer is we are fine with more than one ring buffer. I'm OK with that too. But Linus did push this agenda back in the 2008 Kernel Summit. I'm just continuing it. > > Just my 0.05 cent. Thanks, -- Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Unified Ring Buffer (Next Generation) 2010-05-19 18:44 ` Steven Rostedt @ 2010-05-19 19:25 ` Andi Kleen 2010-05-19 19:38 ` Steven Rostedt 0 siblings, 1 reply; 24+ messages in thread From: Andi Kleen @ 2010-05-19 19:25 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro On Wed, May 19, 2010 at 02:44:18PM -0400, Steven Rostedt wrote: > > > More than a year and a half ago (September 2008), at Linux Plumbers, we > > > had a meeting with several kernel developers to come up with a unified > > > ring buffer. A generic ring buffer in the kernel that any subsystem > > > could use. After coming up with a set of requirements, I worked on > > > > If we take a step back. > > > > Why do you want a single ring buffer for everyone? > > Actually, I would like to have one ring buffer interface, not one ring > buffer per say. Why does every user in the kernel need the same interface? A user with simple requirements would prefer a simple and fast interface, an user with complex requirement a complex requirements a complex one. But they're not the same. Or do you just want the same interface between ftrace and perf? > What would be nice to have is a single interface where the same event > can be recorded using the same interface. I'm not fully sure what advantage that would have. Would you like to exchange different ring buffers for existing users frequently? > OK, as I mentioned, what about a layered approach? Where one level > builds off of another? This may actually solve some of the bloat. That > is, try to keep the more complex users using the same code, but have > that on top of code that is simple for the users that have a restrictive > environment. Some of the complexity is in the underlying implementation isn't it? e.g. single continuous buffer versus multiple ones. That cannot really be addressed by layers, unless you go to virtual call tables (which would seem a bad idea to me for such a hot path) > > > > Perhaps a better goal would be to have a smaller simpler more > > maintainable buffer for ftrace and let the other users their own? > > Perhaps, but the goal is actually to handle all the features of perf and > ftrace in a generic fashion. Ok so your goal is to only cover perf and ftrace and let all the other users be alone? > > You also bring up a point that I try very hard to get across. When > people think of a ring buffer, they think of the ones that they created > in CS101, not realizing that when you are dealing with production > systems, handling the requirements makes the buffering much more > complex. For most users the simple ones are totally adequate though. > Perhaps the answer is we are fine with more than one ring buffer. I'm OK That's what we already have in fact: kfifo, ftrace, perf, a couple of private ones all over. I think a couple of private ones could be converted to kfifo (I did some of that for my own code) But I don't really see how you can marry the simple kfifo world to the highend bells'n'whistles ftrace world. And even in the highend world there might be space for multiple specialized ones. On the other hand if the highend world had less overhead then at least some of the "higher end" kfifo users could start considering to use it. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Unified Ring Buffer (Next Generation) 2010-05-19 19:25 ` Andi Kleen @ 2010-05-19 19:38 ` Steven Rostedt 0 siblings, 0 replies; 24+ messages in thread From: Steven Rostedt @ 2010-05-19 19:38 UTC (permalink / raw) To: Andi Kleen Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro On Wed, 2010-05-19 at 21:25 +0200, Andi Kleen wrote: > > Perhaps, but the goal is actually to handle all the features of perf and > > ftrace in a generic fashion. > > Ok so your goal is to only cover perf and ftrace and let all > the other users be alone? > Yes, my goal is to unify perf and ftrace, but that's because that is what I am working most on. > > > > You also bring up a point that I try very hard to get across. When > > people think of a ring buffer, they think of the ones that they created > > in CS101, not realizing that when you are dealing with production > > systems, handling the requirements makes the buffering much more > > complex. > > For most users the simple ones are totally adequate though. > > > Perhaps the answer is we are fine with more than one ring buffer. I'm OK > > That's what we already have in fact: kfifo, ftrace, perf, a couple > of private ones all over. > > I think a couple of private ones could be converted to kfifo > (I did some of that for my own code) > > But I don't really see how you can marry the simple kfifo world > to the highend bells'n'whistles ftrace world. And even in the highend > world there might be space for multiple specialized ones. > > On the other hand if the highend world had less overhead > then at least some of the "higher end" kfifo users could start > considering to use it. If Mathieu can pull off the requirements that can handle all of ftrace and perf in a light weight manner, then great. But if the end result is still to much for these "private" users, then we should have a "ring_buffer_lite", and perhaps even base it off of a kfifo itself. Note, it would be nice to have a single location that (a) ring buffer(s) reside at. This way if a developer has a need for a ring buffer then they can have a single location to look for one. Maybe the ring buffer code should go in lib/ringbuffers/ instead of kernel/ringbuffer/. -- Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Unified Ring Buffer (Next Generation) 2010-05-19 18:10 ` Andi Kleen 2010-05-19 18:44 ` Steven Rostedt @ 2010-05-19 18:47 ` Mathieu Desnoyers 2010-05-20 6:41 ` Andi Kleen 1 sibling, 1 reply; 24+ messages in thread From: Mathieu Desnoyers @ 2010-05-19 18:47 UTC (permalink / raw) To: Andi Kleen Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro * Andi Kleen (andi@firstfloor.org) wrote: > On Wed, May 19, 2010 at 01:51:54PM -0400, Steven Rostedt wrote: > Hi Andi, > Hi Steven, > > > More than a year and a half ago (September 2008), at Linux Plumbers, we > > had a meeting with several kernel developers to come up with a unified > > ring buffer. A generic ring buffer in the kernel that any subsystem > > could use. After coming up with a set of requirements, I worked on > > If we take a step back. > > Why do you want a single ring buffer for everyone? > > A lot more low profile subsystems subsystems simply use kfifo > (which is also actively developed by Stefanie). In fact there > are far more users of it than of your ring buffer. And it's > really quite simple and easy to use. And it works fine for them. > > I don't think it's that great a goal to have a single ring buffer > for all possible ring buffer needs. After all the requirements > are quite different. One requirement seems to be shared for most tracing heavy users: it has to be blazingly fast. This is a requirement that is commonly overlooked by the "very simple" implementations. > > Some want a simple ring buffer with minimal overhead > and simple interface, others need a mmaped one or have other special > requirements. One size doesn't fit all. The plan here is to create a ring buffer that supports per-buffer instance "flags" that specify what must be supported: e.g. either splice() or mmap(), global vs per-cpu buffers, etc. I already have much of the code that can easily support all these use-cases with only minor modifications. So I don't see why we should duplicate the code. > > It's also not that we're talking about gigantic amounts of code > in all cases where there is a pressing need to unify. For efficient flavors of ring buffer, it's not that much code, that's true, but the complexity level/LOC is rather higher than for standard code. So it's good if we can focus our review efforts on a single ring buffer. > If perf's current ring buffer works for it why not keep using it? I'll let the perf developers answer this one. > One problem I always had with your version was that it's quite > bloated frankly, especially in terms of code size, but > also in its data structures and in the interface complexity. I share your concern about the level of complexity of Steven's lockless ring buffer implementation. Even though Lai and I are listed as "reviewed-by" in the ring buffer code header, I remember that neither him nor myself were particularly comfortable with the way complexity has been handled. The new implementation I propose lessens the complexity level, presents clear abstractions to deal with that complexity, and comes with a formal proof of correctness, all of which I think is really very important to give a good level of insurance that the ring buffer works as expected. > > For debugging kernels etc. with tracing that's not that big an issue, but > I think it's a problem for "non debugging" use. After all Linux > still has the goal to be at least configurable as a low footprint operating > system. My implementation, at the moment, has 50% less lines of code and is 25% smaller in object size than the current ring buffer. But all in all, I think users needing _something_ to perform system-wide tracing shout a lot louder than users who need to save a few bytes. So let's try to get something good in first, while keeping an eye on the object size, and if it happens to be too large for some users, then they can always implement a slower and less efficient ring_buffer_tiny.c if they feel like it. The main concern for the ring buffer code is more about the number of cache lines it grabs from the system when it is active, and about being blazingly fast. We must keep the number of active cache lines very low, sometimes at the expense of overall code size (e.g., having separate fast/slow paths). > > Part of the reason for its big code size seems to be that > it tries to support everyone's requirements, which unsurprisingly > leads to some bloat both in implementation and interface. At the very least Steven's code did not meet the perf requirements nor the LTTng requirements. As far as the "next gen" version is concerned, I plan to allow myself to sometimes say "no" (with proper counter-arguments of course) when people come up with ideas that will either needlessly complexify the ring buffer code or won't take into account the "special" very heavy usage nature of the ring buffer. I really believe in code review and comments. I just don't feel I must always incorporate all the modifications that are requested without proper justification. It's just the approach I have, and it seems to have led to good results so far if we take Tracepoints as an example. > > Also to be honest it's so clever now that at least I have > a hard time understanding it, and I personally prefer code > that I can understand over too clever code. After all if there > is a bug in there and you need to be more clever than the programmer > to debug it, how would that be done? I totally agree with you. This is in good part why I spent a large part of 2009 writing papers explaining my ring buffer, doing Promela models and formal proofs of correctness. I think after all that work, the abstractions I will use will be much easier to grap by anyone willing to do a bit of reading. > > Perhaps a better goal would be to have a smaller simpler more > maintainable buffer for ftrace and let the other users their own? The advantage of having both ftrace and perf as ring buffer users is that they can both give me feedback in the development, and this increases chances that the API will be generic, clean and well documented. Thanks, Mathieu > > Just my 0.05 cent. > > -Andi > -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Unified Ring Buffer (Next Generation) 2010-05-19 18:47 ` Mathieu Desnoyers @ 2010-05-20 6:41 ` Andi Kleen 2010-05-20 13:48 ` Mathieu Desnoyers 0 siblings, 1 reply; 24+ messages in thread From: Andi Kleen @ 2010-05-20 6:41 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andi Kleen, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro > The plan here is to create a ring buffer that supports per-buffer instance > "flags" that specify what must be supported: e.g. either splice() or mmap(), > global vs per-cpu buffers, etc. And you plan to test all those flags in the hot path? > The new implementation I propose lessens the complexity level, presents clear > abstractions to deal with that complexity, and comes with a formal proof of > correctness, all of which I think is really very important to give a good level > of insurance that the ring buffer works as expected. Any simplifcation for the ftrace buffer would be a good thing IMHO. > > For debugging kernels etc. with tracing that's not that big an issue, but > > I think it's a problem for "non debugging" use. After all Linux > > still has the goal to be at least configurable as a low footprint operating > > system. > > My implementation, at the moment, has 50% less lines of code and is 25% smaller > in object size than the current ring buffer. Good. > > But all in all, I think users needing _something_ to perform system-wide tracing > shout a lot louder than users who need to save a few bytes. So let's try to get > something good in first, while keeping an eye on the object size, and if it > happens to be too large for some users, then they can always implement a > slower and less efficient ring_buffer_tiny.c if they feel like it. They don't need to, they already have kfifo. > I totally agree with you. This is in good part why I spent a large part of 2009 > writing papers explaining my ring buffer, doing Promela models and formal proofs > of correctness. I think after all that work, the abstractions I will use will be > much easier to grap by anyone willing to do a bit of reading. Writing papers is not a replacement for simple maintainable code. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Unified Ring Buffer (Next Generation) 2010-05-20 6:41 ` Andi Kleen @ 2010-05-20 13:48 ` Mathieu Desnoyers 0 siblings, 0 replies; 24+ messages in thread From: Mathieu Desnoyers @ 2010-05-20 13:48 UTC (permalink / raw) To: Andi Kleen Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro * Andi Kleen (andi@firstfloor.org) wrote: > > The plan here is to create a ring buffer that supports per-buffer instance > > "flags" that specify what must be supported: e.g. either splice() or mmap(), > > global vs per-cpu buffers, etc. > > And you plan to test all those flags in the hot path? No, I plan to make the hot path inline in the caller, so the flags can statically select the proper behavior at compile-time. We can test flags dynamically on the slow-path to save space. [...] > > > > But all in all, I think users needing _something_ to perform system-wide tracing > > shout a lot louder than users who need to save a few bytes. So let's try to get > > something good in first, while keeping an eye on the object size, and if it > > happens to be too large for some users, then they can always implement a > > slower and less efficient ring_buffer_tiny.c if they feel like it. > > They don't need to, they already have kfifo. > Right. > > I totally agree with you. This is in good part why I spent a large part of 2009 > > writing papers explaining my ring buffer, doing Promela models and formal proofs > > of correctness. I think after all that work, the abstractions I will use will be > > much easier to grap by anyone willing to do a bit of reading. > > Writing papers is not a replacement for simple maintainable code. True. I rather mean "in addition to simple maintainable code". Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Unified Ring Buffer (Next Generation) 2010-05-19 17:51 [RFC] Unified Ring Buffer (Next Generation) Steven Rostedt 2010-05-19 18:10 ` Andi Kleen @ 2010-05-19 19:05 ` Mathieu Desnoyers 2010-05-20 9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar 2 siblings, 0 replies; 24+ messages in thread From: Mathieu Desnoyers @ 2010-05-19 19:05 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen * Steven Rostedt (rostedt@goodmis.org) wrote: [...] > Unfortunately, we are now back with more than one ring buffer in the > kernel. What's worse, neither of them can perform all the features > needed. This is putting a bit of stress on the users of these tools, not > to mention the stress on the developers as well. [...] > This email is about finding a solution to the problem. If we can once > again create a generic ring buffer that handles all requirements, then > we can also merge the functionality of ftrace into perf, and lower the > duplication of code within the kernel. > > > This time around, I'm asking Mathieu Desnoyers to come to the plate, and > see if he can handle the task. > > I'm hoping that this email will start a thread that gets everyone into > agreement and produces something that will make everyone happy. I also hope this will get us to a state of concensus. For the records: I'm willing to handle that task. I plan to use the git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-ringbuffer.git git tree to publish my work (and to submit occasional rounds of patches to LKML too) The current state of that tree is that it includes only the bare-bone ring buffer core, without the filesystem operations, only for per-cpu buffers, and still needs a bit of cleanup. It's currently in kernel/trace/, but, as we discussed, it's probably better if I move it to kernel/ringbuffer/, because it does not have to be tied specifically to tracers. I'm doing it with tracing in mind, but it could be reused as generic ring buffer in the future. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFD] Future tracing/instrumentation directions 2010-05-19 17:51 [RFC] Unified Ring Buffer (Next Generation) Steven Rostedt 2010-05-19 18:10 ` Andi Kleen 2010-05-19 19:05 ` Mathieu Desnoyers @ 2010-05-20 9:31 ` Ingo Molnar 2010-05-20 10:07 ` Frederic Weisbecker ` (2 more replies) 2 siblings, 3 replies; 24+ messages in thread From: Ingo Molnar @ 2010-05-20 9:31 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter * Steven Rostedt <rostedt@goodmis.org> wrote: > More than a year and a half ago (September 2008), at > Linux Plumbers, we had a meeting with several kernel > developers to come up with a unified ring buffer. A > generic ring buffer in the kernel that any subsystem > could use. After coming up with a set of requirements, I > worked on implementing it. One of the requirements was > to start off simple and work to become a more complete > buffering system. > > [...] The thing is, in tracing land and more broadly in instrumentation land we have _much_ more earthly problems these days: - Lets face it, performance of the ring-buffer sucks, in a big way. I've recently benchmarked it and it takes hundreds of instructions to trace a single event. Puh-lease ... - It has grown a lot of slack. It's complex and hard to read. - Over the last year or so the majority of bleeding-edge tracing developers have gradually migrated over to perf and 'perf trace' / 'perf probe' in particular. As far as the past two merge windows go they are out-developing the old ftrace UIs by a ratio of 4:1. So this angle is becoming a prime thing to improve and users and developers are hurting from the ftrace/perf duality. - [ While it's still a long way off, if this trend continues we eventually might even be able to get rid of the /debug/tracing/ temporary debug API and get rid of the ugly in-kernel pretty-printing bits. This is good: it may make Andrew very happy for a change ;-) The main detail here to be careful of is that lots of people are fond of the simplicity of the /debug/tracing/ debug UI, so when we replace it we want to do it by keeping that simple workflow (or best by making it even simpler). I have a few ideas how to do this. There's also the detail that in some cases we want to print events in the kernel in a human readable way: for example EDAC/MCE and other critical events, trace-on-oops, etc. This too can be solved. ] Regarding performance and complexity, which is our main worry atm, fortunately there's work going on in that direction - please see PeterZ's recent string of patches on lkml: 4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware ef60777: perf: Optimize the perf_output() path by removing IRQ-disables fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t 6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg() And it may sound harsh but at this stage i'm personally not at all interested in big design talk. This isnt rocket science, we have developers and users and we know what they are doing and we know what we need to do: we need to improve our crap and we need to reduce complexity. Less is more. So i'd like to see iterative, useful action first, and i am somewhat sceptical about yet another grand tracing design trying to match 100 requirements. Steve, Mathieu, if you are interested please help out Peter with the performance and simplification work. The last thing we need is yet another replace-everything event. If we really want to create a new ring-buffer abstraction i'd suggest we start with Peter's, it has a quite sane design and stayed simple and flexible - if then it could be factored out a bit. Here are more bits of what i see as the 'action' going forward, in no particular order: 1) Push the /debug/tracing/events/ event description into sysfs, as per this thread on lkml: [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs http://groups.google.com/group/linux.kernel/msg/ab9aa075016c639e I.e. one more step towards integrating ftrace into perf. 2) Use 1) to unify the perf events and the ftrace ring-buffer. This, as things are standing is best done by factoring out Peter's ring-buffer in kernel/perf_event.c. It's properly abstracted and it _far_ simpler than kernel/tracing/ring_buffer.c, which has become a monstrosity. (but i'm open to other simplifications as well) 3) Add the function-tracer and function-graph tracer as an event and integrate it into perf. This will live-test the efficiency of the unification and brings over the last big ftrace plugin to perf. 4) Gradually convert/port/migrate all the remaining plugins over as well. We need to do this very gently because there are users - but stop piling new functionality on to the old ftrace side. This usually involves: - Conversion of an explicit tracing callback to TRACE_EVENT (for example in the case of mmiotrace), while keeping all tool functionality. - Migrate any 'special' ftrace feature to perf capabilities so that it's available via the syscall interface as well. (for example 'latency maximum tracking' is something that we probably want to do with kernel-side help - we probably dont want to implement it via tracing everything all the time and finding the maximum based on terabytes of data.) (And there are other complications here too, but you get the idea.) All in one, i think we can reuse more than 50% of all current ftrace code (possibly up to 70-80%) - and we are already reusing bits of it. Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar @ 2010-05-20 10:07 ` Frederic Weisbecker 2010-05-20 11:07 ` Thomas Gleixner 2010-05-20 11:36 ` Ingo Molnar 2010-05-20 14:38 ` Steven Rostedt 2010-05-21 9:24 ` Li Zefan 2 siblings, 2 replies; 24+ messages in thread From: Frederic Weisbecker @ 2010-05-20 10:07 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote: > - [ While it's still a long way off, if this trend continues > we eventually might even be able to get rid of the > /debug/tracing/ temporary debug API and get rid of > the ugly in-kernel pretty-printing bits. This is > good: it may make Andrew very happy for a change ;-) > > The main detail here to be careful of is that lots of > people are fond of the simplicity of the > /debug/tracing/ debug UI, so when we replace it we > want to do it by keeping that simple workflow (or > best by making it even simpler). I have a few ideas > how to do this. How? We can emulate the /debug/tracing result with something like perf trace, still that won't replace the immediate availability of the result of any trace, which makes it valuable for any simplest workflows. > Regarding performance and complexity, which is our main > worry atm, fortunately there's work going on in that > direction - please see PeterZ's recent string of patches > on lkml: > > 4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events > a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware > ef60777: perf: Optimize the perf_output() path by removing IRQ-disables > fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t I would like to highlight the following commit too that _totally_ changes the requirements of our next common ring buffer, whatever it is: c792061: perf: Disallow mmap() on per-task inherited events Now we only need to care about local contention, we have removed the support for buffers that contend across cpus in a single process. Do I understand it right? > 3) Add the function-tracer and function-graph tracer > as an event and integrate it into perf. > > This will live-test the efficiency of the unification > and brings over the last big ftrace plugin to perf. I may start to take care of this soon. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 10:07 ` Frederic Weisbecker @ 2010-05-20 11:07 ` Thomas Gleixner 2010-05-20 11:42 ` Ingo Molnar ` (2 more replies) 2010-05-20 11:36 ` Ingo Molnar 1 sibling, 3 replies; 24+ messages in thread From: Thomas Gleixner @ 2010-05-20 11:07 UTC (permalink / raw) To: Frederic Weisbecker Cc: Ingo Molnar, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter On Thu, 20 May 2010, Frederic Weisbecker wrote: > On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote: > > - [ While it's still a long way off, if this trend continues > > we eventually might even be able to get rid of the > > /debug/tracing/ temporary debug API and get rid of > > the ugly in-kernel pretty-printing bits. This is > > good: it may make Andrew very happy for a change ;-) > > > > The main detail here to be careful of is that lots of > > people are fond of the simplicity of the > > /debug/tracing/ debug UI, so when we replace it we > > want to do it by keeping that simple workflow (or > > best by making it even simpler). I have a few ideas > > how to do this. > > How? We can emulate the /debug/tracing result with something > like perf trace, still that won't replace the immediate > availability of the result of any trace, which makes it > valuable for any simplest workflows. I'm a bit torn about this. I really like the availability of the ascii interface, but if we can come up with a very basic trace binary tool which can be built for deep embedded w/o requiring the world and some more libs to be available, then I might give up my resistance. Ideally it should be done so it can be easily integrated into busybox. I don't care whether I do echo 1 >/debug/..../XXX/enable cat /debug/tracing/trace or perfmini trace enable XXX perfmini trace dump as long as the tool is built in a way that it does not need updates when we add trace points or other functionality to the kernel. Thanks, tglx ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 11:07 ` Thomas Gleixner @ 2010-05-20 11:42 ` Ingo Molnar 2010-05-20 11:48 ` Ingo Molnar 2010-05-20 12:15 ` Frederic Weisbecker 2010-05-20 12:19 ` Theodore Tso 2 siblings, 1 reply; 24+ messages in thread From: Ingo Molnar @ 2010-05-20 11:42 UTC (permalink / raw) To: Thomas Gleixner Cc: Frederic Weisbecker, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter * Thomas Gleixner <tglx@linutronix.de> wrote: > On Thu, 20 May 2010, Frederic Weisbecker wrote: > > On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote: > > > - [ While it's still a long way off, if this trend continues > > > we eventually might even be able to get rid of the > > > /debug/tracing/ temporary debug API and get rid of > > > the ugly in-kernel pretty-printing bits. This is > > > good: it may make Andrew very happy for a change ;-) > > > > > > The main detail here to be careful of is that lots of > > > people are fond of the simplicity of the > > > /debug/tracing/ debug UI, so when we replace it we > > > want to do it by keeping that simple workflow (or > > > best by making it even simpler). I have a few ideas > > > how to do this. > > > > How? We can emulate the /debug/tracing result with something > > like perf trace, still that won't replace the immediate > > availability of the result of any trace, which makes it > > valuable for any simplest workflows. > > I'm a bit torn about this. I really like the availability of the ascii > interface, but if we can come up with a very basic trace binary tool > which can be built for deep embedded w/o requiring the world and some > more libs to be available, then I might give up my resistance. Ideally > it should be done so it can be easily integrated into busybox. > > I don't care whether I do > > echo 1 >/debug/..../XXX/enable > cat /debug/tracing/trace > > or > > perfmini trace enable XXX > perfmini trace dump My suggestion for that flow is even shorter: trace --enable XXX trace Plus: trace --list trace --enable trace --disable > as long as the tool is built in a way that it does not > need updates when we add trace points or other > functionality to the kernel. Yeah, most definitely. The sysfs event_source class will ensure that whatever (new) events are available propagate through the tool and are available to it. Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 11:42 ` Ingo Molnar @ 2010-05-20 11:48 ` Ingo Molnar 0 siblings, 0 replies; 24+ messages in thread From: Ingo Molnar @ 2010-05-20 11:48 UTC (permalink / raw) To: Thomas Gleixner Cc: Frederic Weisbecker, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter * Ingo Molnar <mingo@elte.hu> wrote: > > as long as the tool is built in a way that it does not > > need updates when we add trace points or other > > functionality to the kernel. > > Yeah, most definitely. The sysfs event_source class will > ensure that whatever (new) events are available > propagate through the tool and are available to it. Note that this approach isnt just for tools/perf/, it will be useful for other perf events based tools as well: SysProf and for the libpfm based apps. Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 11:07 ` Thomas Gleixner 2010-05-20 11:42 ` Ingo Molnar @ 2010-05-20 12:15 ` Frederic Weisbecker 2010-05-20 12:19 ` Theodore Tso 2 siblings, 0 replies; 24+ messages in thread From: Frederic Weisbecker @ 2010-05-20 12:15 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter On Thu, May 20, 2010 at 01:07:57PM +0200, Thomas Gleixner wrote: > On Thu, 20 May 2010, Frederic Weisbecker wrote: > > On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote: > > > - [ While it's still a long way off, if this trend continues > > > we eventually might even be able to get rid of the > > > /debug/tracing/ temporary debug API and get rid of > > > the ugly in-kernel pretty-printing bits. This is > > > good: it may make Andrew very happy for a change ;-) > > > > > > The main detail here to be careful of is that lots of > > > people are fond of the simplicity of the > > > /debug/tracing/ debug UI, so when we replace it we > > > want to do it by keeping that simple workflow (or > > > best by making it even simpler). I have a few ideas > > > how to do this. > > > > How? We can emulate the /debug/tracing result with something > > like perf trace, still that won't replace the immediate > > availability of the result of any trace, which makes it > > valuable for any simplest workflows. > > I'm a bit torn about this. I really like the availability of the ascii > interface, but if we can come up with a very basic trace binary tool > which can be built for deep embedded w/o requiring the world and some > more libs to be available, then I might give up my resistance. Ideally > it should be done so it can be easily integrated into busybox. > > I don't care whether I do > > echo 1 >/debug/..../XXX/enable > cat /debug/tracing/trace > > or > > perfmini trace enable XXX > perfmini trace dump > > as long as the tool is built in a way that it does not need updates > when we add trace points or other functionality to the kernel. > > Thanks, > > tglx You have a (much) deeper experience than me about the embedded world. And this was the area I worried most about, especially because dropping the preformatted output means we add a random dependency. If you think people can be fine a minimal tool that does this, then ok, as far as the given tool is really easy to get and to build. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 11:07 ` Thomas Gleixner 2010-05-20 11:42 ` Ingo Molnar 2010-05-20 12:15 ` Frederic Weisbecker @ 2010-05-20 12:19 ` Theodore Tso 2 siblings, 0 replies; 24+ messages in thread From: Theodore Tso @ 2010-05-20 12:19 UTC (permalink / raw) To: Thomas Gleixner Cc: Frederic Weisbecker, Ingo Molnar, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter On May 20, 2010, at 7:07 AM, Thomas Gleixner wrote: > > I don't care whether I do > > echo 1 >/debug/..../XXX/enable > cat /debug/tracing/trace > > or > > perfmini trace enable XXX > perfmini trace dump > > as long as the tool is built in a way that it does not need updates > when we add trace points or other functionality to the kernel. I won't care that much assuming that there is a bash_completion script so I can type "perf mini trace enable ext4_da_writep<TAB>" and the right thing happens.... The point is there are a number of advantages of the current system that go beyond just guaranteed availability of the binary->ascii translation tool. Command competion, shell globbing to find interesting events you want to enable, etc., are also part of the picture -- Ted ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 10:07 ` Frederic Weisbecker 2010-05-20 11:07 ` Thomas Gleixner @ 2010-05-20 11:36 ` Ingo Molnar 2010-05-20 13:06 ` Frederic Weisbecker 1 sibling, 1 reply; 24+ messages in thread From: Ingo Molnar @ 2010-05-20 11:36 UTC (permalink / raw) To: Frederic Weisbecker Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter * Frederic Weisbecker <fweisbec@gmail.com> wrote: > On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote: > > - [ While it's still a long way off, if this trend continues > > we eventually might even be able to get rid of the > > /debug/tracing/ temporary debug API and get rid of > > the ugly in-kernel pretty-printing bits. This is > > good: it may make Andrew very happy for a change ;-) > > > > The main detail here to be careful of is that lots of > > people are fond of the simplicity of the > > /debug/tracing/ debug UI, so when we replace it we > > want to do it by keeping that simple workflow (or > > best by making it even simpler). I have a few ideas > > how to do this. > > How? We can emulate the /debug/tracing result with > something like perf trace, still that won't replace the > immediate availability of the result of any trace, which > makes it valuable for any simplest workflows. Firstly, one thing is sure: until there's no full replacement we obviously dont want to phase out /sys/kernel/debug/tracing. This was more of a 'our future' email (as i see it), the process that will lead to solve some of our more strategic problems in tracing land. Regarding the issues you raised, there are several solutions that dont need /sys/kernel/debug/tracing but still support the very useful and usable 'immediate tracing' workflow that ftrace prototyped. We can have a combination of several things: - Have a simple 'ftrace' command aliased to perf trace. Means less typing, and it also allows a much more finegrained tracing workflow: per user and per task/job workflows, instead of the global/exclusive tracing mode that /sys/kernel/debug/tracing. There would be ready equivalents: ftrace --available-tracers ftrace --current-tracer ftrace --start ftrace --stop ... etc ... - Immediate availability of a trace: persistent events combined with roundrobin ('flight-recorder') recording would solve this. If events are active then if type 'ftrace' you get the current trace. Simple to scrip and simple to use - no need to have access to /sys/kernel/debug/tracing, also can evidently be turned into a per user facility, supports multiple plugins active at once, etc. - For lightweight embedded tracing there are two separate solutions that would work: - we already have perf 'minimal builds' (when certain libraries are not available), we could push that concept some more to create a 'lightweight' command that embedded systems can run just fine. - extend our proxy recording and proxy execution/analysis concepts. We can already run a perf trace recording through a pipe and netcat, and we have perf archive and cross-arch analysis code. If there's interest, then this could be made more convenient and the functionality around this could be collected into a handy proxy tool: ftrace --proxy smallbox --start ftrace --proxy smallbox --stop ftrace --proxy smallbox # prints trace ... etc ... Thus the recording is done on the small box, while all the analysis (and even all the commands) is typed/executed on the bigger box. So there are lots of possibilities - and there are other options as well. Does this address your worries? Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 11:36 ` Ingo Molnar @ 2010-05-20 13:06 ` Frederic Weisbecker 0 siblings, 0 replies; 24+ messages in thread From: Frederic Weisbecker @ 2010-05-20 13:06 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter On Thu, May 20, 2010 at 01:36:15PM +0200, Ingo Molnar wrote: > > * Frederic Weisbecker <fweisbec@gmail.com> wrote: > > > On Thu, May 20, 2010 at 11:31:31AM +0200, Ingo Molnar wrote: > > > - [ While it's still a long way off, if this trend continues > > > we eventually might even be able to get rid of the > > > /debug/tracing/ temporary debug API and get rid of > > > the ugly in-kernel pretty-printing bits. This is > > > good: it may make Andrew very happy for a change ;-) > > > > > > The main detail here to be careful of is that lots of > > > people are fond of the simplicity of the > > > /debug/tracing/ debug UI, so when we replace it we > > > want to do it by keeping that simple workflow (or > > > best by making it even simpler). I have a few ideas > > > how to do this. > > > > How? We can emulate the /debug/tracing result with > > something like perf trace, still that won't replace the > > immediate availability of the result of any trace, which > > makes it valuable for any simplest workflows. > > Firstly, one thing is sure: until there's no full > replacement we obviously dont want to phase out > /sys/kernel/debug/tracing. This was more of a 'our future' > email (as i see it), the process that will lead to solve > some of our more strategic problems in tracing land. Yeah. > Regarding the issues you raised, there are several > solutions that dont need /sys/kernel/debug/tracing but > still support the very useful and usable 'immediate > tracing' workflow that ftrace prototyped. We can have a > combination of several things: > > - Have a simple 'ftrace' command aliased to perf trace. > > Means less typing, and it also allows a much more > finegrained tracing workflow: per user and per task/job > workflows, instead of the global/exclusive tracing mode > that /sys/kernel/debug/tracing. There would be ready > equivalents: > > ftrace --available-tracers > ftrace --current-tracer > ftrace --start > ftrace --stop > > ... etc ... We should probably more talk about events there, but yeah I get your point. > > - Immediate availability of a trace: persistent events > combined with roundrobin ('flight-recorder') recording > would solve this. > > If events are active then if type 'ftrace' you get the > current trace. Simple to scrip and simple to use - no > need to have access to /sys/kernel/debug/tracing, also > can evidently be turned into a per user facility, > supports multiple plugins active at once, etc. Yep. > > - For lightweight embedded tracing there are two separate > solutions that would work: > > - we already have perf 'minimal builds' (when certain > libraries are not available), we could push that > concept some more to create a 'lightweight' command > that embedded systems can run just fine. Yeah that need to be one, very simple, dedicated to tracing. With the very minimal requirements about dependencies. > - extend our proxy recording and proxy execution/analysis > concepts. We can already run a perf trace recording > through a pipe and netcat, and we have perf archive > and cross-arch analysis code. > > If there's interest, then this could be made more > convenient and the functionality around this could > be collected into a handy proxy tool: > > ftrace --proxy smallbox --start > ftrace --proxy smallbox --stop > ftrace --proxy smallbox # prints trace > > ... etc ... > > Thus the recording is done on the small box, while > all the analysis (and even all the commands) is > typed/executed on the bigger box. You mean a client/server control? Controlling the recorder with remote commands? > So there are lots of possibilities - and there are other > options as well. > > Does this address your worries? My worries were more about bringing dependencies to read or launch traces than the possibility to produce a very easy to use tool, even easier and more powerful than the debugfs interface. To sum up, my worries were (and sort of still are a bit): how painful would that be to set up such a tool on a hard embedded box, compared to the directly available stream we have. But Thomas opinion makes me reconsider somehow the situation. Once we have such tool and it becomes truly easy and intuitive, people may migrate to it. At this point, it's possible the situation have evolved enough that it will become obvious to deprecate the debugfs interface. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar 2010-05-20 10:07 ` Frederic Weisbecker @ 2010-05-20 14:38 ` Steven Rostedt 2010-05-20 15:04 ` Steven Rostedt ` (2 more replies) 2010-05-21 9:24 ` Li Zefan 2 siblings, 3 replies; 24+ messages in thread From: Steven Rostedt @ 2010-05-20 14:38 UTC (permalink / raw) To: Ingo Molnar Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter On Thu, 2010-05-20 at 11:31 +0200, Ingo Molnar wrote: > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > More than a year and a half ago (September 2008), at > > Linux Plumbers, we had a meeting with several kernel > > developers to come up with a unified ring buffer. A > > generic ring buffer in the kernel that any subsystem > > could use. After coming up with a set of requirements, I > > worked on implementing it. One of the requirements was > > to start off simple and work to become a more complete > > buffering system. > > > > [...] > > The thing is, in tracing land and more broadly in > instrumentation land we have _much_ more earthly problems > these days: > > - Lets face it, performance of the ring-buffer sucks, in > a big way. I've recently benchmarked it and it takes > hundreds of instructions to trace a single event. > Puh-lease ... "Puh-lease"? Hmm, what were the timings? I just ran the ring_buffer_benchmark module. It's in mainline, so anyone could do the same. Remember to disable KERNEL_DEBUG because things like lockdep can have a significant impact on it. Note, instructions themselves are not that bad, but what those instructions do count a lot more. I avoid all lock functions and memory barriers which cause the code to be a little bigger. But a single cmpxchg can take much more time than a lot of other instructions put together. On this machine: model name : Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz It's a 2x2 (4 cores) With running: # modprobe ring_buffer_benchmark producer_fifo=10 # sleep 30 # cat /debug/tracing/trace <...>-2636 [003] 761.969850: ring_buffer_producer_thread: Starting ring buffer hammer <...>-2636 [003] 771.910121: ring_buffer_producer_thread: End ring buffer hammer <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Consumer at nice: 19 <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Time: 9965321 (usecs) <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Overruns: 0 <...>-2636 [003] 771.910174: ring_buffer_producer_thread: Read: 76580325 (by pages) <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Entries: 75 <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Total: 76580400 <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Missed: 0 <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Hit: 76580400 <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Entries per millisec: 7684 <...>-2636 [003] 771.910176: ring_buffer_producer_thread: 130 ns per entry <...>-2636 [003] 771.910177: ring_buffer_producer_thread: Sleeping for 10 secs <...>-2636 [003] 781.884686: ring_buffer_producer_thread: Starting ring buffer hammer <...>-2636 [003] 790.862392: ring_buffer_producer_thread: End ring buffer hammer <...>-2636 [003] 790.897399: ring_buffer_producer_thread: Running Consumer at nice: 19 <...>-2636 [003] 790.897400: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Time: 9000321 (usecs) <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Overruns: 26581200 <...>-2636 [003] 790.897402: ring_buffer_producer_thread: Read: 6767200 (by events) <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Entries: 0 <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Total: 33348400 <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Missed: 0 <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Hit: 33348400 <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Entries per millisec: 3705 <...>-2636 [003] 790.897405: ring_buffer_producer_thread: 269 ns per entry While having the reader read by pages (splice mode) a write takes 130ns per entry. Not that bad. But when the reader is reading each event, (I'm assuming we get some cache line bouncing here, it is still lockless), a write takes 269ns. Note, the reader is an aggressive reader. That is, it goes into a strong loop and just reads as fast as it can. Which would cause cache line bouncing. When I disable the reader and just measure what a write takes: # rmmod ring_buffer_benchmark # modprobe ring_buffer_benchmark disable_reader=1 producer_fifo=10 # sleep 30 # cat /debug/tracing/trace <...>-2649 [003] 1141.981639: ring_buffer_producer_thread: Starting ring buffer hammer <...>-2649 [003] 1151.950840: ring_buffer_producer_thread: End ring buffer hammer <...>-2649 [003] 1151.950841: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Time: 9994238 (usecs) <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Overruns: 90635670 <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Read: (reader disabled) <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Entries: 62480 <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Total: 90698150 <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Missed: 0 <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Hit: 90698150 <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: Entries per millisec: 9075 <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: 110 ns per entry <...>-2649 [003] 1151.950846: ring_buffer_producer_thread: Sleeping for 10 secs It takes just a 110ns. Lets now use perf to see what is happening: # perf record -a -c10000 -f (wait 30 secs) Ctrl-C # perf report no symbols found in /usr/sbin/irqbalance, maybe install a debug package? # Samples: 1995496 # # Overhead Command Shared Object Symbol # ........ ............... ................................................................ ...... # 23.80% rb_producer [kernel.kallsyms] [k] native_sched_clock 14.83% rb_producer [kernel.kallsyms] [k] rb_reserve_next_event 13.20% rb_producer [kernel.kallsyms] [k] ring_buffer_unlock_commit 12.84% rb_producer [kernel.kallsyms] [k] ring_buffer_lock_reserve 9.27% rb_producer [kernel.kallsyms] [k] rb_end_commit 9.19% rb_producer [kernel.kallsyms] [k] __rb_reserve_next 5.24% rb_producer [kernel.kallsyms] [k] trace_clock_local 3.11% rb_producer [kernel.kallsyms] [k] ring_buffer_event_data 2.88% rb_producer [ring_buffer_benchmark] [k] ring_buffer_benchmark_init 2.53% rb_producer [kernel.kallsyms] [k] trace_recursive_unlock 0.39% rb_producer [kernel.kallsyms] [k] do_gettimeofday 0.34% rb_producer [kernel.kallsyms] [k] read_tsc 0.18% rb_producer [kernel.kallsyms] [k] getnstimeofday 0.14% swapper [kernel.kallsyms] [k] mwait_idle Now that the sched_clock is %23.8 percent, it doesn't look as bad. rb_reserve_next_event is 14.83, ring_buffer_unlock_commit is 13.20, ring_buffer_lock_reserve is 12.84, rb_end_commit is 9.27, __rb_reserve_next is 8.18, trace_clock_local is 5.24 (which only disables preemption and calls sched_clock), ring_buffer_event_data is 3.11 which only returns a pointer on a structure (once a event). ring_buffer_benchmark_init is the benchmark code itself, and trace_recursive_unlock is the code you asked me to add to check for cases where the tracer could have recursed itself. 14.83 + 13.20 + 9.27 + 8.18 + 5.24 + 3.11 = 53.83% Considering that the system is basically idle except for this benchmark, that makes sense to have it that high. We need to compare it. The sched_clock() takes up 1/3 of the entire trace event. > > - It has grown a lot of slack. It's complex and hard to > read. I admit, that it got over designed, and I'm all for replacing it. We were all involved in the initial design, and I posted several versions of RFC when I created it. But I admit, it became complex, and now that we know what we want, we can take a step back and redesign a new one from scratch. But until we do so, I'm continuing to support the current one, and adding small updates when needed. > > - Over the last year or so the majority of bleeding-edge > tracing developers have gradually migrated over to perf > and 'perf trace' / 'perf probe' in particular. As far > as the past two merge windows go they are > out-developing the old ftrace UIs by a ratio of 4:1. Of course a new project will be in a high state of flux. I'm quite satisfied with the interface with ftrace and did not feel that it needed changing. > > So this angle is becoming a prime thing to improve and > users and developers are hurting from the ftrace/perf > duality. I agree. > > - [ While it's still a long way off, if this trend continues > we eventually might even be able to get rid of the > /debug/tracing/ temporary debug API and get rid of > the ugly in-kernel pretty-printing bits. This is > good: it may make Andrew very happy for a change ;-) So how do you plan on getting rid of the in-kernel pretty-printing bits and keeping the trace_dump_on_oops? There's a strong dependency between those two. > > The main detail here to be careful of is that lots of > people are fond of the simplicity of the > /debug/tracing/ debug UI, so when we replace it we > want to do it by keeping that simple workflow (or > best by making it even simpler). I have a few ideas > how to do this. Note, I still would not want to remove the /debug interface. It is a debug interface anyway. Having both a syscall access the functionality of the code and having a debug interface access it very trivial. This could also be a way we can debug the tool too. We can officially support only the tool side, but why throw away the baby with the bath water? > > There's also the detail that in some cases we want to > print events in the kernel in a human readable way: > for example EDAC/MCE and other critical events, > trace-on-oops, etc. This too can be solved. ] But if the code for trace-on-oops is there, then the pretty print code is also there. > > Regarding performance and complexity, which is our main > worry atm, fortunately there's work going on in that > direction - please see PeterZ's recent string of patches > on lkml: > > 4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events > a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware > ef60777: perf: Optimize the perf_output() path by removing IRQ-disables > fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t > 6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg() Yes, I saw these, and it helps a little, but still not everything is there. > > And it may sound harsh but at this stage i'm personally > not at all interested in big design talk. Rewriting ftrace on top of perf is a big design change. > This isnt rocket > science, we have developers and users and we know what > they are doing and we know what we need to do: we need to > improve our crap and we need to reduce complexity. Less is > more. > > So i'd like to see iterative, useful action first, and i > am somewhat sceptical about yet another grand tracing > design trying to match 100 requirements. Here's a few things that I find required: The ability to use splice. You keep telling me this is not important, but splice is magnitudes faster than mmaping into userspace and then copying into a file. If the splice removes the extra copy as we are working on, it will get even faster. It also makes writing over the network much easier. Which trace-cmd does, and it took me half a day to make it do so. We also need a way to easily read the buffer from the kernel. This is for trace-on-oops. Overwrite mode must also be supported. I know that Peter thinks its silly to do this and have it write to a file, but I find it helpful to do so. But fine, I can live without overwrite to a file, if no one else cares. But overwrite mode itself must work for trace-on-oops. We need a way to start and stop tracing from the kernel and userspace. Again, for trace-on-oops. When a bug is hit, we need a way to stop it. We also must have a way to easily start it again from userspace. A simple on/off switch. This is something that you and Thomas hammered into me which was the creation of tracing_on. The above can be done, but lets not just hack the changes in. Lets really look at a proper design. > > Steve, Mathieu, if you are interested please help out > Peter with the performance and simplification work. The > last thing we need is yet another replace-everything > event. But that is what you are asking to happen to ftrace. "replace-everything". > > If we really want to create a new ring-buffer abstraction > i'd suggest we start with Peter's, it has a quite sane > design and stayed simple and flexible - if then it could > be factored out a bit. I'm sorry, but it is very coupled with perf. It is simple and flexible with what perf does today, but not what ftrace does today, or to make perf do what ftrace does. > > Here are more bits of what i see as the 'action' going > forward, in no particular order: > > 1) Push the /debug/tracing/events/ event description > into sysfs, as per this thread on lkml: > > [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs > > http://groups.google.com/group/linux.kernel/msg/ab9aa075016c639e > > I.e. one more step towards integrating ftrace into perf. We could do something like this, but it is not as trivial as it may seem. > > 2) Use 1) to unify the perf events and the ftrace > ring-buffer. This, as things are standing is > best done by factoring out Peter's ring-buffer > in kernel/perf_event.c. It's properly abstracted > and it _far_ simpler than kernel/tracing/ring_buffer.c, > which has become a monstrosity. Sure, my code has become quite complex, I'll admit that. I'm for refactoring. I really don't care what code is used as long as I still have the requirements that make ftrace useful for me and several others. And yes, any ring buffer needs to be factored out to a defined interface. The current ring buffer in perf is very coupled with the perf design. The event code written to perf in include/trace/ftrace.h is still quite a hack. > > (but i'm open to other simplifications as well) > > 3) Add the function-tracer and function-graph tracer > as an event and integrate it into perf. Here's something that I think perf needs to do. It needs to become a bit more modulized, and not absorb everything into perf instead of having things use just a part of perf. Like the breakpoint code requires full perf to be done, and x86 has PERF_EVENTS permanently on. Which is really a PITA. As I designed ftrace from day one, I tried to make it modular. Thus, the function trace (kernel/trace/ftrace.c) has always been a separate entity from ftrace itself. Heck, LTTng uses it. It would be actually trivial to implement the function tracer in perf today. Just register a function you want called at every function event and start running. I need to abstract out the "what function to call bit" but that's about it. > > This will live-test the efficiency of the unification > and brings over the last big ftrace plugin to perf. > > 4) Gradually convert/port/migrate all the remaining > plugins over as well. We need to do this very gently > because there are users - but stop piling new > functionality on to the old ftrace side. This usually > involves: I have not been piling functionality into the ftrace side, as you stated at the beginning of the email, that the development has been quite slow. Yes, I've been still working on things, but I would say it is a far cry from "piling". I did add some simple features, like the "lost events" to the ring buffer, but that code can be looked at for ideas for the future. It does not make merging any more difficult than it already is. Fine, I'll stop some of the code I was going to work on (multiple buffers for ftrace, event triggers, configuring events to stop the tracer (like for different kinds of BUG() and WARN_ON()), etc). > > - Conversion of an explicit tracing callback to > TRACE_EVENT (for example in the case of mmiotrace), > while keeping all tool functionality. Not sure what this is. > > - Migrate any 'special' ftrace feature to perf > capabilities so that it's available via the > syscall interface as well. (for example > 'latency maximum tracking' is something that we > probably want to do with kernel-side help - we > probably dont want to implement it via tracing > everything all the time and finding the maximum > based on terabytes of data.) > > (And there are other complications here too, but you > get the idea.) Yes, the latency tracers are special, and I treat them special in ftrace. They can be redesigned to keep their current functionality. That needed to be done anyway. > > All in one, i think we can reuse more than 50% of all > current ftrace code (possibly up to 70-80%) - and we are > already reusing bits of it. Ingo, as we have had this conversation, I don't know if you realized that I agreed to merge ftrace to perf. There's just some functionality about ftrace I really want to keep, which does include accessing via /debugfs, pretty-print (which is needed by trace-on-oops) and splice. -- Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 14:38 ` Steven Rostedt @ 2010-05-20 15:04 ` Steven Rostedt 2010-05-20 20:12 ` Steven Rostedt 2010-05-21 17:49 ` Ingo Molnar 2 siblings, 0 replies; 24+ messages in thread From: Steven Rostedt @ 2010-05-20 15:04 UTC (permalink / raw) To: Ingo Molnar Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter On Thu, 2010-05-20 at 10:38 -0400, Steven Rostedt wrote: > n this machine: > > model name : Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz > > It's a 2x2 (4 cores) > Oh, and this was done with 2.6.34 freshly checked out. -- Steve > With running: > > # modprobe ring_buffer_benchmark producer_fifo=10 > # sleep 30 > # cat /debug/tracing/trace > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 14:38 ` Steven Rostedt 2010-05-20 15:04 ` Steven Rostedt @ 2010-05-20 20:12 ` Steven Rostedt 2010-05-21 17:49 ` Ingo Molnar 2 siblings, 0 replies; 24+ messages in thread From: Steven Rostedt @ 2010-05-20 20:12 UTC (permalink / raw) To: Ingo Molnar Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter On Thu, 2010-05-20 at 10:38 -0400, Steven Rostedt wrote: > On Thu, 2010-05-20 at 11:31 +0200, Ingo Molnar wrote: > # modprobe ring_buffer_benchmark producer_fifo=10 > # sleep 30 > # cat /debug/tracing/trace > > > <...>-2636 [003] 761.969850: ring_buffer_producer_thread: Starting ring buffer hammer > <...>-2636 [003] 771.910121: ring_buffer_producer_thread: End ring buffer hammer > <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Consumer at nice: 19 > <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 > <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Time: 9965321 (usecs) > <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Overruns: 0 > <...>-2636 [003] 771.910174: ring_buffer_producer_thread: Read: 76580325 (by pages) > <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Entries: 75 > <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Total: 76580400 > <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Missed: 0 > <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Hit: 76580400 > <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Entries per millisec: 7684 > <...>-2636 [003] 771.910176: ring_buffer_producer_thread: 130 ns per entry > <...>-2636 [003] 771.910177: ring_buffer_producer_thread: Sleeping for 10 secs > > <...>-2636 [003] 781.884686: ring_buffer_producer_thread: Starting ring buffer hammer > <...>-2636 [003] 790.862392: ring_buffer_producer_thread: End ring buffer hammer > <...>-2636 [003] 790.897399: ring_buffer_producer_thread: Running Consumer at nice: 19 > <...>-2636 [003] 790.897400: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 > <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Time: 9000321 (usecs) > <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Overruns: 26581200 > <...>-2636 [003] 790.897402: ring_buffer_producer_thread: Read: 6767200 (by events) > <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Entries: 0 > <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Total: 33348400 > <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Missed: 0 > <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Hit: 33348400 > <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Entries per millisec: 3705 > <...>-2636 [003] 790.897405: ring_buffer_producer_thread: 269 ns per entry > > While having the reader read by pages (splice mode) a write takes 130ns > per entry. Not that bad. > > But when the reader is reading each event, (I'm assuming we get some > cache line bouncing here, it is still lockless), a write takes 269ns. > > Note, the reader is an aggressive reader. That is, it goes into a strong > loop and just reads as fast as it can. Which would cause cache line > bouncing. > > When I disable the reader and just measure what a write takes: > > # rmmod ring_buffer_benchmark > # modprobe ring_buffer_benchmark disable_reader=1 producer_fifo=10 > # sleep 30 > # cat /debug/tracing/trace > > <...>-2649 [003] 1141.981639: ring_buffer_producer_thread: Starting ring buffer hammer > <...>-2649 [003] 1151.950840: ring_buffer_producer_thread: End ring buffer hammer > <...>-2649 [003] 1151.950841: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 > <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Time: 9994238 (usecs) > <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Overruns: 90635670 > <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Read: (reader disabled) > <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Entries: 62480 > <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Total: 90698150 > <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Missed: 0 > <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Hit: 90698150 > <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: Entries per millisec: 9075 > <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: 110 ns per entry > <...>-2649 [003] 1151.950846: ring_buffer_producer_thread: Sleeping for 10 secs > > It takes just a 110ns. I checked out latest tip/perf/core with commit: commit dfacc4d6c98b89609250269f518c1f54c30454ef Merge: f869097 85cb68b Author: Ingo Molnar <mingo@elte.hu> Date: Thu May 20 14:38:55 2010 +0200 Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core This seems to include the latest optimizations from Peter. I decided to modify the trace/kernel/ring_buffer_benchmark.c to benchmark the perf ring buffer. I spent an hour or so trying to get it to work but since the perf ring buffer is so coupled with perf, I gave up in trying to test it directly. It requires a reader, and it has no good way to set up a buffer and read it from the kernel. So instead, I figured I would test perf and ftrace with TRACE_EVENT(). Here's what I did, and my changes are at the bottom of this email as a patch too. I created a function within the kernel proper (since the benchmark is a module) called ring_buffer_benchmark_caller(). All this does is call: trace_rb_benchmark(raw_smp_processor_id()); Where trace_rb_benchmark is created with TRACE_EVENT() (as shown in the patch below). I modified the ring buffer benchmark to take a "trace_event" parameter where instead of testing the ring buffer itself, it would just time the calling of "ring_buffer_benchmark_caller()", which includes the trace event. This is nice way to see the difference between perf and ftrace since this is the code that they both use. Note, I noticed that the ring buffer slowed down since 2.6.34. Probably to do with the "lost events" recording that I added. But it went from 110ns to 153ns in the normal benchmark (not this one). Anyway, the benchmark would pause until the tracepoint was activated, as the tracepoint would enable rb_start_bench. When the tracepoint is called the first time, the benchmark would begin. Here's what I did: # modprobe ring_buffer_benchmark producer_fifo=10 trace_event=1 # perf record -e rb_bench:rb_benchmark -c1 -a sleep 30 # cat /debug/tracing/trace Here's the results: # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | <...>-22836 [001] 2804.849264: : wait to start <...>-22836 [001] 2816.419909: : starting <...>-22836 [001] 2816.419911: : Starting ring buffer hammer <...>-22836 [002] 2826.056105: : End ring buffer hammer <...>-22836 [002] 2826.056108: : Running Producer at SCHED_FIFO 10 <...>-22836 [002] 2826.056109: : Time: 9660458 (usecs) <...>-22836 [002] 2826.056110: : Overruns: 0 <...>-22836 [002] 2826.056110: : Read: (reader disabled) <...>-22836 [002] 2826.056111: : Entries: 0 <...>-22836 [002] 2826.056111: : Total: 0 <...>-22836 [002] 2826.056112: : Missed: 0 <...>-22836 [002] 2826.056112: : Hit: 14637150 <...>-22836 [002] 2826.056113: : Entries per millisec: 1515 <...>-22836 [002] 2826.056113: : 660 ns per entry <...>-22836 [002] 2826.056114: : Sleeping for 10 secs <...>-22836 [002] 2836.030464: : wait to start <...>-22836 [002] 2836.030465: : starting <...>-22836 [002] 2836.030465: : Starting ring buffer hammer <...>-22836 [001] 2845.008379: : End ring buffer hammer <...>-22836 [001] 2845.008383: : Running Producer at SCHED_FIFO 10 <...>-22836 [001] 2845.008384: : Time: 9000518 (usecs) <...>-22836 [001] 2845.008384: : Overruns: 0 <...>-22836 [001] 2845.008384: : Read: (reader disabled) <...>-22836 [001] 2845.008385: : Entries: 0 <...>-22836 [001] 2845.008385: : Total: 0 <...>-22836 [001] 2845.008386: : Missed: 0 <...>-22836 [001] 2845.008387: : Hit: 13642200 <...>-22836 [001] 2845.008387: : Entries per millisec: 1515 <...>-22836 [001] 2845.008388: : 660 ns per entry Writing an event in perf took 660ns each. To do the same for ftrace, I first had to set a filter. This is because the events for ftrace write to the same place the ring buffer benchmark is writing its output, and I would have to search through thousands of called events to see the output. But since filtering is actually expensive, it should still give us a basic idea. # modprobe ring_buffer_benchmark producer_fifo=10 trace_event=1 # echo mydata == 10 > /debug/tracing/events/rb_bench/rb_benchmark/filter # echo 1 > /debug/tracing/events/rb_bench/rb_benchmark/enable # cat /debug/tracing/trace # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | rb_producer-22849 [000] 3015.236994: ring_buffer_producer_thread: wait to start rb_producer-22849 [000] 3032.193995: ring_buffer_producer_thread: starting rb_producer-22849 [000] 3032.193996: ring_buffer_producer_thread: Starting ring buffer hammer rb_producer-22849 [000] 3041.513072: ring_buffer_producer_thread: End ring buffer hammer rb_producer-22849 [000] 3041.513073: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 rb_producer-22849 [000] 3041.513074: ring_buffer_producer_thread: Time: 9342579 (usecs) rb_producer-22849 [000] 3041.513074: ring_buffer_producer_thread: Overruns: 0 rb_producer-22849 [000] 3041.513075: ring_buffer_producer_thread: Read: (reader disabled) rb_producer-22849 [000] 3041.513075: ring_buffer_producer_thread: Entries: 0 rb_producer-22849 [000] 3041.513076: ring_buffer_producer_thread: Total: 0 rb_producer-22849 [000] 3041.513076: ring_buffer_producer_thread: Missed: 0 rb_producer-22849 [000] 3041.513077: ring_buffer_producer_thread: Hit: 51588000 rb_producer-22849 [000] 3041.513077: ring_buffer_producer_thread: Entries per millisec: 5522 rb_producer-22849 [000] 3041.513078: ring_buffer_producer_thread: 181 ns per entry rb_producer-22849 [000] 3041.513078: ring_buffer_producer_thread: Sleeping for 10 secs rb_producer-22849 [000] 3051.487356: ring_buffer_producer_thread: wait to start rb_producer-22849 [000] 3051.487356: ring_buffer_producer_thread: starting rb_producer-22849 [000] 3051.487357: ring_buffer_producer_thread: Starting ring buffer hammer rb_producer-22849 [000] 3060.465270: ring_buffer_producer_thread: End ring buffer hammer rb_producer-22849 [000] 3060.465271: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 rb_producer-22849 [000] 3060.465272: ring_buffer_producer_thread: Time: 9000562 (usecs) rb_producer-22849 [000] 3060.465272: ring_buffer_producer_thread: Overruns: 0 rb_producer-22849 [000] 3060.465272: ring_buffer_producer_thread: Read: (reader disabled) rb_producer-22849 [000] 3060.465273: ring_buffer_producer_thread: Entries: 0 rb_producer-22849 [000] 3060.465273: ring_buffer_producer_thread: Total: 0 rb_producer-22849 [000] 3060.465274: ring_buffer_producer_thread: Missed: 0 rb_producer-22849 [000] 3060.465274: ring_buffer_producer_thread: Hit: 49675400 rb_producer-22849 [000] 3060.465275: ring_buffer_producer_thread: Entries per millisec: 5519 rb_producer-22849 [000] 3060.465275: ring_buffer_producer_thread: 181 ns per entry Note, enabling an event in ftrace enables the comm line record, which shows the "rb_producer". Here we see each event took 181ns as called by TRACE_EVENT() and then discarded. But since I'm discarding the events this may be giving ftrace an unfair advantage. Thus, I modified the ring buffer benchmark code as follows: #undef trace_printk #define trace_printk(fmt, args...) printk(fmt, ##args) This makes the ring buffer benchmark write to the console instead of the ring buffer, and lets me run without filtering. # modprobe ring_buffer_benchmark producer_fifo=10 trace_event=1 # echo 0 > /debug/tracing/events/rb_bench/rb_benchmark/filter # echo 1 > /debug/tracing/events/rb_bench/rb_benchmark/enable I got this: wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9972749 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 57079200 Entries per millisec: 5723 174 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9972786 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 57086350 Entries per millisec: 5724 174 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9972772 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 57058250 Entries per millisec: 5721 174 ns per entry Sleeping for 10 secs The non-filtering case (the one that actually records) was 174ns per event. Here's the patch: -- Steve diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index ffb1a5b..3b559d3 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -21,10 +21,12 @@ endif # obj-y += trace_clock.o +CFLAGS_ring_buffer_bm_event.o += -I$(src) + obj-$(CONFIG_FUNCTION_TRACER) += libftrace.o obj-$(CONFIG_RING_BUFFER) += ring_buffer.o obj-$(CONFIG_RING_BUFFER_BENCHMARK) += ring_buffer_benchmark.o - +obj-y += ring_buffer_bm_event.o obj-$(CONFIG_TRACING) += trace.o obj-$(CONFIG_TRACING) += trace_output.o obj-$(CONFIG_TRACING) += trace_stat.o diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c index 302f8a6..9d5cdb1 100644 --- a/kernel/trace/ring_buffer_benchmark.c +++ b/kernel/trace/ring_buffer_benchmark.c @@ -4,12 +4,23 @@ * Copyright (C) 2009 Steven Rostedt <srostedt@redhat.com> */ #include <linux/ring_buffer.h> +#include <linux/ftrace_event.h> #include <linux/completion.h> #include <linux/kthread.h> #include <linux/module.h> #include <linux/time.h> #include <asm/local.h> +#include "ring_buffer_bm_event.h" + +/* Write the strings directly into the buffers. */ +#undef trace_printk +#define trace_printk(fmt, args...) \ +do { \ + __trace_printk_check_format(fmt, ##args); \ + __trace_printk(_THIS_IP_, fmt, ##args); \ +} while (0) + struct rb_page { u64 ts; local_t commit; @@ -32,6 +43,10 @@ static struct task_struct *producer; static struct task_struct *consumer; static unsigned long read; +static int trace_event; +module_param(trace_event, uint, 0644); +MODULE_PARM_DESC(trace_event, "record a trace event"); + static int disable_reader; module_param(disable_reader, uint, 0644); MODULE_PARM_DESC(disable_reader, "only run producer"); @@ -211,6 +226,50 @@ static void ring_buffer_consumer(void) complete(&read_done); } +static inline int ring_buffer_write_trace_event(void) +{ + ring_buffer_benchmark_caller(); + return 0; +} + +static inline int ring_buffer_write_rb_event(void) +{ + struct ring_buffer_event *event; + int *entry; + + event = ring_buffer_lock_reserve(buffer, 10); + if (!event) + return -1; + + entry = ring_buffer_event_data(event); + *entry = smp_processor_id(); + ring_buffer_unlock_commit(buffer, event); + + return 0; +} + +static inline int ring_buffer_write_event(void) +{ + if (trace_event) + return ring_buffer_write_trace_event(); + else + return ring_buffer_write_rb_event(); +} + +static int wait_to_start(void) +{ + while (!rb_start_bench) { + set_current_state(TASK_INTERRUPTIBLE); + if (kthread_should_stop()) + break; + schedule_timeout(HZ/10); + ring_buffer_benchmark_caller(); + } + __set_current_state(TASK_RUNNING); + + return kthread_should_stop(); +} + static void ring_buffer_producer(void) { struct timeval start_tv; @@ -223,6 +282,13 @@ static void ring_buffer_producer(void) unsigned long avg; int cnt = 0; + if (trace_event) { + trace_printk("wait to start\n"); + if (wait_to_start()) + return; + trace_printk("starting\n"); + } + /* * Hammer the buffer for 10 secs (this may * make the system stall) @@ -230,20 +296,13 @@ static void ring_buffer_producer(void) trace_printk("Starting ring buffer hammer\n"); do_gettimeofday(&start_tv); do { - struct ring_buffer_event *event; - int *entry; int i; for (i = 0; i < write_iteration; i++) { - event = ring_buffer_lock_reserve(buffer, 10); - if (!event) { + if (ring_buffer_write_event() < 0) missed++; - } else { + else hit++; - entry = ring_buffer_event_data(event); - *entry = smp_processor_id(); - ring_buffer_unlock_commit(buffer, event); - } } do_gettimeofday(&end_tv); @@ -425,6 +484,12 @@ static int __init ring_buffer_benchmark_init(void) if (!buffer) return -ENOMEM; + /* The reader is userspace */ + if (trace_event) { + disable_reader = 1; + rb_start_bench = 0; + } + if (!disable_reader) { consumer = kthread_create(ring_buffer_consumer_thread, NULL, "rb_consumer"); diff --git a/kernel/trace/ring_buffer_bm_event.c b/kernel/trace/ring_buffer_bm_event.c new file mode 100644 index 0000000..862e80c --- /dev/null +++ b/kernel/trace/ring_buffer_bm_event.c @@ -0,0 +1,13 @@ +#include <linux/module.h> + +#define CREATE_TRACE_POINTS +#include "ring_buffer_bm_event.h" + +int rb_start_bench; +EXPORT_SYMBOL(rb_start_bench); + +void ring_buffer_benchmark_caller(void) +{ + trace_rb_benchmark(raw_smp_processor_id()); +} +EXPORT_SYMBOL(ring_buffer_benchmark_caller); diff --git a/kernel/trace/ring_buffer_bm_event.h b/kernel/trace/ring_buffer_bm_event.h new file mode 100644 index 0000000..a167c6d --- /dev/null +++ b/kernel/trace/ring_buffer_bm_event.h @@ -0,0 +1,38 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM rb_bench + +#if !defined(_TRACE_RB_BENCHMARK_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_RB_BENCHMARK_H + +#include <linux/tracepoint.h> + +extern int rb_start_bench; +void ring_buffer_benchmark_caller(void); + +TRACE_EVENT(rb_benchmark, + + TP_PROTO(int mydata), + + TP_ARGS(mydata), + + TP_STRUCT__entry( + __field( int, mydata ) + ), + + TP_fast_assign( + rb_start_bench = 1; + __entry->mydata = mydata; + ), + + TP_printk("data=%d", __entry->mydata) +); + +#endif /* _TRACE_RB_BENCHMARK_H */ + +#undef TRACE_INCLUDE_FILE +#undef TRACE_INCLUDE_PATH +#define TRACE_INCLUDE_PATH . +#define TRACE_INCLUDE_FILE ring_buffer_bm_event + +/* This part must be outside protection */ +#include <trace/define_trace.h> ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 14:38 ` Steven Rostedt 2010-05-20 15:04 ` Steven Rostedt 2010-05-20 20:12 ` Steven Rostedt @ 2010-05-21 17:49 ` Ingo Molnar 2010-05-24 20:13 ` Steven Rostedt 2 siblings, 1 reply; 24+ messages in thread From: Ingo Molnar @ 2010-05-21 17:49 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter * Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 2010-05-20 at 11:31 +0200, Ingo Molnar wrote: > > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > More than a year and a half ago (September 2008), at > > > Linux Plumbers, we had a meeting with several kernel > > > developers to come up with a unified ring buffer. A > > > generic ring buffer in the kernel that any subsystem > > > could use. After coming up with a set of requirements, I > > > worked on implementing it. One of the requirements was > > > to start off simple and work to become a more complete > > > buffering system. > > > > > > [...] > > > > The thing is, in tracing land and more broadly in > > instrumentation land we have _much_ more earthly problems > > these days: > > > > - Lets face it, performance of the ring-buffer sucks, in > > a big way. I've recently benchmarked it and it takes > > hundreds of instructions to trace a single event. > > Puh-lease ... > > "Puh-lease"? Yeah, we suck :-/ > Hmm, what were the timings? I just ran the > ring_buffer_benchmark module. It's in mainline, so > anyone could do the same. [...] i tested it with perf stat and got 700+ instructions for the sched:sched_wakeup tracepoint. That's without extracting the trace - that would add another significant chunk of overhead. Btw., it's not just ftrace, on the perf side i see big tracing overhead as well, it's even worse than this because we layer it on each other and because perf does a few stupid things (that need fixing). > [...] Remember to disable KERNEL_DEBUG because things > like lockdep can have a significant impact on it. I had lockdep off. > Note, instructions themselves are not that bad, but what > those instructions do count a lot more. I avoid all lock > functions and memory barriers which cause the code to be > a little bigger. But a single cmpxchg can take much more > time than a lot of other instructions put together. It's bad cycle-wise as well and obviously running through 700 instructions is bad no matter what. > On this machine: > > model name : Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz > > It's a 2x2 (4 cores) > > With running: > > # modprobe ring_buffer_benchmark producer_fifo=10 > # sleep 30 > # cat /debug/tracing/trace > > > <...>-2636 [003] 761.969850: ring_buffer_producer_thread: Starting ring buffer hammer > <...>-2636 [003] 771.910121: ring_buffer_producer_thread: End ring buffer hammer > <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Consumer at nice: 19 > <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 > <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Time: 9965321 (usecs) > <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Overruns: 0 > <...>-2636 [003] 771.910174: ring_buffer_producer_thread: Read: 76580325 (by pages) > <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Entries: 75 > <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Total: 76580400 > <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Missed: 0 > <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Hit: 76580400 > <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Entries per millisec: 7684 > <...>-2636 [003] 771.910176: ring_buffer_producer_thread: 130 ns per entry > <...>-2636 [003] 771.910177: ring_buffer_producer_thread: Sleeping for 10 secs > > <...>-2636 [003] 781.884686: ring_buffer_producer_thread: Starting ring buffer hammer > <...>-2636 [003] 790.862392: ring_buffer_producer_thread: End ring buffer hammer > <...>-2636 [003] 790.897399: ring_buffer_producer_thread: Running Consumer at nice: 19 > <...>-2636 [003] 790.897400: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 > <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Time: 9000321 (usecs) > <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Overruns: 26581200 > <...>-2636 [003] 790.897402: ring_buffer_producer_thread: Read: 6767200 (by events) > <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Entries: 0 > <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Total: 33348400 > <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Missed: 0 > <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Hit: 33348400 > <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Entries per millisec: 3705 > <...>-2636 [003] 790.897405: ring_buffer_producer_thread: 269 ns per entry > > While having the reader read by pages (splice mode) a > write takes 130ns per entry. Not that bad. I tried a simple loop of: int main(void) { int i; for (i = 0; i < 1000000; i++) getppid(); return 0; } with kernel/timer.c's getppid syscall having this: #undef CREATE_TRACE_POINTS #include <trace/events/sched.h> SYSCALL_DEFINE0(getppid) ... trace_sched_wakeup(current, 0); and then used: echo 1 > /debug/tracing/events/sched/sched_wakeup/enable echo sched_switch > /debug/tracing/current_tracer perf stat --repeat 10 -e instructions -e cycles ./getppid-1m with the tracepoint off i got: # perf stat -B --repeat 10 -e instructions -e cycles ./getppid-1m Performance counter stats for './getppid-1m' (10 runs): 89.700.970 instructions # 0,614 IPC ( +- 0,146% ) 146.204.088 cycles ( +- 1,160% ) 0,046142868 seconds time elapsed ( +- 1,200% ) with it on i got: # perf stat -B --repeat 10 -e instructions -e cycles ./getppid-1m Performance counter stats for './getppid-1m' (10 runs): 1.241.321.592 instructions # 1,617 IPC ( +- 0,054% ) 767.890.685 cycles ( +- 1,298% ) 0,241048084 seconds time elapsed ( +- 1,288% ) i.e. from 89 instructions to 1241 instructions. From 146 cycles to 767 cycles. And that's just the recording - doesnt count the recovering of the trace. (which can argued to be a slowpath in many tracing workflows when buffers are big enough.) Nehalem box. Can send the config. > But when the reader is reading each event, (I'm assuming > we get some cache line bouncing here, it is still > lockless), a write takes 269ns. That would be like 800 cycles? So it's roughly the same ballpark figure as mine - we suck. (this is a 3GHz box) > Note, the reader is an aggressive reader. That is, it > goes into a strong loop and just reads as fast as it > can. Which would cause cache line bouncing. > > When I disable the reader and just measure what a write > takes: > > # rmmod ring_buffer_benchmark > # modprobe ring_buffer_benchmark disable_reader=1 producer_fifo=10 > # sleep 30 > # cat /debug/tracing/trace > > <...>-2649 [003] 1141.981639: ring_buffer_producer_thread: Starting ring buffer hammer > <...>-2649 [003] 1151.950840: ring_buffer_producer_thread: End ring buffer hammer > <...>-2649 [003] 1151.950841: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 > <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Time: 9994238 (usecs) > <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Overruns: 90635670 > <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Read: (reader disabled) > <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Entries: 62480 > <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Total: 90698150 > <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Missed: 0 > <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Hit: 90698150 > <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: Entries per millisec: 9075 > <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: 110 ns per entry > <...>-2649 [003] 1151.950846: ring_buffer_producer_thread: Sleeping for 10 secs > > It takes just a 110ns. I dont seem to be able to get such low numbers - but even ~250 instructions are pretty expensive. > Lets now use perf to see what is happening: > > # perf record -a -c10000 -f > (wait 30 secs) > Ctrl-C > > # perf report > > no symbols found in /usr/sbin/irqbalance, maybe install a debug package? > # Samples: 1995496 > # > # Overhead Command Shared Object Symbol > # ........ ............... ................................................................ ...... > # > 23.80% rb_producer [kernel.kallsyms] [k] native_sched_clock > 14.83% rb_producer [kernel.kallsyms] [k] rb_reserve_next_event > 13.20% rb_producer [kernel.kallsyms] [k] ring_buffer_unlock_commit > 12.84% rb_producer [kernel.kallsyms] [k] ring_buffer_lock_reserve > 9.27% rb_producer [kernel.kallsyms] [k] rb_end_commit > 9.19% rb_producer [kernel.kallsyms] [k] __rb_reserve_next > 5.24% rb_producer [kernel.kallsyms] [k] trace_clock_local > 3.11% rb_producer [kernel.kallsyms] [k] ring_buffer_event_data > 2.88% rb_producer [ring_buffer_benchmark] [k] ring_buffer_benchmark_init > 2.53% rb_producer [kernel.kallsyms] [k] trace_recursive_unlock > 0.39% rb_producer [kernel.kallsyms] [k] do_gettimeofday > 0.34% rb_producer [kernel.kallsyms] [k] read_tsc > 0.18% rb_producer [kernel.kallsyms] [k] getnstimeofday > 0.14% swapper [kernel.kallsyms] [k] mwait_idle > > > Now that the sched_clock is %23.8 percent, it doesn't > look as bad. > > rb_reserve_next_event is 14.83, > ring_buffer_unlock_commit is 13.20, > ring_buffer_lock_reserve is 12.84, rb_end_commit is > 9.27, __rb_reserve_next is 8.18, trace_clock_local is > 5.24 (which only disables preemption and calls > sched_clock), ring_buffer_event_data is 3.11 which only > returns a pointer on a structure (once a event). > > ring_buffer_benchmark_init is the benchmark code itself, > and trace_recursive_unlock is the code you asked me to > add to check for cases where the tracer could have > recursed itself. > > 14.83 + 13.20 + 9.27 + 8.18 + 5.24 + 3.11 = 53.83% > > Considering that the system is basically idle except for > this benchmark, that makes sense to have it that high. > We need to compare it. The sched_clock() takes up 1/3 of > the entire trace event. Should be better on my box, which has a usable TSC and my sched_clock() is in essence an RDTSC. > > - It has grown a lot of slack. It's complex and hard to > > read. > > I admit, that it got over designed, and I'm all for > replacing it. We were all involved in the initial > design, [...] Yes, by implication i definitely suck too, no argument about that :-) > [...] and I posted several versions of RFC when I > created it. But I admit, it became complex, and now that > we know what we want, we can take a step back and > redesign a new one from scratch. > > But until we do so, I'm continuing to support the > current one, and adding small updates when needed. > > > > > - Over the last year or so the majority of bleeding-edge > > tracing developers have gradually migrated over to perf > > and 'perf trace' / 'perf probe' in particular. As far > > as the past two merge windows go they are > > out-developing the old ftrace UIs by a ratio of 4:1. > > Of course a new project will be in a high state of flux. > I'm quite satisfied with the interface with ftrace and > did not feel that it needed changing. > > > > > So this angle is becoming a prime thing to improve and > > users and developers are hurting from the ftrace/perf > > duality. > > I agree. > > > > > - [ While it's still a long way off, if this trend continues > > we eventually might even be able to get rid of the > > /debug/tracing/ temporary debug API and get rid of > > the ugly in-kernel pretty-printing bits. This is > > good: it may make Andrew very happy for a change ;-) > > So how do you plan on getting rid of the in-kernel > pretty-printing bits and keeping the trace_dump_on_oops? > There's a strong dependency between those two. One approach would be to add a library that can turn the specification into a pretty-print output - say in tools/perf/lib/printlib/ (or lib/printlib/), and co-build it both in perf and in the kernel. That way we get most of the pretty-printing functionality at oops (or critical event) time, but have it in a manageable form. > > The main detail here to be careful of is that lots of > > people are fond of the simplicity of the > > /debug/tracing/ debug UI, so when we replace it we > > want to do it by keeping that simple workflow (or > > best by making it even simpler). I have a few ideas > > how to do this. > > Note, I still would not want to remove the /debug > interface. It is a debug interface anyway. Having both a > syscall access the functionality of the code and having > a debug interface access it very trivial. This could > also be a way we can debug the tool too. > > We can officially support only the tool side, but why > throw away the baby with the bath water? As long as we declare it to go away the moment we have something equivalent or better on the 'perf ftrace' / 'ftrace' tool side, sure. I worry about what Andrew mentioned a few months (years?) ago: unintentional ABI tool bindings to /debug/tracing, just due to us offering it. We want to gravitate towards offering one superb solution for all the tracing goodies we've built and are going to build. > > There's also the detail that in some cases we want to > > print events in the kernel in a human readable way: > > for example EDAC/MCE and other critical events, > > trace-on-oops, etc. This too can be solved. ] > > But if the code for trace-on-oops is there, then the > pretty print code is also there. Not if we share it with the user-space tool (in a generic way) and drive it from the user-space side, not the kernel side. The executable instructions will still be there in the kernel, for those relatively rare cases that need it (lets face it, the usage proportion between ordinary tracing and dump-on-oops is probably 1:100 or worse). > > Regarding performance and complexity, which is our > > main worry atm, fortunately there's work going on in > > that direction - please see PeterZ's recent string of > > patches on lkml: > > > > 4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events > > a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware > > ef60777: perf: Optimize the perf_output() path by removing IRQ-disables > > fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t > > 6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg() > > Yes, I saw these, and it helps a little, but still not everything is > there. Most definitely. We still suck :-) > > And it may sound harsh but at this stage i'm > > personally not at all interested in big design talk. > > Rewriting ftrace on top of perf is a big design change. Yeah, and a pretty rewarding one. > > This isnt rocket science, we have developers and users > > and we know what they are doing and we know what we > > need to do: we need to improve our crap and we need to > > reduce complexity. Less is more. > > > > So i'd like to see iterative, useful action first, and > > i am somewhat sceptical about yet another grand > > tracing design trying to match 100 requirements. > > Here's a few things that I find required: > > The ability to use splice. You keep telling me this is > not important, but splice is magnitudes faster than > mmaping into userspace and then copying into a file. > [...] That claim i find hard to believe, have you got the numbers for that? It's faster for sure if implemented correctly, but i'd be surprised if it made more than 20% of a difference to the total tracing overhead. > [...] If the splice removes the extra copy as we are > working on, it will get even faster. It also makes > writing over the network much easier. Which trace-cmd > does, and it took me half a day to make it do so. > > We also need a way to easily read the buffer from the > kernel. This is for trace-on-oops. > > Overwrite mode must also be supported. I know that Peter > thinks its silly to do this and have it write to a file, > but I find it helpful to do so. But fine, I can live > without overwrite to a file, if no one else cares. But > overwrite mode itself must work for trace-on-oops. We can do overwrite mode, it doesnt look like a central issue. I'm sure we'll find a dozen other things missing as well. We have to start looking. > We need a way to start and stop tracing from the kernel > and userspace. Again, for trace-on-oops. When a bug is > hit, we need a way to stop it. We also must have a way > to easily start it again from userspace. > > A simple on/off switch. This is something that you and > Thomas hammered into me which was the creation of > tracing_on. > > The above can be done, but lets not just hack the > changes in. Lets really look at a proper design. This kind of capability would be met by persistent events attached to struct user. We have to start looking and doing. > > Steve, Mathieu, if you are interested please help out > > Peter with the performance and simplification work. > > The last thing we need is yet another > > replace-everything event. > > But that is what you are asking to happen to ftrace. > "replace-everything". We've got two ring-buffer implementations right now, one has to be replaced. It doesnt mean we should disrupt _two_ implementations and put in a third one, unless there are strong technical reasons for doing so. > > If we really want to create a new ring-buffer > > abstraction i'd suggest we start with Peter's, it has > > a quite sane design and stayed simple and flexible - > > if then it could be factored out a bit. > > I'm sorry, but it is very coupled with perf. [...] Then decouple it. > [...] It is simple and flexible with what perf does > today, but not what ftrace does today, or to make perf > do what ftrace does. Why? Neither overwrite mode, nor splice support, nor persistent buffering looks hard at all. > > Here are more bits of what i see as the 'action' going > > forward, in no particular order: > > > > 1) Push the /debug/tracing/events/ event description > > into sysfs, as per this thread on lkml: > > > > [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs > > > > http://groups.google.com/group/linux.kernel/msg/ab9aa075016c639e > > > > I.e. one more step towards integrating ftrace into perf. > > We could do something like this, but it is not as > trivial as it may seem. Never is. > > 2) Use 1) to unify the perf events and the ftrace > > ring-buffer. This, as things are standing is > > best done by factoring out Peter's ring-buffer > > in kernel/perf_event.c. It's properly abstracted > > and it _far_ simpler than kernel/tracing/ring_buffer.c, > > which has become a monstrosity. > > Sure, my code has become quite complex, I'll admit that. > I'm for refactoring. I really don't care what code is > used as long as I still have the requirements that make > ftrace useful for me and several others. > > And yes, any ring buffer needs to be factored out to a > defined interface. The current ring buffer in perf is > very coupled with the perf design. The event code > written to perf in include/trace/ftrace.h is still quite > a hack. Then lets make it less of a hack. > > (but i'm open to other simplifications as well) > > > > 3) Add the function-tracer and function-graph tracer > > as an event and integrate it into perf. > > Here's something that I think perf needs to do. It needs > to become a bit more modulized, and not absorb > everything into perf instead of having things use just a > part of perf. Like the breakpoint code requires full > perf to be done, and x86 has PERF_EVENTS permanently on. > Which is really a PITA. the Kconfig rule should be fixed so that PERF_EVENTS can be disabled on EMBEDDED. On non-embedded, why is it a PITA? > As I designed ftrace from day one, I tried to make it > modular. Thus, the function trace > (kernel/trace/ftrace.c) has always been a separate > entity from ftrace itself. Heck, LTTng uses it. It would > be actually trivial to implement the function tracer in > perf today. Just register a function you want called at > every function event and start running. Here i disagree: the last thing Linux needs is various fragmented pieces of instrumentation. We try hard with perf to unify. It takes time but has to happen. > I need to abstract out the "what function to call bit" > but that's about it. > > > > > This will live-test the efficiency of the unification > > and brings over the last big ftrace plugin to perf. > > > > 4) Gradually convert/port/migrate all the remaining > > plugins over as well. We need to do this very gently > > because there are users - but stop piling new > > functionality on to the old ftrace side. This usually > > involves: > > I have not been piling functionality into the ftrace > side, as you stated at the beginning of the email, that > the development has been quite slow. Yes, I've been > still working on things, but I would say it is a far cry > from "piling". Sorry, that was a poor choice of a word. We did manage to pile quite a bit of crap into the tracing hotpath though. > I did add some simple features, like the "lost events" > to the ring buffer, but that code can be looked at for > ideas for the future. It does not make merging any more > difficult than it already is. > > Fine, I'll stop some of the code I was going to work on > (multiple buffers for ftrace, event triggers, > configuring events to stop the tracer (like for > different kinds of BUG() and WARN_ON()), etc). I think those are excellent features that seem to fit rather well into the persisent buffering concept, what do you think? Via the 'ftrace' tool we could as a first step extend tracing sessions to per user. We could also bring the start/stop/dont-worry-about-where-the-trace-is workflow to perf trace that way. Hm? > > - Conversion of an explicit tracing callback to > > TRACE_EVENT (for example in the case of mmiotrace), > > while keeping all tool functionality. > > Not sure what this is. We have a lot of explicit ad-hoc callbacks, hooks, notifiers, function pointers in the kernel, which feed some sort of event logging mechanism. These should be unified while keeping their current functionality - but exposing the callback as an event source as well. This enriches total pool of events. For example the mmiotrace tracepoints. > > - Migrate any 'special' ftrace feature to perf > > capabilities so that it's available via the > > syscall interface as well. (for example > > 'latency maximum tracking' is something that we > > probably want to do with kernel-side help - we > > probably dont want to implement it via tracing > > everything all the time and finding the maximum > > based on terabytes of data.) > > > > (And there are other complications here too, but you > > get the idea.) > > Yes, the latency tracers are special, and I treat them > special in ftrace. They can be redesigned to keep their > current functionality. That needed to be done anyway. Yeah. > > All in one, i think we can reuse more than 50% of all > > current ftrace code (possibly up to 70-80%) - and we > > are already reusing bits of it. > > Ingo, as we have had this conversation, I don't know if > you realized that I agreed to merge ftrace to perf. > There's just some functionality about ftrace I really > want to keep, which does include accessing via /debugfs, > pretty-print (which is needed by trace-on-oops) and > splice. Ok, good :-) Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-21 17:49 ` Ingo Molnar @ 2010-05-24 20:13 ` Steven Rostedt 0 siblings, 0 replies; 24+ messages in thread From: Steven Rostedt @ 2010-05-24 20:13 UTC (permalink / raw) To: Ingo Molnar Cc: LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter On Fri, 2010-05-21 at 19:49 +0200, Ingo Molnar wrote: > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > > The thing is, in tracing land and more broadly in > > > instrumentation land we have _much_ more earthly problems > > > these days: > > > > > > - Lets face it, performance of the ring-buffer sucks, in > > > a big way. I've recently benchmarked it and it takes > > > hundreds of instructions to trace a single event. > > > Puh-lease ... > > > > "Puh-lease"? > > Yeah, we suck :-/ But the question is, do we want to suck more? > > > Hmm, what were the timings? I just ran the > > ring_buffer_benchmark module. It's in mainline, so > > anyone could do the same. [...] > > i tested it with perf stat and got 700+ instructions for > the sched:sched_wakeup tracepoint. > > That's without extracting the trace - that would add > another significant chunk of overhead. > > Btw., it's not just ftrace, on the perf side i see big > tracing overhead as well, it's even worse than this > because we layer it on each other and because perf does a > few stupid things (that need fixing). But this has nothing to do with the ring buffer. It has to do with the way the trace event itself is set up. There's little room for improvement if we want to keep things automated. Otherwise, all trace events will need to be written by us, or make all writers of TRACE_EVENT() into tracing experts. The beauty of TRACE_EVENT() was to make it simple enough that anyone could add one without needing to know the details of how it appears. The are also flexible enough to have in kernel users attach directly to them. [coming back up after reading below ] Did you test only the event "echo 1 > /debug/tracing/event/sched/sched_wakeup" or did you only do it in conjunction with the sched_switch tracer? > > > [...] Remember to disable KERNEL_DEBUG because things > > like lockdep can have a significant impact on it. > > I had lockdep off. > > > Note, instructions themselves are not that bad, but what > > those instructions do count a lot more. I avoid all lock > > functions and memory barriers which cause the code to be > > a little bigger. But a single cmpxchg can take much more > > time than a lot of other instructions put together. > > It's bad cycle-wise as well and obviously running through > 700 instructions is bad no matter what. > > > On this machine: > > > > model name : Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz > > > > It's a 2x2 (4 cores) > > > > With running: > > > > # modprobe ring_buffer_benchmark producer_fifo=10 > > # sleep 30 > > # cat /debug/tracing/trace > > > > > > <...>-2636 [003] 761.969850: ring_buffer_producer_thread: Starting ring buffer hammer > > <...>-2636 [003] 771.910121: ring_buffer_producer_thread: End ring buffer hammer > > <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Consumer at nice: 19 > > <...>-2636 [003] 771.910172: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 > > <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Time: 9965321 (usecs) > > <...>-2636 [003] 771.910173: ring_buffer_producer_thread: Overruns: 0 > > <...>-2636 [003] 771.910174: ring_buffer_producer_thread: Read: 76580325 (by pages) > > <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Entries: 75 > > <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Total: 76580400 > > <...>-2636 [003] 771.910175: ring_buffer_producer_thread: Missed: 0 > > <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Hit: 76580400 > > <...>-2636 [003] 771.910176: ring_buffer_producer_thread: Entries per millisec: 7684 > > <...>-2636 [003] 771.910176: ring_buffer_producer_thread: 130 ns per entry > > <...>-2636 [003] 771.910177: ring_buffer_producer_thread: Sleeping for 10 secs > > > > <...>-2636 [003] 781.884686: ring_buffer_producer_thread: Starting ring buffer hammer > > <...>-2636 [003] 790.862392: ring_buffer_producer_thread: End ring buffer hammer > > <...>-2636 [003] 790.897399: ring_buffer_producer_thread: Running Consumer at nice: 19 > > <...>-2636 [003] 790.897400: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 > > <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Time: 9000321 (usecs) > > <...>-2636 [003] 790.897401: ring_buffer_producer_thread: Overruns: 26581200 > > <...>-2636 [003] 790.897402: ring_buffer_producer_thread: Read: 6767200 (by events) > > <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Entries: 0 > > <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Total: 33348400 > > <...>-2636 [003] 790.897403: ring_buffer_producer_thread: Missed: 0 > > <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Hit: 33348400 > > <...>-2636 [003] 790.897404: ring_buffer_producer_thread: Entries per millisec: 3705 > > <...>-2636 [003] 790.897405: ring_buffer_producer_thread: 269 ns per entry > > > > While having the reader read by pages (splice mode) a > > write takes 130ns per entry. Not that bad. > > I tried a simple loop of: > > int main(void) > { > int i; > > for (i = 0; i < 1000000; i++) > getppid(); > > return 0; > } > > with kernel/timer.c's getppid syscall having this: > > #undef CREATE_TRACE_POINTS > #include <trace/events/sched.h> > > SYSCALL_DEFINE0(getppid) > ... > trace_sched_wakeup(current, 0); > > and then used: > > echo 1 > /debug/tracing/events/sched/sched_wakeup/enable > echo sched_switch > /debug/tracing/current_tracer This is not fair, you just enabled two probes, not just one. There's the sched_wakeup generic event "/debug/tracing/events/sched/sched_wakeup" and the sched_switch tracer that adds its own probe. Thus, the trace_sched_wakeup() calls two probes each time. Have you seen the code for the probe of sched_switch tracer's trace_sched_wakeup? What you did will cause the getppid to call both the generic TRACE_EVENT() probe as well as the sched_switch attached probe. Here is the sched_switch sched_wakeup probe: static void probe_sched_wakeup(struct rq *__rq, struct task_struct *wakee, int success) { struct trace_array_cpu *data; unsigned long flags; int cpu, pc; if (unlikely(!sched_ref)) return; tracing_record_cmdline(current); if (!tracer_enabled || sched_stopped) return; pc = preempt_count(); local_irq_save(flags); cpu = raw_smp_processor_id(); data = ctx_trace->data[cpu]; if (likely(!atomic_read(&data->disabled))) tracing_sched_wakeup_trace(ctx_trace, wakee, current, flags, pc); local_irq_restore(flags); } It also disables interrupts, but lets look at some of the functions it calls: void tracing_record_cmdline(struct task_struct *tsk) { if (atomic_read(&trace_record_cmdline_disabled) || !tracer_enabled || !tracing_is_on()) return; trace_save_cmdline(tsk); } static void trace_save_cmdline(struct task_struct *tsk) { unsigned pid, idx; if (!tsk->pid || unlikely(tsk->pid > PID_MAX_DEFAULT)) return; /* * It's not the end of the world if we don't get * the lock, but we also don't want to spin * nor do we want to disable interrupts, * so if we miss here, then better luck next time. */ if (!arch_spin_trylock(&trace_cmdline_lock)) return; idx = map_pid_to_cmdline[tsk->pid]; if (idx == NO_CMDLINE_MAP) { idx = (cmdline_idx + 1) % SAVED_CMDLINES; /* * Check whether the cmdline buffer at idx has a pid * mapped. We are going to overwrite that entry so we * need to clear the map_pid_to_cmdline. Otherwise we * would read the new comm for the old pid. */ pid = map_cmdline_to_pid[idx]; if (pid != NO_CMDLINE_MAP) map_pid_to_cmdline[pid] = NO_CMDLINE_MAP; map_cmdline_to_pid[idx] = tsk->pid; map_pid_to_cmdline[tsk->pid] = idx; cmdline_idx = idx; } memcpy(&saved_cmdlines[idx], tsk->comm, TASK_COMM_LEN); arch_spin_unlock(&trace_cmdline_lock); } The above is grabbing a spinlock! tracing_sched_wakeup_trace(struct trace_array *tr, struct task_struct *wakee, struct task_struct *curr, unsigned long flags, int pc) { struct ftrace_event_call *call = &event_wakeup; struct ring_buffer_event *event; struct ctx_switch_entry *entry; struct ring_buffer *buffer = tr->buffer; event = trace_buffer_lock_reserve(buffer, TRACE_WAKE, sizeof(*entry), flags, pc); if (!event) return; entry = ring_buffer_event_data(event); entry->prev_pid = curr->pid; entry->prev_prio = curr->prio; entry->prev_state = curr->state; entry->next_pid = wakee->pid; entry->next_prio = wakee->prio; entry->next_state = wakee->state; entry->next_cpu = task_cpu(wakee); if (!filter_check_discard(call, entry, buffer, event)) ring_buffer_unlock_commit(buffer, event); ftrace_trace_stack(tr->buffer, flags, 6, pc); ftrace_trace_userstack(tr->buffer, flags, pc); } This has little to do with the ring buffer, or the way ftrace does events. It can certainly be made faster, but wakeups are rather rare (compared to other events that are traced). > > perf stat --repeat 10 -e instructions -e cycles ./getppid-1m > > with the tracepoint off i got: > > # perf stat -B --repeat 10 -e instructions -e cycles ./getppid-1m > > Performance counter stats for './getppid-1m' (10 runs): > > 89.700.970 instructions # 0,614 IPC ( +- 0,146% ) > 146.204.088 cycles ( +- 1,160% ) > > 0,046142868 seconds time elapsed ( +- 1,200% ) > > > with it on i got: > > # perf stat -B --repeat 10 -e instructions -e cycles ./getppid-1m > > Performance counter stats for './getppid-1m' (10 runs): > > 1.241.321.592 instructions # 1,617 IPC ( +- 0,054% ) > 767.890.685 cycles ( +- 1,298% ) > > 0,241048084 seconds time elapsed ( +- 1,288% ) > > > i.e. from 89 instructions to 1241 instructions. From 146 > cycles to 767 cycles. But you enabled sched_switch too, right? That would cause two probes to execute, and one is a lot bigger than the other. > > And that's just the recording - doesnt count the > recovering of the trace. (which can argued to be a > slowpath in many tracing workflows when buffers are big > enough.) But still, you inflated the numbers by enabling the sched_switch tracer. > > Nehalem box. Can send the config. > > > But when the reader is reading each event, (I'm assuming > > we get some cache line bouncing here, it is still > > lockless), a write takes 269ns. > > That would be like 800 cycles? So it's roughly the same > ballpark figure as mine - we suck. (this is a 3GHz box) That was with an aggressive reader, which even though it is lockless, the reader is still dirtying the writer's cache lines. > > > Note, the reader is an aggressive reader. That is, it > > goes into a strong loop and just reads as fast as it > > can. Which would cause cache line bouncing. > > > > When I disable the reader and just measure what a write > > takes: > > > > # rmmod ring_buffer_benchmark > > # modprobe ring_buffer_benchmark disable_reader=1 producer_fifo=10 > > # sleep 30 > > # cat /debug/tracing/trace > > > > <...>-2649 [003] 1141.981639: ring_buffer_producer_thread: Starting ring buffer hammer > > <...>-2649 [003] 1151.950840: ring_buffer_producer_thread: End ring buffer hammer > > <...>-2649 [003] 1151.950841: ring_buffer_producer_thread: Running Producer at SCHED_FIFO 10 > > <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Time: 9994238 (usecs) > > <...>-2649 [003] 1151.950842: ring_buffer_producer_thread: Overruns: 90635670 > > <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Read: (reader disabled) > > <...>-2649 [003] 1151.950843: ring_buffer_producer_thread: Entries: 62480 > > <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Total: 90698150 > > <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Missed: 0 > > <...>-2649 [003] 1151.950844: ring_buffer_producer_thread: Hit: 90698150 > > <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: Entries per millisec: 9075 > > <...>-2649 [003] 1151.950845: ring_buffer_producer_thread: 110 ns per entry > > <...>-2649 [003] 1151.950846: ring_buffer_producer_thread: Sleeping for 10 secs > > > > It takes just a 110ns. > > I dont seem to be able to get such low numbers Well, don't run the sched_switch tracer ;-) > - but even > ~250 instructions are pretty expensive. That bench mark only tested the ring buffer by allocating its own ring buffer. It did not use the ftrace infrastructure. I tried to do the same with perf but gave up because of the limitations of the perf ring buffer. > > > Lets now use perf to see what is happening: > > > > # perf record -a -c10000 -f > > (wait 30 secs) > > Ctrl-C > > > > # perf report > > > > no symbols found in /usr/sbin/irqbalance, maybe install a debug package? > > # Samples: 1995496 > > # > > # Overhead Command Shared Object Symbol > > # ........ ............... ................................................................ ...... > > # > > 23.80% rb_producer [kernel.kallsyms] [k] native_sched_clock > > 14.83% rb_producer [kernel.kallsyms] [k] rb_reserve_next_event > > 13.20% rb_producer [kernel.kallsyms] [k] ring_buffer_unlock_commit > > 12.84% rb_producer [kernel.kallsyms] [k] ring_buffer_lock_reserve > > 9.27% rb_producer [kernel.kallsyms] [k] rb_end_commit > > 9.19% rb_producer [kernel.kallsyms] [k] __rb_reserve_next > > 5.24% rb_producer [kernel.kallsyms] [k] trace_clock_local > > 3.11% rb_producer [kernel.kallsyms] [k] ring_buffer_event_data > > 2.88% rb_producer [ring_buffer_benchmark] [k] ring_buffer_benchmark_init > > 2.53% rb_producer [kernel.kallsyms] [k] trace_recursive_unlock > > 0.39% rb_producer [kernel.kallsyms] [k] do_gettimeofday > > 0.34% rb_producer [kernel.kallsyms] [k] read_tsc > > 0.18% rb_producer [kernel.kallsyms] [k] getnstimeofday > > 0.14% swapper [kernel.kallsyms] [k] mwait_idle > > > > > > Now that the sched_clock is %23.8 percent, it doesn't > > look as bad. > > > > rb_reserve_next_event is 14.83, > > ring_buffer_unlock_commit is 13.20, > > ring_buffer_lock_reserve is 12.84, rb_end_commit is > > 9.27, __rb_reserve_next is 8.18, trace_clock_local is > > 5.24 (which only disables preemption and calls > > sched_clock), ring_buffer_event_data is 3.11 which only > > returns a pointer on a structure (once a event). > > > > ring_buffer_benchmark_init is the benchmark code itself, > > and trace_recursive_unlock is the code you asked me to > > add to check for cases where the tracer could have > > recursed itself. > > > > 14.83 + 13.20 + 9.27 + 8.18 + 5.24 + 3.11 = 53.83% > > > > Considering that the system is basically idle except for > > this benchmark, that makes sense to have it that high. > > We need to compare it. The sched_clock() takes up 1/3 of > > the entire trace event. > > Should be better on my box, which has a usable TSC and my > sched_clock() is in essence an RDTSC. Hmm, this box does have an unstable TSC, but doesn't sched_clock still do rdtsc? > > > > > - It has grown a lot of slack. It's complex and hard to > > > read. > > > > I admit, that it got over designed, and I'm all for > > replacing it. We were all involved in the initial > > design, [...] > > Yes, by implication i definitely suck too, no argument > about that :-) > > > [...] and I posted several versions of RFC when I > > created it. But I admit, it became complex, and now that > > we know what we want, we can take a step back and > > redesign a new one from scratch. > > > > But until we do so, I'm continuing to support the > > current one, and adding small updates when needed. > > > > > > > > - Over the last year or so the majority of bleeding-edge > > > tracing developers have gradually migrated over to perf > > > and 'perf trace' / 'perf probe' in particular. As far > > > as the past two merge windows go they are > > > out-developing the old ftrace UIs by a ratio of 4:1. > > > > Of course a new project will be in a high state of flux. > > I'm quite satisfied with the interface with ftrace and > > did not feel that it needed changing. > > > > > > > > So this angle is becoming a prime thing to improve and > > > users and developers are hurting from the ftrace/perf > > > duality. > > > > I agree. > > > > > > > > - [ While it's still a long way off, if this trend continues > > > we eventually might even be able to get rid of the > > > /debug/tracing/ temporary debug API and get rid of > > > the ugly in-kernel pretty-printing bits. This is > > > good: it may make Andrew very happy for a change ;-) > > > > So how do you plan on getting rid of the in-kernel > > pretty-printing bits and keeping the trace_dump_on_oops? > > There's a strong dependency between those two. > > One approach would be to add a library that can turn the > specification into a pretty-print output - say in > tools/perf/lib/printlib/ (or lib/printlib/), and co-build > it both in perf and in the kernel. Not sure how this would work. But wouldn't this be a nightmare for package developers? What about other tools that use the perf interface. You allow them right? An ideal world, the kernel should have an awesome tracing infrastructure that several tools could use. > > That way we get most of the pretty-printing functionality > at oops (or critical event) time, but have it in a > manageable form. > > > > The main detail here to be careful of is that lots of > > > people are fond of the simplicity of the > > > /debug/tracing/ debug UI, so when we replace it we > > > want to do it by keeping that simple workflow (or > > > best by making it even simpler). I have a few ideas > > > how to do this. > > > > Note, I still would not want to remove the /debug > > interface. It is a debug interface anyway. Having both a > > syscall access the functionality of the code and having > > a debug interface access it very trivial. This could > > also be a way we can debug the tool too. > > > > We can officially support only the tool side, but why > > throw away the baby with the bath water? > > As long as we declare it to go away the moment we have > something equivalent or better on the 'perf ftrace' / > 'ftrace' tool side, sure. > > I worry about what Andrew mentioned a few months (years?) > ago: unintentional ABI tool bindings to /debug/tracing, > just due to us offering it. Well, it is in the /debug fs. Depending on something called "debug" is always a risk. > > We want to gravitate towards offering one superb solution > for all the tracing goodies we've built and are going to > build. Right, and lets make it so any utility could bind to it and perform well. > > > > > There's also the detail that in some cases we want to > > > print events in the kernel in a human readable way: > > > for example EDAC/MCE and other critical events, > > > trace-on-oops, etc. This too can be solved. ] > > > > But if the code for trace-on-oops is there, then the > > pretty print code is also there. > > Not if we share it with the user-space tool (in a generic > way) and drive it from the user-space side, not the kernel > side. > > The executable instructions will still be there in the > kernel, for those relatively rare cases that need it (lets > face it, the usage proportion between ordinary tracing and > dump-on-oops is probably 1:100 or worse). But there should be a system wide tracing, that would most likely be used within the kernel. This would require a global buffer and a way to write to (and read from) in the kernel. > > > > Regarding performance and complexity, which is our > > > main worry atm, fortunately there's work going on in > > > that direction - please see PeterZ's recent string of > > > patches on lkml: > > > > > > 4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events > > > a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware > > > ef60777: perf: Optimize the perf_output() path by removing IRQ-disables > > > fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t > > > 6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg() > > > > Yes, I saw these, and it helps a little, but still not everything is > > there. > > Most definitely. We still suck :-) > > > > And it may sound harsh but at this stage i'm > > > personally not at all interested in big design talk. > > > > Rewriting ftrace on top of perf is a big design change. > > Yeah, and a pretty rewarding one. > > > > This isnt rocket science, we have developers and users > > > and we know what they are doing and we know what we > > > need to do: we need to improve our crap and we need to > > > reduce complexity. Less is more. > > > > > > So i'd like to see iterative, useful action first, and > > > i am somewhat sceptical about yet another grand > > > tracing design trying to match 100 requirements. > > > > Here's a few things that I find required: > > > > The ability to use splice. You keep telling me this is > > not important, but splice is magnitudes faster than > > mmaping into userspace and then copying into a file. > > [...] > > That claim i find hard to believe, have you got the > numbers for that? Well, it's hard to compare since perf does only mmap, and trace-cmd does only splice. But I can compare the two, although the way things are done differently, which affect the results. In my email I showed how I tested both perf and ftrace, by using the one common denominator of both: TRACE_EVENT(). I modified the ring_buffer_benchmark to test the time it takes to call a simple event. I showed what I did in the previous email. This takes a look at how both ftrace and perf handles events. I just tested both perf and trace-cmd against 2.6.34 and latest tip/perf/core. Here's the results I have: On vanilla 2.6.34: [root@ixf9 linux-trace.git]# perf record -c1 -e rb_bench:rb_benchmark -a sleep 60 [ perf record: Woken up 62 times to write data ] [ perf record: Captured and wrote 1031.015 MB perf.data (~45045722 samples) ] >From console: ------------- wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9603958 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 11041600 Entries per millisec: 1149 870 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9973544 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 11518850 Entries per millisec: 1155 865 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9973545 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 10900500 Entries per millisec: 1093 914 ns per entry Sleeping for 10 secs [root@ixf9 linux-trace.git]# perf trace | grep rb_producer | wc -l 16495123 [root@ixf9 linux-trace.git]# trace-cmd record -b 512 -s10 -e rb_bench:rb_benchmark sleep 60 disable all enable rb_bench:rb_benchmark path = /debug/tracing/events/rb_bench/rb_benchmark/enable offset=163000 offset=163000 offset=47a5e000 offset=47a5e000 Kernel buffer statistics: Note: "entries" are the entries left in the kernel ring buffer and are not recorded in the trace data. They should all be zero. CPU: 0 entries: 0 overrun: 0 commit overrun: 0 CPU: 1 entries: 0 overrun: 102010404 commit overrun: 0 CPU: 2 entries: 0 overrun: 0 commit overrun: 0 CPU: 3 entries: 0 overrun: 0 commit overrun: 0 >From console: ------------- wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9162808 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 49703550 Entries per millisec: 5424 184 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9973570 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 54054950 Entries per millisec: 5420 184 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9973501 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 53908300 Entries per millisec: 5405 185 ns per entry Sleeping for 10 secs [root@ixf9 linux-trace.git]# trace-cmd report | grep rb_producer | wc -l 59795170 [root@ixf9 linux-trace.git]# ls -l perf.data trace.dat -rw------- 1 root root 1082104985 2010-05-24 12:05 perf.data -rw-r--r--. 1 root root 1202053120 2010-05-24 12:10 trace.dat On latest tip/perf/core: commit d67f088e084755bdceb4f15bc6e05e309db1eea7 Author: Arnaldo Carvalho de Melo <acme@redhat.com> Date: Sun May 23 22:36:51 2010 -0300 perf report: Support multiple events on the TUI [root@ixf9 linux-trace.git]# perf record -c1 -e rb_bench:rb_benchmark -a sleep 60 [ perf record: Woken up 394 times to write data ] [ perf record: Captured and wrote 1093.414 MB perf.data (~47771989 samples) ] >From console: ------------- wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9514214 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 19149150 Entries per millisec: 2012 497 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9972824 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 20701150 Entries per millisec: 2075 481 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9972802 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 20365250 Entries per millisec: 2042 489 ns per entry Sleeping for 10 secs [root@ixf9 linux-trace.git]# perf trace | grep rb_producer | wc -l 17694461 [root@ixf9 linux-trace.git]# trace-cmd record -b 512 -s10 -e rb_bench:rb_benchmark sleep 60 disable all enable rb_bench:rb_benchmark path = /debug/tracing/events/rb_bench/rb_benchmark/enable offset=15d000 offset=15d000 offset=343b2000 offset=343b2000 Kernel buffer statistics: Note: "entries" are the entries left in the kernel ring buffer and are not recorded in the trace data. They should all be zero. CPU: 0 entries: 0 overrun: 0 commit overrun: 0 CPU: 1 entries: 0 overrun: 63560076 commit overrun: 0 CPU: 2 entries: 0 overrun: 0 commit overrun: 0 CPU: 3 entries: 0 overrun: 31949460 commit overrun: 0 >From console: ------------- wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9461863 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 50454750 Entries per millisec: 5332 187 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9972793 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 54203850 Entries per millisec: 5435 183 ns per entry Sleeping for 10 secs wait to start starting Starting ring buffer hammer End ring buffer hammer Running Producer at SCHED_FIFO 10 Time: 9972746 (usecs) Overruns: 0 Read: (reader disabled) Entries: 0 Total: 0 Missed: 0 Hit: 53295900 Entries per millisec: 5344 187 ns per entry Sleeping for 10 secs [root@ixf9 linux-trace.git]# trace-cmd report | grep rb_producer | wc -l 65393428 [root@ixf9 linux-trace.git]# ls -l perf.data trace.dat -rw------- 1 root root 1147509728 2010-05-24 12:56 perf.data -rw-r--r--. 1 root root 1314439168 2010-05-24 13:01 trace.dat In vanilla 2.6.34 we have: perf: 865ns - 914ns per event ftrace: 184ns - 185ns per event events recorded to file: perf: 16,495,123 events trace-cmd: 59,795,170 events file size: perf: 1,082,104,985 bytes ftrace: 1,202,053,120 bytes which gives us: perf: ~65 bytes per event ftrace: ~20 bytes per event That was vanilla, now lets look at the results with all the latest optimizations. This has improved things in perf: In lastest tip/perf/core we have: perf: 497ns - 489ns per event ftrace: 187ns per event events recorded to file: perf: 17,694,461 events trace-cmd: 65,393,428 events file size: perf: 1,147,509,728 bytes trace-cmd: 1,314,439,168 bytes which gives us: perf: ~64 bytes per event ftrace: ~20 bytes per event Looks like Peter's lastest optimizations have cut the time to do a perf trace event almost in half. But it is still 2 1/2 times longer to do a perf event than a ftrace event. But I noticed that even though perf is recording events faster, the number of events that were saved to file hardly changed at all. Which means the reader is still a bottle neck for both cases. I use to think that ftrace had bloated events, but perf triples the size of each event. Perhaps this is also part of the reason perf events are so slow. > > It's faster for sure if implemented correctly, but i'd be > surprised if it made more than 20% of a difference to the > total tracing overhead. It's hard to compare since there's so many variables. trace-cmd recorded ~13% more file than perf. But unless we can compare straight mmap vs splice, I still take this with a grain of salt. > > > [...] If the splice removes the extra copy as we are > > working on, it will get even faster. It also makes > > writing over the network much easier. Which trace-cmd > > does, and it took me half a day to make it do so. > > > > We also need a way to easily read the buffer from the > > kernel. This is for trace-on-oops. > > > > Overwrite mode must also be supported. I know that Peter > > thinks its silly to do this and have it write to a file, > > but I find it helpful to do so. But fine, I can live > > without overwrite to a file, if no one else cares. But > > overwrite mode itself must work for trace-on-oops. > > We can do overwrite mode, it doesnt look like a central > issue. > > I'm sure we'll find a dozen other things missing as well. > We have to start looking. > > > We need a way to start and stop tracing from the kernel > > and userspace. Again, for trace-on-oops. When a bug is > > hit, we need a way to stop it. We also must have a way > > to easily start it again from userspace. > > > > A simple on/off switch. This is something that you and > > Thomas hammered into me which was the creation of > > tracing_on. > > > > The above can be done, but lets not just hack the > > changes in. Lets really look at a proper design. > > This kind of capability would be met by persistent events > attached to struct user. > > We have to start looking and doing. > > > > Steve, Mathieu, if you are interested please help out > > > Peter with the performance and simplification work. > > > The last thing we need is yet another > > > replace-everything event. > > > > But that is what you are asking to happen to ftrace. > > "replace-everything". > > We've got two ring-buffer implementations right now, one > has to be replaced. And you seem to be asking to replace the more efficient one. I admit that my ring buffer can be replaced due to its complexity, but perf needs a major face lift in its buffering scheme to keep up as a competitive tracer. > > It doesnt mean we should disrupt _two_ implementations and > put in a third one, unless there are strong technical > reasons for doing so. But we have to disrupt perf anyway to make it come close to what ftrace does today. Since neither of the buffers seem to satisfy the other's requirements then perhaps that's the best thing to do. Replace both of them. I have yet to see that perf's ring buffer is any better than ftrace's. On the contrary, the performance looks worse. Each time we add a feature, it gets more complex, and the final result of the perf ring buffer may end up being slower and more complex than the ftrace one. I can't see hacking in changes to the perf ring buffer trying to make it work. Perhaps the best thing to do is redesign it, start from scratch. I'm willing to let the ftrace ring buffer be redesigned, why can't we do that to perf as well, and merge the two utilities as we do it? Lets start with a good design instead of just patching the current design trying to make it work. That is what the O(1) scheduler suffered from. Too many, "let's try this" hacks to fix inefficiencies. Replacing it with a new design was the proper approach. > > > > If we really want to create a new ring-buffer > > > abstraction i'd suggest we start with Peter's, it has > > > a quite sane design and stayed simple and flexible - > > > if then it could be factored out a bit. > > > > I'm sorry, but it is very coupled with perf. [...] > > Then decouple it. I don't understand the coupling between perf and its ring buffer enough to do it. I just tried to make it work with the ring_buffer_benchmark and it became much more involved than I would like. It would require Peter and/or Frederic to do this work. > > > [...] It is simple and flexible with what perf does > > today, but not what ftrace does today, or to make perf > > do what ftrace does. > > Why? Neither overwrite mode, nor splice support, nor > persistent buffering looks hard at all. I tried to make it work by itself in the ring buffer benchmark. But it I gave up because it is so coupled with perf's user interface. 1) I had no idea how to create a separate buffer. 2) I had no idea how to read the buffer from the kernel. I'm sure I can figure these out if I had the time, but I don't. > > > > Here are more bits of what i see as the 'action' going > > > forward, in no particular order: > > > > > > 1) Push the /debug/tracing/events/ event description > > > into sysfs, as per this thread on lkml: > > > > > > [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs > > > > > > http://groups.google.com/group/linux.kernel/msg/ab9aa075016c639e > > > > > > I.e. one more step towards integrating ftrace into perf. > > > > We could do something like this, but it is not as > > trivial as it may seem. > > Never is. Right, and moving perf to splice/overwrite/etc, is not going to be as easy as you believe it will. > > > > 2) Use 1) to unify the perf events and the ftrace > > > ring-buffer. This, as things are standing is > > > best done by factoring out Peter's ring-buffer > > > in kernel/perf_event.c. It's properly abstracted > > > and it _far_ simpler than kernel/tracing/ring_buffer.c, > > > which has become a monstrosity. > > > > Sure, my code has become quite complex, I'll admit that. > > I'm for refactoring. I really don't care what code is > > used as long as I still have the requirements that make > > ftrace useful for me and several others. > > > > And yes, any ring buffer needs to be factored out to a > > defined interface. The current ring buffer in perf is > > very coupled with the perf design. The event code > > written to perf in include/trace/ftrace.h is still quite > > a hack. > > Then lets make it less of a hack. Agreed, lets redesign it :-) > > > > (but i'm open to other simplifications as well) > > > > > > 3) Add the function-tracer and function-graph tracer > > > as an event and integrate it into perf. > > > > Here's something that I think perf needs to do. It needs > > to become a bit more modulized, and not absorb > > everything into perf instead of having things use just a > > part of perf. Like the breakpoint code requires full > > perf to be done, and x86 has PERF_EVENTS permanently on. > > Which is really a PITA. > > the Kconfig rule should be fixed so that PERF_EVENTS can > be disabled on EMBEDDED. > > On non-embedded, why is it a PITA? It was a PITA for me since I was not able to test !PERF_EVENTS generic changes on x86. > > > As I designed ftrace from day one, I tried to make it > > modular. Thus, the function trace > > (kernel/trace/ftrace.c) has always been a separate > > entity from ftrace itself. Heck, LTTng uses it. It would > > be actually trivial to implement the function tracer in > > perf today. Just register a function you want called at > > every function event and start running. > > Here i disagree: the last thing Linux needs is various > fragmented pieces of instrumentation. > > We try hard with perf to unify. It takes time but has to > happen. I'm all for making it unified, it is that I do not want to have to learn every aspect of perf just to change part of it. Or perhaps there's another subsystem where it could use a part of perf nicely. This is difficult to do with its design today. > > > I need to abstract out the "what function to call bit" > > but that's about it. > > > > > > > > This will live-test the efficiency of the unification > > > and brings over the last big ftrace plugin to perf. > > > > > > 4) Gradually convert/port/migrate all the remaining > > > plugins over as well. We need to do this very gently > > > because there are users - but stop piling new > > > functionality on to the old ftrace side. This usually > > > involves: > > > > I have not been piling functionality into the ftrace > > side, as you stated at the beginning of the email, that > > the development has been quite slow. Yes, I've been > > still working on things, but I would say it is a far cry > > from "piling". > > Sorry, that was a poor choice of a word. We did manage to > pile quite a bit of crap into the tracing hotpath though. I was looking at the hot path of the ring buffer, and I think it was a major design mistake in embedding the timestamp into the ring buffer level. That should have been done in a higher level. By removing that requirement, it would have made the ring buffer code hot path a lot nicer. We all failed in that design mistake. (I'm as guilty as anyone) > > > I did add some simple features, like the "lost events" > > to the ring buffer, but that code can be looked at for > > ideas for the future. It does not make merging any more > > difficult than it already is. > > > > Fine, I'll stop some of the code I was going to work on > > (multiple buffers for ftrace, event triggers, > > configuring events to stop the tracer (like for > > different kinds of BUG() and WARN_ON()), etc). > > > I think those are excellent features that seem to fit > rather well into the persisent buffering concept, what do > you think? > > Via the 'ftrace' tool we could as a first step extend > tracing sessions to per user. We could also bring the > start/stop/dont-worry-about-where-the-trace-is workflow to > perf trace that way. > > Hm? > > > > - Conversion of an explicit tracing callback to > > > TRACE_EVENT (for example in the case of mmiotrace), > > > while keeping all tool functionality. > > > > Not sure what this is. > > We have a lot of explicit ad-hoc callbacks, hooks, > notifiers, function pointers in the kernel, which feed > some sort of event logging mechanism. These should be > unified while keeping their current functionality - but > exposing the callback as an event source as well. This > enriches total pool of events. > > For example the mmiotrace tracepoints. I'm not against any of this. Just need to take some time in finding a proper design to this. One thing we have to consider is looking at the use of the tracing infrastructure in the kernel outside of kernel developers. At the Linux Collaboration Summit, several people (non kernel developers) wanted to have some amount of tracing happening in a production system. To have this happen, the overhead must be at least configured to a minimum. Especially if we want a real competitor to DTrace. -- Steve > > > > - Migrate any 'special' ftrace feature to perf > > > capabilities so that it's available via the > > > syscall interface as well. (for example > > > 'latency maximum tracking' is something that we > > > probably want to do with kernel-side help - we > > > probably dont want to implement it via tracing > > > everything all the time and finding the maximum > > > based on terabytes of data.) > > > > > > (And there are other complications here too, but you > > > get the idea.) > > > > Yes, the latency tracers are special, and I treat them > > special in ftrace. They can be redesigned to keep their > > current functionality. That needed to be done anyway. > > Yeah. > > > > All in one, i think we can reuse more than 50% of all > > > current ftrace code (possibly up to 70-80%) - and we > > > are already reusing bits of it. > > > > Ingo, as we have had this conversation, I don't know if > > you realized that I agreed to merge ftrace to perf. > > There's just some functionality about ftrace I really > > want to keep, which does include accessing via /debugfs, > > pretty-print (which is needed by trace-on-oops) and > > splice. > > Ok, good :-) > > Thanks, > > Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFD] Future tracing/instrumentation directions 2010-05-20 9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar 2010-05-20 10:07 ` Frederic Weisbecker 2010-05-20 14:38 ` Steven Rostedt @ 2010-05-21 9:24 ` Li Zefan 2 siblings, 0 replies; 24+ messages in thread From: Li Zefan @ 2010-05-21 9:24 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Peter Zijlstra, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Masami Hiramatsu, Lin Ming, Cyrill Gorcunov, Mike Galbraith, Paul Mackerras, Hitoshi Mitake, Robert Richter Ingo Molnar wrote: > * Steven Rostedt <rostedt@goodmis.org> wrote: > >> More than a year and a half ago (September 2008), at >> Linux Plumbers, we had a meeting with several kernel >> developers to come up with a unified ring buffer. A >> generic ring buffer in the kernel that any subsystem >> could use. After coming up with a set of requirements, I >> worked on implementing it. One of the requirements was >> to start off simple and work to become a more complete >> buffering system. >> >> [...] > > The thing is, in tracing land and more broadly in > instrumentation land we have _much_ more earthly problems > these days: > > - Lets face it, performance of the ring-buffer sucks, in > a big way. I've recently benchmarked it and it takes > hundreds of instructions to trace a single event. > Puh-lease ... > We ran some benchmarks with all the trace events enabled except lock and kmem events, and the results showed the overhead was quite small and acceptable. But that was 2.6.31, we didn't benchmark newer kernels which have more tracepoints. ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2010-05-24 20:13 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-05-19 17:51 [RFC] Unified Ring Buffer (Next Generation) Steven Rostedt 2010-05-19 18:10 ` Andi Kleen 2010-05-19 18:44 ` Steven Rostedt 2010-05-19 19:25 ` Andi Kleen 2010-05-19 19:38 ` Steven Rostedt 2010-05-19 18:47 ` Mathieu Desnoyers 2010-05-20 6:41 ` Andi Kleen 2010-05-20 13:48 ` Mathieu Desnoyers 2010-05-19 19:05 ` Mathieu Desnoyers 2010-05-20 9:31 ` [RFD] Future tracing/instrumentation directions Ingo Molnar 2010-05-20 10:07 ` Frederic Weisbecker 2010-05-20 11:07 ` Thomas Gleixner 2010-05-20 11:42 ` Ingo Molnar 2010-05-20 11:48 ` Ingo Molnar 2010-05-20 12:15 ` Frederic Weisbecker 2010-05-20 12:19 ` Theodore Tso 2010-05-20 11:36 ` Ingo Molnar 2010-05-20 13:06 ` Frederic Weisbecker 2010-05-20 14:38 ` Steven Rostedt 2010-05-20 15:04 ` Steven Rostedt 2010-05-20 20:12 ` Steven Rostedt 2010-05-21 17:49 ` Ingo Molnar 2010-05-24 20:13 ` Steven Rostedt 2010-05-21 9:24 ` Li Zefan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).