From: Mathieu Desnoyers <compudj@krystal.dyndns.org>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: "Frank Ch. Eigler" <fche@redhat.com>,
Martin Bligh <mbligh@google.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>,
od@novell.com
Subject: Re: Unified tracing buffer
Date: Mon, 22 Sep 2008 14:45:38 -0400 [thread overview]
Message-ID: <20080922184538.GB6349@Krystal> (raw)
In-Reply-To: <alpine.DEB.1.10.0809201043490.13681@gandalf.stny.rr.com>
* Steven Rostedt (rostedt@goodmis.org) wrote:
>
>
> On Sat, 20 Sep 2008, Steven Rostedt wrote:
>
> >
> >
> > > > >
> > > > Markers and the buffers are two separate things. Perhaps I'm just tired,
> > > > but I'm thinking that you are thinking we are going to remove markers and
> > > > trace points.
> > > >
> > > > This code is only to give the kernel a ring buffer to use. Not a way to
> > > > put hooks into kernel code. We have tracepoints and markers for that.
> > > >
> > >
> > > I think what Frank tries to express is that we would not lose any
> > > flexibility, but make life much easier for everyone, if we use the
> > > markers as the API to register event ids, keep their type table and to
> > > export the data at runtime.
> >
> > No, absolutely not!
> >
> > Sorry, I don't want to touch markers. I'm fine with tracepoints, but
> > there should be no need to use a damn marker if I want to use the tracer.
> > I shouldn't need to even touch tracepoints to use the trace buffer.
> > That is making things too complicated again. The tracepoints and markers
> > should allow you to hook into the buffers. They are separate. I can
> > imagine using tracepoints without needing buffers and I can see using the
> > buffers without using tracepoints or markers. They are separate things. Do
> > not bind the use of the buffers around markers.
> >
> >
> > Markers are great for you and for many others, but this is about the
> > tracing mechanism and one should not be forced to use markers if they want
> > to do a trace.
> >
>
Hi Steven,
As I expressed above, this is merely one way I propose data could be
exported to user-space. If you have other simpler design ideas in mind,
I look forward to hear them so we can discuss the technical difficulties
associated with that kind of exercice : sending binary data across the
kernel-userspace boundary.
See below for comments.
> Mathieu,
>
> Think about the function tracer itself. It gets called at every funtion,
> where I record the interrupts enabled state, task pid, preempt state,
> function addr, and parent function addr. (that's just off the top of my
> head, I may even record more).
>
> What I don't want is a:
>
> function_call(unsigned long func, unsigned long parent)
> {
> struct ftrace_event event;
>
> event.pid = current->pid;
> event.pc = preempt_count();
> event.irq = local_irq_flags();
> event.func = func;
> event.parent = parent;
>
> trace_mark(func_event_id, "%p",
> sizeof(event), &event);
> }
>
>
> and then to turn on function tracing, I need to hook into this marker. I'd
> rather just push the data right into the buffer here without having to
> make another function call to hook into this.
>
> I'd rather have instead a simple:
>
> struct ftrace_event *event;
>
> event = ring_buffer_reserve(func_event_id,
> sizeof(*event));
>
> event->pid = current->pid;
> event->pc = preempt_count();
> event->irq = local_irq_flags();
> event->func = func;
> event->parent = parent;
>
> ring_buffer_commit(event);
>
The scheme you propose here is based on a few inherent assumptions :
- You assume ring_buffer_reserve() and ring_buffer_commit() are static
inline and thus does not turn into function calls.
- You assume these are small enough so they can be inlined without
causing L1 insn cache trashing when tracing is activated.
- You therefore assume they use a locking scheme that lets them be
really really compact (e.g. interrupt disable and spin lock).
- You assume that the performance impact of doing a function call is
bigger than the impact of locking, which is false by at least a factor
10.
Interrupt disable and spin locks are _really_ slow. So I think putting
the function call concern up front here is really a matter of premature
optimization gone wrong.
I've got burned in the past history of LTTng. The first versions has a
code generator which created specialized code to serialize the
information into the buffers, exactly like you propose to do. But the
overall impact on kernel code size ended up being too big because we
have to repeat all the code to deal with the buffers for every different
type.
However, I think there might be a way to satisfy us both. An information
source like dynamic function trace happen to fit in a particular
use-case where one single execution site is used to format the data
received as parameter for a _lot_ of instrumented sites, and the type
and event names happen to be the same everywhere. This would therefore
benefit widely of having the capability to write directly into the
buffers.
The thing is that I would like ftrace to expose the types it expects to
write into the trace buffers so a generic trace buffer userspace
consumer could read it.
One way to do it, which would let you write data directly into the
buffers, would be something like : (before anyone says "that's many
lines of code", please compile it and look at the assembly result. A lot
of this translates in precomputed values, especially for the event size
computation). (And no, the following code has not been compile-tested)
include/linux/someheader.h :
/* Calculate the offset needed to align the type */
static inline unsigned int var_align(size_t align_drift,
size_t size_of_type)
{
size_t alignment = min(sizeof(void *), size_of_type);
return ((alignment - align_drift) & (alignment-1));
}
kernel/trace/ftrace.c :
/*
* the following macro would only do the "declaration" part of the
* markers, without doing all the function call stuff.
*/
DECLARE_MARKER(function_entry,
"pid %d pc %d flags %lu func 0x%lX parent 0x%lX");
void ftrace_mcount(unsigned long ip, unsigned long parent_ip)
{
size_t ev_size = 0;
char *buffer;
/*
* We assume event payload aligned on sizeof(void *).
* Event size calculated statically.
*/
ev_size += sizeof(int);
ev_size += var_align(ev_size, sizeof(int));
ev_size += sizeof(int);
ev_size += var_align(ev_size, sizeof(unsigned long));
ev_size += sizeof(unsigned long);
ev_size += var_align(ev_size, sizeof(unsigned long));
ev_size += sizeof(unsigned long);
ev_size += var_align(ev_size, sizeof(unsigned long));
ev_size += sizeof(unsigned long);
/*
* Now reserve space and copy data.
*/
buffer = ring_buffer_reserve(func_event_id, ev_size);
/* Write pid */
*(int *)buffer = current->pid;
buffer += sizeof(int);
/* Write pc */
buffer += var_align(buffer, sizeof(int));
*(int *)buffer = preempt_count();
buffer += sizeof(int);
/* Write flags */
buffer += var_align(buffer, sizeof(unsigned long));
*(unsigned long *)buffer = local_irq_flags();
buffer += sizeof(unsigned long);
/* Write func */
buffer += var_align(buffer, sizeof(unsigned long));
*(unsigned long *)buffer = func;
buffer += sizeof(unsigned long);
/* Write parent */
buffer += var_align(buffer, sizeof(unsigned long));
*(unsigned long *)buffer = parent;
buffer += sizeof(unsigned long);
ring_buffer_commit(buffer, ev_size);
}
Would that be suitable for you ?
We could also think of passing the function pointer of the bin to ascii
converter to DECLARE_MARKER(), such as :
void function_entry_show(struct seq_file *m, char *buffer);
DECLARE_MARKER(function_entry, function_entry_show,
"pid %d pc %d flags %lu func 0x%lX parent 0x%lX");
void function_entry_show(struct seq_file *m, char *buffer)
{
/* Read pid */
seq_printf(m, "pid = %d ", *(int *)buffer);
buffer += sizeof(int);
/* Read pc */
buffer += var_align(buffer, sizeof(int));
seq_printf(m, "pc = %d ", *(int *)buffer);
buffer += sizeof(int);
/* Read flags */
buffer += var_align(buffer, sizeof(unsigned long));
seq_printf(m, "flags = %lu ", *(unsigned long *)buffer);
buffer += sizeof(unsigned long);
/* Read func */
buffer += var_align(buffer, sizeof(unsigned long));
seq_printf(m, "func = 0x%lX ", *(unsigned long *)buffer);
buffer += sizeof(unsigned long);
/* Read parent */
buffer += var_align(buffer, sizeof(unsigned long));
seq_printf(m, "parent = 0x%lX ", *(unsigned long *)buffer);
buffer += sizeof(unsigned long);
}
Note that in this particular case, given we would not need any special
"dump everything as if it was an unorganized array of bytes", the
function_entry_show() would be totally useless if we provide a sane
vsnprintf-like decoder based on the format string.
I did this example to show you how we could deal with the special cases
where people would be interested to write a whole network packet (or any
similar structure) directly to the trace (given it has field structures
which are not too tied to the compiler internals and has field sizes
portable across architectures). We could do this without much problem by
adding a format string type which specified such a binary blob, and we
could even leave room for people to provide their ascii formatting
function pointer, as shows my second example here.
Mathieu
>
> -- Steve
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2008-09-22 18:50 UTC|newest]
Thread overview: 125+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-19 21:33 Unified tracing buffer Martin Bligh
2008-09-19 21:42 ` Randy Dunlap
2008-09-19 21:57 ` Martin Bligh
2008-09-19 22:41 ` Olaf Dabrunz
2008-09-19 22:19 ` Martin Bligh
2008-09-20 8:10 ` Olaf Dabrunz
2008-09-20 8:29 ` Steven Rostedt
2008-09-20 11:40 ` Mathieu Desnoyers
2008-09-20 8:26 ` Steven Rostedt
2008-09-20 11:44 ` Mathieu Desnoyers
2008-09-19 22:28 ` Olaf Dabrunz
2008-09-19 22:09 ` Martin Bligh
2008-09-19 23:18 ` Frank Ch. Eigler
2008-09-20 8:50 ` Steven Rostedt
2008-09-20 13:37 ` Mathieu Desnoyers
2008-09-20 13:51 ` Steven Rostedt
2008-09-20 14:54 ` Steven Rostedt
2008-09-22 18:45 ` Mathieu Desnoyers [this message]
2008-09-22 21:39 ` Steven Rostedt
2008-09-23 3:27 ` Mathieu Desnoyers
2008-09-20 0:07 ` Peter Zijlstra
2008-09-22 14:07 ` K.Prasad
2008-09-22 14:45 ` Peter Zijlstra
2008-09-22 16:29 ` Martin Bligh
2008-09-22 16:36 ` Peter Zijlstra
2008-09-22 20:50 ` Masami Hiramatsu
2008-09-23 3:05 ` Mathieu Desnoyers
2008-09-23 2:49 ` Mathieu Desnoyers
2008-09-23 5:25 ` Tom Zanussi
2008-09-23 9:31 ` Peter Zijlstra
2008-09-23 18:13 ` Mathieu Desnoyers
2008-09-23 18:13 ` Mathieu Desnoyers
2008-09-23 18:33 ` Christoph Lameter
2008-09-23 18:33 ` Christoph Lameter
2008-09-23 18:56 ` Linus Torvalds
2008-09-23 18:56 ` Linus Torvalds
2008-09-23 13:50 ` Mathieu Desnoyers
2008-09-23 14:00 ` Martin Bligh
2008-09-23 17:55 ` K.Prasad
2008-09-23 18:27 ` Martin Bligh
2008-09-24 3:50 ` Tom Zanussi
2008-09-24 5:42 ` K.Prasad
2008-09-25 6:07 ` [RFC PATCH 0/8] current relay cleanup patchset Tom Zanussi
2008-09-25 6:07 ` [RFC PATCH 1/8] relay - Clean up relay_switch_subbuf() and make waking up consumers optional Tom Zanussi
2008-09-25 6:07 ` [RFC PATCH 2/8] relay - Make the relay sub-buffer switch code replaceable Tom Zanussi
2008-09-25 6:07 ` [RFC PATCH 3/8] relay - Add channel flags to relay, remove global callback param Tom Zanussi
2008-09-25 6:07 ` [RFC PATCH 4/8] relay - Add reserved param to switch-subbuf, in preparation for non-pad write/reserve Tom Zanussi
2008-09-25 6:07 ` [RFC PATCH 5/8] relay - Map the first sub-buffer at the end of the buffer, for temporary convenience Tom Zanussi
2008-09-25 6:07 ` [RFC PATCH 6/8] relay - Replace relay_reserve/relay_write with non-padded versions Tom Zanussi
2008-09-25 6:07 ` [RFC PATCH 7/8] relay - Remove padding-related code from relay_read()/relay_splice_read() et al Tom Zanussi
2008-09-25 6:08 ` [RFC PATCH 8/8] relay - Clean up remaining padding-related junk Tom Zanussi
2008-09-23 5:27 ` [PATCH 1/3] relay - clean up subbuf switch Tom Zanussi
2008-09-23 20:15 ` Andrew Morton
2008-09-23 5:27 ` [PATCH 2/3] relay - make subbuf switch replaceable Tom Zanussi
2008-09-23 20:17 ` Andrew Morton
2008-09-23 5:27 ` [PATCH 3/3] relay - add channel flags Tom Zanussi
2008-09-23 20:20 ` Andrew Morton
2008-09-24 3:57 ` Tom Zanussi
2008-09-20 0:26 ` Unified tracing buffer Marcel Holtmann
2008-09-20 9:03 ` Steven Rostedt
2008-09-20 13:55 ` Mathieu Desnoyers
2008-09-20 14:12 ` Arjan van de Ven
2008-09-22 18:52 ` Mathieu Desnoyers
2008-10-02 15:28 ` Jason Baron
2008-10-03 16:11 ` Mathieu Desnoyers
2008-10-03 18:37 ` Jason Baron
2008-10-03 19:10 ` Mathieu Desnoyers
2008-10-03 19:25 ` Jason Baron
2008-10-03 19:56 ` Mathieu Desnoyers
2008-10-03 20:25 ` Jason Baron
2008-10-03 21:52 ` Frank Ch. Eigler
2008-09-22 3:09 ` KOSAKI Motohiro
2008-09-22 9:57 ` Peter Zijlstra
2008-09-23 2:36 ` Mathieu Desnoyers
2008-09-22 13:57 ` K.Prasad
2008-09-22 19:45 ` Masami Hiramatsu
2008-09-22 20:13 ` Martin Bligh
2008-09-22 22:25 ` Masami Hiramatsu
2008-09-22 23:11 ` Darren Hart
2008-09-23 0:04 ` Masami Hiramatsu
2008-09-22 23:16 ` Martin Bligh
2008-09-23 0:05 ` Masami Hiramatsu
2008-09-23 0:12 ` Martin Bligh
2008-09-23 14:49 ` Masami Hiramatsu
2008-09-23 15:04 ` Mathieu Desnoyers
2008-09-23 15:30 ` Masami Hiramatsu
2008-09-23 16:01 ` Linus Torvalds
2008-09-23 17:04 ` Masami Hiramatsu
2008-09-23 17:30 ` Thomas Gleixner
2008-09-23 18:59 ` Masami Hiramatsu
2008-09-23 19:36 ` Thomas Gleixner
2008-09-23 19:38 ` Martin Bligh
2008-09-23 19:41 ` Thomas Gleixner
2008-09-23 19:50 ` Martin Bligh
2008-09-23 20:03 ` Thomas Gleixner
2008-09-23 21:02 ` Martin Bligh
2008-09-23 20:03 ` Masami Hiramatsu
2008-09-23 20:08 ` Thomas Gleixner
2008-09-23 15:46 ` Linus Torvalds
2008-09-23 0:39 ` Linus Torvalds
2008-09-23 1:26 ` Roland Dreier
2008-09-23 1:39 ` Steven Rostedt
2008-09-23 2:02 ` Mathieu Desnoyers
2008-09-23 2:26 ` Darren Hart
2008-09-23 2:31 ` Mathieu Desnoyers
2008-09-23 3:26 ` Linus Torvalds
2008-09-23 3:36 ` Mathieu Desnoyers
2008-09-23 4:05 ` Linus Torvalds
2008-09-23 3:43 ` Steven Rostedt
2008-09-23 4:10 ` Masami Hiramatsu
2008-09-23 4:17 ` Martin Bligh
2008-09-23 15:23 ` Masami Hiramatsu
2008-09-23 10:53 ` Steven Rostedt
2008-09-23 4:19 ` Linus Torvalds
2008-09-23 14:12 ` Mathieu Desnoyers
2008-09-23 2:30 ` Mathieu Desnoyers
2008-09-23 3:06 ` Masami Hiramatsu
2008-09-23 14:36 ` KOSAKI Motohiro
2008-09-23 15:02 ` Frank Ch. Eigler
2008-09-23 15:21 ` Masami Hiramatsu
2008-09-23 17:59 ` KOSAKI Motohiro
2008-09-23 18:28 ` Martin Bligh
2008-09-23 3:33 ` Andi Kleen
2008-09-23 3:47 ` Martin Bligh
2008-09-23 5:04 ` Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080922184538.GB6349@Krystal \
--to=compudj@krystal.dyndns.org \
--cc=fche@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mbligh@google.com \
--cc=od@novell.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.