From: Mathieu Desnoyers <compudj@krystal.dyndns.org>
To: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>,
linux-kernel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <peterz@infradead.org>,
Frederic Weisbecker <fweisbec@gmail.com>,
Theodore Tso <tytso@mit.edu>,
Arjan van de Ven <arjan@infradead.org>,
Pekka Paalanen <pq@iki.fi>,
Arnaldo Carvalho de Melo <acme@redhat.com>,
"H. Peter Anvin" <hpa@zytor.com>,
Martin Bligh <mbligh@google.com>,
"Frank Ch. Eigler" <fche@redhat.com>,
Tom Zanussi <tzanussi@gmail.com>,
Masami Hiramatsu <mhiramat@redhat.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Jason Baron <jbaron@redhat.com>,
Christoph Hellwig <hch@infradead.org>,
Jiaying Zhang <jiayingz@google.com>,
Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>,
mrubin@google.com, md@google.com
Subject: Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
Date: Wed, 4 Mar 2009 10:39:35 -0500 [thread overview]
Message-ID: <20090304153934.GA31573@Krystal> (raw)
In-Reply-To: <20090304102640.GB31239@elte.hu>
* Ingo Molnar (mingo@elte.hu) wrote:
>
> * Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > RFC only, not for pulling, unless everyone is fine with these
> > :-)
> >
> > After telling the folks from Google that I had splice code
> > that I needed to get working, I finally got around to doing
> > it.
> >
> > Not only did I get the splice code working, but I also made a
> > perl script (might want to cover you eyes if you look at that
> > code) that can automagically create the output from the binary
> > files reading the format arguments in the
> > /debugfs/tracing/events/<subsys>/<event>/format file.
>
> Very cool stuff!
>
> The ftrace splice path allows the following tracing path: after
> we create a trace ringbuffer page in the kernel, we dont touch
> it _ever again_. We can shuffle it to disk via DMA or over the
> network via DMA without bringing it again into the CPU's caches.
>
> This feature allows low-overhead high-throughput tracing on
> unprecedented levels. There's no format string overhead, no
> instruction patching/trapping overhead - straight raw binary
> tracing with C syntax tracepoints and a zero-copy path to
> storage.
>
> Have you had any chance to measure tracing overhead and the
> maximum throghput we can reach with the ftrace splice code? I'd
> expect to see some really nice numbers.
>
"unprecented levels" -> LTTng has been using splice for about 5 months.
The only reason why I did not post the patchset which performs this is
because I am currently adapting a LTTng module to optionally format the
binary buffers to a text output. And yes, it provides very, very good
performance results.
Mathieu
> > >From previous patches, we have in include/trace/sched_event_types.h:
> >
> > #undef TRACE_SYSTEM
> > #define TRACE_SYSTEM sched
> >
> > TRACE_EVENT_FORMAT(sched_switch,
> > TPPROTO(struct rq *rq, struct task_struct *prev,
> > struct task_struct *next),
> > TPARGS(rq, prev, next),
> > TPFMT("task %s:%d ==> %s:%d",
> > prev->comm, prev->pid, next->comm, next->pid),
> > TRACE_STRUCT(
> > TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > TRACE_FIELD(int, prev_prio, prev->prio)
> > TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > next_comm,
> > TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > next->comm,
> > TASK_COMM_LEN)))
> > TRACE_FIELD(pid_t, next_pid, next->pid)
> > TRACE_FIELD(int, next_prio, next->prio)
> > ),
> > TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > );
> >
> > By making that, we automagically get this file:
> >
> > # cat /debug/tracing/events/sched/sched_switch/format
> > name: sched_switch
> > ID: 29
> > format:
> > field:unsigned char type; offset:0; size:1;
> > field:unsigned char flags; offset:1; size:1;
> > field:unsigned char preempt_count; offset:2; size:1;
> > field:int pid; offset:4; size:4;
> > field:int tgid; offset:8; size:4;
> >
> > field:pid_t prev_pid; offset:12; size:4;
> > field:int prev_prio; offset:16; size:4;
> > field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
> > field:pid_t next_pid; offset:36; size:4;
> > field:int next_prio; offset:40; size:4;
> >
> > print fmt: "prev %d:%d ==> next %s:%d:%d"
> >
> >
> > Now with this patch set, we create a way to read the ftrace
> > ring buffers directly, as a binary page. Splice has been used
> > such that the user could splice the ring buffers without need
> > to copy the pages. The pages are taken from the ring buffers
> > and can be placed directly into files, without extra copies.
> >
> > # ls /debug/tracing/binary_buffers/
> > 0 1 2 3 4 5 6 7
> >
> > One can either just use the read/write to grab live data from
> > these buffers, or they could use splice. I have a simple file
> > that reads this buffers using splice. Note, it only runs on
> > one file, you can make multiple copies to run more. The ring
> > buffers in ftrace are per cpu and they are not dependent on
> > each other.
> >
> > Also, if there is no data in the buffer, it returns -EAGAIN.
> >
> > # find /debug/tracing/events -name 'type' | while read f; do
> > > echo raw > $f; done
> >
> > # find /debug/tracing/events -name 'enable' | while read f; do
> > > echo 1 > $f; done
> >
> > # ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0
> >
> > Yes you can run multiple instances of this on different buffers.
> >
> > # ./rb-read.pl /tmp/buf-0
> >
> > produces:
> >
> > 0 [000] 7071.936459: (irq_handler_entry) irq 48
> > 0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
> > 0 [000] 7071.988801: (sched_signal_send) sig: 14 task 0
> > 0 [000] 7071.988813: (sched_wakeup) task 0 success=1
> > 0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
> > 3303 [000] 7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
> > 0 [000] 7072.020370: (sched_wakeup) task 0 success=1
> > 0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
> > 16 [000] 7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
> >
> > Notice the "(sched_switch)" lines.
> >
> > The splice C code:
> > http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
> >
> > The perl script to parse:
> > http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
> >
> > And yes I know, they are both ugly :-p
>
> Any chance of merging them into a single perl script? I.e. the
> perl script would embedd the .c code and build it on every
> invocation (into /tmp), so that one can do single-tool tracing
> with no immediate binary components. Such self-sufficiency is
> very handy when probing systems in a minimally invasive way.
>
> And we could also have twice the ugliness for the same price.
>
> > The following patches are in:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> >
> > branch: rfc/splice/tip/tracing/ftrace
>
> Looks like the only problem with the code is the -EFAULT
> handling bug it exposed in simple_read_from_buffer() - but that
> fix should go on a separate track IMO.
>
> So i've pulled it into tip:tracing - thanks Steve! We definitely
> want this in v2.6.30.
>
> Ingo
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2009-03-04 15:39 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-03-04 2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
2009-03-04 2:49 ` [PATCH 1/5] ring-buffer: reset write field for ring_buffer_read_page Steven Rostedt
2009-03-04 2:49 ` [PATCH 2/5] ring-buffer: fix ring_buffer_read_page Steven Rostedt
2009-03-04 2:49 ` [PATCH 3/5] ring-buffer: replace sizeof of event header with offsetof Steven Rostedt
2009-03-04 2:49 ` [PATCH 4/5] ring-buffer: make ring_buffer_read_page read from start on partial page Steven Rostedt
2009-03-04 2:49 ` [PATCH 5/5] tracing: add binary buffer files for use with splice Steven Rostedt
2009-03-04 3:35 ` Andrew Morton
2009-03-04 3:43 ` Steven Rostedt
2009-03-04 4:38 ` H. Peter Anvin
2009-03-04 4:45 ` Steven Rostedt
2009-03-04 4:46 ` Theodore Tso
2009-03-04 4:49 ` Steven Rostedt
2009-03-04 5:07 ` [PATCH] fs: make simple_read_from_buffer conventional Steven Rostedt
2009-03-04 10:12 ` Ingo Molnar
2009-03-04 3:01 ` [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
2009-03-04 3:23 ` Steven Rostedt
2009-03-04 10:26 ` Ingo Molnar
2009-03-04 14:51 ` Steven Rostedt
2009-03-04 22:47 ` Ingo Oeser
2009-03-04 15:39 ` Mathieu Desnoyers [this message]
2009-03-04 17:00 ` Mathieu Desnoyers
2009-03-04 17:19 ` Peter Zijlstra
2009-03-06 16:59 ` Steven Rostedt
2009-03-06 19:10 ` Mathieu Desnoyers
2009-03-06 23:28 ` Jiaying Zhang
2009-03-08 19:21 ` Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090304153934.GA31573@Krystal \
--to=compudj@krystal.dyndns.org \
--cc=acme@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=arjan@infradead.org \
--cc=eduard.munteanu@linux360.ro \
--cc=fche@redhat.com \
--cc=fweisbec@gmail.com \
--cc=hch@infradead.org \
--cc=hpa@zytor.com \
--cc=jbaron@redhat.com \
--cc=jiayingz@google.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mbligh@google.com \
--cc=md@google.com \
--cc=mhiramat@redhat.com \
--cc=mingo@elte.hu \
--cc=mrubin@google.com \
--cc=peterz@infradead.org \
--cc=pq@iki.fi \
--cc=rostedt@goodmis.org \
--cc=tytso@mit.edu \
--cc=tzanussi@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.