From: Karim Yaghmour <karim@opersys.com>
To: Roman Zippel <zippel@linux-m68k.org>
Cc: Andi Kleen <ak@muc.de>, Nikita Danilov <nikita@clusterfs.com>,
linux-kernel@vger.kernel.org, Tom Zanussi <zanussi@us.ibm.com>
Subject: Re: 2.6.11-rc1-mm1
Date: Fri, 14 Jan 2005 23:18:52 -0500 [thread overview]
Message-ID: <41E899AC.3070705@opersys.com> (raw)
In-Reply-To: <Pine.LNX.4.61.0501150101010.30794@scrub.home>
Hello Roman,
Roman Zippel wrote:
> This doesn't mean everything has to be put into a single call. Several
> parameters can still be set after creation.
I don't have a problem with that. If that's preferable, then we can do
it this way too.
> Why should a subsystem care about the details of the buffer management?
Because it wants to enforce a data format on buffer boundaries.
Let me explain how this applies in the case of LTT, but this easily
generalizes itself to any sort of subsystem that needs to transfer
large amounts of information between the kernel and user-space. And
to avoid any confusion, let me repeat that relayfs is not intended
just for conveying debug/performance/trace info.
Basically, in the case of LTT at least, the kernel tracing infrastructure
must provide a stream of data to the user-space tools that they will in
turn process and display to the user. At this point it must be said that
what you write and you how write it in the trace depends largely on a
few key issues. Namely:
- How much data you expect to be generating.
- What you intend to do with it.
Given ltt's target audience (mainstream developers, sysadmins, and power-
users), one of the goals was to have a trace format that provided
easy browsing forward and backwards, and random access. Initially,
this was implemented using two 1MB buffers, one that was being written to
while the other one was being written to disk. So, in essence, we had
random access at 1MB boundaries. For reading backwards, the size of the
event is written at the end of the event and we just need to read
2 bytes prior to the current event to know where the previous event
started.
Eventually we found that this format was rather bulky, and that it
recorded superfluous data. Amongst other things we relied on a single
buffer, so with each event we logged the CPU-ID of the processor on
which the event occured. So, in order to reduce the amount of data
recorded and in trying to obtain better performance at runtime by
avoiding a call to do_gettimeofday for every event, we did the
following:
- Eliminate the CPU-ID => use per-cpu buffers instead.
- Stop calling do_gettimeofday when possible => instead write a
complete time-stamp at sub-buffer boundaries (begining and end;
because of clock drift) and only read the lower-half of the TSC
for each event. Determining an event's actual time is done in
post-mortem in user-space.
So how does this translate in practice? Here's the trace header. This
is written only once at the start of the trace:
/* Information logged when a trace is started */
typedef struct _ltt_trace_start {
u32 magic_number;
u32 arch_type;
u32 arch_variant;
u32 system_type;
u8 major_version;
u8 minor_version;
u32 buffer_size;
ltt_event_mask event_mask;
ltt_event_mask details_mask;
u8 log_cpuid;
u8 use_tsc;
u8 flight_recorder;
} LTT_PACKED_STRUCT ltt_trace_start;
This is written in the begining of every new sub-buffer:
/* Start of trace buffer information */
typedef struct _ltt_buffer_start {
struct timeval time; /* Time stamp of this buffer */
u32 tsc; /* TSC of this buffer, if applicable */
u32 id; /* Unique buffer ID */
} LTT_PACKED_STRUCT ltt_buffer_start;
This is written at the end of every sub-buffer:
typedef struct _ltt_buffer_end {
struct timeval time; /* Time stamp of this buffer */
u32 tsc; /* TSC of this buffer, if applicable */
} LTT_PACKED_STRUCT ltt_buffer_end;
As you can see, we can't just dump this information in an event channel.
This is really intrinsic to how the trace data is going to be read
later on. Removing this data would require more data for each event to
be logged, and require parsing through the trace before reading it in
order to obtain markers allowing random access. This wouldn't be so
bad if we were expecting users to use LTT sporadically for very short
periods of time. However, given ltt's target audience (i.e. need to
run traces for hours, maybe days, weeks), traces would rapidely become
useless because while plowing through a few hundred KBs of data and
allocating RAM for building internal structures as you go is fine,
plowing through tens of GBs of data, possibly hundreds, requires that
you come up with a format that won't require unreasonable resources
from your system, while incuring negligeable runtime costs for generating
it. We believe the format we currently have achieves the right balance
here.
So what happens now is that ltt tells relayfs when creating a channel
how much space it needs for these basic structures, and provides it
with callbacks which are invoked at boundaries for filling the actual
reserved space. In all other circumstances, here's what we are writing
into the relayfs buffer for each event:
- Event ID (1 byte)
- Time delta (4 bytes) => this the low 32-bits from the TSC or a
diff between the current do_gettimeofday and the one at buffer start.
- Event details (variable length, see include/linux/ltt-events.h)
- Event size (2 bytes)
Of course there are possible improvements. For one thing, we've
discussed dropping the "event size" altogether and rely on smaller
buffers and dynamically create sub-buffer indexing tables for reading
backwards. This is still part of a work in progress which aims at
creating an even better and more flexible format. Of course in an
ideal world this new format and the corresponding user tools would
be available as we speak, but there's only so much that can be done
without having an existing solid base to work off on. As usual,
we're open to any other outside suggestions.
> You could move all this into the relay layer by making a relay channel
> an event channel. I know you want to save space, but having a magic
> event_struct_size array is not a good idea. If you have that much events,
> that a little more overhead causes problems, the tracing results won't be
> reliable anymore anyway.
I hope what I said above explains why this isn't possible.
> Simplicity and maintainability are far more important than saving a few
> bytes, the general case should be fast and simple, leave the complexity to
> the special cases.
I agree. I also realize that not all relayfs clients will have the
same requirements as ltt. Already, ltt uses a few things from relayfs
that others are unlikely to need. For example, it directly invokes
relay_lock_channel() to directly lock a channel and relay_write_direct()
to directly write to the buffers without relying on the usual
relay_write() which takes care of both. This allows LTT to do
zero-copy (i.e. no need to pack a buffer before comiting it.) Other
subsystems may actually not use any relayfs function to write, but
instead write directly to a channel as if it was an allocated buffer
(which in fact it is). In all cases, though, the open(), mmap(),
write() semantic makes it very simple for user-space applications
to process channeled data.
So here's a suggested change. Instead of the current relay_open()
API, here are three replacement functions (inspired by Tim's input
and your comments above):
relay_open(channel_path, mode, bufsize, nbufs);
relay_set_property(property, value);
relay_get_property(property, &value);
Is this more palatable?
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || karim@opersys.com || 1-866-677-4546
next prev parent reply other threads:[~2005-01-15 4:13 UTC|newest]
Thread overview: 142+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-01-14 8:23 2.6.11-rc1-mm1 Andrew Morton
2005-01-14 8:47 ` 2.6.11-rc1-mm1 Andi Kleen
2005-01-14 9:27 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-14 10:27 ` 2.6.11-rc1-mm1 Nikita Danilov
2005-01-14 10:38 ` 2.6.11-rc1-mm1 Andi Kleen
2005-01-14 11:06 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-14 15:31 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-14 21:11 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-14 22:58 ` 2.6.11-rc1-mm1 Tim Bird
2005-01-15 0:20 ` 2.6.11-rc1-mm1 Andi Kleen
2005-01-15 4:25 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-15 1:06 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-15 4:18 ` Karim Yaghmour [this message]
2005-01-16 2:38 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-16 6:00 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-16 16:52 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-16 21:18 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-17 1:37 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-17 2:24 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-17 12:20 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-17 20:32 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-17 22:31 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-17 22:42 ` 2.6.11-rc1-mm1 Robert Wisniewski
2005-01-17 23:26 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-17 23:41 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-18 0:02 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-18 3:05 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-17 13:54 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-17 21:27 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-17 23:57 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-18 4:03 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-18 4:30 ` 2.6.11-rc1-mm1 Aaron Cohen
2005-01-18 4:46 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-18 8:07 ` 2.6.11-rc1-mm1 Tom Zanussi
2005-01-18 16:40 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-18 19:37 ` 2.6.11-rc1-mm1 Tom Zanussi
2005-01-18 15:31 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-21 6:26 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-21 22:23 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-23 7:43 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-23 7:52 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-23 8:28 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-24 0:38 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-25 9:12 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-18 1:13 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-18 2:52 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-17 17:02 ` 2.6.11-rc1-mm1 Tom Zanussi
2005-01-16 19:05 ` 2.6.11-rc1-mm1 Tom Zanussi
2005-01-19 11:14 ` 2.6.11-rc1-mm1 Christoph Hellwig
2005-01-19 16:53 ` 2.6.11-rc1-mm1 Tom Zanussi
2005-01-16 16:14 ` 2.6.11-rc1-mm1 Christoph Hellwig
2005-01-16 19:47 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-16 20:30 ` 2.6.11-rc1-mm1 Tom Zanussi
2005-01-19 11:11 ` 2.6.11-rc1-mm1 Christoph Hellwig
2005-01-14 15:24 ` 2.6.11-rc1-mm1 Roman Zippel
2005-01-18 11:19 ` 2.6.11-rc1-mm1 Masami Hiramatsu
2005-01-18 11:46 ` 2.6.11-rc1-mm1 Andi Kleen
2005-01-18 14:52 ` [Lkst-develop] 2.6.11-rc1-mm1 Masami Hiramatsu
2005-01-14 12:36 ` 2.6.11-rc1-mm1 Miklos Szeredi
2005-01-14 13:04 ` 2.6.11-rc1-mm1 Kasper Sandberg
2005-01-14 18:35 ` 2.6.11-rc1-mm1 Andrew Morton
2005-01-14 19:08 ` 2.6.11-rc1-mm1 Rogério Brito
2005-01-14 19:41 ` 2.6.11-rc1-mm1 Peter Buckingham
2005-01-17 17:04 ` 2.6.11-rc1-mm1 Matthias Urlichs
2005-01-14 19:02 ` 2.6.11-rc1-mm1 Bill Davidsen
2005-01-14 15:07 ` 2.6.11-rc1-mm1 Barry K. Nathan
2005-01-14 16:56 ` 2.6.11-rc1-mm1 Dave Jones
2005-01-14 17:55 ` 2.6.11-rc1-mm1 Barry K. Nathan
2005-01-19 23:06 ` 2.6.11-rc1-mm1 Marcos D. Marado Torres
2005-01-19 23:54 ` 2.6.11-rc1-mm1 Barry K. Nathan
2005-01-14 15:35 ` 2.6.11-rc1-mm1 Zwane Mwaikambo
2005-01-14 22:03 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-14 17:35 ` [patch] 2.6.11-rc1-mm1: ip_tables.c: ipt_find_target must be EXPORT_SYMBOL'ed Adrian Bunk
2005-01-14 17:43 ` Patrick McHardy
2005-01-14 22:41 ` 2.6.11-rc1-mm1 Tim Bird
2005-01-14 22:46 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-14 23:22 ` 2.6.11-rc1-mm1 Tim Bird
2005-01-15 0:24 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-15 1:27 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-16 16:18 ` 2.6.11-rc1-mm1 Christoph Hellwig
2005-01-15 13:08 ` [RFC] Instrumentation (was Re: 2.6.11-rc1-mm1) Thomas Gleixner
2005-01-16 2:09 ` Karim Yaghmour
2005-01-16 3:11 ` Roman Zippel
2005-01-16 4:23 ` Karim Yaghmour
2005-01-16 23:43 ` Thomas Gleixner
2005-01-17 1:54 ` Karim Yaghmour
2005-01-17 10:26 ` Thomas Gleixner
2005-01-17 20:34 ` Karim Yaghmour
2005-01-17 22:18 ` Thomas Gleixner
2005-01-17 23:57 ` Karim Yaghmour
2005-01-18 8:46 ` Thomas Gleixner
2005-01-18 16:31 ` Karim Yaghmour
2005-01-19 7:13 ` Werner Almesberger
2005-01-19 17:38 ` Karim Yaghmour
2005-01-14 22:48 ` 2.6.11-rc1-mm1 Andre Eisenbach
2005-01-15 8:42 ` 2.6.11-rc1-mm1 Miklos Szeredi
2005-01-15 8:45 ` 2.6.11-rc1-mm1 Miklos Szeredi
[not found] ` <1105740276.8604.83.camel@tglx.tec.linutronix.de>
2005-01-14 23:09 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-15 0:01 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-15 0:26 ` 2.6.11-rc1-mm1 Andrew Morton
2005-01-15 1:00 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-15 1:25 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-15 10:20 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-16 4:13 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-16 15:19 ` 2.6.11-rc1-mm1 Robert Wisniewski
2005-01-15 1:14 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-15 9:57 ` 2.6.11-rc1-mm1 Thomas Gleixner
2005-01-16 16:21 ` 2.6.11-rc1-mm1 Christoph Hellwig
2005-01-16 19:49 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-16 20:11 ` 2.6.11-rc1-mm1 Robert Wisniewski
2005-01-16 20:32 ` 2.6.11-rc1-mm1 Andrew Morton
2005-01-16 21:06 ` 2.6.11-rc1-mm1 Robert Wisniewski
2005-01-16 21:40 ` 2.6.11-rc1-mm1 Arjan van de Ven
2005-01-17 15:48 ` 2.6.11-rc1-mm1 Robert Wisniewski
2005-01-17 16:13 ` 2.6.11-rc1-mm1 Christoph Hellwig
2005-01-17 21:38 ` 2.6.11-rc1-mm1 Karim Yaghmour
2005-01-16 20:39 ` 2.6.11-rc1-mm1 Christoph Hellwig
2005-01-16 21:14 ` 2.6.11-rc1-mm1 Robert Wisniewski
2005-01-15 2:58 ` 2.6.11-rc1-mm1 William Lee Irwin III
2005-01-17 22:19 ` 2.6.11-rc1-mm1 William Lee Irwin III
2005-01-16 0:59 ` 2.6.11-rc1-mm1 Joseph Fannin
2005-01-16 19:09 ` 2.6.11-rc1-mm1 Daniel Drake
2005-01-16 19:20 ` 2.6.11-rc1-mm1 William Lee Irwin III
2005-01-16 21:09 ` 2.6.11-rc1-mm1 Daniel Drake
2005-01-17 23:31 ` 2.6.11-rc1-mm1 J.A. Magallon
2005-01-18 2:35 ` 2.6.11-rc1-mm1 Daniel Drake
2005-01-18 2:54 ` [PATCH] Wait and retry mounting root device (revised) Daniel Drake
2005-01-18 0:34 ` Al Viro
2005-01-18 0:02 ` Randy.Dunlap
2005-01-18 8:05 ` Andries Brouwer
2005-01-18 8:28 ` Helge Hafting
2005-01-18 8:49 ` Andrew Morton
2005-01-18 13:20 ` Helge Hafting
2005-01-20 20:55 ` [PATCH] Configurable delay before mounting root device Daniel Drake
2005-01-20 20:24 ` Andrew Morton
2005-01-21 18:15 ` Daniel Drake
2005-01-20 22:49 ` William Park
2005-01-18 1:03 ` [PATCH] Wait and retry mounting root device (revised) William Park
2005-01-19 0:43 ` Werner Almesberger
2005-01-18 8:02 ` Andries Brouwer
2005-01-19 20:11 ` Frank van Maarseveen
-- strict thread matches above, loose matches on Subject: below --
2005-01-17 6:49 2.6.11-rc1-mm1 Prasanna S Panchamukhi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41E899AC.3070705@opersys.com \
--to=karim@opersys.com \
--cc=ak@muc.de \
--cc=linux-kernel@vger.kernel.org \
--cc=nikita@clusterfs.com \
--cc=zanussi@us.ibm.com \
--cc=zippel@linux-m68k.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox