* Unified tracing buffer
@ 2008-09-19 21:33 Martin Bligh
2008-09-19 21:42 ` Randy Dunlap
` (8 more replies)
0 siblings, 9 replies; 122+ messages in thread
From: Martin Bligh @ 2008-09-19 21:33 UTC (permalink / raw)
To: Linux Kernel Mailing List
Cc: Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
Steven Rostedt, od, Frank Ch. Eigler
During kernel summit and Plumbers conference, Linus and others
expressed a desire for a unified
tracing buffer system for multiple tracing applications (eg ftrace,
lttng, systemtap, blktrace, etc) to use.
This provides several advantages, including the ability to interleave
data from multiple sources,
not having to learn 200 different tools, duplicated code/effort, etc.
Several of us got together last night and tried to cut this down to
the simplest usable system
we could agree on (and nobody got hurt!). This will form version 1.
I've sketched out a few
enhancements we know that we want, but have agreed to leave these
until version 2.
The answer to most questions about the below is "yes we know, we'll
fix that in version 2"
(or 3). Simplicity was the rule ...
Sketch of design. Enjoy flaming me. Code will follow shortly.
STORAGE
-------
We will support multiple buffers for different tracing systems, with
separate names, event id spaces.
Event ids are 16 bit, dynamically allocated.
A "one line of text" print function will be provided for each event,
or use the default (probably hex printf)
Will provide a "flight data recorder" mode, and a "spool to disk" mode.
Circular buffer per cpu, protected by per-cpu spinlock_irq
Word aligned records.
Variable record length, header will start with length record.
Timestamps in fixed timebase, monotonically increasing (across all CPUs)
INPUT_FUNCTIONS
---------------
allocate_buffer (name, size)
return buffer_handle
register_event (buffer_handle, event_id, print_function)
You can pass in a requested event_id from a fixed set, and
will be given it, or an error
0 means allocate me one dynamically
returns event_id (or -E_ERROR)
record_event (buffer_handle, event_id, length, *buf)
OUTPUT
------
Data will be output via debugfs, and provide the following output streams:
/debugfs/tracing/<name>/buffers/text
clear text stream (will merge the per-cpu streams via insertion
sort, and use the print functions)
/debugfs/tracing/<name>/buffers/binary[cpu_number]
per-cpu binary data
CONTROL
-------
Sysfs style tree under debugfs
/debugfs/tracing/<name>/buffers/enabed <--- binary value
/debugfs/tracing/<name>/<event1>
/debugfs/tracing/<name>/<event2>
etc ...
provides a way to enable/disable events, see what's available, and
what's enabled.
KNOWN ISSUES / PLANS
-------------------
No way to unregister buffers and events.
Will provide an unregister_buffer and unregister_event call
Generating systemwide time is hard on some platforms
Yes. Time-based output provides a lot of simplicity for the user though
We won't support these platforms at first, we'll add functionality
to make it work for them later.
(plan based on tick-based ms timing, plus counter offset from that
if needed).
Spinlock_irq is ineffecient, and doesn't support tracing in NMIs
True. We'll implement a lockless scheme later (see lttng)
Putting a length record in every event is inefficient
True. Fixed record length with optional extensions is better, but
more complex. v2.
Putting a full timestamp rather than an offset in every event is inefficient
See above. True, but v2.
Relayfs already exists! use that!
People were universally not keen on that idea. Complexity, interface, etc.
We're also providing some higher level shared functions for time &
event ids.
There's no way to decode the binary data stream
Code will be shared from the kernel to decode it, so that we can
get the compact binary
format and decode it later. That code will be kept in the kernel
tree (it's a trivial piece of C).
Version 1.1 ;-)
^ permalink raw reply [flat|nested] 122+ messages in thread* Re: Unified tracing buffer 2008-09-19 21:33 Unified tracing buffer Martin Bligh @ 2008-09-19 21:42 ` Randy Dunlap 2008-09-19 21:57 ` Martin Bligh 2008-09-19 22:28 ` Olaf Dabrunz ` (7 subsequent siblings) 8 siblings, 1 reply; 122+ messages in thread From: Randy Dunlap @ 2008-09-19 21:42 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler On Fri, 19 Sep 2008 14:33:42 -0700 Martin Bligh wrote: > STORAGE > ------- > > We will support multiple buffers for different tracing systems, with > separate names, event id spaces. > Event ids are 16 bit, dynamically allocated. What are these (like)? > A "one line of text" print function will be provided for each event, > or use the default (probably hex printf) > Will provide a "flight data recorder" mode, and a "spool to disk" mode. > > Circular buffer per cpu, protected by per-cpu spinlock_irq > Word aligned records. Arch-specific "word"? or some fixed-size-for-all-systems (so that trace buffers can be shared/used on other systems?) Preferably the latter. > Variable record length, header will start with length record. > Timestamps in fixed timebase, monotonically increasing (across all CPUs) what timestamp resolution? --- ~Randy ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:42 ` Randy Dunlap @ 2008-09-19 21:57 ` Martin Bligh 2008-09-19 22:41 ` Olaf Dabrunz 2008-09-20 8:26 ` Steven Rostedt 0 siblings, 2 replies; 122+ messages in thread From: Martin Bligh @ 2008-09-19 21:57 UTC (permalink / raw) To: Randy Dunlap Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler >> Event ids are 16 bit, dynamically allocated. > > What are these (like)? u16 Sorry, probably lots of implicit assumptions in there that I forgot to explain > Arch-specific "word"? > or some fixed-size-for-all-systems (so that trace buffers can be > shared/used on other systems?) Preferably the latter. Mmmm. I don't see anything wrong with making it just 8 byte aligned, personally. Steven - this was your thing? >> Variable record length, header will start with length record. >> Timestamps in fixed timebase, monotonically increasing (across all CPUs) > > what timestamp resolution? ns is probably sufficient for output, but may need to be higher internally to get correct ordering of events across CPUs. So, as long as we record this in the buffer header, the internal resolution shouldn't be critical. The text print output ... I'd say ns? We can put it relative to wall time in there, as long as we record it in the buffer header at trace start. I guess we should document the buffer header ;-) ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:57 ` Martin Bligh @ 2008-09-19 22:41 ` Olaf Dabrunz 2008-09-19 22:19 ` Martin Bligh 2008-09-20 8:26 ` Steven Rostedt 1 sibling, 1 reply; 122+ messages in thread From: Olaf Dabrunz @ 2008-09-19 22:41 UTC (permalink / raw) To: Martin Bligh Cc: Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od Frank Ch. Eigler On 19-Sep-08, Martin Bligh wrote: > >> Event ids are 16 bit, dynamically allocated. > > > > What are these (like)? > > u16 > > Sorry, probably lots of implicit assumptions in there that I forgot to explain Ids for event types. Either allocated dynamically, if the tracer needs new ids on each use, or statically assigned for others (like my fctrace or Steven's ftrace, I believe). Should we have a reserved range / registry for static allocation, maybe something like a very simple version of devices.txt? > > Arch-specific "word"? > > or some fixed-size-for-all-systems (so that trace buffers can be > > shared/used on other systems?) Preferably the latter. > > Mmmm. I don't see anything wrong with making it just 8 byte aligned, personally. > Steven - this was your thing? Unaligned can be much slower. I guess some very quick tracers can benefit from alignment. > as long as we record it in the buffer header at trace start. I guess we should > document the buffer header ;-) Yes. :) -- Olaf Dabrunz (od/odabrunz), SUSE Linux Products GmbH, Nürnberg ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 22:41 ` Olaf Dabrunz @ 2008-09-19 22:19 ` Martin Bligh 2008-09-20 8:10 ` Olaf Dabrunz 2008-09-20 8:29 ` Steven Rostedt 0 siblings, 2 replies; 122+ messages in thread From: Martin Bligh @ 2008-09-19 22:19 UTC (permalink / raw) To: Martin Bligh, Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od Frank Ch. Eigler >> Sorry, probably lots of implicit assumptions in there that I forgot to explain > > Ids for event types. Either allocated dynamically, if the tracer needs > new ids on each use, or statically assigned for others (like my fctrace > or Steven's ftrace, I believe). Should we have a reserved range / registry > for static allocation, maybe something like a very simple version of > devices.txt? Sure, but it's per-tracer, so hopefully won't be a big problem (eg fctrace would have a different event-id namespace from blktrace) ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 22:19 ` Martin Bligh @ 2008-09-20 8:10 ` Olaf Dabrunz 2008-09-20 8:29 ` Steven Rostedt 1 sibling, 0 replies; 122+ messages in thread From: Olaf Dabrunz @ 2008-09-20 8:10 UTC (permalink / raw) To: Martin Bligh Cc: Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od Frank Ch. Eigler On 19-Sep-08, Martin Bligh wrote: > >> Sorry, probably lots of implicit assumptions in there that I forgot to explain > > > > Ids for event types. Either allocated dynamically, if the tracer needs > > new ids on each use, or statically assigned for others (like my fctrace > > or Steven's ftrace, I believe). Should we have a reserved range / registry > > for static allocation, maybe something like a very simple version of > > devices.txt? > > Sure, but it's per-tracer, so hopefully won't be a big problem (eg fctrace > would have a different event-id namespace from blktrace) Ah, that is right. We can distinguish them. -- Olaf Dabrunz (od/odabrunz), SUSE Linux Products GmbH, Nürnberg ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 22:19 ` Martin Bligh 2008-09-20 8:10 ` Olaf Dabrunz @ 2008-09-20 8:29 ` Steven Rostedt 2008-09-20 11:40 ` Mathieu Desnoyers 1 sibling, 1 reply; 122+ messages in thread From: Steven Rostedt @ 2008-09-20 8:29 UTC (permalink / raw) To: Martin Bligh Cc: Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, od Frank Ch. Eigler On Fri, 19 Sep 2008, Martin Bligh wrote: > >> Sorry, probably lots of implicit assumptions in there that I forgot to explain > > > > Ids for event types. Either allocated dynamically, if the tracer needs > > new ids on each use, or statically assigned for others (like my fctrace > > or Steven's ftrace, I believe). Should we have a reserved range / registry > > for static allocation, maybe something like a very simple version of > > devices.txt? > > Sure, but it's per-tracer, so hopefully won't be a big problem (eg fctrace > would have a different event-id namespace from blktrace) > Right! We stated in our little meeting that the true event id association is buffer id / event id tuple. We will not be assigning ranges for events for specific tracers. Ftrace will not have its own range. The static ids are reserved for the static trace points and some various static trace types that the average kernel developer may use. Think "string event type" for a event type that will simply hold an ASCII string. -- Steve ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 8:29 ` Steven Rostedt @ 2008-09-20 11:40 ` Mathieu Desnoyers 0 siblings, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-20 11:40 UTC (permalink / raw) To: Steven Rostedt Cc: Martin Bligh, Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Frank Ch. Eigler * Steven Rostedt (rostedt@goodmis.org) wrote: > > On Fri, 19 Sep 2008, Martin Bligh wrote: > > > >> Sorry, probably lots of implicit assumptions in there that I forgot to explain > > > > > > Ids for event types. Either allocated dynamically, if the tracer needs > > > new ids on each use, or statically assigned for others (like my fctrace > > > or Steven's ftrace, I believe). Should we have a reserved range / registry > > > for static allocation, maybe something like a very simple version of > > > devices.txt? > > > > Sure, but it's per-tracer, so hopefully won't be a big problem (eg fctrace > > would have a different event-id namespace from blktrace) > > > > Right! > > We stated in our little meeting that the true event id association is > buffer id / event id tuple. We will not be assigning ranges for events > for specific tracers. Ftrace will not have its own range. The static ids > are reserved for the static trace points and some various static trace > types that the average kernel developer may use. > Just to be sure of the "that the average kernel developer may use" meaning : I would recommend keeping those static ID range only for internal buffering mechanism events. E.g., if we need to add a periodical event in every stream so we can detect timestamp wrap-around, that would be part of the buffering infrastructure itself, and thus reserve an event ID. core_heartbeat (u64 timestamp) Same thing if we want to export the table that maps: event name <-> event ID <-> event typing (includes event size info) That table can be presented into the buffers (possible a single metadata buffer) in the form of two static event IDs : core_id_name (u16 id, const char *name) core_id_type (u16 id, const char *type) Here another assumption for portability is to declare event type as a possibly extended format string. Alternative suggestions are welcome. Mathieu > Think "string event type" for a event type that will simply hold an ASCII > string. > > -- Steve > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:57 ` Martin Bligh 2008-09-19 22:41 ` Olaf Dabrunz @ 2008-09-20 8:26 ` Steven Rostedt 2008-09-20 11:44 ` Mathieu Desnoyers 1 sibling, 1 reply; 122+ messages in thread From: Steven Rostedt @ 2008-09-20 8:26 UTC (permalink / raw) To: Martin Bligh Cc: Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, od, Frank Ch. Eigler On Fri, 19 Sep 2008, Martin Bligh wrote: > >> Event ids are 16 bit, dynamically allocated. > > > > What are these (like)? > > u16 > > Sorry, probably lots of implicit assumptions in there that I forgot to explain > > > Arch-specific "word"? > > or some fixed-size-for-all-systems (so that trace buffers can be > > shared/used on other systems?) Preferably the latter. > > Mmmm. I don't see anything wrong with making it just 8 byte aligned, personally. > Steven - this was your thing? I'm fine with 8 byte aligned for all. > > >> Variable record length, header will start with length record. > >> Timestamps in fixed timebase, monotonically increasing (across all CPUs) > > > > what timestamp resolution? > > ns is probably sufficient for output, but may need to be higher > internally to get > correct ordering of events across CPUs. So, as long as we record this in > the buffer header, the internal resolution shouldn't be critical. The text > print output ... I'd say ns? We can put it relative to wall time in there, > as long as we record it in the buffer header at trace start. I guess we should > document the buffer header ;-) > ftrace has two outputs. One is ns from start of boot, the other is ns from start of trace. Not sure we need to make wall time in there, unless we add it in the trace header (as you stated). Probably best for Version #2. -- Steve ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 8:26 ` Steven Rostedt @ 2008-09-20 11:44 ` Mathieu Desnoyers 0 siblings, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-20 11:44 UTC (permalink / raw) To: Steven Rostedt Cc: Martin Bligh, Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler * Steven Rostedt (rostedt@goodmis.org) wrote: > > > On Fri, 19 Sep 2008, Martin Bligh wrote: > > > >> Event ids are 16 bit, dynamically allocated. > > > > > > What are these (like)? > > > > u16 > > > > Sorry, probably lots of implicit assumptions in there that I forgot to explain > > > > > Arch-specific "word"? > > > or some fixed-size-for-all-systems (so that trace buffers can be > > > shared/used on other systems?) Preferably the latter. > > > > Mmmm. I don't see anything wrong with making it just 8 byte aligned, personally. > > Steven - this was your thing? > > I'm fine with 8 byte aligned for all. > > > > > >> Variable record length, header will start with length record. > > >> Timestamps in fixed timebase, monotonically increasing (across all CPUs) > > > > > > what timestamp resolution? > > > > ns is probably sufficient for output, but may need to be higher > > internally to get > > correct ordering of events across CPUs. So, as long as we record this in > > the buffer header, the internal resolution shouldn't be critical. The text > > print output ... I'd say ns? We can put it relative to wall time in there, > > as long as we record it in the buffer header at trace start. I guess we should > > document the buffer header ;-) > > > > ftrace has two outputs. One is ns from start of boot, the other is ns from > start of trace. Not sure we need to make wall time in there, unless we add > it in the trace header (as you stated). Probably best for Version #2. > > -- Steve > For simplicity and efficiency, I would try to keep the timestamp recorded for every event as close as what we are reading from the hardware without manipulation. We can save the timestamp value at the beginning of the trace (in the buffer header) and we can also save the scaling value in the same header so we can transform stuff like cycle counts into ns, ps from boot time or from trace start when we pretty-print. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:33 Unified tracing buffer Martin Bligh 2008-09-19 21:42 ` Randy Dunlap @ 2008-09-19 22:28 ` Olaf Dabrunz 2008-09-19 22:09 ` Martin Bligh 2008-09-19 23:18 ` Frank Ch. Eigler ` (6 subsequent siblings) 8 siblings, 1 reply; 122+ messages in thread From: Olaf Dabrunz @ 2008-09-19 22:28 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, Olaf Dabrunz, Frank Ch. Eigler Hi Martin and everyone, thanks for writing this up. Just one comment: > /debugfs/tracing/<name>/buffers/enabed <--- binary value ASCII "0" and "1"? And my e-mail is odabrunz@novell.com or preferably od@suse.de, as in this mail. Thanks, :) -- Olaf Dabrunz (od/odabrunz), SUSE Linux Products GmbH, Nürnberg ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 22:28 ` Olaf Dabrunz @ 2008-09-19 22:09 ` Martin Bligh 0 siblings, 0 replies; 122+ messages in thread From: Martin Bligh @ 2008-09-19 22:09 UTC (permalink / raw) To: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, Frank Ch. Eigler Cc: Olaf Dabrunz >> /debugfs/tracing/<name>/buffers/enabed <--- binary value > > ASCII "0" and "1"? Oops. s/binary/boolean/ and yes, in ASCII. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:33 Unified tracing buffer Martin Bligh 2008-09-19 21:42 ` Randy Dunlap 2008-09-19 22:28 ` Olaf Dabrunz @ 2008-09-19 23:18 ` Frank Ch. Eigler 2008-09-20 8:50 ` Steven Rostedt 2008-09-20 0:07 ` Peter Zijlstra ` (5 subsequent siblings) 8 siblings, 1 reply; 122+ messages in thread From: Frank Ch. Eigler @ 2008-09-19 23:18 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od Hi - On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote: > During kernel summit and Plumbers conference, Linus and others > expressed a desire for a unified tracing buffer system for multiple > tracing applications (eg ftrace, lttng, systemtap, blktrace, etc) to > use. OK. > [...] > STORAGE > ------- > > We will support multiple buffers for different tracing systems, with > separate names, event id spaces. [...] OK. This is completely orthogonal to ... > INPUT_FUNCTIONS > --------------- > > allocate_buffer (name, size) > return buffer_handle > > register_event (buffer_handle, event_id, print_function) > You can pass in a requested event_id from a fixed set, and > will be given it, or an error > 0 means allocate me one dynamically > returns event_id (or -E_ERROR) > > record_event (buffer_handle, event_id, length, *buf) How do you imagine record_event() being used from the point of view of the instrumented module? Is it to be protected by some sort of test of the control variable? Is the little binary event buffer supposed to be constructed unconditionally? (On the stack?) You should compare this to markers and tracepoints. It sounds to me like this is not that different from trace_mark (event_name, "%*b", length, buf); where the goofy "%*b" could be some magic to identify the proposed "everything is a short blob" event record type. By the way, systemtap supports formatted printing that generates binary records via directives like "%4b" for 4-byte ints. I wonder if that would be a suitable facility for this and/or markers to allow instrumentation code to *generate* those binary event records. Do you believe that fans of tracepoints would support a single void*/length struct parametrization? > Data will be output via debugfs, and provide the following output streams: > > /debugfs/tracing/<name>/buffers/text > clear text stream (will merge the per-cpu streams via insertion > sort, and use the print functions) Can you spell out this part a little more? I wonder because at the tracing miniconf on Wednesday we talked about systemtap's likely need to *consume* these trace events as they are being generated. If systemtap can only see them as a binary blob or a rendered ascii string, they would not be as useful as if the record was decomposable in kernel. Perhaps the event-type-registration call can declare the binary struct, like a perl pack directive ... or a marker (binary) format string. > CONTROL > > Sysfs style tree under debugfs > > /debugfs/tracing/<name>/buffers/enabed <--- binary value > > /debugfs/tracing/<name>/<event1> > /debugfs/tracing/<name>/<event2> > etc ... > provides a way to enable/disable events, see what's available, and > what's enabled. This sort of control is (or should be) already available for markers. - FChE ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 23:18 ` Frank Ch. Eigler @ 2008-09-20 8:50 ` Steven Rostedt 2008-09-20 13:37 ` Mathieu Desnoyers 0 siblings, 1 reply; 122+ messages in thread From: Steven Rostedt @ 2008-09-20 8:50 UTC (permalink / raw) To: Frank Ch. Eigler Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, od [ Note, It is very late for me and I probably should be finishing up packing for my trip home, but I'm stupid and decided to respond now instead ] On Fri, 19 Sep 2008, Frank Ch. Eigler wrote: > Hi - > > > On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote: > > > During kernel summit and Plumbers conference, Linus and others > > expressed a desire for a unified tracing buffer system for multiple > > tracing applications (eg ftrace, lttng, systemtap, blktrace, etc) to > > use. > > OK. > > > > [...] > > STORAGE > > ------- > > > > We will support multiple buffers for different tracing systems, with > > separate names, event id spaces. [...] > > OK. This is completely orthogonal to ... > > > > INPUT_FUNCTIONS > > --------------- > > > > allocate_buffer (name, size) > > return buffer_handle > > > > register_event (buffer_handle, event_id, print_function) > > You can pass in a requested event_id from a fixed set, and > > will be given it, or an error > > 0 means allocate me one dynamically > > returns event_id (or -E_ERROR) > > > > record_event (buffer_handle, event_id, length, *buf) > > How do you imagine record_event() being used from the point of view of > the instrumented module? Is it to be protected by some sort of test > of the control variable? Is the little binary event buffer supposed > to be constructed unconditionally? (On the stack?) The event buffer is allocated when you create the buffer. The tracer will do that on initialization. > > > You should compare this to markers and tracepoints. It sounds to me like > this is not that different from > > trace_mark (event_name, "%*b", length, buf); > > where the goofy "%*b" could be some magic to identify the proposed > "everything is a short blob" event record type. This is completely separate from the trace buffer itself. The trace points and trace markers simply write whatever they want into the trace buffer. The tracepoints and trace markers are something to hook points of code to do some type of tracing. The trace buffer record_event is how it will do it if it chooses to do so. > > By the way, systemtap supports formatted printing that generates > binary records via directives like "%4b" for 4-byte ints. I wonder if > that would be a suitable facility for this and/or markers to allow > instrumentation code to *generate* those binary event records. > The tracepoints and markers can generate anything they want. > > Do you believe that fans of tracepoints would support a single > void*/length struct parametrization? The "record_event" would not (I repeat, "not") be in general code. It will be used by the different tracers. The markers/tracepoints will be in the code that can hooked to do things like profiling, or if you want, record the data into the trace buffer via the record event. > > > > > Data will be output via debugfs, and provide the following output streams: > > > > /debugfs/tracing/<name>/buffers/text > > clear text stream (will merge the per-cpu streams via insertion > > sort, and use the print functions) > > Can you spell out this part a little more? I wonder because at the > tracing miniconf on Wednesday we talked about systemtap's likely need > to *consume* these trace events as they are being generated. What does this mean exactly. I should have asked more details about this but I was too worried about time constraints (since Linus was speaking next to think about it at the time. That is, when you read a marker, who and when does that data get consumed? Where does that data go? Do you even need to store it in this ring buffer? > > If systemtap can only see them as a binary blob or a rendered ascii > string, they would not be as useful as if the record was decomposable > in kernel. Perhaps the event-type-registration call can declare the > binary struct, like a perl pack directive ... or a marker (binary) > format string. I don't understand the above. I'm also thinking that we are miscommunicating a bit here. Let make make an example with something that I know, ftrace. We hit the tracepoint in, lets say, scheduler. Previously at initialization time, ftrace would have registered with this trace point and would have allocated a buffer. When the tracepoint is actually hit, it jumps to the function that ftrace registered. Then this function would record the event into the buffer using 'record_event'. At a later time, the user could read the buffer from the filesystem using either the pretty print format or raw binary format. This code is not replacing tracepoints or markers. When you say you will consume at reading, it sounds like you don't even need to use the buffer mechanism and the trace points/markers is good enough for you. The tracepoints and markers are not something that is being replaced. The trace buffers are just something to use that we can record data to that we can retrieve at a later time. > > > > CONTROL > > > > Sysfs style tree under debugfs > > > > /debugfs/tracing/<name>/buffers/enabed <--- binary value > > > > /debugfs/tracing/<name>/<event1> > > /debugfs/tracing/<name>/<event2> > > etc ... > > provides a way to enable/disable events, see what's available, and > > what's enabled. > > This sort of control is (or should be) already available for markers. Markers and the buffers are two separate things. Perhaps I'm just tired, but I'm thinking that you are thinking we are going to remove markers and trace points. This code is only to give the kernel a ring buffer to use. Not a way to put hooks into kernel code. We have tracepoints and markers for that. -- Steve ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 8:50 ` Steven Rostedt @ 2008-09-20 13:37 ` Mathieu Desnoyers 2008-09-20 13:51 ` Steven Rostedt 0 siblings, 1 reply; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-20 13:37 UTC (permalink / raw) To: Steven Rostedt Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od * Steven Rostedt (rostedt@goodmis.org) wrote: > > [ Note, It is very late for me and I probably should be finishing up > packing for my trip home, but I'm stupid and decided to respond now > instead ] > > > On Fri, 19 Sep 2008, Frank Ch. Eigler wrote: > > > Hi - > > > > > > On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote: > > > > > During kernel summit and Plumbers conference, Linus and others > > > expressed a desire for a unified tracing buffer system for multiple > > > tracing applications (eg ftrace, lttng, systemtap, blktrace, etc) to > > > use. > > > > OK. > > > > > > > [...] > > > STORAGE > > > ------- > > > > > > We will support multiple buffers for different tracing systems, with > > > separate names, event id spaces. [...] > > > > OK. This is completely orthogonal to ... > > > > > > > INPUT_FUNCTIONS > > > --------------- > > > > > > allocate_buffer (name, size) > > > return buffer_handle > > > > > > register_event (buffer_handle, event_id, print_function) > > > You can pass in a requested event_id from a fixed set, and > > > will be given it, or an error > > > 0 means allocate me one dynamically > > > returns event_id (or -E_ERROR) > > > > > > record_event (buffer_handle, event_id, length, *buf) > > > > How do you imagine record_event() being used from the point of view of > > the instrumented module? Is it to be protected by some sort of test > > of the control variable? Is the little binary event buffer supposed > > to be constructed unconditionally? (On the stack?) > > The event buffer is allocated when you create the buffer. The tracer will > do that on initialization. > I would propose disable the event source ASAP when disabled, without any conditional test if possible. This is actually what the Kernel Markers does with the help of Immediate Values. It should also come with a finer-grained filtering based on a global "enable" variable and per-buffer "enable" variables so a tracer can atomically start collecting all of its event types. We could then require both global tracing and per-buffer tracing to be enabled to write into the buffer. I think I see where Frank is going, and I agree that this is more or less what I have had in mind : the Markers could be used as a global event ID registry and could hold the event name, ID, types (format string) table. Therefore, the markers would simply become the "Write to buffer" interface, which would associate IDs automatically and keep the format strings into a table dumped into a metainformation trace buffer at trace start and whenever IDs are dynamically registered while the trace is active. > > > > > > You should compare this to markers and tracepoints. It sounds to me like > > this is not that different from > > > > trace_mark (event_name, "%*b", length, buf); > > > > where the goofy "%*b" could be some magic to identify the proposed > > "everything is a short blob" event record type. > > This is completely separate from the trace buffer itself. The trace points > and trace markers simply write whatever they want into the trace buffer. > The tracepoints and trace markers are something to hook points of code to > do some type of tracing. The trace buffer record_event is how it will do > it if it chooses to do so. > As I said above, the tracepoints are meant to be a in-kernel API which instruments the kernel code. It leaves the markers, which are meant to be exposed to userspace anyway, for such record_event use. They would actually accomplish two things : they would register the event (just declaring the marker puts the event in a special section, which is our mapping table) and would also record the event when enabled and executed. > > > > By the way, systemtap supports formatted printing that generates > > binary records via directives like "%4b" for 4-byte ints. I wonder if > > that would be a suitable facility for this and/or markers to allow > > instrumentation code to *generate* those binary event records. > > > > The tracepoints and markers can generate anything they want. > > > > > Do you believe that fans of tracepoints would support a single > > void*/length struct parametrization? > > The "record_event" would not (I repeat, "not") be in general code. It will > be used by the different tracers. The markers/tracepoints will be in the > code that can hooked to do things like profiling, or if you want, record > the data into the trace buffer via the record event. > > > > > > > > > > Data will be output via debugfs, and provide the following output streams: > > > > > > /debugfs/tracing/<name>/buffers/text > > > clear text stream (will merge the per-cpu streams via insertion > > > sort, and use the print functions) > > > > Can you spell out this part a little more? I wonder because at the > > tracing miniconf on Wednesday we talked about systemtap's likely need > > to *consume* these trace events as they are being generated. > > What does this mean exactly. I should have asked more details about this > but I was too worried about time constraints (since Linus was speaking > next to think about it at the time. > > That is, when you read a marker, who and when does that data get consumed? > Where does that data go? Do you even need to store it in this ring buffer? > Given that systemtap may need to access the kernel state and the moment the instrumentation is reached, I think it implies they have to be called _before_ the data is writter to the buffers. We can thus see SystemTAP as a very powerful filtering mechanism. SystemTAP could also choose to directly use the instrumentation available (kprobes, tracepoints, ...) when it does not need to act as a filter, but more like a statistic gathering module. So I think we should simply provide a callback filter registration mechanism in the filtering chain, thus having a filtering pseudo-code looking like : if (unlikely(marker_enabled)) if (likely(global_tracing_enabled)) if (likely(buffer_tracing_enabled)) if (likely(call_filter_chain())) write to buffer > > > > If systemtap can only see them as a binary blob or a rendered ascii > > string, they would not be as useful as if the record was decomposable > > in kernel. Perhaps the event-type-registration call can declare the > > binary struct, like a perl pack directive ... or a marker (binary) > > format string. > > I don't understand the above. I'm also thinking that we are > miscommunicating a bit here. > > Let make make an example with something that I know, ftrace. > > We hit the tracepoint in, lets say, scheduler. Previously at > initialization time, ftrace would have registered with this trace point > and would have allocated a buffer. When the tracepoint is actually hit, it > jumps to the function that ftrace registered. Then this function would > record the event into the buffer using 'record_event'. At a later time, > the user could read the buffer from the filesystem using either the pretty > print format or raw binary format. > > This code is not replacing tracepoints or markers. When you say you will > consume at reading, it sounds like you don't even need to use the buffer > mechanism and the trace points/markers is good enough for you. The > tracepoints and markers are not something that is being replaced. The > trace buffers are just something to use that we can record data to that we > can retrieve at a later time. > If we export the data to record through markers, we would be able to let SystemTAP (registered in our filter chain) look at the event types being passed to the buffers and thus perform clever filtering on the information. > > > > > > > CONTROL > > > > > > Sysfs style tree under debugfs > > > > > > /debugfs/tracing/<name>/buffers/enabed <--- binary value > > > > > > /debugfs/tracing/<name>/<event1> > > > /debugfs/tracing/<name>/<event2> > > > etc ... > > > provides a way to enable/disable events, see what's available, and > > > what's enabled. > > > > This sort of control is (or should be) already available for markers. > > Markers and the buffers are two separate things. Perhaps I'm just tired, > but I'm thinking that you are thinking we are going to remove markers and > trace points. > > This code is only to give the kernel a ring buffer to use. Not a way to > put hooks into kernel code. We have tracepoints and markers for that. > I think what Frank tries to express is that we would not lose any flexibility, but make life much easier for everyone, if we use the markers as the API to register event ids, keep their type table and to export the data at runtime. One addition that would be needed to the markers is to create a "binary blob" type (size %lu binary %pW ? W would be for "Write") which would express that this data is actually only parsable by specialized functions which knows the types embedded within the binary blob. If necessary, systemtap could then develop incrementally specialized functions to deal with these blob, only if required. All the other basic types would be easy to print or filter with a vsnprintf-like function. Mathieu > -- Steve > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 13:37 ` Mathieu Desnoyers @ 2008-09-20 13:51 ` Steven Rostedt 2008-09-20 14:54 ` Steven Rostedt 0 siblings, 1 reply; 122+ messages in thread From: Steven Rostedt @ 2008-09-20 13:51 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od On Sat, 20 Sep 2008, Mathieu Desnoyers wrote: > * Steven Rostedt (rostedt@goodmis.org) wrote: > > > > [ Note, It is very late for me and I probably should be finishing up > > packing for my trip home, but I'm stupid and decided to respond now > > instead ] [ Note, I'm now at the airport sipping on coffee, and I had 2 hours of horrible sleep. ] > > > > > > On Fri, 19 Sep 2008, Frank Ch. Eigler wrote: > > > > > Hi - > > > > > > > > > On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote: > > > > > > > During kernel summit and Plumbers conference, Linus and others > > > > expressed a desire for a unified tracing buffer system for multiple > > > > tracing applications (eg ftrace, lttng, systemtap, blktrace, etc) to > > > > use. > > > > > > OK. > > > > > > > > > > [...] > > > > STORAGE > > > > ------- > > > > > > > > We will support multiple buffers for different tracing systems, with > > > > separate names, event id spaces. [...] > > > > > > OK. This is completely orthogonal to ... > > > > > > > > > > INPUT_FUNCTIONS > > > > --------------- > > > > > > > > allocate_buffer (name, size) > > > > return buffer_handle > > > > > > > > register_event (buffer_handle, event_id, print_function) > > > > You can pass in a requested event_id from a fixed set, and > > > > will be given it, or an error > > > > 0 means allocate me one dynamically > > > > returns event_id (or -E_ERROR) > > > > > > > > record_event (buffer_handle, event_id, length, *buf) > > > > > > How do you imagine record_event() being used from the point of view of > > > the instrumented module? Is it to be protected by some sort of test > > > of the control variable? Is the little binary event buffer supposed > > > to be constructed unconditionally? (On the stack?) > > > > The event buffer is allocated when you create the buffer. The tracer will > > do that on initialization. > > > > I would propose disable the event source ASAP when disabled, without any > conditional test if possible. This is actually what the Kernel Markers > does with the help of Immediate Values. It should also come with a > finer-grained filtering based on a global "enable" variable and > per-buffer "enable" variables so a tracer can atomically start > collecting all of its event types. We could then require both > global tracing and per-buffer tracing to be enabled to write into the > buffer. > > I think I see where Frank is going, and I agree that this is more or > less what I have had in mind : the Markers could be used as a global > event ID registry and could hold the event name, ID, types (format > string) table. Therefore, the markers would simply become the "Write to > buffer" interface, which would associate IDs automatically and keep the > format strings into a table dumped into a metainformation trace buffer > at trace start and whenever IDs are dynamically registered while the > trace is active. > > > > > > > > > > You should compare this to markers and tracepoints. It sounds to me like > > > this is not that different from > > > > > > trace_mark (event_name, "%*b", length, buf); > > > > > > where the goofy "%*b" could be some magic to identify the proposed > > > "everything is a short blob" event record type. > > > > This is completely separate from the trace buffer itself. The trace points > > and trace markers simply write whatever they want into the trace buffer. > > The tracepoints and trace markers are something to hook points of code to > > do some type of tracing. The trace buffer record_event is how it will do > > it if it chooses to do so. > > > > As I said above, the tracepoints are meant to be a in-kernel API which > instruments the kernel code. It leaves the markers, which are meant to > be exposed to userspace anyway, for such record_event use. They would > actually accomplish two things : they would register the event (just > declaring the marker puts the event in a special section, which is our > mapping table) and would also record the event when enabled and > executed. > > > > > > > > By the way, systemtap supports formatted printing that generates > > > binary records via directives like "%4b" for 4-byte ints. I wonder if > > > that would be a suitable facility for this and/or markers to allow > > > instrumentation code to *generate* those binary event records. > > > > > > > The tracepoints and markers can generate anything they want. > > > > > > > > Do you believe that fans of tracepoints would support a single > > > void*/length struct parametrization? > > > > The "record_event" would not (I repeat, "not") be in general code. It will > > be used by the different tracers. The markers/tracepoints will be in the > > code that can hooked to do things like profiling, or if you want, record > > the data into the trace buffer via the record event. > > > > > > > > > > > > > > > Data will be output via debugfs, and provide the following output streams: > > > > > > > > /debugfs/tracing/<name>/buffers/text > > > > clear text stream (will merge the per-cpu streams via insertion > > > > sort, and use the print functions) > > > > > > Can you spell out this part a little more? I wonder because at the > > > tracing miniconf on Wednesday we talked about systemtap's likely need > > > to *consume* these trace events as they are being generated. > > > > What does this mean exactly. I should have asked more details about this > > but I was too worried about time constraints (since Linus was speaking > > next to think about it at the time. > > > > That is, when you read a marker, who and when does that data get consumed? > > Where does that data go? Do you even need to store it in this ring buffer? > > > > Given that systemtap may need to access the kernel state and the moment > the instrumentation is reached, I think it implies they have to be > called _before_ the data is writter to the buffers. We can thus see > SystemTAP as a very powerful filtering mechanism. SystemTAP could also > choose to directly use the instrumentation available (kprobes, > tracepoints, ...) when it does not need to act as a filter, but more > like a statistic gathering module. So I think we should simply provide a > callback filter registration mechanism in the filtering chain, thus > having a filtering pseudo-code looking like : > > if (unlikely(marker_enabled)) > if (likely(global_tracing_enabled)) > if (likely(buffer_tracing_enabled)) > if (likely(call_filter_chain())) > write to buffer > > > > > > > If systemtap can only see them as a binary blob or a rendered ascii > > > string, they would not be as useful as if the record was decomposable > > > in kernel. Perhaps the event-type-registration call can declare the > > > binary struct, like a perl pack directive ... or a marker (binary) > > > format string. > > > > I don't understand the above. I'm also thinking that we are > > miscommunicating a bit here. > > > > Let make make an example with something that I know, ftrace. > > > > We hit the tracepoint in, lets say, scheduler. Previously at > > initialization time, ftrace would have registered with this trace point > > and would have allocated a buffer. When the tracepoint is actually hit, it > > jumps to the function that ftrace registered. Then this function would > > record the event into the buffer using 'record_event'. At a later time, > > the user could read the buffer from the filesystem using either the pretty > > print format or raw binary format. > > > > This code is not replacing tracepoints or markers. When you say you will > > consume at reading, it sounds like you don't even need to use the buffer > > mechanism and the trace points/markers is good enough for you. The > > tracepoints and markers are not something that is being replaced. The > > trace buffers are just something to use that we can record data to that we > > can retrieve at a later time. > > > > If we export the data to record through markers, we would be able to let > SystemTAP (registered in our filter chain) look at the event types being > passed to the buffers and thus perform clever filtering on the > information. > > > > > > > > > > > CONTROL > > > > > > > > Sysfs style tree under debugfs > > > > > > > > /debugfs/tracing/<name>/buffers/enabed <--- binary value > > > > > > > > /debugfs/tracing/<name>/<event1> > > > > /debugfs/tracing/<name>/<event2> > > > > etc ... > > > > provides a way to enable/disable events, see what's available, and > > > > what's enabled. > > > > > > This sort of control is (or should be) already available for markers. > > > > Markers and the buffers are two separate things. Perhaps I'm just tired, > > but I'm thinking that you are thinking we are going to remove markers and > > trace points. > > > > This code is only to give the kernel a ring buffer to use. Not a way to > > put hooks into kernel code. We have tracepoints and markers for that. > > > > I think what Frank tries to express is that we would not lose any > flexibility, but make life much easier for everyone, if we use the > markers as the API to register event ids, keep their type table and to > export the data at runtime. No, absolutely not! Sorry, I don't want to touch markers. I'm fine with tracepoints, but there should be no need to use a damn marker if I want to use the tracer. I shouldn't need to even touch tracepoints to use the trace buffer. That is making things too complicated again. The tracepoints and markers should allow you to hook into the buffers. They are separate. I can imagine using tracepoints without needing buffers and I can see using the buffers without using tracepoints or markers. They are separate things. Do not bind the use of the buffers around markers. Markers are great for you and for many others, but this is about the tracing mechanism and one should not be forced to use markers if they want to do a trace. -- Steve > > One addition that would be needed to the markers is to create a "binary > blob" type (size %lu binary %pW ? W would be for "Write") which would > express that this data is actually only parsable by specialized > functions which knows the types embedded within the binary blob. If > necessary, systemtap could then develop incrementally specialized > functions to deal with these blob, only if required. All the other basic > types would be easy to print or filter with a vsnprintf-like function. > > Mathieu > > > -- Steve > > > > -- > Mathieu Desnoyers > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 > ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 13:51 ` Steven Rostedt @ 2008-09-20 14:54 ` Steven Rostedt 2008-09-22 18:45 ` Mathieu Desnoyers 0 siblings, 1 reply; 122+ messages in thread From: Steven Rostedt @ 2008-09-20 14:54 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od On Sat, 20 Sep 2008, Steven Rostedt wrote: > > > > > > > > > Markers and the buffers are two separate things. Perhaps I'm just tired, > > > but I'm thinking that you are thinking we are going to remove markers and > > > trace points. > > > > > > This code is only to give the kernel a ring buffer to use. Not a way to > > > put hooks into kernel code. We have tracepoints and markers for that. > > > > > > > I think what Frank tries to express is that we would not lose any > > flexibility, but make life much easier for everyone, if we use the > > markers as the API to register event ids, keep their type table and to > > export the data at runtime. > > No, absolutely not! > > Sorry, I don't want to touch markers. I'm fine with tracepoints, but > there should be no need to use a damn marker if I want to use the tracer. > I shouldn't need to even touch tracepoints to use the trace buffer. > That is making things too complicated again. The tracepoints and markers > should allow you to hook into the buffers. They are separate. I can > imagine using tracepoints without needing buffers and I can see using the > buffers without using tracepoints or markers. They are separate things. Do > not bind the use of the buffers around markers. > > > Markers are great for you and for many others, but this is about the > tracing mechanism and one should not be forced to use markers if they want > to do a trace. > Mathieu, Think about the function tracer itself. It gets called at every funtion, where I record the interrupts enabled state, task pid, preempt state, function addr, and parent function addr. (that's just off the top of my head, I may even record more). What I don't want is a: function_call(unsigned long func, unsigned long parent) { struct ftrace_event event; event.pid = current->pid; event.pc = preempt_count(); event.irq = local_irq_flags(); event.func = func; event.parent = parent; trace_mark(func_event_id, "%p", sizeof(event), &event); } and then to turn on function tracing, I need to hook into this marker. I'd rather just push the data right into the buffer here without having to make another function call to hook into this. I'd rather have instead a simple: struct ftrace_event *event; event = ring_buffer_reserve(func_event_id, sizeof(*event)); event->pid = current->pid; event->pc = preempt_count(); event->irq = local_irq_flags(); event->func = func; event->parent = parent; ring_buffer_commit(event); -- Steve ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 14:54 ` Steven Rostedt @ 2008-09-22 18:45 ` Mathieu Desnoyers 2008-09-22 21:39 ` Steven Rostedt 0 siblings, 1 reply; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-22 18:45 UTC (permalink / raw) To: Steven Rostedt Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od * Steven Rostedt (rostedt@goodmis.org) wrote: > > > On Sat, 20 Sep 2008, Steven Rostedt wrote: > > > > > > > > > > > > > > Markers and the buffers are two separate things. Perhaps I'm just tired, > > > > but I'm thinking that you are thinking we are going to remove markers and > > > > trace points. > > > > > > > > This code is only to give the kernel a ring buffer to use. Not a way to > > > > put hooks into kernel code. We have tracepoints and markers for that. > > > > > > > > > > I think what Frank tries to express is that we would not lose any > > > flexibility, but make life much easier for everyone, if we use the > > > markers as the API to register event ids, keep their type table and to > > > export the data at runtime. > > > > No, absolutely not! > > > > Sorry, I don't want to touch markers. I'm fine with tracepoints, but > > there should be no need to use a damn marker if I want to use the tracer. > > I shouldn't need to even touch tracepoints to use the trace buffer. > > That is making things too complicated again. The tracepoints and markers > > should allow you to hook into the buffers. They are separate. I can > > imagine using tracepoints without needing buffers and I can see using the > > buffers without using tracepoints or markers. They are separate things. Do > > not bind the use of the buffers around markers. > > > > > > Markers are great for you and for many others, but this is about the > > tracing mechanism and one should not be forced to use markers if they want > > to do a trace. > > > Hi Steven, As I expressed above, this is merely one way I propose data could be exported to user-space. If you have other simpler design ideas in mind, I look forward to hear them so we can discuss the technical difficulties associated with that kind of exercice : sending binary data across the kernel-userspace boundary. See below for comments. > Mathieu, > > Think about the function tracer itself. It gets called at every funtion, > where I record the interrupts enabled state, task pid, preempt state, > function addr, and parent function addr. (that's just off the top of my > head, I may even record more). > > What I don't want is a: > > function_call(unsigned long func, unsigned long parent) > { > struct ftrace_event event; > > event.pid = current->pid; > event.pc = preempt_count(); > event.irq = local_irq_flags(); > event.func = func; > event.parent = parent; > > trace_mark(func_event_id, "%p", > sizeof(event), &event); > } > > > and then to turn on function tracing, I need to hook into this marker. I'd > rather just push the data right into the buffer here without having to > make another function call to hook into this. > > I'd rather have instead a simple: > > struct ftrace_event *event; > > event = ring_buffer_reserve(func_event_id, > sizeof(*event)); > > event->pid = current->pid; > event->pc = preempt_count(); > event->irq = local_irq_flags(); > event->func = func; > event->parent = parent; > > ring_buffer_commit(event); > The scheme you propose here is based on a few inherent assumptions : - You assume ring_buffer_reserve() and ring_buffer_commit() are static inline and thus does not turn into function calls. - You assume these are small enough so they can be inlined without causing L1 insn cache trashing when tracing is activated. - You therefore assume they use a locking scheme that lets them be really really compact (e.g. interrupt disable and spin lock). - You assume that the performance impact of doing a function call is bigger than the impact of locking, which is false by at least a factor 10. Interrupt disable and spin locks are _really_ slow. So I think putting the function call concern up front here is really a matter of premature optimization gone wrong. I've got burned in the past history of LTTng. The first versions has a code generator which created specialized code to serialize the information into the buffers, exactly like you propose to do. But the overall impact on kernel code size ended up being too big because we have to repeat all the code to deal with the buffers for every different type. However, I think there might be a way to satisfy us both. An information source like dynamic function trace happen to fit in a particular use-case where one single execution site is used to format the data received as parameter for a _lot_ of instrumented sites, and the type and event names happen to be the same everywhere. This would therefore benefit widely of having the capability to write directly into the buffers. The thing is that I would like ftrace to expose the types it expects to write into the trace buffers so a generic trace buffer userspace consumer could read it. One way to do it, which would let you write data directly into the buffers, would be something like : (before anyone says "that's many lines of code", please compile it and look at the assembly result. A lot of this translates in precomputed values, especially for the event size computation). (And no, the following code has not been compile-tested) include/linux/someheader.h : /* Calculate the offset needed to align the type */ static inline unsigned int var_align(size_t align_drift, size_t size_of_type) { size_t alignment = min(sizeof(void *), size_of_type); return ((alignment - align_drift) & (alignment-1)); } kernel/trace/ftrace.c : /* * the following macro would only do the "declaration" part of the * markers, without doing all the function call stuff. */ DECLARE_MARKER(function_entry, "pid %d pc %d flags %lu func 0x%lX parent 0x%lX"); void ftrace_mcount(unsigned long ip, unsigned long parent_ip) { size_t ev_size = 0; char *buffer; /* * We assume event payload aligned on sizeof(void *). * Event size calculated statically. */ ev_size += sizeof(int); ev_size += var_align(ev_size, sizeof(int)); ev_size += sizeof(int); ev_size += var_align(ev_size, sizeof(unsigned long)); ev_size += sizeof(unsigned long); ev_size += var_align(ev_size, sizeof(unsigned long)); ev_size += sizeof(unsigned long); ev_size += var_align(ev_size, sizeof(unsigned long)); ev_size += sizeof(unsigned long); /* * Now reserve space and copy data. */ buffer = ring_buffer_reserve(func_event_id, ev_size); /* Write pid */ *(int *)buffer = current->pid; buffer += sizeof(int); /* Write pc */ buffer += var_align(buffer, sizeof(int)); *(int *)buffer = preempt_count(); buffer += sizeof(int); /* Write flags */ buffer += var_align(buffer, sizeof(unsigned long)); *(unsigned long *)buffer = local_irq_flags(); buffer += sizeof(unsigned long); /* Write func */ buffer += var_align(buffer, sizeof(unsigned long)); *(unsigned long *)buffer = func; buffer += sizeof(unsigned long); /* Write parent */ buffer += var_align(buffer, sizeof(unsigned long)); *(unsigned long *)buffer = parent; buffer += sizeof(unsigned long); ring_buffer_commit(buffer, ev_size); } Would that be suitable for you ? We could also think of passing the function pointer of the bin to ascii converter to DECLARE_MARKER(), such as : void function_entry_show(struct seq_file *m, char *buffer); DECLARE_MARKER(function_entry, function_entry_show, "pid %d pc %d flags %lu func 0x%lX parent 0x%lX"); void function_entry_show(struct seq_file *m, char *buffer) { /* Read pid */ seq_printf(m, "pid = %d ", *(int *)buffer); buffer += sizeof(int); /* Read pc */ buffer += var_align(buffer, sizeof(int)); seq_printf(m, "pc = %d ", *(int *)buffer); buffer += sizeof(int); /* Read flags */ buffer += var_align(buffer, sizeof(unsigned long)); seq_printf(m, "flags = %lu ", *(unsigned long *)buffer); buffer += sizeof(unsigned long); /* Read func */ buffer += var_align(buffer, sizeof(unsigned long)); seq_printf(m, "func = 0x%lX ", *(unsigned long *)buffer); buffer += sizeof(unsigned long); /* Read parent */ buffer += var_align(buffer, sizeof(unsigned long)); seq_printf(m, "parent = 0x%lX ", *(unsigned long *)buffer); buffer += sizeof(unsigned long); } Note that in this particular case, given we would not need any special "dump everything as if it was an unorganized array of bytes", the function_entry_show() would be totally useless if we provide a sane vsnprintf-like decoder based on the format string. I did this example to show you how we could deal with the special cases where people would be interested to write a whole network packet (or any similar structure) directly to the trace (given it has field structures which are not too tied to the compiler internals and has field sizes portable across architectures). We could do this without much problem by adding a format string type which specified such a binary blob, and we could even leave room for people to provide their ascii formatting function pointer, as shows my second example here. Mathieu > > -- Steve > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 18:45 ` Mathieu Desnoyers @ 2008-09-22 21:39 ` Steven Rostedt 2008-09-23 3:27 ` Mathieu Desnoyers 0 siblings, 1 reply; 122+ messages in thread From: Steven Rostedt @ 2008-09-22 21:39 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od On Mon, 22 Sep 2008, Mathieu Desnoyers wrote: > * Steven Rostedt (rostedt@goodmis.org) wrote: [...] > > > > Think about the function tracer itself. It gets called at every funtion, > > where I record the interrupts enabled state, task pid, preempt state, > > function addr, and parent function addr. (that's just off the top of my > > head, I may even record more). > > > > What I don't want is a: > > > > function_call(unsigned long func, unsigned long parent) > > { > > struct ftrace_event event; > > > > event.pid = current->pid; > > event.pc = preempt_count(); > > event.irq = local_irq_flags(); > > event.func = func; > > event.parent = parent; > > > > trace_mark(func_event_id, "%p", > > sizeof(event), &event); > > } > > > > > > and then to turn on function tracing, I need to hook into this marker. I'd > > rather just push the data right into the buffer here without having to > > make another function call to hook into this. > > > > I'd rather have instead a simple: > > > > struct ftrace_event *event; > > > > event = ring_buffer_reserve(func_event_id, > > sizeof(*event)); > > > > event->pid = current->pid; > > event->pc = preempt_count(); > > event->irq = local_irq_flags(); > > event->func = func; > > event->parent = parent; > > > > ring_buffer_commit(event); > > > > The scheme you propose here is based on a few inherent assumptions : > > - You assume ring_buffer_reserve() and ring_buffer_commit() are static > inline and thus does not turn into function calls. > - You assume these are small enough so they can be inlined without > causing L1 insn cache trashing when tracing is activated. > - You therefore assume they use a locking scheme that lets them be > really really compact (e.g. interrupt disable and spin lock). > - You assume that the performance impact of doing a function call is > bigger than the impact of locking, which is false by at least a factor > 10. I don't assume anything. I will have the requirement that reserve and commit must be paired, and for the first version, hold locks. Maybe I should rename it to: ring_buffer_lock_reserve and ring_buffer_unlock_commit. To show this. [...] > kernel/trace/ftrace.c : > > /* > * the following macro would only do the "declaration" part of the > * markers, without doing all the function call stuff. > */ > DECLARE_MARKER(function_entry, > "pid %d pc %d flags %lu func 0x%lX parent 0x%lX"); > > void ftrace_mcount(unsigned long ip, unsigned long parent_ip) > { > size_t ev_size = 0; > char *buffer; > > /* > * We assume event payload aligned on sizeof(void *). > * Event size calculated statically. > */ > ev_size += sizeof(int); > ev_size += var_align(ev_size, sizeof(int)); > ev_size += sizeof(int); > ev_size += var_align(ev_size, sizeof(unsigned long)); > ev_size += sizeof(unsigned long); > ev_size += var_align(ev_size, sizeof(unsigned long)); > ev_size += sizeof(unsigned long); > ev_size += var_align(ev_size, sizeof(unsigned long)); > ev_size += sizeof(unsigned long); > > /* > * Now reserve space and copy data. > */ > buffer = ring_buffer_reserve(func_event_id, ev_size); > /* Write pid */ > *(int *)buffer = current->pid; > buffer += sizeof(int); > > /* Write pc */ > buffer += var_align(buffer, sizeof(int)); > *(int *)buffer = preempt_count(); > buffer += sizeof(int); > > /* Write flags */ > buffer += var_align(buffer, sizeof(unsigned long)); > *(unsigned long *)buffer = local_irq_flags(); > buffer += sizeof(unsigned long); > > /* Write func */ > buffer += var_align(buffer, sizeof(unsigned long)); > *(unsigned long *)buffer = func; > buffer += sizeof(unsigned long); > > /* Write parent */ > buffer += var_align(buffer, sizeof(unsigned long)); > *(unsigned long *)buffer = parent; > buffer += sizeof(unsigned long); > > ring_buffer_commit(buffer, ev_size); > } > > > Would that be suitable for you ? YUCK YUCK YUCK!!!! Mathieu, Do I have to bring up the argument of simplicity again? I will never use such an API. Mine was very simple, I have to spend 10 minutes trying to figure out what the above is. I only spent 5 so I'm still at a lost. > > We could also think of passing the function pointer of the bin to ascii > converter to DECLARE_MARKER(), such as : > > void function_entry_show(struct seq_file *m, char *buffer); > > DECLARE_MARKER(function_entry, function_entry_show, > "pid %d pc %d flags %lu func 0x%lX parent 0x%lX"); > > void function_entry_show(struct seq_file *m, char *buffer) > { > /* Read pid */ > seq_printf(m, "pid = %d ", *(int *)buffer); > buffer += sizeof(int); > > /* Read pc */ > buffer += var_align(buffer, sizeof(int)); > seq_printf(m, "pc = %d ", *(int *)buffer); > buffer += sizeof(int); > > /* Read flags */ > buffer += var_align(buffer, sizeof(unsigned long)); > seq_printf(m, "flags = %lu ", *(unsigned long *)buffer); > buffer += sizeof(unsigned long); > > /* Read func */ > buffer += var_align(buffer, sizeof(unsigned long)); > seq_printf(m, "func = 0x%lX ", *(unsigned long *)buffer); > buffer += sizeof(unsigned long); > > /* Read parent */ > buffer += var_align(buffer, sizeof(unsigned long)); > seq_printf(m, "parent = 0x%lX ", *(unsigned long *)buffer); > buffer += sizeof(unsigned long); > } > > Note that in this particular case, given we would not need any special > "dump everything as if it was an unorganized array of bytes", the > function_entry_show() would be totally useless if we provide a sane > vsnprintf-like decoder based on the format string. > > I did this example to show you how we could deal with the special cases > where people would be interested to write a whole network packet (or any > similar structure) directly to the trace (given it has field structures > which are not too tied to the compiler internals and has field sizes > portable across architectures). We could do this without much problem by > adding a format string type which specified such a binary blob, and we > could even leave room for people to provide their ascii formatting > function pointer, as shows my second example here. > These are not special cases, these are what I use often. They are not special for me. -- Steve ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 21:39 ` Steven Rostedt @ 2008-09-23 3:27 ` Mathieu Desnoyers 0 siblings, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 3:27 UTC (permalink / raw) To: Steven Rostedt Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od * Steven Rostedt (rostedt@goodmis.org) wrote: > > On Mon, 22 Sep 2008, Mathieu Desnoyers wrote: > > * Steven Rostedt (rostedt@goodmis.org) wrote: [...] > > > and then to turn on function tracing, I need to hook into this marker. I'd > > > rather just push the data right into the buffer here without having to > > > make another function call to hook into this. > > > > > > > The scheme you propose here is based on a few inherent assumptions : > > > > - You assume ring_buffer_reserve() and ring_buffer_commit() are static > > inline and thus does not turn into function calls. > > - You assume these are small enough so they can be inlined without > > causing L1 insn cache trashing when tracing is activated. > > - You therefore assume they use a locking scheme that lets them be > > really really compact (e.g. interrupt disable and spin lock). > > - You assume that the performance impact of doing a function call is > > bigger than the impact of locking, which is false by at least a factor > > 10. > > I don't assume anything. I will have the requirement that reserve and > commit must be paired, and for the first version, hold locks. > By saying you don't want to do any function call, the only technical reason I see for you wanting that is performance, and thus you would assume the above. If not, why don't you want to make another function call ? This all I mean by "assumption" here. > Maybe I should rename it to: ring_buffer_lock_reserve and > ring_buffer_unlock_commit. To show this. > > [...] > > > kernel/trace/ftrace.c : > > > > /* > > * the following macro would only do the "declaration" part of the > > * markers, without doing all the function call stuff. > > */ > > DECLARE_MARKER(function_entry, > > "pid %d pc %d flags %lu func 0x%lX parent 0x%lX"); > > > > void ftrace_mcount(unsigned long ip, unsigned long parent_ip) > > { > > size_t ev_size = 0; > > char *buffer; > > > > /* > > * We assume event payload aligned on sizeof(void *). > > * Event size calculated statically. > > */ > > ev_size += sizeof(int); > > ev_size += var_align(ev_size, sizeof(int)); > > ev_size += sizeof(int); > > ev_size += var_align(ev_size, sizeof(unsigned long)); > > ev_size += sizeof(unsigned long); > > ev_size += var_align(ev_size, sizeof(unsigned long)); > > ev_size += sizeof(unsigned long); > > ev_size += var_align(ev_size, sizeof(unsigned long)); > > ev_size += sizeof(unsigned long); > > > > /* > > * Now reserve space and copy data. > > */ > > buffer = ring_buffer_reserve(func_event_id, ev_size); > > /* Write pid */ > > *(int *)buffer = current->pid; > > buffer += sizeof(int); > > > > /* Write pc */ > > buffer += var_align(buffer, sizeof(int)); > > *(int *)buffer = preempt_count(); > > buffer += sizeof(int); > > > > /* Write flags */ > > buffer += var_align(buffer, sizeof(unsigned long)); > > *(unsigned long *)buffer = local_irq_flags(); > > buffer += sizeof(unsigned long); > > > > /* Write func */ > > buffer += var_align(buffer, sizeof(unsigned long)); > > *(unsigned long *)buffer = func; > > buffer += sizeof(unsigned long); > > > > /* Write parent */ > > buffer += var_align(buffer, sizeof(unsigned long)); > > *(unsigned long *)buffer = parent; > > buffer += sizeof(unsigned long); > > > > ring_buffer_commit(buffer, ev_size); > > } > > > > > > Would that be suitable for you ? > > YUCK YUCK YUCK!!!! > > Mathieu, > > Do I have to bring up the argument of simplicity again? I will never use > such an API. Mine was very simple, I have to spend 10 minutes trying to > figure out what the above is. I only spent 5 so I'm still at a lost. > I was actually waiting for you to propose an alternative, but I fear you already did without me noticing :) How do you deal with exporting data across kernel/user boundary in your proposal exactly ? How does this work on architecture with 64-bits kernel and 32-bits userland... ? A simple C structure copy might be simple to _code_, but hellish to export to userspace and lead to hard to debug binary incompatibilities (different gcc flags, 32/64 bits user/kernel). And this is without telling about the non-portability of the exported data. If gcc/icc-knowledgeful people can reassure me by certifying it won't generate a mess, fine, but until then, I stay very doubtful about solutions involving to imply binary compability between kernel and userland. And common.. 10 minutes to understand the above code. Your _are_ kidding me right ? Would that help if I create a small 4 lineish wrapper around the buffer write ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:33 Unified tracing buffer Martin Bligh ` (2 preceding siblings ...) 2008-09-19 23:18 ` Frank Ch. Eigler @ 2008-09-20 0:07 ` Peter Zijlstra 2008-09-22 14:07 ` K.Prasad 2008-09-20 0:26 ` Unified tracing buffer Marcel Holtmann ` (4 subsequent siblings) 8 siblings, 1 reply; 122+ messages in thread From: Peter Zijlstra @ 2008-09-20 0:07 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler Oddly whitespace damaged mail.. On Fri, 2008-09-19 at 14:33 -0700, Martin Bligh wrote: > During kernel summit and Plumbers conference, Linus and others > expressed a desire for a unified > tracing buffer system for multiple tracing applications (eg ftrace, > lttng, systemtap, blktrace, etc) to use. > This provides several advantages, including the ability to interleave > data from multiple sources, > not having to learn 200 different tools, duplicated code/effort, etc. > > Several of us got together last night and tried to cut this down to > the simplest usable system > we could agree on (and nobody got hurt!). This will form version 1. > I've sketched out a few > enhancements we know that we want, but have agreed to leave these > until version 2. > The answer to most questions about the below is "yes we know, we'll > fix that in version 2" > (or 3). Simplicity was the rule ... > > Sketch of design. Enjoy flaming me. Code will follow shortly. > > > STORAGE > ------- > > We will support multiple buffers for different tracing systems, with > separate names, event id spaces. > Event ids are 16 bit, dynamically allocated. > A "one line of text" print function will be provided for each event, > or use the default (probably hex printf) > Will provide a "flight data recorder" mode, and a "spool to disk" mode. > > Circular buffer per cpu, protected by per-cpu spinlock_irq > Word aligned records. > Variable record length, header will start with length record. > Timestamps in fixed timebase, monotonically increasing (across all CPUs) > > > INPUT_FUNCTIONS > --------------- > > allocate_buffer (name, size) > return buffer_handle > > register_event (buffer_handle, event_id, print_function) > You can pass in a requested event_id from a fixed set, and > will be given it, or an error > 0 means allocate me one dynamically > returns event_id (or -E_ERROR) > > record_event (buffer_handle, event_id, length, *buf) I'd hoped for an interface like: struct ringbuffer *ringbuffer_alloc(const char *name, size_t size); void ringbuffer_free(struct ringbuffer *buffer); int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size); int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size); On top of which you'd do the event thing, the register event with a callback idea makes sense, except I'd split the consumption into two: - one method to pull the binary event out, which knows how long it ought to be etc.. - one method to convert the binary event to ASCII You don't always need the latter one, esp if you're dumping to disk. You can also generalize the merge sorting forward iterator when you have that, by providing an event compare function. By doing it like this folks can focus on utterly optimizing the ringbuffer to death, and other folks can toy around with doing fancy event encodings (/me pitches asn.1-der encoded structured data and runs like crazy) ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 0:07 ` Peter Zijlstra @ 2008-09-22 14:07 ` K.Prasad 2008-09-22 14:45 ` Peter Zijlstra 0 siblings, 1 reply; 122+ messages in thread From: K.Prasad @ 2008-09-22 14:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi On Sat, Sep 20, 2008 at 02:07:58AM +0200, Peter Zijlstra wrote: > Oddly whitespace damaged mail.. > > On Fri, 2008-09-19 at 14:33 -0700, Martin Bligh wrote: > > During kernel summit and Plumbers conference, Linus and others > > expressed a desire for a unified > > tracing buffer system for multiple tracing applications (eg ftrace, > > lttng, systemtap, blktrace, etc) to use. > > This provides several advantages, including the ability to interleave > > data from multiple sources, > > not having to learn 200 different tools, duplicated code/effort, etc. > > > > Several of us got together last night and tried to cut this down to > > the simplest usable system > > we could agree on (and nobody got hurt!). This will form version 1. > > I've sketched out a few > > enhancements we know that we want, but have agreed to leave these > > until version 2. > > The answer to most questions about the below is "yes we know, we'll > > fix that in version 2" > > (or 3). Simplicity was the rule ... > > > > Sketch of design. Enjoy flaming me. Code will follow shortly. > > > > > > STORAGE > > ------- > > > > We will support multiple buffers for different tracing systems, with > > separate names, event id spaces. > > Event ids are 16 bit, dynamically allocated. > > A "one line of text" print function will be provided for each event, > > or use the default (probably hex printf) > > Will provide a "flight data recorder" mode, and a "spool to disk" mode. > > > > Circular buffer per cpu, protected by per-cpu spinlock_irq > > Word aligned records. > > Variable record length, header will start with length record. > > Timestamps in fixed timebase, monotonically increasing (across all CPUs) > > > > > > INPUT_FUNCTIONS > > --------------- > > > > allocate_buffer (name, size) > > return buffer_handle > > > > register_event (buffer_handle, event_id, print_function) > > You can pass in a requested event_id from a fixed set, and > > will be given it, or an error > > 0 means allocate me one dynamically > > returns event_id (or -E_ERROR) > > > > record_event (buffer_handle, event_id, length, *buf) > > I'd hoped for an interface like: > > struct ringbuffer *ringbuffer_alloc(const char *name, size_t size); > void ringbuffer_free(struct ringbuffer *buffer); > int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size); > int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size); > > On top of which you'd do the event thing, the register event with a > callback idea makes sense, except I'd split the consumption into two: > - one method to pull the binary event out, which knows how long it > ought to be etc.. > - one method to convert the binary event to ASCII > In conjunction with the previous email on this thread (http://lkml.org/lkml/2008/9/22/160), may I suggest the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be: relay_printk(<some struct with default filenames/pathnames>, <string>, ...) ; relay_dump(<some struct with default filenames/pathnames>, <binary data>); and relay_cleanup_all(<the struct name>); - Single interface that cleans up all files/directories/output data created under a logical entity. Thanks, K.Prasad ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 14:07 ` K.Prasad @ 2008-09-22 14:45 ` Peter Zijlstra 2008-09-22 16:29 ` Martin Bligh ` (5 more replies) 0 siblings, 6 replies; 122+ messages in thread From: Peter Zijlstra @ 2008-09-22 14:45 UTC (permalink / raw) To: prasad Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi On Mon, 2008-09-22 at 19:37 +0530, K.Prasad wrote: > > > INPUT_FUNCTIONS > > > --------------- > > > > > > allocate_buffer (name, size) > > > return buffer_handle > > > > > > register_event (buffer_handle, event_id, print_function) > > > You can pass in a requested event_id from a fixed set, and > > > will be given it, or an error > > > 0 means allocate me one dynamically > > > returns event_id (or -E_ERROR) > > > > > > record_event (buffer_handle, event_id, length, *buf) > > > > I'd hoped for an interface like: > > > > struct ringbuffer *ringbuffer_alloc(const char *name, size_t size); > > void ringbuffer_free(struct ringbuffer *buffer); > > int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size); > > int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size); > > > > On top of which you'd do the event thing, the register event with a > > callback idea makes sense, except I'd split the consumption into two: > > - one method to pull the binary event out, which knows how long it > > ought to be etc.. > > - one method to convert the binary event to ASCII > > > In conjunction with the previous email on this thread > (http://lkml.org/lkml/2008/9/22/160), may I suggest > the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be: > > relay_printk(<some struct with default filenames/pathnames>, <string>, > ....) ; > relay_dump(<some struct with default filenames/pathnames>, <binary > data>); > and > relay_cleanup_all(<the struct name>); - Single interface that cleans up > all files/directories/output data created under a logical entity. Dude, relayfs is such a bad performing mess that extending it seems like a bad idea. Better to write something new and delete everything relayfs related. Also, it seems prudent to separate the ring-buffer implementation from the event encoding/decoding facilities. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 14:45 ` Peter Zijlstra @ 2008-09-22 16:29 ` Martin Bligh 2008-09-22 16:36 ` Peter Zijlstra 2008-09-23 2:49 ` Mathieu Desnoyers ` (4 subsequent siblings) 5 siblings, 1 reply; 122+ messages in thread From: Martin Bligh @ 2008-09-22 16:29 UTC (permalink / raw) To: Peter Zijlstra Cc: prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi >> In conjunction with the previous email on this thread >> (http://lkml.org/lkml/2008/9/22/160), may I suggest >> the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be: >> >> relay_printk(<some struct with default filenames/pathnames>, <string>, >> ....) ; >> relay_dump(<some struct with default filenames/pathnames>, <binary >> data>); >> and >> relay_cleanup_all(<the struct name>); - Single interface that cleans up >> all files/directories/output data created under a logical entity. > > Dude, relayfs is such a bad performing mess that extending it seems like > a bad idea. Better to write something new and delete everything relayfs > related. There did seem to be pretty universal agreement that we'd rather not use relayfs. > Also, it seems prudent to separate the ring-buffer implementation from > the event encoding/decoding facilities. Right - in conversation I had with Mathieu later, he suggested cleaning up relayfs - I fear this will delay us far too long, and get bogged down. If we can get one clean circular buffer implementation, then both relayfs and the tracing could share that common solution, ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 16:29 ` Martin Bligh @ 2008-09-22 16:36 ` Peter Zijlstra 2008-09-22 20:50 ` Masami Hiramatsu 2008-09-23 3:05 ` Mathieu Desnoyers 0 siblings, 2 replies; 122+ messages in thread From: Peter Zijlstra @ 2008-09-22 16:36 UTC (permalink / raw) To: Martin Bligh Cc: prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi On Mon, 2008-09-22 at 09:29 -0700, Martin Bligh wrote: > >> In conjunction with the previous email on this thread > >> (http://lkml.org/lkml/2008/9/22/160), may I suggest > >> the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be: > >> > >> relay_printk(<some struct with default filenames/pathnames>, <string>, > >> ....) ; > >> relay_dump(<some struct with default filenames/pathnames>, <binary > >> data>); > >> and > >> relay_cleanup_all(<the struct name>); - Single interface that cleans up > >> all files/directories/output data created under a logical entity. > > > > Dude, relayfs is such a bad performing mess that extending it seems like > > a bad idea. Better to write something new and delete everything relayfs > > related. > > There did seem to be pretty universal agreement that we'd rather not > use relayfs. > > > Also, it seems prudent to separate the ring-buffer implementation from > > the event encoding/decoding facilities. > > Right - in conversation I had with Mathieu later, he suggested cleaning up > relayfs - I fear this will delay us far too long, and get bogged down. > If we can get one clean circular buffer implementation, then both > relayfs and the tracing could share that common solution, Currently only blktrace and kvmtrace use relayfs, and I've heard people talk about converting both to use lttng/ftrace infrastructure. At which point relayfs is orphaned and ready for removal. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 16:36 ` Peter Zijlstra @ 2008-09-22 20:50 ` Masami Hiramatsu 2008-09-23 3:05 ` Mathieu Desnoyers 1 sibling, 0 replies; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-22 20:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Martin Bligh, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi Peter Zijlstra wrote: > On Mon, 2008-09-22 at 09:29 -0700, Martin Bligh wrote: >>>> In conjunction with the previous email on this thread >>>> (http://lkml.org/lkml/2008/9/22/160), may I suggest >>>> the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be: >>>> >>>> relay_printk(<some struct with default filenames/pathnames>, <string>, >>>> ....) ; >>>> relay_dump(<some struct with default filenames/pathnames>, <binary >>>> data>); >>>> and >>>> relay_cleanup_all(<the struct name>); - Single interface that cleans up >>>> all files/directories/output data created under a logical entity. >>> Dude, relayfs is such a bad performing mess that extending it seems like >>> a bad idea. Better to write something new and delete everything relayfs >>> related. >> There did seem to be pretty universal agreement that we'd rather not >> use relayfs. >> >>> Also, it seems prudent to separate the ring-buffer implementation from >>> the event encoding/decoding facilities. >> Right - in conversation I had with Mathieu later, he suggested cleaning up >> relayfs - I fear this will delay us far too long, and get bogged down. >> If we can get one clean circular buffer implementation, then both >> relayfs and the tracing could share that common solution, > > Currently only blktrace and kvmtrace use relayfs, and I've heard people > talk about converting both to use lttng/ftrace infrastructure. At which > point relayfs is orphaned and ready for removal. Hi Peter, Systemtap is still a heavy user of relayfs. :-) Anyway, if new buffering mechanism is enough for us, I think we're happy to move on it. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 16:36 ` Peter Zijlstra 2008-09-22 20:50 ` Masami Hiramatsu @ 2008-09-23 3:05 ` Mathieu Desnoyers 1 sibling, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 3:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Martin Bligh, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote: > On Mon, 2008-09-22 at 09:29 -0700, Martin Bligh wrote: > > >> In conjunction with the previous email on this thread > > >> (http://lkml.org/lkml/2008/9/22/160), may I suggest > > >> the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be: > > >> > > >> relay_printk(<some struct with default filenames/pathnames>, <string>, > > >> ....) ; > > >> relay_dump(<some struct with default filenames/pathnames>, <binary > > >> data>); > > >> and > > >> relay_cleanup_all(<the struct name>); - Single interface that cleans up > > >> all files/directories/output data created under a logical entity. > > > > > > Dude, relayfs is such a bad performing mess that extending it seems like > > > a bad idea. Better to write something new and delete everything relayfs > > > related. > > > > There did seem to be pretty universal agreement that we'd rather not > > use relayfs. > > > > > Also, it seems prudent to separate the ring-buffer implementation from > > > the event encoding/decoding facilities. > > > > Right - in conversation I had with Mathieu later, he suggested cleaning up > > relayfs - I fear this will delay us far too long, and get bogged down. > > If we can get one clean circular buffer implementation, then both > > relayfs and the tracing could share that common solution, > > Currently only blktrace and kvmtrace use relayfs, and I've heard people > talk about converting both to use lttng/ftrace infrastructure. At which > point relayfs is orphaned and ready for removal. > LTTng sits on top of relay for buffer allocation and for the mmap operation (that's about it, it overrides the rest). Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 14:45 ` Peter Zijlstra 2008-09-22 16:29 ` Martin Bligh @ 2008-09-23 2:49 ` Mathieu Desnoyers 2008-09-23 5:25 ` Tom Zanussi ` (3 subsequent siblings) 5 siblings, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 2:49 UTC (permalink / raw) To: Peter Zijlstra Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote: > On Mon, 2008-09-22 at 19:37 +0530, K.Prasad wrote: > > > > > INPUT_FUNCTIONS > > > > --------------- > > > > > > > > allocate_buffer (name, size) > > > > return buffer_handle > > > > > > > > register_event (buffer_handle, event_id, print_function) > > > > You can pass in a requested event_id from a fixed set, and > > > > will be given it, or an error > > > > 0 means allocate me one dynamically > > > > returns event_id (or -E_ERROR) > > > > > > > > record_event (buffer_handle, event_id, length, *buf) > > > > > > I'd hoped for an interface like: > > > > > > struct ringbuffer *ringbuffer_alloc(const char *name, size_t size); > > > void ringbuffer_free(struct ringbuffer *buffer); > > > int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size); > > > int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size); > > > > > > On top of which you'd do the event thing, the register event with a > > > callback idea makes sense, except I'd split the consumption into two: > > > - one method to pull the binary event out, which knows how long it > > > ought to be etc.. > > > - one method to convert the binary event to ASCII > > > > > In conjunction with the previous email on this thread > > (http://lkml.org/lkml/2008/9/22/160), may I suggest > > the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be: > > > > relay_printk(<some struct with default filenames/pathnames>, <string>, > > ....) ; > > relay_dump(<some struct with default filenames/pathnames>, <binary > > data>); > > and > > relay_cleanup_all(<the struct name>); - Single interface that cleans up > > all files/directories/output data created under a logical entity. > > Dude, relayfs is such a bad performing mess that extending it seems like > a bad idea. Better to write something new and delete everything relayfs > related. > LTTng only uses relay for buffer mapping and mmap to userspace. The rest of internal buffer management is done within LTTng by overriding relay callbacks. One thing we could think of is to incrementally fix relay rather than deleting it completely. > Also, it seems prudent to separate the ring-buffer implementation from > the event encoding/decoding facilities. > Sure, but still I think both are needed, even if they are separated as two different layers (as they should). Mathieu > > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 14:45 ` Peter Zijlstra 2008-09-22 16:29 ` Martin Bligh 2008-09-23 2:49 ` Mathieu Desnoyers @ 2008-09-23 5:25 ` Tom Zanussi 2008-09-23 9:31 ` Peter Zijlstra ` (2 more replies) 2008-09-23 5:27 ` [PATCH 1/3] relay - clean up subbuf switch Tom Zanussi ` (2 subsequent siblings) 5 siblings, 3 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-23 5:25 UTC (permalink / raw) To: Peter Zijlstra Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder On Mon, 2008-09-22 at 16:45 +0200, Peter Zijlstra wrote: > On Mon, 2008-09-22 at 19:37 +0530, K.Prasad wrote: > > > > > INPUT_FUNCTIONS > > > > --------------- > > > > > > > > allocate_buffer (name, size) > > > > return buffer_handle > > > > > > > > register_event (buffer_handle, event_id, print_function) > > > > You can pass in a requested event_id from a fixed set, and > > > > will be given it, or an error > > > > 0 means allocate me one dynamically > > > > returns event_id (or -E_ERROR) > > > > > > > > record_event (buffer_handle, event_id, length, *buf) > > > > > > I'd hoped for an interface like: > > > > > > struct ringbuffer *ringbuffer_alloc(const char *name, size_t size); > > > void ringbuffer_free(struct ringbuffer *buffer); > > > int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size); > > > int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size); > > > > > > On top of which you'd do the event thing, the register event with a > > > callback idea makes sense, except I'd split the consumption into two: > > > - one method to pull the binary event out, which knows how long it > > > ought to be etc.. > > > - one method to convert the binary event to ASCII > > > > > In conjunction with the previous email on this thread > > (http://lkml.org/lkml/2008/9/22/160), may I suggest > > the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be: > > > > relay_printk(<some struct with default filenames/pathnames>, <string>, > > ....) ; > > relay_dump(<some struct with default filenames/pathnames>, <binary > > data>); > > and > > relay_cleanup_all(<the struct name>); - Single interface that cleans up > > all files/directories/output data created under a logical entity. > > Dude, relayfs is such a bad performing mess that extending it seems like > a bad idea. Better to write something new and delete everything relayfs > related. Hmm, I haven't seen complaints lately about about relayfs being 'bad performing'. The write/reserve functions are pretty fast - they don't do much else in the fast path other than update an index, but if they're still too slow, please let me know how to make them faster. In any case, I'll post a couple patches in a few minutes that give complete control over the write path for anyone who doesn't want to be hampered by the existing versions. As for the interface, yeah, it has gathered some some cruft over time and has turned out to be too complex for most people. The reason a lot of that complexity is there in the first place though, ironically, is that it was put there in explicit support of the requirements of LTT/LTTng (sub-buffers, padding, mmap, etc), which supposedly represented the needs of all 'industrial-strength' tracers at the time. Well, four years after the 'troll merge' that initially got relayfs streamlined and into the kernel, in anticipation of a soon-to-follow streamlined LTT/LTTng which has yet to emerge, apparently those requirements are no longer valid and neither LTTng nor anything else needs the capabilities of relayfs. That's fine, if it isn't needed, it isn't needed. But since it no longer has to conform to the requirements of any imaginary tracer, maybe it should be put through yet another streamlining effort and everything that's not required to support current users removed: - get rid of anything having to do with padding, nobody needs it and its only affect has been to horribly distort and complicate a lot of the code - get rid of sub-buffers, they just cause confusion - get rid of mmap, nobody uses it - no sub-buffers and no mmap support means we can get rid of most of the callbacks, and a lot of API confusion along with them - add relay flags - they probably should have been used from the beginning and options made explicit instead of being shoehorned into the callback functions. Going even further, why not just replace the current write functions with versions that write into pages and SPLICE_F_MOVE them to their destination - normally userspace doesn't want to see the data anyway - and get rid of everything else. Add support for splice_write() and maybe you have an elegant way to do userspace tracing (via vmsplice) too. Another source of complexity has turned out to be the removal of the 'fs' part of relayfs - it basically meant adding callback hooks so relay files could be used in other pseudo filesystems, which is great, but it further complicated the API and scared away users. We could add back the fs part, but that would be going backwards, so those callbacks at least would have to stay I guess. Well, I'll post some patches shortly for a few of these things, but I doubt I'll do much more than that, since on the one hand I only have a few nights a week to work on this stuff and it's become a not-very-fun hobby, and since I think you guys have already decided on the way forward and anything I post would be removed soon anyway. As for the relay_printk() etc stuff, the part that adds the common code from blktrace for all tracers would definitely be a benefit, but I still don't think it goes far enough in providing generic trace control - see e.g. the kmemtrace-on-utt code where I still had to add code to add a bunch of control files - it would be nice to have a standard and easy way to do that. For the printk() functionality itself, we submitted something similar a year ago (dti_printk) and nobody was interested: http://lwn.net/Articles/240330/ http://dti.sourceforge.net/ I told the folks in charge at IBM then that doing that kind of in-kernel filtering and sorting might be interesting and useful for ad hoc kernel hacking, but was basically a sideshow; the really useful part of the blktrace tracing code and 90% of the work needed to make it into a generically usable tracing system wasn't in the kernel at all, but in the unglamorous userspace code that did the streaming and display of the trace data via disk/network/live, etc. Eventually I did go ahead and do that 90%, which wasn't a small task, and now anyone can use the blktrace code for generic tracing: http://utt.sourceforge.net/ I can't say I did it justice, but it does work, and in fact, it didn't take much time at all to convert the kmemtrace code to using it: http://utt.sourceforge.net/kmemtrace-utt-kernel.patch http://utt.sourceforge.net/kmemtrace-utt-user.patch It should also be pretty straightforward to extend it to handle the output from any number of trace sources as has been mentioned, assuming you have a common sequencing source, so regardless of what you guys end up replacing relayfs with, you might consider using it anyway... > > Also, it seems prudent to separate the ring-buffer implementation from > the event encoding/decoding facilities. > > > ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 5:25 ` Tom Zanussi @ 2008-09-23 9:31 ` Peter Zijlstra 2008-09-23 18:13 ` Mathieu Desnoyers 2008-09-23 13:50 ` Mathieu Desnoyers 2008-09-23 14:00 ` Martin Bligh 2 siblings, 1 reply; 122+ messages in thread From: Peter Zijlstra @ 2008-09-23 9:31 UTC (permalink / raw) To: Tom Zanussi Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder On Tue, 2008-09-23 at 00:25 -0500, Tom Zanussi wrote: > - get rid of anything having to do with padding, nobody needs it and its > only affect has been to horribly distort and complicate a lot of the > code > - get rid of sub-buffers, they just cause confusion > - get rid of mmap, nobody uses it > - no sub-buffers and no mmap support means we can get rid of most of the > callbacks, and a lot of API confusion along with them > - add relay flags - they probably should have been used from the > beginning and options made explicit instead of being shoehorned into the > callback functions. - get rid of the vmap buffers as they cause tlb pressure and eat up precious vspace on 32 bit platforms. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 9:31 ` Peter Zijlstra @ 2008-09-23 18:13 ` Mathieu Desnoyers 2008-09-23 18:33 ` Christoph Lameter 0 siblings, 1 reply; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 18:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Tom Zanussi, prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, Christoph Lameter, linux-mm * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote: > On Tue, 2008-09-23 at 00:25 -0500, Tom Zanussi wrote: > > > - get rid of anything having to do with padding, nobody needs it and its > > only affect has been to horribly distort and complicate a lot of the > > code > > - get rid of sub-buffers, they just cause confusion > > - get rid of mmap, nobody uses it > > - no sub-buffers and no mmap support means we can get rid of most of the > > callbacks, and a lot of API confusion along with them > > - add relay flags - they probably should have been used from the > > beginning and options made explicit instead of being shoehorned into the > > callback functions. > > - get rid of the vmap buffers as they cause tlb pressure and eat up > precious vspace on 32 bit platforms. > Although I agree on the basic idea, namely to use a sane amount of TLB entries for tracing, I disagree on the way proposed to reach this goal. Such memory management concerns belong to the mm field and should not be done "oh so cleverly" by a buffer management infrastructure in the back of the kernel memory management infrastructure. I think we should instead try to figure out what is currently missing in the kernel vmap mechanism (probably the ability to vmap from large 4MB pages after boot), and fix _that_ instead (if possible), which would not only benefit to tracing, but also to module support. Also, I would like to keep a contiguous address mapping within buffers so we could keep the buffer read/write code as simple as possible, leveraging the existing CPU MM unit. I added Christoph Lameter to the CC list, he always comes with clever ideas. :) Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 18:13 ` Mathieu Desnoyers @ 2008-09-23 18:33 ` Christoph Lameter 2008-09-23 18:56 ` Linus Torvalds 0 siblings, 1 reply; 122+ messages in thread From: Christoph Lameter @ 2008-09-23 18:33 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, Tom Zanussi, prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, linux-mm Mathieu Desnoyers wrote: > > I think we should instead try to figure out what is currently missing in > the kernel vmap mechanism (probably the ability to vmap from large 4MB > pages after boot), and fix _that_ instead (if possible), which would not > only benefit to tracing, but also to module support. With some custom code one can vmap 2MB pages on x86. See the VMEMMAP support in the x86 arch. The code in mm/sparse-vmemmap.c could be abstracted for a general 2MB mapping API to reduce TLB pressure for the buffers. If there are concerns about fragmentation then one could fallback to 4kb TLBs. See the virtualizable compound page patchset which does something similar. > I added Christoph Lameter to the CC list, he always comes with clever > ideas. :) Oh mostly we are just recycling the old ideas. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 18:33 ` Christoph Lameter @ 2008-09-23 18:56 ` Linus Torvalds 0 siblings, 0 replies; 122+ messages in thread From: Linus Torvalds @ 2008-09-23 18:56 UTC (permalink / raw) To: Christoph Lameter Cc: Mathieu Desnoyers, Peter Zijlstra, Tom Zanussi, prasad, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, linux-mm On Tue, 23 Sep 2008, Christoph Lameter wrote: > Mathieu Desnoyers wrote: > > > > I think we should instead try to figure out what is currently missing in > > the kernel vmap mechanism (probably the ability to vmap from large 4MB > > pages after boot), and fix _that_ instead (if possible), which would not > > only benefit to tracing, but also to module support. No. Don't go there. Piece of absolute shit. The problem with VMAP is that it's _limited_. We don't have reasonable virtual address space holes for x86-32. The other is that physically contiguos buffers are hard to come by. Certainly not an acceptable solution. The third is that if you have multiple buffers, you need to look them up in software anyway, so the whole notion of mis-using the TLB to avoid a software lookup is TOTAL CRAP. Don't do virtual mapping. IT IS BROKEN. IT IS A TOTAL AND UTTER PIECE OF SHIT. I will absolutely not take any general-purpse tracing code if I'm aware of it mis-using the TLB to play games. Linus ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 5:25 ` Tom Zanussi 2008-09-23 9:31 ` Peter Zijlstra @ 2008-09-23 13:50 ` Mathieu Desnoyers 2008-09-23 14:00 ` Martin Bligh 2 siblings, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 13:50 UTC (permalink / raw) To: Tom Zanussi Cc: Peter Zijlstra, prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder * Tom Zanussi (zanussi@comcast.net) wrote: > - get rid of anything having to do with padding, nobody needs it and its > only affect has been to horribly distort and complicate a lot of the > code > - get rid of sub-buffers, they just cause confusion > - get rid of mmap, nobody uses it LTTng uses relay mmap. That's about the only feature of relay it uses along with memory allocation. It however implements its own buffer management mechanism with poll() and ioctl GET_SUBBUF/PUT_SUBBUF to read subbuffers. But these ops are all within LTTng. BTW it would be good to change relay so it can take a buffer pointer as input for relay_open. That would help getting memory mapped in the linear mapping to be used for tracing when known at boot time. Mathieu > - no sub-buffers and no mmap support means we can get rid of most of the > callbacks, and a lot of API confusion along with them > - add relay flags - they probably should have been used from the > beginning and options made explicit instead of being shoehorned into the > callback functions. > > Going even further, why not just replace the current write functions > with versions that write into pages and SPLICE_F_MOVE them to their > destination - normally userspace doesn't want to see the data anyway - > and get rid of everything else. Add support for splice_write() and > maybe you have an elegant way to do userspace tracing (via vmsplice) > too. Sounds interesting. So then vmsplice would be used to support sending trace data over the network or to disk ? > > Another source of complexity has turned out to be the removal of the > 'fs' part of relayfs - it basically meant adding callback hooks so relay > files could be used in other pseudo filesystems, which is great, but it > further complicated the API and scared away users. We could add back > the fs part, but that would be going backwards, so those callbacks at > least would have to stay I guess. > > Well, I'll post some patches shortly for a few of these things, but I > doubt I'll do much more than that, since on the one hand I only have a > few nights a week to work on this stuff and it's become a not-very-fun > hobby, and since I think you guys have already decided on the way > forward and anything I post would be removed soon anyway. > I am not sure of that. I think there is some room for relay improvements we could work on. As for the mechanism used to insure data coherency, I think relay does not provide any. Could it be changed to an interrupt disable+spinlock ? Then, in a second phase, we can optimize it by using a lockless mechanism like LTTng does. > As for the relay_printk() etc stuff, the part that adds the common code > from blktrace for all tracers would definitely be a benefit, but I still > don't think it goes far enough in providing generic trace control - see > e.g. the kmemtrace-on-utt code where I still had to add code to add a > bunch of control files - it would be nice to have a standard and easy > way to do that. For the printk() functionality itself, we submitted > something similar a year ago (dti_printk) and nobody was interested: > > http://lwn.net/Articles/240330/ > http://dti.sourceforge.net/ > > I told the folks in charge at IBM then that doing that kind of in-kernel > filtering and sorting might be interesting and useful for ad hoc kernel > hacking, but was basically a sideshow; the really useful part of the > blktrace tracing code and 90% of the work needed to make it into a > generically usable tracing system wasn't in the kernel at all, but in > the unglamorous userspace code that did the streaming and display of the > trace data via disk/network/live, etc. Eventually I did go ahead and do > that 90%, which wasn't a small task, and now anyone can use the blktrace > code for generic tracing: > > http://utt.sourceforge.net/ > > I can't say I did it justice, but it does work, and in fact, it didn't > take much time at all to convert the kmemtrace code to using it: > > http://utt.sourceforge.net/kmemtrace-utt-kernel.patch > http://utt.sourceforge.net/kmemtrace-utt-user.patch > > It should also be pretty straightforward to extend it to handle the > output from any number of trace sources as has been mentioned, assuming > you have a common sequencing source, so regardless of what you guys end > up replacing relayfs with, you might consider using it anyway... > I did the same with LTTV :) Writing userspace tools, including GUIs and everything, can be quite a big task. I would be good to keep in mind that a layered infrastructure would be good. A bit like network packet encapsulation, we could have : Layer 2 : Event payload (dealt by a unified event encoding infrastructure) - Structure defined by event ID/type mapping table Layer 1 : Events (dealt by a unified buffer layout infrastructure) - Event header - Timestamp - Event ID - Event size Layer 0 : Buffers (dealt by a unified buffering infrastructure) - Buffer header - Subbuffers Mathieu > > > > Also, it seems prudent to separate the ring-buffer implementation from > > the event encoding/decoding facilities. > > > > > > > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 5:25 ` Tom Zanussi 2008-09-23 9:31 ` Peter Zijlstra 2008-09-23 13:50 ` Mathieu Desnoyers @ 2008-09-23 14:00 ` Martin Bligh 2008-09-23 17:55 ` K.Prasad 2008-09-24 3:50 ` Tom Zanussi 2 siblings, 2 replies; 122+ messages in thread From: Martin Bligh @ 2008-09-23 14:00 UTC (permalink / raw) To: Tom Zanussi Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder > - get rid of anything having to do with padding, nobody needs it and its > only affect has been to horribly distort and complicate a lot of the > code > - get rid of sub-buffers, they just cause confusion > - get rid of mmap, nobody uses it > - no sub-buffers and no mmap support means we can get rid of most of the > callbacks, and a lot of API confusion along with them > - add relay flags - they probably should have been used from the > beginning and options made explicit instead of being shoehorned into the > callback functions. Actually, I think if you did all that, it'd be pretty close to what we want anyway ... ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:00 ` Martin Bligh @ 2008-09-23 17:55 ` K.Prasad 2008-09-23 18:27 ` Martin Bligh 2008-09-24 3:50 ` Tom Zanussi 1 sibling, 1 reply; 122+ messages in thread From: K.Prasad @ 2008-09-23 17:55 UTC (permalink / raw) To: Martin Bligh, Tom Zanussi, Mathieu Desnoyers Cc: Peter Zijlstra, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder On Tue, Sep 23, 2008 at 07:00:38AM -0700, Martin Bligh wrote: > > - get rid of anything having to do with padding, nobody needs it and its > > only affect has been to horribly distort and complicate a lot of the > > code > > - get rid of sub-buffers, they just cause confusion > > - get rid of mmap, nobody uses it > > - no sub-buffers and no mmap support means we can get rid of most of the > > callbacks, and a lot of API confusion along with them > > - add relay flags - they probably should have been used from the > > beginning and options made explicit instead of being shoehorned into the > > callback functions. > > Actually, I think if you did all that, it'd be pretty close to what we > want anyway ... In the perspective of having a layered infrastructure, can we consider the interfaces later added over relay (to be used as a wrapper), namely relay_printk() and relay_dump()? Also add the following features to it and we get close to the functionality that is sought: - Add callbacks to append fine-granular timestamp information depending upon user's requirement - Ability to provision more custom-defined control files that can suit independent tracer's requirements These interfaces already come along with the following features (some repetition here from my previous email for the sake of completeness): - Very minimal work required to log data using the interfaces. Usage is made simple to resemble the printk(). Like struct relay_printk_data *tpk; tpk->parent_dir = "PARENT"; tpk->dir = "DIR"; relay_printk(tpk, <String to be output>); relay_dump(tpk, <Some binary data to output>); Output at: <debugfs_mount>/PARENT/DIR/<TRACE FILES> - Assumes default values for most tunables, such as per-CPU buffer size, relay-flags (such as global vs local per-cpu buffers, flight recorder vs overwrite mode), thus reducing the work required for setting up these interfaces. They can be over-written in case of advanced needs. - Well defined control operations to start, stop tracing operations, status files to indicate buffer overflow, etc. - Given the recent patches that Tom Zanussi has sent to bring in the erstwhile 'trace' functionality into relay itself, there can be a lot of code-reduction in relay.c (in -mm) thereby leading to a light-weight implementation. Thanks, K.Prasad ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 17:55 ` K.Prasad @ 2008-09-23 18:27 ` Martin Bligh 0 siblings, 0 replies; 122+ messages in thread From: Martin Bligh @ 2008-09-23 18:27 UTC (permalink / raw) To: prasad Cc: Tom Zanussi, Mathieu Desnoyers, Peter Zijlstra, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder On Tue, Sep 23, 2008 at 10:55 AM, K.Prasad <prasad@linux.vnet.ibm.com> wrote: > On Tue, Sep 23, 2008 at 07:00:38AM -0700, Martin Bligh wrote: >> > - get rid of anything having to do with padding, nobody needs it and its >> > only affect has been to horribly distort and complicate a lot of the >> > code >> > - get rid of sub-buffers, they just cause confusion >> > - get rid of mmap, nobody uses it >> > - no sub-buffers and no mmap support means we can get rid of most of the >> > callbacks, and a lot of API confusion along with them >> > - add relay flags - they probably should have been used from the >> > beginning and options made explicit instead of being shoehorned into the >> > callback functions. >> >> Actually, I think if you did all that, it'd be pretty close to what we >> want anyway ... > > In the perspective of having a layered infrastructure, can we consider > the interfaces later added over relay (to be used as a wrapper), namely > relay_printk() and relay_dump()? Might well work, but let's see what relayfs comes out looking like. If it's heavily simplfiied, hopefully people will like it. > - Very minimal work required to log data using the interfaces. Usage is > made simple to resemble the printk(). Like > > struct relay_printk_data *tpk; > tpk->parent_dir = "PARENT"; > tpk->dir = "DIR"; > relay_printk(tpk, <String to be output>); > relay_dump(tpk, <Some binary data to output>); You really don't want to store strings in the buffer, it's horribly inefficient. I think the intent was to store binary data from tagged events, along with the format strings, and do all the expansion later. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:00 ` Martin Bligh 2008-09-23 17:55 ` K.Prasad @ 2008-09-24 3:50 ` Tom Zanussi 2008-09-24 5:42 ` K.Prasad ` (9 more replies) 1 sibling, 10 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-24 3:50 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder On Tue, 2008-09-23 at 07:00 -0700, Martin Bligh wrote: > > - get rid of anything having to do with padding, nobody needs it and its > > only affect has been to horribly distort and complicate a lot of the > > code > > - get rid of sub-buffers, they just cause confusion > > - get rid of mmap, nobody uses it > > - no sub-buffers and no mmap support means we can get rid of most of the > > callbacks, and a lot of API confusion along with them > > - add relay flags - they probably should have been used from the > > beginning and options made explicit instead of being shoehorned into the > > callback functions. > > Actually, I think if you did all that, it'd be pretty close to what we > want anyway ... OK, then, I'll continue with the cleanup patchset and see where it goes... Tom ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-24 3:50 ` Tom Zanussi @ 2008-09-24 5:42 ` K.Prasad 2008-09-25 6:07 ` [RFC PATCH 0/8] current relay cleanup patchset Tom Zanussi ` (8 subsequent siblings) 9 siblings, 0 replies; 122+ messages in thread From: K.Prasad @ 2008-09-24 5:42 UTC (permalink / raw) To: Tom Zanussi Cc: Martin Bligh, Peter Zijlstra, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder On Tue, Sep 23, 2008 at 10:50:15PM -0500, Tom Zanussi wrote: > > On Tue, 2008-09-23 at 07:00 -0700, Martin Bligh wrote: > > > - get rid of anything having to do with padding, nobody needs it and its > > > only affect has been to horribly distort and complicate a lot of the > > > code > > > - get rid of sub-buffers, they just cause confusion > > > - get rid of mmap, nobody uses it > > > - no sub-buffers and no mmap support means we can get rid of most of the > > > callbacks, and a lot of API confusion along with them > > > - add relay flags - they probably should have been used from the > > > beginning and options made explicit instead of being shoehorned into the > > > callback functions. > > > > Actually, I think if you did all that, it'd be pretty close to what we > > want anyway ... > > OK, then, I'll continue with the cleanup patchset and see where it > goes... > > Tom > Hi Tom, Kindly let us know if the patches are available in some downloadable location or have been maintained in a git tree. I'm planning to re-base the relay_* interfaces (erstwhile 'trace' code), to work on top of your patches. Thanks, K.Prasad ^ permalink raw reply [flat|nested] 122+ messages in thread
* [RFC PATCH 0/8] current relay cleanup patchset 2008-09-24 3:50 ` Tom Zanussi 2008-09-24 5:42 ` K.Prasad @ 2008-09-25 6:07 ` Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 1/8] relay - Clean up relay_switch_subbuf() and make waking up consumers optional Tom Zanussi ` (7 subsequent siblings) 9 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-25 6:07 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Here's the current relay cleanup patchset. The first two patches make the write path completely replaceable, the third adds flags along with some related cleanup, and the next 5 remove the padding in several stages. It's a work in progress, but because I wanted the intermediate stages to actually work and not break anything, some of these patches, especially 05, are just temporary and will be removed in the next iteration. I didn't have time to clean up the first 3 either - I'll also do that the next time around. Anyway, removing the padding has simplified the read/splice code significantly; it's always been a source of headaches so I'm glad it's gone, and it doesn't seem to have broken anything - a quick test using blktrace in both read() and splice() modes didn't show any problems. In the next round I plan to do vmap and sub-buffer removal. Tom ^ permalink raw reply [flat|nested] 122+ messages in thread
* [RFC PATCH 1/8] relay - Clean up relay_switch_subbuf() and make waking up consumers optional. 2008-09-24 3:50 ` Tom Zanussi 2008-09-24 5:42 ` K.Prasad 2008-09-25 6:07 ` [RFC PATCH 0/8] current relay cleanup patchset Tom Zanussi @ 2008-09-25 6:07 ` Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 2/8] relay - Make the relay sub-buffer switch code replaceable Tom Zanussi ` (6 subsequent siblings) 9 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-25 6:07 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Clean up relay_switch_subbuf() and make waking up consumers optional. Over time, relay_switch_subbuf() has accumulated some cruft - this patch cleans it up and at the same time makes available some of it available as common functions that any subbuf-switch implementor would need (this is partially in preparation for the next patch, which makes the subbuf-switch function completely replaceable). It also removes the hard-coded reader wakeup and moves it into a replaceable callback called notify_consumers(); this allows any given tracer to implement consumer notification as it sees fit. --- include/linux/relay.h | 51 +++++++++++++++++++++++++++++++++++++++++++++++++ kernel/relay.c | 43 +++++++++++++++++++++++------------------ 2 files changed, 75 insertions(+), 19 deletions(-) diff --git a/include/linux/relay.h b/include/linux/relay.h index 953fc05..2242004 100644 --- a/include/linux/relay.h +++ b/include/linux/relay.h @@ -159,6 +159,15 @@ struct rchan_callbacks * The callback should return 0 if successful, negative if not. */ int (*remove_buf_file)(struct dentry *dentry); + + /* + * wakeup_readers - sub-buffer was switched, let readers know + * @buf: the channel buffer + * + * Called during sub-buffer switch. Users who don't want any + * wakeups should implement an empty version. + */ + void (*wakeup_readers)(struct rchan_buf *buf); }; /* @@ -186,6 +195,48 @@ extern size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length); /** + * relay_event_toobig - is event too big to fit in a sub-buffer? + * @buf: relay channel buffer + * @length: length of event + * + * Returns 1 if too big, 0 otherwise. + * + * switch_subbuf() helper function + */ +static inline int relay_event_toobig(struct rchan_buf *buf, size_t length) +{ + return length > buf->chan->subbuf_size; +} + +/** + * relay_update_filesize - add to filesize of relay file + * @buf: relay channel buffer + * @length: length to add + * + * switch_subbuf() helper function + */ +static inline void relay_update_filesize(struct rchan_buf *buf, size_t length) +{ + if (buf->dentry) + buf->dentry->d_inode->i_size += length; + else + buf->early_bytes += length; + + smp_mb(); +} + +/** + * relay_inc_produced - add 1 to buf->produced + * @buf: relay channel buffer + * + * switch_subbuf() helper function + */ +static inline void relay_inc_produced(struct rchan_buf *buf) +{ + buf->subbufs_produced++; +} + +/** * relay_write - write data into the channel * @chan: relay channel * @data: data to be written diff --git a/kernel/relay.c b/kernel/relay.c index 8d13a78..7d588fe 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -324,6 +324,21 @@ static int remove_buf_file_default_callback(struct dentry *dentry) return -EINVAL; } +/* + * wakeup_readers() default callback. + */ +static void wakeup_readers_default_callback(struct rchan_buf *buf) +{ + if (waitqueue_active(&buf->read_wait)) + /* + * Calling wake_up_interruptible() from here + * will deadlock if we happen to be logging + * from the scheduler (trying to re-grab + * rq->lock), so defer it. + */ + __mod_timer(&buf->timer, jiffies + 1); +} + /* relay channel default callbacks */ static struct rchan_callbacks default_channel_callbacks = { .subbuf_start = subbuf_start_default_callback, @@ -331,6 +346,7 @@ static struct rchan_callbacks default_channel_callbacks = { .buf_unmapped = buf_unmapped_default_callback, .create_buf_file = create_buf_file_default_callback, .remove_buf_file = remove_buf_file_default_callback, + .wakeup_readers = wakeup_readers_default_callback, }; /** @@ -508,6 +524,8 @@ static void setup_callbacks(struct rchan *chan, cb->create_buf_file = create_buf_file_default_callback; if (!cb->remove_buf_file) cb->remove_buf_file = remove_buf_file_default_callback; + if (!cb->wakeup_readers) + cb->wakeup_readers = wakeup_readers_default_callback; chan->cb = cb; } @@ -726,30 +744,17 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) void *old, *new; size_t old_subbuf, new_subbuf; - if (unlikely(length > buf->chan->subbuf_size)) + if (unlikely(relay_event_toobig(buf, length))) goto toobig; if (buf->offset != buf->chan->subbuf_size + 1) { buf->prev_padding = buf->chan->subbuf_size - buf->offset; old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs; buf->padding[old_subbuf] = buf->prev_padding; - buf->subbufs_produced++; - if (buf->dentry) - buf->dentry->d_inode->i_size += - buf->chan->subbuf_size - - buf->padding[old_subbuf]; - else - buf->early_bytes += buf->chan->subbuf_size - - buf->padding[old_subbuf]; - smp_mb(); - if (waitqueue_active(&buf->read_wait)) - /* - * Calling wake_up_interruptible() from here - * will deadlock if we happen to be logging - * from the scheduler (trying to re-grab - * rq->lock), so defer it. - */ - __mod_timer(&buf->timer, jiffies + 1); + relay_inc_produced(buf); + relay_update_filesize(buf, buf->chan->subbuf_size - + buf->padding[old_subbuf]); + buf->chan->cb->wakeup_readers(buf); } old = buf->data; @@ -763,7 +768,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) buf->data = new; buf->padding[new_subbuf] = 0; - if (unlikely(length + buf->offset > buf->chan->subbuf_size)) + if (unlikely(relay_event_toobig(buf, length + buf->offset))) goto toobig; return length; -- 1.5.3.5 ^ permalink raw reply related [flat|nested] 122+ messages in thread
* [RFC PATCH 2/8] relay - Make the relay sub-buffer switch code replaceable. 2008-09-24 3:50 ` Tom Zanussi ` (2 preceding siblings ...) 2008-09-25 6:07 ` [RFC PATCH 1/8] relay - Clean up relay_switch_subbuf() and make waking up consumers optional Tom Zanussi @ 2008-09-25 6:07 ` Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 3/8] relay - Add channel flags to relay, remove global callback param Tom Zanussi ` (5 subsequent siblings) 9 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-25 6:07 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Make the relay sub-buffer switch code replaceable. With this patch, tracers now have complete control over the relay write (or reserve) path if they choose to do so, by implementing their own version of the sub-buffer switch function (switch_subbuf()), in addition to their own local write/reserve functions. Tracers who choose not to do so automatically default to the normal behavior. --- include/linux/relay.h | 22 +++++++++++++++++----- kernel/relay.c | 13 ++++++++----- 2 files changed, 25 insertions(+), 10 deletions(-) diff --git a/include/linux/relay.h b/include/linux/relay.h index 2242004..a1dcfc1 100644 --- a/include/linux/relay.h +++ b/include/linux/relay.h @@ -168,6 +168,18 @@ struct rchan_callbacks * wakeups should implement an empty version. */ void (*wakeup_readers)(struct rchan_buf *buf); + + /* + * switch_subbuf - sub-buffer switch callback + * @buf: the channel buffer + * @length: size of current event + * + * Returns either the length passed in or 0 if full. + * + * Performs sub-buffer-switch tasks such as updating filesize, + * waking up readers, etc. + */ + size_t (*switch_subbuf)(struct rchan_buf *buf, size_t length); }; /* @@ -191,8 +203,8 @@ extern void relay_subbufs_consumed(struct rchan *chan, extern void relay_reset(struct rchan *chan); extern int relay_buf_full(struct rchan_buf *buf); -extern size_t relay_switch_subbuf(struct rchan_buf *buf, - size_t length); +extern size_t switch_subbuf_default_callback(struct rchan_buf *buf, + size_t length); /** * relay_event_toobig - is event too big to fit in a sub-buffer? @@ -259,7 +271,7 @@ static inline void relay_write(struct rchan *chan, local_irq_save(flags); buf = chan->buf[smp_processor_id()]; if (unlikely(buf->offset + length > chan->subbuf_size)) - length = relay_switch_subbuf(buf, length); + length = chan->cb->switch_subbuf(buf, length); memcpy(buf->data + buf->offset, data, length); buf->offset += length; local_irq_restore(flags); @@ -285,7 +297,7 @@ static inline void __relay_write(struct rchan *chan, buf = chan->buf[get_cpu()]; if (unlikely(buf->offset + length > buf->chan->subbuf_size)) - length = relay_switch_subbuf(buf, length); + length = chan->cb->switch_subbuf(buf, length); memcpy(buf->data + buf->offset, data, length); buf->offset += length; put_cpu(); @@ -308,7 +320,7 @@ static inline void *relay_reserve(struct rchan *chan, size_t length) struct rchan_buf *buf = chan->buf[smp_processor_id()]; if (unlikely(buf->offset + length > buf->chan->subbuf_size)) { - length = relay_switch_subbuf(buf, length); + length = chan->cb->switch_subbuf(buf, length); if (!length) return NULL; } diff --git a/kernel/relay.c b/kernel/relay.c index 7d588fe..f1f55ae 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -347,6 +347,7 @@ static struct rchan_callbacks default_channel_callbacks = { .create_buf_file = create_buf_file_default_callback, .remove_buf_file = remove_buf_file_default_callback, .wakeup_readers = wakeup_readers_default_callback, + .switch_subbuf = switch_subbuf_default_callback, }; /** @@ -526,6 +527,8 @@ static void setup_callbacks(struct rchan *chan, cb->remove_buf_file = remove_buf_file_default_callback; if (!cb->wakeup_readers) cb->wakeup_readers = wakeup_readers_default_callback; + if (!cb->switch_subbuf) + cb->switch_subbuf = switch_subbuf_default_callback; chan->cb = cb; } @@ -730,7 +733,7 @@ int relay_late_setup_files(struct rchan *chan, } /** - * relay_switch_subbuf - switch to a new sub-buffer + * switch_subbuf_default_callback - switch to a new sub-buffer * @buf: channel buffer * @length: size of current event * @@ -739,7 +742,7 @@ int relay_late_setup_files(struct rchan *chan, * Performs sub-buffer-switch tasks such as invoking callbacks, * updating padding counts, waking up readers, etc. */ -size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) +size_t switch_subbuf_default_callback(struct rchan_buf *buf, size_t length) { void *old, *new; size_t old_subbuf, new_subbuf; @@ -777,7 +780,7 @@ toobig: buf->chan->last_toobig = length; return 0; } -EXPORT_SYMBOL_GPL(relay_switch_subbuf); +EXPORT_SYMBOL_GPL(switch_subbuf_default_callback); /** * relay_subbufs_consumed - update the buffer's sub-buffers-consumed count @@ -857,14 +860,14 @@ void relay_flush(struct rchan *chan) return; if (chan->is_global && chan->buf[0]) { - relay_switch_subbuf(chan->buf[0], 0); + chan->cb->switch_subbuf(chan->buf[0], 0); return; } mutex_lock(&relay_channels_mutex); for_each_possible_cpu(i) if (chan->buf[i]) - relay_switch_subbuf(chan->buf[i], 0); + chan->cb->switch_subbuf(chan->buf[i], 0); mutex_unlock(&relay_channels_mutex); } EXPORT_SYMBOL_GPL(relay_flush); -- 1.5.3.5 ^ permalink raw reply related [flat|nested] 122+ messages in thread
* [RFC PATCH 3/8] relay - Add channel flags to relay, remove global callback param. 2008-09-24 3:50 ` Tom Zanussi ` (3 preceding siblings ...) 2008-09-25 6:07 ` [RFC PATCH 2/8] relay - Make the relay sub-buffer switch code replaceable Tom Zanussi @ 2008-09-25 6:07 ` Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 4/8] relay - Add reserved param to switch-subbuf, in preparation for non-pad write/reserve Tom Zanussi ` (4 subsequent siblings) 9 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-25 6:07 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder relay should probably have had a flags param from the beginning; it wasn't originally added because it wasn't originally needed - it probably would have helped avoid some of the callback contortions that were added due to a lack of flags. This adds them and does a small amount of low-hanging cleanup, and is also in preparation for some new flags in future patches. --- block/blktrace.c | 5 ++--- include/linux/relay.h | 19 ++++++++++--------- kernel/relay.c | 20 ++++++++++---------- virt/kvm/kvm_trace.c | 9 ++++----- 4 files changed, 26 insertions(+), 27 deletions(-) diff --git a/block/blktrace.c b/block/blktrace.c index eb9651c..150c5f7 100644 --- a/block/blktrace.c +++ b/block/blktrace.c @@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry) static struct dentry *blk_create_buf_file_callback(const char *filename, struct dentry *parent, int mode, - struct rchan_buf *buf, - int *is_global) + struct rchan_buf *buf) { return debugfs_create_file(filename, mode, parent, buf, &relay_file_operations); @@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, goto err; bt->rchan = relay_open("trace", dir, buts->buf_size, - buts->buf_nr, &blk_relay_callbacks, bt); + buts->buf_nr, &blk_relay_callbacks, bt, 0UL); if (!bt->rchan) goto err; diff --git a/include/linux/relay.h b/include/linux/relay.h index a1dcfc1..648b4da 100644 --- a/include/linux/relay.h +++ b/include/linux/relay.h @@ -28,6 +28,12 @@ #define RELAYFS_CHANNEL_VERSION 7 /* + * relay channel flags + */ +#define RCHAN_MODE_OVERWRITE 0x00000001 /* 'flight' mode */ +#define RCHAN_GLOBAL_BUFFER 0x00000002 /* not using per-cpu */ + +/* * Per-cpu relay channel buffer */ struct rchan_buf @@ -66,11 +72,11 @@ struct rchan void *private_data; /* for user-defined data */ size_t last_toobig; /* tried to log event > subbuf size */ struct rchan_buf *buf[NR_CPUS]; /* per-cpu channel buffers */ - int is_global; /* One global buffer ? */ struct list_head list; /* for channel list */ struct dentry *parent; /* parent dentry passed to open */ int has_base_filename; /* has a filename associated? */ char base_filename[NAME_MAX]; /* saved base filename */ + unsigned long flags; /* relay flags for this channel */ }; /* @@ -125,7 +131,6 @@ struct rchan_callbacks * @parent: the parent of the file to create * @mode: the mode of the file to create * @buf: the channel buffer - * @is_global: outparam - set non-zero if the buffer should be global * * Called during relay_open(), once for each per-cpu buffer, * to allow the client to create a file to be used to @@ -136,17 +141,12 @@ struct rchan_callbacks * The callback should return the dentry of the file created * to represent the relay buffer. * - * Setting the is_global outparam to a non-zero value will - * cause relay_open() to create a single global buffer rather - * than the default set of per-cpu buffers. - * * See Documentation/filesystems/relayfs.txt for more info. */ struct dentry *(*create_buf_file)(const char *filename, struct dentry *parent, int mode, - struct rchan_buf *buf, - int *is_global); + struct rchan_buf *buf); /* * remove_buf_file - remove file representing a relay channel buffer @@ -191,7 +191,8 @@ struct rchan *relay_open(const char *base_filename, size_t subbuf_size, size_t n_subbufs, struct rchan_callbacks *cb, - void *private_data); + void *private_data, + unsigned long rchan_flags); extern int relay_late_setup_files(struct rchan *chan, const char *base_filename, struct dentry *parent); diff --git a/kernel/relay.c b/kernel/relay.c index f1f55ae..d7a6458 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -310,8 +310,7 @@ static void buf_unmapped_default_callback(struct rchan_buf *buf, static struct dentry *create_buf_file_default_callback(const char *filename, struct dentry *parent, int mode, - struct rchan_buf *buf, - int *is_global) + struct rchan_buf *buf) { return NULL; } @@ -411,7 +410,7 @@ void relay_reset(struct rchan *chan) if (!chan) return; - if (chan->is_global && chan->buf[0]) { + if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) { __relay_reset(chan->buf[0], 0); return; } @@ -445,8 +444,7 @@ static struct dentry *relay_create_buf_file(struct rchan *chan, /* Create file in fs */ dentry = chan->cb->create_buf_file(tmpname, chan->parent, - S_IRUSR, buf, - &chan->is_global); + S_IRUSR, buf); kfree(tmpname); @@ -463,7 +461,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu) struct rchan_buf *buf = NULL; struct dentry *dentry; - if (chan->is_global) + if (chan->flags & RCHAN_GLOBAL_BUFFER) return chan->buf[0]; buf = relay_create_buf(chan); @@ -480,7 +478,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu) buf->cpu = cpu; __relay_reset(buf, 1); - if(chan->is_global) { + if(chan->flags & RCHAN_GLOBAL_BUFFER) { chan->buf[0] = buf; buf->cpu = 0; } @@ -595,7 +593,8 @@ struct rchan *relay_open(const char *base_filename, size_t subbuf_size, size_t n_subbufs, struct rchan_callbacks *cb, - void *private_data) + void *private_data, + unsigned long rchan_flags) { unsigned int i; struct rchan *chan; @@ -612,6 +611,7 @@ struct rchan *relay_open(const char *base_filename, chan->subbuf_size = subbuf_size; chan->alloc_size = FIX_SIZE(subbuf_size * n_subbufs); chan->parent = parent; + chan->flags = rchan_flags; chan->private_data = private_data; if (base_filename) { chan->has_base_filename = 1; @@ -828,7 +828,7 @@ void relay_close(struct rchan *chan) return; mutex_lock(&relay_channels_mutex); - if (chan->is_global && chan->buf[0]) + if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) relay_close_buf(chan->buf[0]); else for_each_possible_cpu(i) @@ -859,7 +859,7 @@ void relay_flush(struct rchan *chan) if (!chan) return; - if (chan->is_global && chan->buf[0]) { + if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) { chan->cb->switch_subbuf(chan->buf[0], 0); return; } diff --git a/virt/kvm/kvm_trace.c b/virt/kvm/kvm_trace.c index 58141f3..d0a9e1c 100644 --- a/virt/kvm/kvm_trace.c +++ b/virt/kvm/kvm_trace.c @@ -130,10 +130,9 @@ static int kvm_subbuf_start_callback(struct rchan_buf *buf, void *subbuf, } static struct dentry *kvm_create_buf_file_callack(const char *filename, - struct dentry *parent, - int mode, - struct rchan_buf *buf, - int *is_global) + struct dentry *parent, + int mode, + struct rchan_buf *buf) { return debugfs_create_file(filename, mode, parent, buf, &relay_file_operations); @@ -171,7 +170,7 @@ static int do_kvm_trace_enable(struct kvm_user_trace_setup *kuts) goto err; kt->rchan = relay_open("trace", kvm_debugfs_dir, kuts->buf_size, - kuts->buf_nr, &kvm_relay_callbacks, kt); + kuts->buf_nr, &kvm_relay_callbacks, kt, 0UL); if (!kt->rchan) goto err; -- 1.5.3.5 ^ permalink raw reply related [flat|nested] 122+ messages in thread
* [RFC PATCH 4/8] relay - Add reserved param to switch-subbuf, in preparation for non-pad write/reserve. 2008-09-24 3:50 ` Tom Zanussi ` (4 preceding siblings ...) 2008-09-25 6:07 ` [RFC PATCH 3/8] relay - Add channel flags to relay, remove global callback param Tom Zanussi @ 2008-09-25 6:07 ` Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 5/8] relay - Map the first sub-buffer at the end of the buffer, for temporary convenience Tom Zanussi ` (3 subsequent siblings) 9 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-25 6:07 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Add reserved param to switch-subbuf, in preparation for non-pad write/reserve. Because a write/reserve can now cross sub-buffer boundaries, we use the length returned as a remainder for the new sub-buffer, and use the reserved param to return a pointer to the reserved space, or NULL if it couldn't be reserved. This patch also changes write/reserve to preserve their current behavior despite that change. This all goes away in a future patch, but is here now so things don't break. --- include/linux/relay.h | 24 ++++++++++++++++-------- kernel/relay.c | 12 +++++++++--- 2 files changed, 25 insertions(+), 11 deletions(-) diff --git a/include/linux/relay.h b/include/linux/relay.h index 648b4da..13163b0 100644 --- a/include/linux/relay.h +++ b/include/linux/relay.h @@ -173,13 +173,16 @@ struct rchan_callbacks * switch_subbuf - sub-buffer switch callback * @buf: the channel buffer * @length: size of current event + * @reserved: a pointer to the space reserved * * Returns either the length passed in or 0 if full. * * Performs sub-buffer-switch tasks such as updating filesize, * waking up readers, etc. */ - size_t (*switch_subbuf)(struct rchan_buf *buf, size_t length); + size_t (*switch_subbuf)(struct rchan_buf *buf, + size_t length, + void **reserved); }; /* @@ -205,7 +208,8 @@ extern void relay_reset(struct rchan *chan); extern int relay_buf_full(struct rchan_buf *buf); extern size_t switch_subbuf_default_callback(struct rchan_buf *buf, - size_t length); + size_t length, + void **reserved); /** * relay_event_toobig - is event too big to fit in a sub-buffer? @@ -268,12 +272,14 @@ static inline void relay_write(struct rchan *chan, { unsigned long flags; struct rchan_buf *buf; + void *reserved; local_irq_save(flags); buf = chan->buf[smp_processor_id()]; + reserved = buf->data + buf->offset; if (unlikely(buf->offset + length > chan->subbuf_size)) - length = chan->cb->switch_subbuf(buf, length); - memcpy(buf->data + buf->offset, data, length); + length = chan->cb->switch_subbuf(buf, length, &reserved); + memcpy(reserved, data, length); buf->offset += length; local_irq_restore(flags); } @@ -295,11 +301,13 @@ static inline void __relay_write(struct rchan *chan, size_t length) { struct rchan_buf *buf; + void *reserved; buf = chan->buf[get_cpu()]; + reserved = buf->data + buf->offset; if (unlikely(buf->offset + length > buf->chan->subbuf_size)) - length = chan->cb->switch_subbuf(buf, length); - memcpy(buf->data + buf->offset, data, length); + length = chan->cb->switch_subbuf(buf, length, &reserved); + memcpy(reserved, data, length); buf->offset += length; put_cpu(); } @@ -320,12 +328,12 @@ static inline void *relay_reserve(struct rchan *chan, size_t length) void *reserved; struct rchan_buf *buf = chan->buf[smp_processor_id()]; + reserved = buf->data + buf->offset; if (unlikely(buf->offset + length > buf->chan->subbuf_size)) { - length = chan->cb->switch_subbuf(buf, length); + length = chan->cb->switch_subbuf(buf, length, &reserved); if (!length) return NULL; } - reserved = buf->data + buf->offset; buf->offset += length; return reserved; diff --git a/kernel/relay.c b/kernel/relay.c index d7a6458..9ea9240 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -736,13 +736,16 @@ int relay_late_setup_files(struct rchan *chan, * switch_subbuf_default_callback - switch to a new sub-buffer * @buf: channel buffer * @length: size of current event + * @reserved: a pointer to the space reserved * * Returns either the length passed in or 0 if full. * * Performs sub-buffer-switch tasks such as invoking callbacks, * updating padding counts, waking up readers, etc. */ -size_t switch_subbuf_default_callback(struct rchan_buf *buf, size_t length) +size_t switch_subbuf_default_callback(struct rchan_buf *buf, + size_t length, + void **reserved) { void *old, *new; size_t old_subbuf, new_subbuf; @@ -774,6 +777,9 @@ size_t switch_subbuf_default_callback(struct rchan_buf *buf, size_t length) if (unlikely(relay_event_toobig(buf, length + buf->offset))) goto toobig; + if (reserved) + *reserved = buf->data; + return length; toobig: @@ -860,14 +866,14 @@ void relay_flush(struct rchan *chan) return; if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) { - chan->cb->switch_subbuf(chan->buf[0], 0); + chan->cb->switch_subbuf(chan->buf[0], 0, NULL); return; } mutex_lock(&relay_channels_mutex); for_each_possible_cpu(i) if (chan->buf[i]) - chan->cb->switch_subbuf(chan->buf[i], 0); + chan->cb->switch_subbuf(chan->buf[i], 0, NULL); mutex_unlock(&relay_channels_mutex); } EXPORT_SYMBOL_GPL(relay_flush); -- 1.5.3.5 ^ permalink raw reply related [flat|nested] 122+ messages in thread
* [RFC PATCH 5/8] relay - Map the first sub-buffer at the end of the buffer, for temporary convenience. 2008-09-24 3:50 ` Tom Zanussi ` (5 preceding siblings ...) 2008-09-25 6:07 ` [RFC PATCH 4/8] relay - Add reserved param to switch-subbuf, in preparation for non-pad write/reserve Tom Zanussi @ 2008-09-25 6:07 ` Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 6/8] relay - Replace relay_reserve/relay_write with non-padded versions Tom Zanussi ` (2 subsequent siblings) 9 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-25 6:07 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Map the first sub-buffer at the end of the buffer, for temporary convenience. Make relay buffers 'circular' for writing by mapping the first subbuf at end of last subbuf. This is so we can do writes across last subbuf boundary without adding special write logic. This is a temporary state of affairs and it all goes away in a future patch, but it's here now so things will still work. --- kernel/relay.c | 26 +++++++++++++++----------- 1 files changed, 15 insertions(+), 11 deletions(-) diff --git a/kernel/relay.c b/kernel/relay.c index 9ea9240..9a08fec 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -125,20 +125,20 @@ static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma) /** * relay_alloc_buf - allocate a channel buffer * @buf: the buffer struct - * @size: total size of the buffer * * Returns a pointer to the resulting buffer, %NULL if unsuccessful. The * passed in size will get page aligned, if it isn't already. */ -static void *relay_alloc_buf(struct rchan_buf *buf, size_t *size) +static void *relay_alloc_buf(struct rchan_buf *buf) { void *mem; - unsigned int i, j, n_pages; + unsigned int i, j, n_pages, n_subbuf_pages; - *size = PAGE_ALIGN(*size); - n_pages = *size >> PAGE_SHIFT; + buf->chan->alloc_size = PAGE_ALIGN(buf->chan->alloc_size); + n_pages = buf->chan->alloc_size >> PAGE_SHIFT; + n_subbuf_pages = PAGE_ALIGN(buf->chan->subbuf_size) >> PAGE_SHIFT; - buf->page_array = relay_alloc_page_array(n_pages); + buf->page_array = relay_alloc_page_array(n_pages + n_subbuf_pages); if (!buf->page_array) return NULL; @@ -148,11 +148,14 @@ static void *relay_alloc_buf(struct rchan_buf *buf, size_t *size) goto depopulate; set_page_private(buf->page_array[i], (unsigned long)buf); } - mem = vmap(buf->page_array, n_pages, VM_MAP, PAGE_KERNEL); + for (i = 0; i < n_subbuf_pages; i++) + buf->page_array[n_pages + i] = buf->page_array[i]; + mem = vmap(buf->page_array, n_pages + n_subbuf_pages, VM_MAP, + PAGE_KERNEL); if (!mem) goto depopulate; - memset(mem, 0, *size); + memset(mem, 0, buf->chan->alloc_size); buf->page_count = n_pages; return mem; @@ -179,12 +182,13 @@ static struct rchan_buf *relay_create_buf(struct rchan *chan) if (!buf->padding) goto free_buf; - buf->start = relay_alloc_buf(buf, &chan->alloc_size); + buf->chan = chan; + kref_get(&buf->chan->kref); + + buf->start = relay_alloc_buf(buf); if (!buf->start) goto free_buf; - buf->chan = chan; - kref_get(&buf->chan->kref); return buf; free_buf: -- 1.5.3.5 ^ permalink raw reply related [flat|nested] 122+ messages in thread
* [RFC PATCH 6/8] relay - Replace relay_reserve/relay_write with non-padded versions. 2008-09-24 3:50 ` Tom Zanussi ` (6 preceding siblings ...) 2008-09-25 6:07 ` [RFC PATCH 5/8] relay - Map the first sub-buffer at the end of the buffer, for temporary convenience Tom Zanussi @ 2008-09-25 6:07 ` Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 7/8] relay - Remove padding-related code from relay_read()/relay_splice_read() et al Tom Zanussi 2008-09-25 6:08 ` [RFC PATCH 8/8] relay - Clean up remaining padding-related junk Tom Zanussi 9 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-25 6:07 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Replace relay_reserve/relay_write with non-padded versions. The old versions of relay_reserve/relay_write would write/reserve an event only if the whole thing could fit in the remaining space of the current sub-buffer; if it couldn't it would add padding to the current sub-buffer and reserve in the next. The new versions don't add padding but use up all the space in a sub-buffer and write the remainder in the next sub-buffer. They won't however write a partial event - if there's not enough space for the event in the current sub-buffer and the next sub-buffer isn't free, the whole reserve/write will fail. --- include/linux/relay.h | 41 +++++++++++++++++++---------- kernel/relay.c | 69 +++++++++++++++++++++++++++---------------------- 2 files changed, 65 insertions(+), 45 deletions(-) diff --git a/include/linux/relay.h b/include/linux/relay.h index 13163b0..c42b2d3 100644 --- a/include/linux/relay.h +++ b/include/linux/relay.h @@ -207,9 +207,9 @@ extern void relay_subbufs_consumed(struct rchan *chan, extern void relay_reset(struct rchan *chan); extern int relay_buf_full(struct rchan_buf *buf); -extern size_t switch_subbuf_default_callback(struct rchan_buf *buf, - size_t length, - void **reserved); +extern size_t relay_switch_subbuf_default_callback(struct rchan_buf *buf, + size_t length, + void **reserved); /** * relay_event_toobig - is event too big to fit in a sub-buffer? @@ -270,17 +270,23 @@ static inline void relay_write(struct rchan *chan, const void *data, size_t length) { - unsigned long flags; + size_t remainder = length; struct rchan_buf *buf; + unsigned long flags; void *reserved; local_irq_save(flags); buf = chan->buf[smp_processor_id()]; reserved = buf->data + buf->offset; - if (unlikely(buf->offset + length > chan->subbuf_size)) - length = chan->cb->switch_subbuf(buf, length, &reserved); + if (unlikely(buf->offset + length > buf->chan->subbuf_size)) { + remainder = chan->cb->switch_subbuf(buf, length, &reserved); + if (unlikely(!reserved)) { + local_irq_restore(flags); + return; + } + } memcpy(reserved, data, length); - buf->offset += length; + buf->offset += remainder; local_irq_restore(flags); } @@ -300,15 +306,22 @@ static inline void __relay_write(struct rchan *chan, const void *data, size_t length) { + size_t remainder = length; struct rchan_buf *buf; + unsigned long flags; void *reserved; buf = chan->buf[get_cpu()]; reserved = buf->data + buf->offset; - if (unlikely(buf->offset + length > buf->chan->subbuf_size)) - length = chan->cb->switch_subbuf(buf, length, &reserved); + if (unlikely(buf->offset + length > buf->chan->subbuf_size)) { + remainder = chan->cb->switch_subbuf(buf, length, &reserved); + if (unlikely(!reserved)) { + local_irq_restore(flags); + return; + } + } memcpy(reserved, data, length); - buf->offset += length; + buf->offset += remainder; put_cpu(); } @@ -323,15 +336,15 @@ static inline void __relay_write(struct rchan *chan, * Does not protect the buffer at all - caller must provide * appropriate synchronization. */ -static inline void *relay_reserve(struct rchan *chan, size_t length) +static inline void *relay_reserve(struct rchan *chan, + size_t length) { - void *reserved; struct rchan_buf *buf = chan->buf[smp_processor_id()]; + void *reserved = buf->data + buf->offset; - reserved = buf->data + buf->offset; if (unlikely(buf->offset + length > buf->chan->subbuf_size)) { length = chan->cb->switch_subbuf(buf, length, &reserved); - if (!length) + if (unlikely(!reserved)) return NULL; } buf->offset += length; diff --git a/kernel/relay.c b/kernel/relay.c index 9a08fec..15e4de2 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -350,7 +350,7 @@ static struct rchan_callbacks default_channel_callbacks = { .create_buf_file = create_buf_file_default_callback, .remove_buf_file = remove_buf_file_default_callback, .wakeup_readers = wakeup_readers_default_callback, - .switch_subbuf = switch_subbuf_default_callback, + .switch_subbuf = relay_switch_subbuf_default_callback, }; /** @@ -530,7 +530,7 @@ static void setup_callbacks(struct rchan *chan, if (!cb->wakeup_readers) cb->wakeup_readers = wakeup_readers_default_callback; if (!cb->switch_subbuf) - cb->switch_subbuf = switch_subbuf_default_callback; + cb->switch_subbuf = relay_switch_subbuf_default_callback; chan->cb = cb; } @@ -736,8 +736,20 @@ int relay_late_setup_files(struct rchan *chan, return err; } +static inline int next_subbuf_free(struct rchan_buf *buf) +{ + size_t full_subbufs; + + if (buf->chan->flags & RCHAN_MODE_OVERWRITE) + return 1; + + full_subbufs = buf->subbufs_produced - buf->subbufs_consumed; + + return (full_subbufs < buf->chan->n_subbufs - 1); +} + /** - * switch_subbuf_default_callback - switch to a new sub-buffer + * relay_switch_subbuf_default_callback - switch to a new sub-buffer * @buf: channel buffer * @length: size of current event * @reserved: a pointer to the space reserved @@ -747,50 +759,45 @@ int relay_late_setup_files(struct rchan *chan, * Performs sub-buffer-switch tasks such as invoking callbacks, * updating padding counts, waking up readers, etc. */ -size_t switch_subbuf_default_callback(struct rchan_buf *buf, - size_t length, - void **reserved) +size_t relay_switch_subbuf_default_callback(struct rchan_buf *buf, + size_t length, + void **reserved) { - void *old, *new; - size_t old_subbuf, new_subbuf; + size_t remainder, new_subbuf; + void *new_data; if (unlikely(relay_event_toobig(buf, length))) goto toobig; - if (buf->offset != buf->chan->subbuf_size + 1) { - buf->prev_padding = buf->chan->subbuf_size - buf->offset; - old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs; - buf->padding[old_subbuf] = buf->prev_padding; - relay_inc_produced(buf); - relay_update_filesize(buf, buf->chan->subbuf_size - - buf->padding[old_subbuf]); - buf->chan->cb->wakeup_readers(buf); + /* don't write anything unless we can write it all. */ + if (!next_subbuf_free(buf)) { + *reserved = NULL; + return 0; } - old = buf->data; + if (reserved) + *reserved = buf->data + buf->offset; + + remainder = length - (buf->chan->subbuf_size - buf->offset); + relay_inc_produced(buf); + relay_update_filesize(buf, buf->chan->subbuf_size + remainder); + buf->chan->cb->wakeup_readers(buf); + new_subbuf = buf->subbufs_produced % buf->chan->n_subbufs; - new = buf->start + new_subbuf * buf->chan->subbuf_size; - buf->offset = 0; - if (!buf->chan->cb->subbuf_start(buf, new, old, buf->prev_padding)) { - buf->offset = buf->chan->subbuf_size + 1; - return 0; - } - buf->data = new; - buf->padding[new_subbuf] = 0; + new_data = buf->start + new_subbuf * buf->chan->subbuf_size; + + buf->data = new_data; + buf->offset = 0; /* remainder will be added by caller */ if (unlikely(relay_event_toobig(buf, length + buf->offset))) goto toobig; - if (reserved) - *reserved = buf->data; - - return length; - + return remainder; toobig: buf->chan->last_toobig = length; return 0; } -EXPORT_SYMBOL_GPL(switch_subbuf_default_callback); +EXPORT_SYMBOL_GPL(relay_switch_subbuf_default_callback); /** * relay_subbufs_consumed - update the buffer's sub-buffers-consumed count -- 1.5.3.5 ^ permalink raw reply related [flat|nested] 122+ messages in thread
* [RFC PATCH 7/8] relay - Remove padding-related code from relay_read()/relay_splice_read() et al. 2008-09-24 3:50 ` Tom Zanussi ` (7 preceding siblings ...) 2008-09-25 6:07 ` [RFC PATCH 6/8] relay - Replace relay_reserve/relay_write with non-padded versions Tom Zanussi @ 2008-09-25 6:07 ` Tom Zanussi 2008-09-25 6:08 ` [RFC PATCH 8/8] relay - Clean up remaining padding-related junk Tom Zanussi 9 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-25 6:07 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Remove padding-related code from relay_read()/relay_splice_read() et al. Because we no longer write padding, we no longer have to read it or account for it anywhere else, greatly simplifying the related code. --- kernel/relay.c | 149 ++++++++------------------------------------------------ 1 files changed, 20 insertions(+), 129 deletions(-) diff --git a/kernel/relay.c b/kernel/relay.c index 15e4de2..21b3e19 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -966,72 +966,13 @@ static void relay_file_read_consume(struct rchan_buf *buf, size_t bytes_consumed) { size_t subbuf_size = buf->chan->subbuf_size; - size_t n_subbufs = buf->chan->n_subbufs; - size_t read_subbuf; - - if (buf->subbufs_produced == buf->subbufs_consumed && - buf->offset == buf->bytes_consumed) - return; - - if (buf->bytes_consumed + bytes_consumed > subbuf_size) { - relay_subbufs_consumed(buf->chan, buf->cpu, 1); - buf->bytes_consumed = 0; - } buf->bytes_consumed += bytes_consumed; - if (!read_pos) - read_subbuf = buf->subbufs_consumed % n_subbufs; - else - read_subbuf = read_pos / buf->chan->subbuf_size; - if (buf->bytes_consumed + buf->padding[read_subbuf] == subbuf_size) { - if ((read_subbuf == buf->subbufs_produced % n_subbufs) && - (buf->offset == subbuf_size)) - return; - relay_subbufs_consumed(buf->chan, buf->cpu, 1); - buf->bytes_consumed = 0; - } -} -/* - * relay_file_read_avail - boolean, are there unconsumed bytes available? - */ -static int relay_file_read_avail(struct rchan_buf *buf, size_t read_pos) -{ - size_t subbuf_size = buf->chan->subbuf_size; - size_t n_subbufs = buf->chan->n_subbufs; - size_t produced = buf->subbufs_produced; - size_t consumed = buf->subbufs_consumed; - - relay_file_read_consume(buf, read_pos, 0); - - consumed = buf->subbufs_consumed; - - if (unlikely(buf->offset > subbuf_size)) { - if (produced == consumed) - return 0; - return 1; - } - - if (unlikely(produced - consumed >= n_subbufs)) { - consumed = produced - n_subbufs + 1; - buf->subbufs_consumed = consumed; + if (buf->bytes_consumed == subbuf_size) { + relay_subbufs_consumed(buf->chan, buf->cpu, 1); buf->bytes_consumed = 0; } - - produced = (produced % n_subbufs) * subbuf_size + buf->offset; - consumed = (consumed % n_subbufs) * subbuf_size + buf->bytes_consumed; - - if (consumed > produced) - produced += n_subbufs * subbuf_size; - - if (consumed == produced) { - if (buf->offset == subbuf_size && - buf->subbufs_produced > buf->subbufs_consumed) - return 1; - return 0; - } - - return 1; } /** @@ -1042,21 +983,19 @@ static int relay_file_read_avail(struct rchan_buf *buf, size_t read_pos) static size_t relay_file_read_subbuf_avail(size_t read_pos, struct rchan_buf *buf) { - size_t padding, avail = 0; + size_t avail; size_t read_subbuf, read_offset, write_subbuf, write_offset; size_t subbuf_size = buf->chan->subbuf_size; write_subbuf = (buf->data - buf->start) / subbuf_size; - write_offset = buf->offset > subbuf_size ? subbuf_size : buf->offset; + write_offset = buf->offset; read_subbuf = read_pos / subbuf_size; read_offset = read_pos % subbuf_size; - padding = buf->padding[read_subbuf]; - if (read_subbuf == write_subbuf) { - if (read_offset + padding < write_offset) - avail = write_offset - (read_offset + padding); - } else - avail = (subbuf_size - padding) - read_offset; + avail = subbuf_size - read_offset; + + if (read_subbuf == write_subbuf && read_offset < write_offset) + avail = write_offset - read_offset; return avail; } @@ -1066,28 +1005,17 @@ static size_t relay_file_read_subbuf_avail(size_t read_pos, * @read_pos: file read position * @buf: relay channel buffer * - * If the @read_pos is in the middle of padding, return the - * position of the first actually available byte, otherwise - * return the original value. + * If the @read_pos is 0, return the position of the first + * unconsumed byte, otherwise return the original value. */ static size_t relay_file_read_start_pos(size_t read_pos, struct rchan_buf *buf) { - size_t read_subbuf, padding, padding_start, padding_end; size_t subbuf_size = buf->chan->subbuf_size; - size_t n_subbufs = buf->chan->n_subbufs; - size_t consumed = buf->subbufs_consumed % n_subbufs; + size_t consumed = buf->subbufs_consumed % buf->chan->n_subbufs; if (!read_pos) read_pos = consumed * subbuf_size + buf->bytes_consumed; - read_subbuf = read_pos / subbuf_size; - padding = buf->padding[read_subbuf]; - padding_start = (read_subbuf + 1) * subbuf_size - padding; - padding_end = (read_subbuf + 1) * subbuf_size; - if (read_pos >= padding_start && read_pos < padding_end) { - read_subbuf = (read_subbuf + 1) % n_subbufs; - read_pos = read_subbuf * subbuf_size; - } return read_pos; } @@ -1102,17 +1030,9 @@ static size_t relay_file_read_end_pos(struct rchan_buf *buf, size_t read_pos, size_t count) { - size_t read_subbuf, padding, end_pos; - size_t subbuf_size = buf->chan->subbuf_size; - size_t n_subbufs = buf->chan->n_subbufs; + size_t end_pos = read_pos + count; - read_subbuf = read_pos / subbuf_size; - padding = buf->padding[read_subbuf]; - if (read_pos % subbuf_size + count + padding == subbuf_size) - end_pos = (read_subbuf + 1) * subbuf_size; - else - end_pos = read_pos + count; - if (end_pos >= subbuf_size * n_subbufs) + if (end_pos >= buf->chan->subbuf_size * buf->chan->n_subbufs) end_pos = 0; return end_pos; @@ -1166,9 +1086,6 @@ static ssize_t relay_file_read_subbufs(struct file *filp, loff_t *ppos, mutex_lock(&filp->f_path.dentry->d_inode->i_mutex); do { - if (!relay_file_read_avail(buf, *ppos)) - break; - read_start = relay_file_read_start_pos(*ppos, buf); avail = relay_file_read_subbuf_avail(read_start, buf); if (!avail) @@ -1243,8 +1160,7 @@ static int subbuf_splice_actor(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, - unsigned int flags, - int *nonpad_ret) + unsigned int flags) { unsigned int pidx, poff, total_len, subbuf_pages, nr_pages, ret; struct rchan_buf *rbuf = in->private_data; @@ -1252,9 +1168,6 @@ static int subbuf_splice_actor(struct file *in, uint64_t pos = (uint64_t) *ppos; uint32_t alloc_size = (uint32_t) rbuf->chan->alloc_size; size_t read_start = (size_t) do_div(pos, alloc_size); - size_t read_subbuf = read_start / subbuf_size; - size_t padding = rbuf->padding[read_subbuf]; - size_t nonpad_end = read_subbuf * subbuf_size + subbuf_size - padding; struct page *pages[PIPE_BUFFERS]; struct partial_page partial[PIPE_BUFFERS]; struct splice_pipe_desc spd = { @@ -1266,7 +1179,8 @@ static int subbuf_splice_actor(struct file *in, .spd_release = relay_page_release, }; - if (rbuf->subbufs_produced == rbuf->subbufs_consumed) + if (rbuf->subbufs_produced == rbuf->subbufs_consumed && + rbuf->offset == rbuf->bytes_consumed) return 0; /* @@ -1281,46 +1195,25 @@ static int subbuf_splice_actor(struct file *in, nr_pages = min_t(unsigned int, subbuf_pages, PIPE_BUFFERS); for (total_len = 0; spd.nr_pages < nr_pages; spd.nr_pages++) { - unsigned int this_len, this_end, private; - unsigned int cur_pos = read_start + total_len; + unsigned int this_len; if (!len) break; this_len = min_t(unsigned long, len, PAGE_SIZE - poff); - private = this_len; spd.pages[spd.nr_pages] = rbuf->page_array[pidx]; spd.partial[spd.nr_pages].offset = poff; - this_end = cur_pos + this_len; - if (this_end >= nonpad_end) { - this_len = nonpad_end - cur_pos; - private = this_len + padding; - } spd.partial[spd.nr_pages].len = this_len; - spd.partial[spd.nr_pages].private = private; len -= this_len; total_len += this_len; poff = 0; pidx = (pidx + 1) % subbuf_pages; - - if (this_end >= nonpad_end) { - spd.nr_pages++; - break; - } } - if (!spd.nr_pages) - return 0; - - ret = *nonpad_ret = splice_to_pipe(pipe, &spd); - if (ret < 0 || ret < total_len) - return ret; - - if (read_start + ret == nonpad_end) - ret += padding; + ret = splice_to_pipe(pipe, &spd); return ret; } @@ -1333,13 +1226,12 @@ static ssize_t relay_file_splice_read(struct file *in, { ssize_t spliced; int ret; - int nonpad_ret = 0; ret = 0; spliced = 0; while (len && !spliced) { - ret = subbuf_splice_actor(in, ppos, pipe, len, flags, &nonpad_ret); + ret = subbuf_splice_actor(in, ppos, pipe, len, flags); if (ret < 0) break; else if (!ret) { @@ -1356,8 +1248,7 @@ static ssize_t relay_file_splice_read(struct file *in, len = 0; else len -= ret; - spliced += nonpad_ret; - nonpad_ret = 0; + spliced += ret; } if (spliced) -- 1.5.3.5 ^ permalink raw reply related [flat|nested] 122+ messages in thread
* [RFC PATCH 8/8] relay - Clean up remaining padding-related junk. 2008-09-24 3:50 ` Tom Zanussi ` (8 preceding siblings ...) 2008-09-25 6:07 ` [RFC PATCH 7/8] relay - Remove padding-related code from relay_read()/relay_splice_read() et al Tom Zanussi @ 2008-09-25 6:08 ` Tom Zanussi 9 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-25 6:08 UTC (permalink / raw) To: Martin Bligh Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Clean up remaining padding-related junk. Removes the rest of the padding-related junk. Also simplifies the subbuf_start callback a bit. --- block/blktrace.c | 5 +++-- include/linux/relay.h | 12 ++---------- kernel/relay.c | 19 ++++--------------- virt/kvm/kvm_trace.c | 7 ++++--- 4 files changed, 13 insertions(+), 30 deletions(-) diff --git a/block/blktrace.c b/block/blktrace.c index 150c5f7..271b7b7 100644 --- a/block/blktrace.c +++ b/block/blktrace.c @@ -334,8 +334,9 @@ static const struct file_operations blk_msg_fops = { * Keep track of how many times we encountered a full subbuffer, to aid * the user space app in telling how many lost events there were. */ -static int blk_subbuf_start_callback(struct rchan_buf *buf, void *subbuf, - void *prev_subbuf, size_t prev_padding) +static int blk_subbuf_start_callback(struct rchan_buf *buf, + void *subbuf, + int first_subbuf) { struct blk_trace *bt; diff --git a/include/linux/relay.h b/include/linux/relay.h index c42b2d3..85f43a0 100644 --- a/include/linux/relay.h +++ b/include/linux/relay.h @@ -51,8 +51,6 @@ struct rchan_buf struct page **page_array; /* array of current buffer pages */ unsigned int page_count; /* number of current buffer pages */ unsigned int finalized; /* buffer has been finalized */ - size_t *padding; /* padding counts per sub-buffer */ - size_t prev_padding; /* temporary variable */ size_t bytes_consumed; /* bytes consumed in cur read subbuf */ size_t early_bytes; /* bytes consumed before VFS inited */ unsigned int cpu; /* this buf's cpu */ @@ -88,23 +86,17 @@ struct rchan_callbacks * subbuf_start - called on buffer-switch to a new sub-buffer * @buf: the channel buffer containing the new sub-buffer * @subbuf: the start of the new sub-buffer - * @prev_subbuf: the start of the previous sub-buffer - * @prev_padding: unused space at the end of previous sub-buffer + * @first_subbuf: boolean, is this the first subbuf? * * The client should return 1 to continue logging, 0 to stop * logging. * - * NOTE: subbuf_start will also be invoked when the buffer is - * created, so that the first sub-buffer can be initialized - * if necessary. In this case, prev_subbuf will be NULL. - * * NOTE: the client can reserve bytes at the beginning of the new * sub-buffer by calling subbuf_start_reserve() in this callback. */ int (*subbuf_start) (struct rchan_buf *buf, void *subbuf, - void *prev_subbuf, - size_t prev_padding); + int first_subbuf); /* * buf_mapped - relay buffer mmap notification diff --git a/kernel/relay.c b/kernel/relay.c index 21b3e19..b2bf510 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -178,10 +178,6 @@ static struct rchan_buf *relay_create_buf(struct rchan *chan) if (!buf) return NULL; - buf->padding = kmalloc(chan->n_subbufs * sizeof(size_t *), GFP_KERNEL); - if (!buf->padding) - goto free_buf; - buf->chan = chan; kref_get(&buf->chan->kref); @@ -192,7 +188,6 @@ static struct rchan_buf *relay_create_buf(struct rchan *chan) return buf; free_buf: - kfree(buf->padding); kfree(buf); return NULL; } @@ -225,7 +220,6 @@ static void relay_destroy_buf(struct rchan_buf *buf) relay_free_page_array(buf->page_array); } chan->buf[buf->cpu] = NULL; - kfree(buf->padding); kfree(buf); kref_put(&chan->kref, relay_destroy_channel); } @@ -283,8 +277,7 @@ EXPORT_SYMBOL_GPL(relay_buf_full); */ static int subbuf_start_default_callback (struct rchan_buf *buf, void *subbuf, - void *prev_subbuf, - size_t prev_padding) + int first_subbuf) { if (relay_buf_full(buf)) return 0; @@ -374,8 +367,6 @@ static void wakeup_readers(unsigned long data) */ static void __relay_reset(struct rchan_buf *buf, unsigned int init) { - size_t i; - if (init) { init_waitqueue_head(&buf->read_wait); kref_init(&buf->kref); @@ -390,10 +381,7 @@ static void __relay_reset(struct rchan_buf *buf, unsigned int init) buf->data = buf->start; buf->offset = 0; - for (i = 0; i < buf->chan->n_subbufs; i++) - buf->padding[i] = 0; - - buf->chan->cb->subbuf_start(buf, buf->data, NULL, 0); + buf->chan->cb->subbuf_start(buf, buf->data, 1); } /** @@ -757,7 +745,7 @@ static inline int next_subbuf_free(struct rchan_buf *buf) * Returns either the length passed in or 0 if full. * * Performs sub-buffer-switch tasks such as invoking callbacks, - * updating padding counts, waking up readers, etc. + * waking up readers, etc. */ size_t relay_switch_subbuf_default_callback(struct rchan_buf *buf, size_t length, @@ -788,6 +776,7 @@ size_t relay_switch_subbuf_default_callback(struct rchan_buf *buf, buf->data = new_data; buf->offset = 0; /* remainder will be added by caller */ + buf->chan->cb->subbuf_start(buf, new_data, 0); if (unlikely(relay_event_toobig(buf, length + buf->offset))) goto toobig; diff --git a/virt/kvm/kvm_trace.c b/virt/kvm/kvm_trace.c index d0a9e1c..4626caa 100644 --- a/virt/kvm/kvm_trace.c +++ b/virt/kvm/kvm_trace.c @@ -105,13 +105,14 @@ DEFINE_SIMPLE_ATTRIBUTE(kvm_trace_lost_ops, lost_records_get, NULL, "%llu\n"); * many times we encountered a full subbuffer, to tell user space app the * lost records there were. */ -static int kvm_subbuf_start_callback(struct rchan_buf *buf, void *subbuf, - void *prev_subbuf, size_t prev_padding) +static int kvm_subbuf_start_callback(struct rchan_buf *buf, + void *subbuf, + int first_subbuf) { struct kvm_trace *kt; if (!relay_buf_full(buf)) { - if (!prev_subbuf) { + if (first_subbuf) { /* * executed only once when the channel is opened * save metadata as first record -- 1.5.3.5 ^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 1/3] relay - clean up subbuf switch 2008-09-22 14:45 ` Peter Zijlstra ` (2 preceding siblings ...) 2008-09-23 5:25 ` Tom Zanussi @ 2008-09-23 5:27 ` Tom Zanussi 2008-09-23 20:15 ` Andrew Morton 2008-09-23 5:27 ` [PATCH 2/3] relay - make subbuf switch replaceable Tom Zanussi 2008-09-23 5:27 ` [PATCH 3/3] relay - add channel flags Tom Zanussi 5 siblings, 1 reply; 122+ messages in thread From: Tom Zanussi @ 2008-09-23 5:27 UTC (permalink / raw) To: Peter Zijlstra Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Clean up relay_switch_subbuf() and make waking up consumers optional. Over time, relay_switch_subbuf() has accumulated some cruft - this patch cleans it up and at the same time makes available some of it available as common functions that any subbuf-switch implementor would need (this is partially in preparation for the next patch, which makes the subbuf-switch function completely replaceable). It also removes the hard-coded reader wakeup and moves it into a replaceable callback called notify_consumers(); this allows any given tracer to implement consumer notification as it sees fit. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> diff --git a/include/linux/relay.h b/include/linux/relay.h index 953fc05..17f0515 100644 --- a/include/linux/relay.h +++ b/include/linux/relay.h @@ -159,6 +159,15 @@ struct rchan_callbacks * The callback should return 0 if successful, negative if not. */ int (*remove_buf_file)(struct dentry *dentry); + + /* + * notify_consumers - new sub-buffer available, let consumers know + * @buf: the channel buffer + * + * Called during sub-buffer switch. Applications which don't + * want to notify anyone should implement an empty version. + */ + void (*notify_consumers)(struct rchan_buf *buf); }; /* @@ -186,6 +195,48 @@ extern size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length); /** + * relay_event_toobig - is event too big to fit in a sub-buffer? + * @buf: relay channel buffer + * @length: length of event + * + * Returns 1 if too big, 0 otherwise. + * + * switch_subbuf() helper function. + */ +static inline int relay_event_toobig(struct rchan_buf *buf, size_t length) +{ + return length > buf->chan->subbuf_size; +} + +/** + * relay_update_filesize - increase relay file i_size by length + * @buf: relay channel buffer + * @length: length to add + * + * switch_subbuf() helper function. + */ +static inline void relay_update_filesize(struct rchan_buf *buf, size_t length) +{ + if (buf->dentry) + buf->dentry->d_inode->i_size += length; + else + buf->early_bytes += length; + + smp_mb(); +} + +/** + * relay_inc_produced - increase number of sub-buffers produced by 1 + * @buf: relay channel buffer + * + * switch_subbuf() helper function. + */ +static inline void relay_inc_produced(struct rchan_buf *buf) +{ + buf->subbufs_produced++; +} + +/** * relay_write - write data into the channel * @chan: relay channel * @data: data to be written diff --git a/kernel/relay.c b/kernel/relay.c index 8d13a78..53652f1 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -324,6 +324,21 @@ static int remove_buf_file_default_callback(struct dentry *dentry) return -EINVAL; } +/* + * notify_consumers() default callback. + */ +static void notify_consumers_default_callback(struct rchan_buf *buf) +{ + if (waitqueue_active(&buf->read_wait)) + /* + * Calling wake_up_interruptible() from here + * will deadlock if we happen to be logging + * from the scheduler (trying to re-grab + * rq->lock), so defer it. + */ + __mod_timer(&buf->timer, jiffies + 1); +} + /* relay channel default callbacks */ static struct rchan_callbacks default_channel_callbacks = { .subbuf_start = subbuf_start_default_callback, @@ -331,6 +346,7 @@ static struct rchan_callbacks default_channel_callbacks = { .buf_unmapped = buf_unmapped_default_callback, .create_buf_file = create_buf_file_default_callback, .remove_buf_file = remove_buf_file_default_callback, + .notify_consumers = notify_consumers_default_callback, }; /** @@ -508,6 +524,8 @@ static void setup_callbacks(struct rchan *chan, cb->create_buf_file = create_buf_file_default_callback; if (!cb->remove_buf_file) cb->remove_buf_file = remove_buf_file_default_callback; + if (!cb->notify_consumers) + cb->notify_consumers = notify_consumers_default_callback; chan->cb = cb; } @@ -726,30 +744,17 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) void *old, *new; size_t old_subbuf, new_subbuf; - if (unlikely(length > buf->chan->subbuf_size)) + if (unlikely(relay_event_toobig(buf, length))) goto toobig; if (buf->offset != buf->chan->subbuf_size + 1) { buf->prev_padding = buf->chan->subbuf_size - buf->offset; old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs; buf->padding[old_subbuf] = buf->prev_padding; - buf->subbufs_produced++; - if (buf->dentry) - buf->dentry->d_inode->i_size += - buf->chan->subbuf_size - - buf->padding[old_subbuf]; - else - buf->early_bytes += buf->chan->subbuf_size - - buf->padding[old_subbuf]; - smp_mb(); - if (waitqueue_active(&buf->read_wait)) - /* - * Calling wake_up_interruptible() from here - * will deadlock if we happen to be logging - * from the scheduler (trying to re-grab - * rq->lock), so defer it. - */ - __mod_timer(&buf->timer, jiffies + 1); + relay_inc_produced(buf); + relay_update_filesize(buf, buf->chan->subbuf_size - + buf->padding[old_subbuf]); + buf->chan->cb->notify_consumers(buf); } old = buf->data; @@ -763,7 +768,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) buf->data = new; buf->padding[new_subbuf] = 0; - if (unlikely(length + buf->offset > buf->chan->subbuf_size)) + if (unlikely(relay_event_toobig(buf, length + buf->offset))) goto toobig; return length; ^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [PATCH 1/3] relay - clean up subbuf switch 2008-09-23 5:27 ` [PATCH 1/3] relay - clean up subbuf switch Tom Zanussi @ 2008-09-23 20:15 ` Andrew Morton 0 siblings, 0 replies; 122+ messages in thread From: Andrew Morton @ 2008-09-23 20:15 UTC (permalink / raw) To: Tom Zanussi Cc: a.p.zijlstra, prasad, mbligh, linux-kernel, torvalds, tglx, compudj, rostedt, od, fche, hch, dwilder On Tue, 23 Sep 2008 00:27:02 -0500 Tom Zanussi <zanussi@comcast.net> wrote: > Clean up relay_switch_subbuf() and make waking up consumers optional. > > Over time, relay_switch_subbuf() has accumulated some cruft - this > patch cleans it up and at the same time makes available some of it > available as common functions that any subbuf-switch implementor would > need (this is partially in preparation for the next patch, which makes > the subbuf-switch function completely replaceable). It also removes > the hard-coded reader wakeup and moves it into a replaceable callback > called notify_consumers(); this allows any given tracer to implement > consumer notification as it sees fit. > > Signed-off-by: Tom Zanussi <tzanussi@gmail.com> > > diff --git a/include/linux/relay.h b/include/linux/relay.h > index 953fc05..17f0515 100644 > --- a/include/linux/relay.h > +++ b/include/linux/relay.h > @@ -159,6 +159,15 @@ struct rchan_callbacks > * The callback should return 0 if successful, negative if not. > */ > int (*remove_buf_file)(struct dentry *dentry); > + > + /* > + * notify_consumers - new sub-buffer available, let consumers know > + * @buf: the channel buffer > + * > + * Called during sub-buffer switch. Applications which don't > + * want to notify anyone should implement an empty version. > + */ > + void (*notify_consumers)(struct rchan_buf *buf); > }; Does this comment format and placement get properly processed by the kerneldoc tools? > /* > @@ -186,6 +195,48 @@ extern size_t relay_switch_subbuf(struct rchan_buf *buf, > size_t length); > > /** > + * relay_event_toobig - is event too big to fit in a sub-buffer? > + * @buf: relay channel buffer > + * @length: length of event > + * > + * Returns 1 if too big, 0 otherwise. > + * > + * switch_subbuf() helper function. > + */ > +static inline int relay_event_toobig(struct rchan_buf *buf, size_t length) > +{ > + return length > buf->chan->subbuf_size; > +} > + > +/** > + * relay_update_filesize - increase relay file i_size by length > + * @buf: relay channel buffer > + * @length: length to add > + * > + * switch_subbuf() helper function. > + */ > +static inline void relay_update_filesize(struct rchan_buf *buf, size_t length) > +{ > + if (buf->dentry) > + buf->dentry->d_inode->i_size += length; > + else > + buf->early_bytes += length; > + > + smp_mb(); > +} What locking protects the non-atomic modification of the 64-bit i_size here? > +/** > + * relay_inc_produced - increase number of sub-buffers produced by 1 > + * @buf: relay channel buffer > + * > + * switch_subbuf() helper function. > + */ > +static inline void relay_inc_produced(struct rchan_buf *buf) > +{ > + buf->subbufs_produced++; > +} This also needs caller-provided locking. That's part of the function's interface and should be documented, > +/** > * relay_write - write data into the channel > * @chan: relay channel > * @data: data to be written > diff --git a/kernel/relay.c b/kernel/relay.c > index 8d13a78..53652f1 100644 > --- a/kernel/relay.c > +++ b/kernel/relay.c > @@ -324,6 +324,21 @@ static int remove_buf_file_default_callback(struct dentry *dentry) > return -EINVAL; > } > > +/* > + * notify_consumers() default callback. > + */ > +static void notify_consumers_default_callback(struct rchan_buf *buf) > +{ > + if (waitqueue_active(&buf->read_wait)) > + /* > + * Calling wake_up_interruptible() from here > + * will deadlock if we happen to be logging > + * from the scheduler (trying to re-grab > + * rq->lock), so defer it. > + */ > + __mod_timer(&buf->timer, jiffies + 1); > +} > + > /* relay channel default callbacks */ > static struct rchan_callbacks default_channel_callbacks = { > .subbuf_start = subbuf_start_default_callback, > @@ -331,6 +346,7 @@ static struct rchan_callbacks default_channel_callbacks = { > .buf_unmapped = buf_unmapped_default_callback, > .create_buf_file = create_buf_file_default_callback, > .remove_buf_file = remove_buf_file_default_callback, > + .notify_consumers = notify_consumers_default_callback, > }; > > /** > @@ -508,6 +524,8 @@ static void setup_callbacks(struct rchan *chan, > cb->create_buf_file = create_buf_file_default_callback; > if (!cb->remove_buf_file) > cb->remove_buf_file = remove_buf_file_default_callback; > + if (!cb->notify_consumers) > + cb->notify_consumers = notify_consumers_default_callback; > chan->cb = cb; > } > > @@ -726,30 +744,17 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) > void *old, *new; > size_t old_subbuf, new_subbuf; > > - if (unlikely(length > buf->chan->subbuf_size)) > + if (unlikely(relay_event_toobig(buf, length))) > goto toobig; > > if (buf->offset != buf->chan->subbuf_size + 1) { > buf->prev_padding = buf->chan->subbuf_size - buf->offset; > old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs; > buf->padding[old_subbuf] = buf->prev_padding; > - buf->subbufs_produced++; > - if (buf->dentry) > - buf->dentry->d_inode->i_size += > - buf->chan->subbuf_size - > - buf->padding[old_subbuf]; > - else > - buf->early_bytes += buf->chan->subbuf_size - > - buf->padding[old_subbuf]; > - smp_mb(); > - if (waitqueue_active(&buf->read_wait)) > - /* > - * Calling wake_up_interruptible() from here > - * will deadlock if we happen to be logging > - * from the scheduler (trying to re-grab > - * rq->lock), so defer it. > - */ > - __mod_timer(&buf->timer, jiffies + 1); > + relay_inc_produced(buf); > + relay_update_filesize(buf, buf->chan->subbuf_size - > + buf->padding[old_subbuf]); > + buf->chan->cb->notify_consumers(buf); > } > > old = buf->data; > @@ -763,7 +768,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) > buf->data = new; > buf->padding[new_subbuf] = 0; > > - if (unlikely(length + buf->offset > buf->chan->subbuf_size)) > + if (unlikely(relay_event_toobig(buf, length + buf->offset))) > goto toobig; I think you can put the unlikely() into relay_event_toobig() and gcc will dtrt. If that has any value. > return length; > ^ permalink raw reply [flat|nested] 122+ messages in thread
* [PATCH 2/3] relay - make subbuf switch replaceable 2008-09-22 14:45 ` Peter Zijlstra ` (3 preceding siblings ...) 2008-09-23 5:27 ` [PATCH 1/3] relay - clean up subbuf switch Tom Zanussi @ 2008-09-23 5:27 ` Tom Zanussi 2008-09-23 20:17 ` Andrew Morton 2008-09-23 5:27 ` [PATCH 3/3] relay - add channel flags Tom Zanussi 5 siblings, 1 reply; 122+ messages in thread From: Tom Zanussi @ 2008-09-23 5:27 UTC (permalink / raw) To: Peter Zijlstra Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Make the relay sub-buffer switch code replaceable. With this patch, tracers now have complete control over the relay write (or reserve) path if they choose to do so, by implementing their own version of the sub-buffer switch function (switch_subbuf()), in addition to their own local write/reserve functions. Tracers who choose not to do so automatically default to the normal behavior. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> diff --git a/include/linux/relay.h b/include/linux/relay.h index 17f0515..52e4d61 100644 --- a/include/linux/relay.h +++ b/include/linux/relay.h @@ -168,6 +168,18 @@ struct rchan_callbacks * want to notify anyone should implement an empty version. */ void (*notify_consumers)(struct rchan_buf *buf); + + /* + * switch_subbuf - sub-buffer switch callback + * @buf: the channel buffer + * @length: size of current event + * + * Returns either the length passed in or 0 if full. + * + * Performs sub-buffer-switch tasks such as updating filesize, + * waking up readers, etc. + */ + size_t (*switch_subbuf)(struct rchan_buf *buf, size_t length); }; /* @@ -191,8 +203,9 @@ extern void relay_subbufs_consumed(struct rchan *chan, extern void relay_reset(struct rchan *chan); extern int relay_buf_full(struct rchan_buf *buf); -extern size_t relay_switch_subbuf(struct rchan_buf *buf, - size_t length); +extern size_t switch_subbuf_default_callback(struct rchan_buf *buf, + size_t length); + /** * relay_event_toobig - is event too big to fit in a sub-buffer? @@ -259,7 +272,7 @@ static inline void relay_write(struct rchan *chan, local_irq_save(flags); buf = chan->buf[smp_processor_id()]; if (unlikely(buf->offset + length > chan->subbuf_size)) - length = relay_switch_subbuf(buf, length); + length = chan->cb->switch_subbuf(buf, length); memcpy(buf->data + buf->offset, data, length); buf->offset += length; local_irq_restore(flags); @@ -285,7 +298,7 @@ static inline void __relay_write(struct rchan *chan, buf = chan->buf[get_cpu()]; if (unlikely(buf->offset + length > buf->chan->subbuf_size)) - length = relay_switch_subbuf(buf, length); + length = chan->cb->switch_subbuf(buf, length); memcpy(buf->data + buf->offset, data, length); buf->offset += length; put_cpu(); @@ -308,7 +321,7 @@ static inline void *relay_reserve(struct rchan *chan, size_t length) struct rchan_buf *buf = chan->buf[smp_processor_id()]; if (unlikely(buf->offset + length > buf->chan->subbuf_size)) { - length = relay_switch_subbuf(buf, length); + length = chan->cb->switch_subbuf(buf, length); if (!length) return NULL; } diff --git a/kernel/relay.c b/kernel/relay.c index 53652f1..e299f49 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -347,6 +347,7 @@ static struct rchan_callbacks default_channel_callbacks = { .create_buf_file = create_buf_file_default_callback, .remove_buf_file = remove_buf_file_default_callback, .notify_consumers = notify_consumers_default_callback, + .switch_subbuf = switch_subbuf_default_callback, }; /** @@ -526,6 +527,8 @@ static void setup_callbacks(struct rchan *chan, cb->remove_buf_file = remove_buf_file_default_callback; if (!cb->notify_consumers) cb->notify_consumers = notify_consumers_default_callback; + if (!cb->switch_subbuf) + cb->switch_subbuf = switch_subbuf_default_callback; chan->cb = cb; } @@ -730,7 +733,7 @@ int relay_late_setup_files(struct rchan *chan, } /** - * relay_switch_subbuf - switch to a new sub-buffer + * switch_subbuf_default_callback - switch to a new sub-buffer * @buf: channel buffer * @length: size of current event * @@ -739,7 +742,7 @@ int relay_late_setup_files(struct rchan *chan, * Performs sub-buffer-switch tasks such as invoking callbacks, * updating padding counts, waking up readers, etc. */ -size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) +size_t switch_subbuf_default_callback(struct rchan_buf *buf, size_t length) { void *old, *new; size_t old_subbuf, new_subbuf; @@ -777,7 +780,7 @@ toobig: buf->chan->last_toobig = length; return 0; } -EXPORT_SYMBOL_GPL(relay_switch_subbuf); +EXPORT_SYMBOL_GPL(switch_subbuf_default_callback); /** * relay_subbufs_consumed - update the buffer's sub-buffers-consumed count @@ -857,14 +860,14 @@ void relay_flush(struct rchan *chan) return; if (chan->is_global && chan->buf[0]) { - relay_switch_subbuf(chan->buf[0], 0); + chan->cb->switch_subbuf(chan->buf[0], 0); return; } mutex_lock(&relay_channels_mutex); for_each_possible_cpu(i) if (chan->buf[i]) - relay_switch_subbuf(chan->buf[i], 0); + chan->cb->switch_subbuf(chan->buf[i], 0); mutex_unlock(&relay_channels_mutex); } EXPORT_SYMBOL_GPL(relay_flush); ^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [PATCH 2/3] relay - make subbuf switch replaceable 2008-09-23 5:27 ` [PATCH 2/3] relay - make subbuf switch replaceable Tom Zanussi @ 2008-09-23 20:17 ` Andrew Morton 0 siblings, 0 replies; 122+ messages in thread From: Andrew Morton @ 2008-09-23 20:17 UTC (permalink / raw) To: Tom Zanussi Cc: a.p.zijlstra, prasad, mbligh, linux-kernel, torvalds, tglx, compudj, rostedt, od, fche, hch, dwilder On Tue, 23 Sep 2008 00:27:30 -0500 Tom Zanussi <zanussi@comcast.net> wrote: > Make the relay sub-buffer switch code replaceable. > > With this patch, tracers now have complete control over the relay > write (or reserve) path if they choose to do so, by implementing their > own version of the sub-buffer switch function (switch_subbuf()), in > addition to their own local write/reserve functions. Tracers who > choose not to do so automatically default to the normal behavior. > > > ... > > -EXPORT_SYMBOL_GPL(relay_switch_subbuf); > +EXPORT_SYMBOL_GPL(switch_subbuf_default_callback); It would be nice to keep the `relay_' prefix on the exported relay interface? Something called `switch_subbuf_default_callback' could belong to pretty much anywhere in the kernel. ^ permalink raw reply [flat|nested] 122+ messages in thread
* [PATCH 3/3] relay - add channel flags 2008-09-22 14:45 ` Peter Zijlstra ` (4 preceding siblings ...) 2008-09-23 5:27 ` [PATCH 2/3] relay - make subbuf switch replaceable Tom Zanussi @ 2008-09-23 5:27 ` Tom Zanussi 2008-09-23 20:20 ` Andrew Morton 5 siblings, 1 reply; 122+ messages in thread From: Tom Zanussi @ 2008-09-23 5:27 UTC (permalink / raw) To: Peter Zijlstra Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder Add channel flags to relay, remove global callback param. relay should probably have had a flags param from the beginning; it wasn't originally added because it wasn't originally needed - it probably would have helped avoid some of the callback contortions that were added due to a lack of flags. This adds them and does a small amount of low-hanging cleanup, and is also in preparation for some new flags in future patches. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> diff --git a/block/blktrace.c b/block/blktrace.c index eb9651c..150c5f7 100644 --- a/block/blktrace.c +++ b/block/blktrace.c @@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry) static struct dentry *blk_create_buf_file_callback(const char *filename, struct dentry *parent, int mode, - struct rchan_buf *buf, - int *is_global) + struct rchan_buf *buf) { return debugfs_create_file(filename, mode, parent, buf, &relay_file_operations); @@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, goto err; bt->rchan = relay_open("trace", dir, buts->buf_size, - buts->buf_nr, &blk_relay_callbacks, bt); + buts->buf_nr, &blk_relay_callbacks, bt, 0UL); if (!bt->rchan) goto err; diff --git a/include/linux/relay.h b/include/linux/relay.h index 52e4d61..18fd269 100644 --- a/include/linux/relay.h +++ b/include/linux/relay.h @@ -25,7 +25,13 @@ /* * Tracks changes to rchan/rchan_buf structs */ -#define RELAYFS_CHANNEL_VERSION 7 +#define RELAYFS_CHANNEL_VERSION 8 + +/* + * relay channel flags + */ +#define RCHAN_MODE_OVERWRITE 0x00000001 /* 'flight' mode */ +#define RCHAN_GLOBAL_BUFFER 0x00000002 /* not using per-cpu */ /* * Per-cpu relay channel buffer @@ -66,11 +72,11 @@ struct rchan void *private_data; /* for user-defined data */ size_t last_toobig; /* tried to log event > subbuf size */ struct rchan_buf *buf[NR_CPUS]; /* per-cpu channel buffers */ - int is_global; /* One global buffer ? */ struct list_head list; /* for channel list */ struct dentry *parent; /* parent dentry passed to open */ int has_base_filename; /* has a filename associated? */ char base_filename[NAME_MAX]; /* saved base filename */ + unsigned long flags; /* relay flags for this channel */ }; /* @@ -125,7 +131,6 @@ struct rchan_callbacks * @parent: the parent of the file to create * @mode: the mode of the file to create * @buf: the channel buffer - * @is_global: outparam - set non-zero if the buffer should be global * * Called during relay_open(), once for each per-cpu buffer, * to allow the client to create a file to be used to @@ -136,17 +141,12 @@ struct rchan_callbacks * The callback should return the dentry of the file created * to represent the relay buffer. * - * Setting the is_global outparam to a non-zero value will - * cause relay_open() to create a single global buffer rather - * than the default set of per-cpu buffers. - * * See Documentation/filesystems/relayfs.txt for more info. */ struct dentry *(*create_buf_file)(const char *filename, struct dentry *parent, int mode, - struct rchan_buf *buf, - int *is_global); + struct rchan_buf *buf); /* * remove_buf_file - remove file representing a relay channel buffer @@ -191,7 +191,8 @@ struct rchan *relay_open(const char *base_filename, size_t subbuf_size, size_t n_subbufs, struct rchan_callbacks *cb, - void *private_data); + void *private_data, + unsigned long rchan_flags); extern int relay_late_setup_files(struct rchan *chan, const char *base_filename, struct dentry *parent); diff --git a/kernel/relay.c b/kernel/relay.c index e299f49..a2e06b0 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -310,8 +310,7 @@ static void buf_unmapped_default_callback(struct rchan_buf *buf, static struct dentry *create_buf_file_default_callback(const char *filename, struct dentry *parent, int mode, - struct rchan_buf *buf, - int *is_global) + struct rchan_buf *buf) { return NULL; } @@ -411,7 +410,7 @@ void relay_reset(struct rchan *chan) if (!chan) return; - if (chan->is_global && chan->buf[0]) { + if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) { __relay_reset(chan->buf[0], 0); return; } @@ -445,8 +444,7 @@ static struct dentry *relay_create_buf_file(struct rchan *chan, /* Create file in fs */ dentry = chan->cb->create_buf_file(tmpname, chan->parent, - S_IRUSR, buf, - &chan->is_global); + S_IRUSR, buf); kfree(tmpname); @@ -463,7 +461,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu) struct rchan_buf *buf = NULL; struct dentry *dentry; - if (chan->is_global) + if (chan->flags & RCHAN_GLOBAL_BUFFER) return chan->buf[0]; buf = relay_create_buf(chan); @@ -480,7 +478,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu) buf->cpu = cpu; __relay_reset(buf, 1); - if(chan->is_global) { + if(chan->flags & RCHAN_GLOBAL_BUFFER) { chan->buf[0] = buf; buf->cpu = 0; } @@ -595,7 +593,8 @@ struct rchan *relay_open(const char *base_filename, size_t subbuf_size, size_t n_subbufs, struct rchan_callbacks *cb, - void *private_data) + void *private_data, + unsigned long rchan_flags) { unsigned int i; struct rchan *chan; @@ -612,6 +611,7 @@ struct rchan *relay_open(const char *base_filename, chan->subbuf_size = subbuf_size; chan->alloc_size = FIX_SIZE(subbuf_size * n_subbufs); chan->parent = parent; + chan->flags = rchan_flags; chan->private_data = private_data; if (base_filename) { chan->has_base_filename = 1; @@ -828,7 +828,7 @@ void relay_close(struct rchan *chan) return; mutex_lock(&relay_channels_mutex); - if (chan->is_global && chan->buf[0]) + if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) relay_close_buf(chan->buf[0]); else for_each_possible_cpu(i) @@ -859,7 +859,7 @@ void relay_flush(struct rchan *chan) if (!chan) return; - if (chan->is_global && chan->buf[0]) { + if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) { chan->cb->switch_subbuf(chan->buf[0], 0); return; } diff --git a/virt/kvm/kvm_trace.c b/virt/kvm/kvm_trace.c index 58141f3..d0a9e1c 100644 --- a/virt/kvm/kvm_trace.c +++ b/virt/kvm/kvm_trace.c @@ -130,10 +130,9 @@ static int kvm_subbuf_start_callback(struct rchan_buf *buf, void *subbuf, } static struct dentry *kvm_create_buf_file_callack(const char *filename, - struct dentry *parent, - int mode, - struct rchan_buf *buf, - int *is_global) + struct dentry *parent, + int mode, + struct rchan_buf *buf) { return debugfs_create_file(filename, mode, parent, buf, &relay_file_operations); @@ -171,7 +170,7 @@ static int do_kvm_trace_enable(struct kvm_user_trace_setup *kuts) goto err; kt->rchan = relay_open("trace", kvm_debugfs_dir, kuts->buf_size, - kuts->buf_nr, &kvm_relay_callbacks, kt); + kuts->buf_nr, &kvm_relay_callbacks, kt, 0UL); if (!kt->rchan) goto err; ^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [PATCH 3/3] relay - add channel flags 2008-09-23 5:27 ` [PATCH 3/3] relay - add channel flags Tom Zanussi @ 2008-09-23 20:20 ` Andrew Morton 2008-09-24 3:57 ` Tom Zanussi 0 siblings, 1 reply; 122+ messages in thread From: Andrew Morton @ 2008-09-23 20:20 UTC (permalink / raw) To: Tom Zanussi Cc: a.p.zijlstra, prasad, mbligh, linux-kernel, torvalds, tglx, compudj, rostedt, od, fche, hch, dwilder On Tue, 23 Sep 2008 00:27:56 -0500 Tom Zanussi <zanussi@comcast.net> wrote: > Add channel flags to relay, remove global callback param. > > relay should probably have had a flags param from the beginning; it > wasn't originally added because it wasn't originally needed - it > probably would have helped avoid some of the callback contortions > that were added due to a lack of flags. This adds them and does a > small amount of low-hanging cleanup, and is also in preparation for > some new flags in future patches. > > Signed-off-by: Tom Zanussi <tzanussi@gmail.com> > > diff --git a/block/blktrace.c b/block/blktrace.c > index eb9651c..150c5f7 100644 > --- a/block/blktrace.c > +++ b/block/blktrace.c > @@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry) > static struct dentry *blk_create_buf_file_callback(const char *filename, > struct dentry *parent, > int mode, > - struct rchan_buf *buf, > - int *is_global) > + struct rchan_buf *buf) > { > return debugfs_create_file(filename, mode, parent, buf, > &relay_file_operations); > @@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, > goto err; > > bt->rchan = relay_open("trace", dir, buts->buf_size, > - buts->buf_nr, &blk_relay_callbacks, bt); > + buts->buf_nr, &blk_relay_callbacks, bt, 0UL); > if (!bt->rchan) > goto err; > > diff --git a/include/linux/relay.h b/include/linux/relay.h > index 52e4d61..18fd269 100644 > --- a/include/linux/relay.h > +++ b/include/linux/relay.h > @@ -25,7 +25,13 @@ > /* > * Tracks changes to rchan/rchan_buf structs > */ > -#define RELAYFS_CHANNEL_VERSION 7 > +#define RELAYFS_CHANNEL_VERSION 8 What is the significance of this change? Does it affect the kernel<->userspace interface? Is it back-compatible with existing userspace? > +/* > + * relay channel flags > + */ > +#define RCHAN_MODE_OVERWRITE 0x00000001 /* 'flight' mode */ > +#define RCHAN_GLOBAL_BUFFER 0x00000002 /* not using per-cpu */ > > > ... > > @@ -480,7 +478,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu) > buf->cpu = cpu; > __relay_reset(buf, 1); > > - if(chan->is_global) { > + if(chan->flags & RCHAN_GLOBAL_BUFFER) { Please use checkpatch. It's a little thing, but it's so easy to fix.. > chan->buf[0] = buf; > buf->cpu = 0; > } > > ... ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 3/3] relay - add channel flags 2008-09-23 20:20 ` Andrew Morton @ 2008-09-24 3:57 ` Tom Zanussi 0 siblings, 0 replies; 122+ messages in thread From: Tom Zanussi @ 2008-09-24 3:57 UTC (permalink / raw) To: Andrew Morton Cc: a.p.zijlstra, prasad, mbligh, linux-kernel, torvalds, tglx, compudj, rostedt, od, fche, hch, dwilder Thanks for the comments on these patches - I'll include the changes in the next go-round. On Tue, 2008-09-23 at 13:20 -0700, Andrew Morton wrote: > On Tue, 23 Sep 2008 00:27:56 -0500 > Tom Zanussi <zanussi@comcast.net> wrote: > > > Add channel flags to relay, remove global callback param. > > > > relay should probably have had a flags param from the beginning; it > > wasn't originally added because it wasn't originally needed - it > > probably would have helped avoid some of the callback contortions > > that were added due to a lack of flags. This adds them and does a > > small amount of low-hanging cleanup, and is also in preparation for > > some new flags in future patches. > > > > Signed-off-by: Tom Zanussi <tzanussi@gmail.com> > > > > diff --git a/block/blktrace.c b/block/blktrace.c > > index eb9651c..150c5f7 100644 > > --- a/block/blktrace.c > > +++ b/block/blktrace.c > > @@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry) > > static struct dentry *blk_create_buf_file_callback(const char *filename, > > struct dentry *parent, > > int mode, > > - struct rchan_buf *buf, > > - int *is_global) > > + struct rchan_buf *buf) > > { > > return debugfs_create_file(filename, mode, parent, buf, > > &relay_file_operations); > > @@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, > > goto err; > > > > bt->rchan = relay_open("trace", dir, buts->buf_size, > > - buts->buf_nr, &blk_relay_callbacks, bt); > > + buts->buf_nr, &blk_relay_callbacks, bt, 0UL); > > if (!bt->rchan) > > goto err; > > > > diff --git a/include/linux/relay.h b/include/linux/relay.h > > index 52e4d61..18fd269 100644 > > --- a/include/linux/relay.h > > +++ b/include/linux/relay.h > > @@ -25,7 +25,13 @@ > > /* > > * Tracks changes to rchan/rchan_buf structs > > */ > > -#define RELAYFS_CHANNEL_VERSION 7 > > +#define RELAYFS_CHANNEL_VERSION 8 > > What is the significance of this change? Does it affect the > kernel<->userspace interface? Is it back-compatible with existing > userspace? > No, nothing to do with the kernel-userspace interface. The channel version is included in the channel struct and was meant as an aid in deciphering channel data in crash dumps. Tom > > > +/* > > + * relay channel flags > > + */ > > +#define RCHAN_MODE_OVERWRITE 0x00000001 /* 'flight' mode */ > > +#define RCHAN_GLOBAL_BUFFER 0x00000002 /* not using per-cpu */ > > > > > > ... > > > > @@ -480,7 +478,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu) > > buf->cpu = cpu; > > __relay_reset(buf, 1); > > > > - if(chan->is_global) { > > + if(chan->flags & RCHAN_GLOBAL_BUFFER) { > > Please use checkpatch. It's a little thing, but it's so easy to fix.. > > > chan->buf[0] = buf; > > buf->cpu = 0; > > } > > > > ... ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:33 Unified tracing buffer Martin Bligh ` (3 preceding siblings ...) 2008-09-20 0:07 ` Peter Zijlstra @ 2008-09-20 0:26 ` Marcel Holtmann 2008-09-20 9:03 ` Steven Rostedt ` (3 subsequent siblings) 8 siblings, 0 replies; 122+ messages in thread From: Marcel Holtmann @ 2008-09-20 0:26 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler Hi Martin, > During kernel summit and Plumbers conference, Linus and others > expressed a desire for a unified > tracing buffer system for multiple tracing applications (eg ftrace, > lttng, systemtap, blktrace, etc) to use. > This provides several advantages, including the ability to interleave > data from multiple sources, > not having to learn 200 different tools, duplicated code/effort, etc. I talked with Thomas and Steven about it during lunch and we might wanna also use it for sniffing/monitoring/capturing efforts. Inaky and I talked about unifying things like usbmon, hcidump, tcpdump etc. with a common interface into the kernel. Currently every subsystem does it differently. Especially when it comes to add monitor support to SDIO we ran into the problem that we don't just wanna invent another interface. Right now I haven't spent much time on this and regret that I didn't sit together with you guys yesterday, but it was not that high priority on my list of nice things to have. Anyway, since I am mostly interested in subsystems that copy a lot of packets around, it would be nice to only reference to the data. For example for structures like SKBs and URBs and only have to copy the data if we have a consumer. My thoughts :) Regards Marcel ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:33 Unified tracing buffer Martin Bligh ` (4 preceding siblings ...) 2008-09-20 0:26 ` Unified tracing buffer Marcel Holtmann @ 2008-09-20 9:03 ` Steven Rostedt 2008-09-20 13:55 ` Mathieu Desnoyers 2008-09-22 9:57 ` Peter Zijlstra 2008-09-22 13:57 ` K.Prasad ` (2 subsequent siblings) 8 siblings, 2 replies; 122+ messages in thread From: Steven Rostedt @ 2008-09-20 9:03 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, od, Frank Ch. Eigler Martin, First I like to express my appreciation to you for writing this up. Not only that, but being the one person from keeping us from killing each other ;-) On Fri, 19 Sep 2008, Martin Bligh wrote: > During kernel summit and Plumbers conference, Linus and others > expressed a desire for a unified > tracing buffer system for multiple tracing applications (eg ftrace, > lttng, systemtap, blktrace, etc) to use. > This provides several advantages, including the ability to interleave > data from multiple sources, > not having to learn 200 different tools, duplicated code/effort, etc. > > Several of us got together last night and tried to cut this down to > the simplest usable system > we could agree on (and nobody got hurt!). This will form version 1. Yes, we kept the chairs on the floor the whole time. > I've sketched out a few > enhancements we know that we want, but have agreed to leave these > until version 2. > The answer to most questions about the below is "yes we know, we'll > fix that in version 2" > (or 3). Simplicity was the rule ... > > Sketch of design. Enjoy flaming me. Code will follow shortly. > > > STORAGE > ------- > > We will support multiple buffers for different tracing systems, with > separate names, event id spaces. > Event ids are 16 bit, dynamically allocated. > A "one line of text" print function will be provided for each event, > or use the default (probably hex printf) > Will provide a "flight data recorder" mode, and a "spool to disk" mode. I don't remember talking about the "spool to disk" for version 1. We still want to do this? I thought we would have overwrite mode (flight data record), and a "throw all new data away when the producer fills the buffer before the consumer takes" mode. > > Circular buffer per cpu, protected by per-cpu spinlock_irq > Word aligned records. As stated in another email "8 byte aligned" words should be fine. > Variable record length, header will start with length record. > Timestamps in fixed timebase, monotonically increasing (across all CPUs) > > > INPUT_FUNCTIONS > --------------- > > allocate_buffer (name, size) > return buffer_handle > > register_event (buffer_handle, event_id, print_function) > You can pass in a requested event_id from a fixed set, and > will be given it, or an error > 0 means allocate me one dynamically > returns event_id (or -E_ERROR) > > record_event (buffer_handle, event_id, length, *buf) I was talking with Thomas about this, and we probably want (and I'm sure Mathieu and others would agree), a... event_handle = reserve_event(buffer_handle, event_id, length) as well as a.. comit_event(event_handle). Oh, and all commands should start with the namespace. ring_buffer_alloc() ring_buffer_free() ring_buffer_record_event() etc. > > > OUTPUT > ------ > > Data will be output via debugfs, and provide the following output streams: > > /debugfs/tracing/<name>/buffers/text > clear text stream (will merge the per-cpu streams via insertion > sort, and use the print functions) > > /debugfs/tracing/<name>/buffers/binary[cpu_number] > per-cpu binary data Ah, I thought we were going to have: /debugfs/tracing/buffers/<name>/<buffer crap> and each tracer have /debugfs/tracing/<name>/<trace command crap> This way we can easily see all the buffers in one place that are allocated without having to see a tracer name first. The reason I like the way I propose, is that a utility that needs to read all the buffers, doesn't need to go into directories that don't even have buffers. Not all tracers will allocate a buffer. > > > CONTROL > ------- > > Sysfs style tree under debugfs > > /debugfs/tracing/<name>/buffers/enabed <--- binary value > > /debugfs/tracing/<name>/<event1> > /debugfs/tracing/<name>/<event2> > etc ... I wonder if we should make this another sub dir: /debugfs/tracing/buffers/events/<event-name> > provides a way to enable/disable events, see what's available, and > what's enabled. > > > KNOWN ISSUES / PLANS > ------------------- > > No way to unregister buffers and events. > Will provide an unregister_buffer and unregister_event call I can see registering events, but shouldn't we "allocate" buffers? > > > Generating systemwide time is hard on some platforms > Yes. Time-based output provides a lot of simplicity for the user though > We won't support these platforms at first, we'll add functionality > to make it work for them later. > (plan based on tick-based ms timing, plus counter offset from that > if needed). > > Spinlock_irq is ineffecient, and doesn't support tracing in NMIs > True. We'll implement a lockless scheme later (see lttng) > > Putting a length record in every event is inefficient > True. Fixed record length with optional extensions is better, but > more complex. v2. > > Putting a full timestamp rather than an offset in every event is inefficient > See above. True, but v2. > > Relayfs already exists! use that! > People were universally not keen on that idea. Complexity, interface, etc. > We're also providing some higher level shared functions for time & > event ids. > > There's no way to decode the binary data stream > Code will be shared from the kernel to decode it, so that we can > get the compact binary > format and decode it later. That code will be kept in the kernel > tree (it's a trivial piece of C). > Version 1.1 ;-) > Sounds good, Thanks! -- Steve ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 9:03 ` Steven Rostedt @ 2008-09-20 13:55 ` Mathieu Desnoyers 2008-09-20 14:12 ` Arjan van de Ven 2008-09-22 3:09 ` KOSAKI Motohiro 2008-09-22 9:57 ` Peter Zijlstra 1 sibling, 2 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-20 13:55 UTC (permalink / raw) To: Steven Rostedt Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler * Steven Rostedt (rostedt@goodmis.org) wrote: > > > Martin, > > First I like to express my appreciation to you for writing this up. Not > only that, but being the one person from keeping us from killing each > other ;-) > > > On Fri, 19 Sep 2008, Martin Bligh wrote: > > > During kernel summit and Plumbers conference, Linus and others > > expressed a desire for a unified > > tracing buffer system for multiple tracing applications (eg ftrace, > > lttng, systemtap, blktrace, etc) to use. > > This provides several advantages, including the ability to interleave > > data from multiple sources, > > not having to learn 200 different tools, duplicated code/effort, etc. > > > > Several of us got together last night and tried to cut this down to > > the simplest usable system > > we could agree on (and nobody got hurt!). This will form version 1. > > Yes, we kept the chairs on the floor the whole time. > Yes, they were too heavy. ;) > > I've sketched out a few > > enhancements we know that we want, but have agreed to leave these > > until version 2. > > The answer to most questions about the below is "yes we know, we'll > > fix that in version 2" > > (or 3). Simplicity was the rule ... > > > > Sketch of design. Enjoy flaming me. Code will follow shortly. > > > > > > STORAGE > > ------- > > > > We will support multiple buffers for different tracing systems, with > > separate names, event id spaces. > > Event ids are 16 bit, dynamically allocated. > > A "one line of text" print function will be provided for each event, > > or use the default (probably hex printf) > > Will provide a "flight data recorder" mode, and a "spool to disk" mode. > > I don't remember talking about the "spool to disk" for version 1. > We still want to do this? I thought we would have overwrite mode (flight > data record), and a "throw all new data away when the producer fills the > buffer before the consumer takes" mode. > Yes, I think the spool to disk mode will be the default mode needed by a big amount people who want to stream data out continuously. The flight recorder is needed mostly for event backlog analysis. I think we have to provide both. > > > > Circular buffer per cpu, protected by per-cpu spinlock_irq > > Word aligned records. > > As stated in another email "8 byte aligned" words should be fine. > It's also easy to be sizeof(void *) aligned, as long as we export sizeof(void *) in the buffer header so we keep portability. But we can keep that for v2. It's also good to write a magic number in the trace header to auto-detect endianness. > > Variable record length, header will start with length record. > > Timestamps in fixed timebase, monotonically increasing (across all CPUs) > > > > > > INPUT_FUNCTIONS > > --------------- > > > > allocate_buffer (name, size) > > return buffer_handle > > > > register_event (buffer_handle, event_id, print_function) > > You can pass in a requested event_id from a fixed set, and > > will be given it, or an error > > 0 means allocate me one dynamically > > returns event_id (or -E_ERROR) > > > > record_event (buffer_handle, event_id, length, *buf) > > I was talking with Thomas about this, and we probably want (and I'm sure > Mathieu and others would agree), a... > > event_handle = reserve_event(buffer_handle, event_id, length) > > as well as a.. > > comit_event(event_handle). > How about : trace_mark(ftrace_evname, "size %lu binary %pW", sizeof(mystruct), mystruct); or trace_mark(sched_wakeup, "target_pid %ld", task->pid); Note the namespacing with buffers being "ftrace" and "sched" here. That would encapsulate the whole - Event ID registration - Event type registration - Sending data out - Enabling the event source directly at the source We can then export the markers through a debugfs file and let userland enable them one by one and possibly connect systemtap filters on them (one table of registered filters, one table for the markers, a command file to connect/disconnect filters to/from markers). > > Oh, and all commands should start with the namespace. > > ring_buffer_alloc() > ring_buffer_free() > ring_buffer_record_event() > We could even rename markers if required, I don't really care. e.g. : trace_mark -> ring_buffer_record_event() but note that this would contain all the event ID registration. > etc. > > > > > > > OUTPUT > > ------ > > > > Data will be output via debugfs, and provide the following output streams: > > > > /debugfs/tracing/<name>/buffers/text > > clear text stream (will merge the per-cpu streams via insertion > > sort, and use the print functions) > > > > /debugfs/tracing/<name>/buffers/binary[cpu_number] > > per-cpu binary data > > Ah, I thought we were going to have: > > /debugfs/tracing/buffers/<name>/<buffer crap> > > and each tracer have > > /debugfs/tracing/<name>/<trace command crap> > > This way we can easily see all the buffers in one place that are allocated > without having to see a tracer name first. > > The reason I like the way I propose, is that a utility that needs to read > all the buffers, doesn't need to go into directories that don't even have > buffers. Not all tracers will allocate a buffer. > people can still do ls debugfs/tracing/*/buffers/. But yes, we did agree on having the buffers/ subdir outside of the "trace command crap". It makes the buffers easier to see in the directory tree, and makes it clear that those buffers can be used by other users than the actual tracer this controls their input. > > > > > > > CONTROL > > ------- > > > > Sysfs style tree under debugfs > > > > /debugfs/tracing/<name>/buffers/enabed <--- binary value > > > > /debugfs/tracing/<name>/<event1> > > /debugfs/tracing/<name>/<event2> > > etc ... > > I wonder if we should make this another sub dir: > > /debugfs/tracing/buffers/events/<event-name> > Sure. If needed, we could change the markers to take two separate parameters : trace_mark(tracer_name, event_name, "format", args) Mathieu > > > provides a way to enable/disable events, see what's available, and > > what's enabled. > > > > > > KNOWN ISSUES / PLANS > > ------------------- > > > > No way to unregister buffers and events. > > Will provide an unregister_buffer and unregister_event call > > I can see registering events, but shouldn't we "allocate" buffers? > > > > > > > Generating systemwide time is hard on some platforms > > Yes. Time-based output provides a lot of simplicity for the user though > > We won't support these platforms at first, we'll add functionality > > to make it work for them later. > > (plan based on tick-based ms timing, plus counter offset from that > > if needed). > > > > Spinlock_irq is ineffecient, and doesn't support tracing in NMIs > > True. We'll implement a lockless scheme later (see lttng) > > > > Putting a length record in every event is inefficient > > True. Fixed record length with optional extensions is better, but > > more complex. v2. > > > > Putting a full timestamp rather than an offset in every event is inefficient > > See above. True, but v2. > > > > Relayfs already exists! use that! > > People were universally not keen on that idea. Complexity, interface, etc. > > We're also providing some higher level shared functions for time & > > event ids. > > > > There's no way to decode the binary data stream > > Code will be shared from the kernel to decode it, so that we can > > get the compact binary > > format and decode it later. That code will be kept in the kernel > > tree (it's a trivial piece of C). > > Version 1.1 ;-) > > > > Sounds good, > > Thanks! > > -- Steve > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 13:55 ` Mathieu Desnoyers @ 2008-09-20 14:12 ` Arjan van de Ven 2008-09-22 18:52 ` Mathieu Desnoyers 2008-09-22 3:09 ` KOSAKI Motohiro 1 sibling, 1 reply; 122+ messages in thread From: Arjan van de Ven @ 2008-09-20 14:12 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler On Sat, 20 Sep 2008 09:55:48 -0400 Mathieu Desnoyers <compudj@krystal.dyndns.org> wrote: > How about : > > trace_mark(ftrace_evname, "size %lu binary %pW", > sizeof(mystruct), mystruct); > or > trace_mark(sched_wakeup, "target_pid %ld", task->pid); > > Note the namespacing with buffers being "ftrace" and "sched" here. > > That would encapsulate the whole > - Event ID registration > - Event type registration > - Sending data out > - Enabling the event source directly at the source > > We can then export the markers through a debugfs file and let userland > enable them one by one and possibly connect systemtap filters on them > (one table of registered filters, one table for the markers, a command > file to connect/disconnect filters to/from markers). I would like to ask for the following from the start: have a field for a longer description of the marker that describes it's usage and context. Getting this there from the start is critical, because only when adding the marker point do people still really remember why/what (and having to type a good description also helps them to realize if this is the right point or not). This can then be exposed to the user so he has a standing chance of knowing what the marker is about. It also has a standing chance of being updated when the code changes this way -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 14:12 ` Arjan van de Ven @ 2008-09-22 18:52 ` Mathieu Desnoyers 2008-10-02 15:28 ` Jason Baron 0 siblings, 1 reply; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-22 18:52 UTC (permalink / raw) To: Arjan van de Ven Cc: Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler * Arjan van de Ven (arjan@infradead.org) wrote: > On Sat, 20 Sep 2008 09:55:48 -0400 > Mathieu Desnoyers <compudj@krystal.dyndns.org> wrote: > > > How about : > > > > trace_mark(ftrace_evname, "size %lu binary %pW", > > sizeof(mystruct), mystruct); > > or > > trace_mark(sched_wakeup, "target_pid %ld", task->pid); > > > > Note the namespacing with buffers being "ftrace" and "sched" here. > > > > That would encapsulate the whole > > - Event ID registration > > - Event type registration > > - Sending data out > > - Enabling the event source directly at the source > > > > We can then export the markers through a debugfs file and let userland > > enable them one by one and possibly connect systemtap filters on them > > (one table of registered filters, one table for the markers, a command > > file to connect/disconnect filters to/from markers). > > I would like to ask for the following from the start: have a field for > a longer description of the marker that describes it's usage and > context. Getting this there from the start is critical, because only > when adding the marker point do people still really remember why/what > (and having to type a good description also helps them to realize if > this is the right point or not). This can then be exposed to the user > so he has a standing chance of knowing what the marker is about. > > It also has a standing chance of being updated when the code changes > this way > I agree, and I think it might be required in both markers and tracepoints. Given that tracepoints are declared in a global header (DECLARE_TRACE()), I would add this kind of description here. Tracepoint uses within the kernel code (statements like : trace_sched_switch(prev, next); added to the scheduler) would therefore be tied to the description without having to contain it in the core kernel code. Markers, on the other hand, could become the "event description" interface which is exported to userspace. Considering that, I guess it's as important to let a precise description follow the markers. Mathieu > -- > Arjan van de Ven Intel Open Source Technology Centre > For development, discussion and tips for power savings, > visit http://www.lesswatts.org > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 18:52 ` Mathieu Desnoyers @ 2008-10-02 15:28 ` Jason Baron 2008-10-03 16:11 ` Mathieu Desnoyers 0 siblings, 1 reply; 122+ messages in thread From: Jason Baron @ 2008-10-02 15:28 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler On Mon, Sep 22, 2008 at 02:52:09PM -0400, Mathieu Desnoyers wrote: > > On Sat, 20 Sep 2008 09:55:48 -0400 > > Mathieu Desnoyers <compudj@krystal.dyndns.org> wrote: > > > > > How about : > > > > > > trace_mark(ftrace_evname, "size %lu binary %pW", > > > sizeof(mystruct), mystruct); > > > or > > > trace_mark(sched_wakeup, "target_pid %ld", task->pid); > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here. > > > > > > That would encapsulate the whole > > > - Event ID registration > > > - Event type registration > > > - Sending data out > > > - Enabling the event source directly at the source > > > > > > We can then export the markers through a debugfs file and let userland > > > enable them one by one and possibly connect systemtap filters on them > > > (one table of registered filters, one table for the markers, a command > > > file to connect/disconnect filters to/from markers). > > > > I would like to ask for the following from the start: have a field for > > a longer description of the marker that describes it's usage and > > context. Getting this there from the start is critical, because only > > when adding the marker point do people still really remember why/what > > (and having to type a good description also helps them to realize if > > this is the right point or not). This can then be exposed to the user > > so he has a standing chance of knowing what the marker is about. > > > > It also has a standing chance of being updated when the code changes > > this way > > > > I agree, and I think it might be required in both markers and > tracepoints. > > Given that tracepoints are declared in a global header > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint > uses within the kernel code (statements like : > trace_sched_switch(prev, next); > added to the scheduler) would therefore be tied to the description > without having to contain it in the core kernel code. > > Markers, on the other hand, could become the "event description" > interface which is exported to userspace. Considering that, I guess it's > as important to let a precise description follow the markers. > > Mathieu > > hi, Tracepoints and markers seem to both have their place, with tracepoints being integral to kernel users, and markers being important for userspace. However, it seems to me like there is overlap in the code and an extra level of indirection when markers are layered on tracespoints. could they be merged a bit more? What if we extended DEFINE_TRACE() to also create a 'set_marker(marker_cb)' function where 'marker_cb' has the function signature: marker_cb(<tracepoint prototype>, *marker_probe_func); We then also create 'register_marker_##name' function in DEFINE_TRACE(), which allows one to regiser marker callbacks in the usual way. Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has registered a marker (which can set the tracepoint.state appropriately). The 'marker_cb' function then marshalls its arguemnts and passes them through to the marker functions that were registered. I think in this way we can simplify the tracepoints and markers by combining them to a large extent. thanks, -Jason ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-10-02 15:28 ` Jason Baron @ 2008-10-03 16:11 ` Mathieu Desnoyers 2008-10-03 18:37 ` Jason Baron 0 siblings, 1 reply; 122+ messages in thread From: Mathieu Desnoyers @ 2008-10-03 16:11 UTC (permalink / raw) To: Jason Baron Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler * Jason Baron (jbaron@redhat.com) wrote: > On Mon, Sep 22, 2008 at 02:52:09PM -0400, Mathieu Desnoyers wrote: > > > On Sat, 20 Sep 2008 09:55:48 -0400 > > > Mathieu Desnoyers <compudj@krystal.dyndns.org> wrote: > > > > > > > How about : > > > > > > > > trace_mark(ftrace_evname, "size %lu binary %pW", > > > > sizeof(mystruct), mystruct); > > > > or > > > > trace_mark(sched_wakeup, "target_pid %ld", task->pid); > > > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here. > > > > > > > > That would encapsulate the whole > > > > - Event ID registration > > > > - Event type registration > > > > - Sending data out > > > > - Enabling the event source directly at the source > > > > > > > > We can then export the markers through a debugfs file and let userland > > > > enable them one by one and possibly connect systemtap filters on them > > > > (one table of registered filters, one table for the markers, a command > > > > file to connect/disconnect filters to/from markers). > > > > > > I would like to ask for the following from the start: have a field for > > > a longer description of the marker that describes it's usage and > > > context. Getting this there from the start is critical, because only > > > when adding the marker point do people still really remember why/what > > > (and having to type a good description also helps them to realize if > > > this is the right point or not). This can then be exposed to the user > > > so he has a standing chance of knowing what the marker is about. > > > > > > It also has a standing chance of being updated when the code changes > > > this way > > > > > > > I agree, and I think it might be required in both markers and > > tracepoints. > > > > Given that tracepoints are declared in a global header > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint > > uses within the kernel code (statements like : > > trace_sched_switch(prev, next); > > added to the scheduler) would therefore be tied to the description > > without having to contain it in the core kernel code. > > > > Markers, on the other hand, could become the "event description" > > interface which is exported to userspace. Considering that, I guess it's > > as important to let a precise description follow the markers. > > > > Mathieu > > > > > > hi, > > Tracepoints and markers seem to both have their place, with tracepoints > being integral to kernel users, and markers being important for > userspace. However, it seems to me like there is overlap in the > code and an extra level of indirection when markers are layered on > tracespoints. could they be merged a bit more? > > What if we extended DEFINE_TRACE() to also create a > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature: > > marker_cb(<tracepoint prototype>, *marker_probe_func); > > We then also create 'register_marker_##name' function in DEFINE_TRACE(), > which allows one to regiser marker callbacks in the usual way. > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has > registered a marker (which can set the tracepoint.state appropriately). > > The 'marker_cb' function then marshalls its arguemnts and passes them > through to the marker functions that were registered. > > I think in this way we can simplify the tracepoints and markers by > combining them to a large extent. > > thanks, > > -Jason > I think what you propose here is already in y LTTng tree in a different form. It's a patch to markers to allow declaring a marker which enables an associated tracepoint when enabled. This way, we can have a marker (exposed to userspace) connecting itself automatically to a tracepoint when enabled. It's here : http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09 Note that locking depends on the psrwlock patch so we can have nested module list readers. Otherwise locking becomes _really_ messy. :-( Mathieu > > > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-10-03 16:11 ` Mathieu Desnoyers @ 2008-10-03 18:37 ` Jason Baron 2008-10-03 19:10 ` Mathieu Desnoyers 0 siblings, 1 reply; 122+ messages in thread From: Jason Baron @ 2008-10-03 18:37 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote: > > > > > How about : > > > > > > > > > > trace_mark(ftrace_evname, "size %lu binary %pW", > > > > > sizeof(mystruct), mystruct); > > > > > or > > > > > trace_mark(sched_wakeup, "target_pid %ld", task->pid); > > > > > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here. > > > > > > > > > > That would encapsulate the whole > > > > > - Event ID registration > > > > > - Event type registration > > > > > - Sending data out > > > > > - Enabling the event source directly at the source > > > > > > > > > > We can then export the markers through a debugfs file and let userland > > > > > enable them one by one and possibly connect systemtap filters on them > > > > > (one table of registered filters, one table for the markers, a command > > > > > file to connect/disconnect filters to/from markers). > > > > > > > > I would like to ask for the following from the start: have a field for > > > > a longer description of the marker that describes it's usage and > > > > context. Getting this there from the start is critical, because only > > > > when adding the marker point do people still really remember why/what > > > > (and having to type a good description also helps them to realize if > > > > this is the right point or not). This can then be exposed to the user > > > > so he has a standing chance of knowing what the marker is about. > > > > > > > > It also has a standing chance of being updated when the code changes > > > > this way > > > > > > > > > > I agree, and I think it might be required in both markers and > > > tracepoints. > > > > > > Given that tracepoints are declared in a global header > > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint > > > uses within the kernel code (statements like : > > > trace_sched_switch(prev, next); > > > added to the scheduler) would therefore be tied to the description > > > without having to contain it in the core kernel code. > > > > > > Markers, on the other hand, could become the "event description" > > > interface which is exported to userspace. Considering that, I guess it's > > > as important to let a precise description follow the markers. > > > > > > Mathieu > > > > > > > > > > hi, > > > > Tracepoints and markers seem to both have their place, with tracepoints > > being integral to kernel users, and markers being important for > > userspace. However, it seems to me like there is overlap in the > > code and an extra level of indirection when markers are layered on > > tracespoints. could they be merged a bit more? > > > > What if we extended DEFINE_TRACE() to also create a > > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature: > > > > marker_cb(<tracepoint prototype>, *marker_probe_func); > > > > We then also create 'register_marker_##name' function in DEFINE_TRACE(), > > which allows one to regiser marker callbacks in the usual way. > > > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has > > registered a marker (which can set the tracepoint.state appropriately). > > > > The 'marker_cb' function then marshalls its arguemnts and passes them > > through to the marker functions that were registered. > > > > I think in this way we can simplify the tracepoints and markers by > > combining them to a large extent. > > > > thanks, > > > > -Jason > > > > I think what you propose here is already in y LTTng tree in a different > form. It's a patch to markers to allow declaring a marker which enables > an associated tracepoint when enabled. This way, we can have a marker > (exposed to userspace) connecting itself automatically to a tracepoint > when enabled. > > It's here : > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09 > > Note that locking depends on the psrwlock patch so we can have nested > module list readers. Otherwise locking becomes _really_ messy. :-( > > Mathieu > That patch simplifies using markers with tracepoints and couples markers and tracepoints much more closely. But I was proposing to make the coupling tighter... Couldn't 'marker_probe_register()' register the marker directly with the tracepoint callsite? Have DEFINE_TRACE() take an additional argument which references a marker callback funtion. That function would look like (very loose C code): marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe, private_data) { probe(private_data, "%arg1 %arg2", arg1->a, arg2->b); } The 'marker_blah_callback()' would be invoked from within DO_TRACE() for each marker that has been registered with the associated tracepoint, in a similar way to how we iterate over the tracepoint callbacks, we can iterate over the registered markers and pass them to the 'marker_blah_callback()' function. By associating the marker_blah_callback() in DEFINE_TRACE(), we only need to look in one file to understand what is associated with a particular tracepoint. I think marker.c and tracepoint.c could also be consolidated at that point. thanks, -Jason ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-10-03 18:37 ` Jason Baron @ 2008-10-03 19:10 ` Mathieu Desnoyers 2008-10-03 19:25 ` Jason Baron 2008-10-03 21:52 ` Frank Ch. Eigler 0 siblings, 2 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-10-03 19:10 UTC (permalink / raw) To: Jason Baron Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler * Jason Baron (jbaron@redhat.com) wrote: > On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote: > > > > > > How about : > > > > > > > > > > > > trace_mark(ftrace_evname, "size %lu binary %pW", > > > > > > sizeof(mystruct), mystruct); > > > > > > or > > > > > > trace_mark(sched_wakeup, "target_pid %ld", task->pid); > > > > > > > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here. > > > > > > > > > > > > That would encapsulate the whole > > > > > > - Event ID registration > > > > > > - Event type registration > > > > > > - Sending data out > > > > > > - Enabling the event source directly at the source > > > > > > > > > > > > We can then export the markers through a debugfs file and let userland > > > > > > enable them one by one and possibly connect systemtap filters on them > > > > > > (one table of registered filters, one table for the markers, a command > > > > > > file to connect/disconnect filters to/from markers). > > > > > > > > > > I would like to ask for the following from the start: have a field for > > > > > a longer description of the marker that describes it's usage and > > > > > context. Getting this there from the start is critical, because only > > > > > when adding the marker point do people still really remember why/what > > > > > (and having to type a good description also helps them to realize if > > > > > this is the right point or not). This can then be exposed to the user > > > > > so he has a standing chance of knowing what the marker is about. > > > > > > > > > > It also has a standing chance of being updated when the code changes > > > > > this way > > > > > > > > > > > > > I agree, and I think it might be required in both markers and > > > > tracepoints. > > > > > > > > Given that tracepoints are declared in a global header > > > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint > > > > uses within the kernel code (statements like : > > > > trace_sched_switch(prev, next); > > > > added to the scheduler) would therefore be tied to the description > > > > without having to contain it in the core kernel code. > > > > > > > > Markers, on the other hand, could become the "event description" > > > > interface which is exported to userspace. Considering that, I guess it's > > > > as important to let a precise description follow the markers. > > > > > > > > Mathieu > > > > > > > > > > > > > > hi, > > > > > > Tracepoints and markers seem to both have their place, with tracepoints > > > being integral to kernel users, and markers being important for > > > userspace. However, it seems to me like there is overlap in the > > > code and an extra level of indirection when markers are layered on > > > tracespoints. could they be merged a bit more? > > > > > > What if we extended DEFINE_TRACE() to also create a > > > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature: > > > > > > marker_cb(<tracepoint prototype>, *marker_probe_func); > > > > > > We then also create 'register_marker_##name' function in DEFINE_TRACE(), > > > which allows one to regiser marker callbacks in the usual way. > > > > > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has > > > registered a marker (which can set the tracepoint.state appropriately). > > > > > > The 'marker_cb' function then marshalls its arguemnts and passes them > > > through to the marker functions that were registered. > > > > > > I think in this way we can simplify the tracepoints and markers by > > > combining them to a large extent. > > > > > > thanks, > > > > > > -Jason > > > > > > > I think what you propose here is already in y LTTng tree in a different > > form. It's a patch to markers to allow declaring a marker which enables > > an associated tracepoint when enabled. This way, we can have a marker > > (exposed to userspace) connecting itself automatically to a tracepoint > > when enabled. > > > > It's here : > > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09 > > > > Note that locking depends on the psrwlock patch so we can have nested > > module list readers. Otherwise locking becomes _really_ messy. :-( > > > > Mathieu > > > > That patch simplifies using markers with tracepoints and couples > markers and tracepoints much more closely. But I was proposing to make > the coupling tighter... > > Couldn't 'marker_probe_register()' register the marker directly with > the tracepoint callsite? Have DEFINE_TRACE() take an additional argument > which references a marker callback funtion. That function would look > like (very loose C code): > > marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe, I don't want the tracepoints to be coupled with markers (which are a userspace API). The other way around is fine : letting a marker automatically enable a tracepoint makes sense, but the opposite would tie the in-kernel API (tracepoint) to the external marker representation, and I would like to avoid that. And how do you plan to deal with : TPPROTO(arg1, arg2) == void ? C won't let you define stuff like : blah(void, marker_probe_func *probe, void *private_data) The devil is in the details.... ;) Mathieu > private_data) > { > probe(private_data, "%arg1 %arg2", arg1->a, arg2->b); > } > > The 'marker_blah_callback()' would be invoked from within DO_TRACE() for > each marker that has been registered with the associated tracepoint, in > a similar way to how we iterate over the tracepoint callbacks, we can > iterate over the registered markers and pass them to the > 'marker_blah_callback()' function. > > By associating the marker_blah_callback() in DEFINE_TRACE(), we only > need to look in one file to understand what is associated with a > particular tracepoint. I think marker.c and tracepoint.c could also be > consolidated at that point. > > thanks, > > -Jason > > > > > > > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-10-03 19:10 ` Mathieu Desnoyers @ 2008-10-03 19:25 ` Jason Baron 2008-10-03 19:56 ` Mathieu Desnoyers 2008-10-03 21:52 ` Frank Ch. Eigler 1 sibling, 1 reply; 122+ messages in thread From: Jason Baron @ 2008-10-03 19:25 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler On Fri, Oct 03, 2008 at 03:10:26PM -0400, Mathieu Desnoyers wrote: > * Jason Baron (jbaron@redhat.com) wrote: > > On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote: > > > > > > > How about : > > > > > > > > > > > > > > trace_mark(ftrace_evname, "size %lu binary %pW", > > > > > > > sizeof(mystruct), mystruct); > > > > > > > or > > > > > > > trace_mark(sched_wakeup, "target_pid %ld", task->pid); > > > > > > > > > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here. > > > > > > > > > > > > > > That would encapsulate the whole > > > > > > > - Event ID registration > > > > > > > - Event type registration > > > > > > > - Sending data out > > > > > > > - Enabling the event source directly at the source > > > > > > > > > > > > > > We can then export the markers through a debugfs file and let userland > > > > > > > enable them one by one and possibly connect systemtap filters on them > > > > > > > (one table of registered filters, one table for the markers, a command > > > > > > > file to connect/disconnect filters to/from markers). > > > > > > > > > > > > I would like to ask for the following from the start: have a field for > > > > > > a longer description of the marker that describes it's usage and > > > > > > context. Getting this there from the start is critical, because only > > > > > > when adding the marker point do people still really remember why/what > > > > > > (and having to type a good description also helps them to realize if > > > > > > this is the right point or not). This can then be exposed to the user > > > > > > so he has a standing chance of knowing what the marker is about. > > > > > > > > > > > > It also has a standing chance of being updated when the code changes > > > > > > this way > > > > > > > > > > > > > > > > I agree, and I think it might be required in both markers and > > > > > tracepoints. > > > > > > > > > > Given that tracepoints are declared in a global header > > > > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint > > > > > uses within the kernel code (statements like : > > > > > trace_sched_switch(prev, next); > > > > > added to the scheduler) would therefore be tied to the description > > > > > without having to contain it in the core kernel code. > > > > > > > > > > Markers, on the other hand, could become the "event description" > > > > > interface which is exported to userspace. Considering that, I guess it's > > > > > as important to let a precise description follow the markers. > > > > > > > > > > Mathieu > > > > > > > > > > > > > > > > > > hi, > > > > > > > > Tracepoints and markers seem to both have their place, with tracepoints > > > > being integral to kernel users, and markers being important for > > > > userspace. However, it seems to me like there is overlap in the > > > > code and an extra level of indirection when markers are layered on > > > > tracespoints. could they be merged a bit more? > > > > > > > > What if we extended DEFINE_TRACE() to also create a > > > > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature: > > > > > > > > marker_cb(<tracepoint prototype>, *marker_probe_func); > > > > > > > > We then also create 'register_marker_##name' function in DEFINE_TRACE(), > > > > which allows one to regiser marker callbacks in the usual way. > > > > > > > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has > > > > registered a marker (which can set the tracepoint.state appropriately). > > > > > > > > The 'marker_cb' function then marshalls its arguemnts and passes them > > > > through to the marker functions that were registered. > > > > > > > > I think in this way we can simplify the tracepoints and markers by > > > > combining them to a large extent. > > > > > > > > thanks, > > > > > > > > -Jason > > > > > > > > > > I think what you propose here is already in y LTTng tree in a different > > > form. It's a patch to markers to allow declaring a marker which enables > > > an associated tracepoint when enabled. This way, we can have a marker > > > (exposed to userspace) connecting itself automatically to a tracepoint > > > when enabled. > > > > > > It's here : > > > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09 > > > > > > Note that locking depends on the psrwlock patch so we can have nested > > > module list readers. Otherwise locking becomes _really_ messy. :-( > > > > > > Mathieu > > > > > > > That patch simplifies using markers with tracepoints and couples > > markers and tracepoints much more closely. But I was proposing to make > > the coupling tighter... > > > > Couldn't 'marker_probe_register()' register the marker directly with > > the tracepoint callsite? Have DEFINE_TRACE() take an additional argument > > which references a marker callback funtion. That function would look > > like (very loose C code): > > > > marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe, > > I don't want the tracepoints to be coupled with markers (which are a > userspace API). The other way around is fine : letting a marker > automatically enable a tracepoint makes sense, but the opposite would > tie the in-kernel API (tracepoint) to the external marker > representation, and I would like to avoid that. > The interface to markers is still marker_probe_register() and marker_probe_unregister(). I don't see how that changes with this proposal? > And how do you plan to deal with : > > TPPROTO(arg1, arg2) == void ? > > C won't let you define stuff like : > > blah(void, marker_probe_func *probe, void *private_data) > it'd be simple enough to pass the the noargs requirement down as an extra argument to DO_TRACE(), and then invoke the callback with no arguments. thanks, -Jason ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-10-03 19:25 ` Jason Baron @ 2008-10-03 19:56 ` Mathieu Desnoyers 2008-10-03 20:25 ` Jason Baron 0 siblings, 1 reply; 122+ messages in thread From: Mathieu Desnoyers @ 2008-10-03 19:56 UTC (permalink / raw) To: Jason Baron Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler * Jason Baron (jbaron@redhat.com) wrote: > On Fri, Oct 03, 2008 at 03:10:26PM -0400, Mathieu Desnoyers wrote: > > * Jason Baron (jbaron@redhat.com) wrote: > > > On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote: > > > > > > > > How about : > > > > > > > > > > > > > > > > trace_mark(ftrace_evname, "size %lu binary %pW", > > > > > > > > sizeof(mystruct), mystruct); > > > > > > > > or > > > > > > > > trace_mark(sched_wakeup, "target_pid %ld", task->pid); > > > > > > > > > > > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here. > > > > > > > > > > > > > > > > That would encapsulate the whole > > > > > > > > - Event ID registration > > > > > > > > - Event type registration > > > > > > > > - Sending data out > > > > > > > > - Enabling the event source directly at the source > > > > > > > > > > > > > > > > We can then export the markers through a debugfs file and let userland > > > > > > > > enable them one by one and possibly connect systemtap filters on them > > > > > > > > (one table of registered filters, one table for the markers, a command > > > > > > > > file to connect/disconnect filters to/from markers). > > > > > > > > > > > > > > I would like to ask for the following from the start: have a field for > > > > > > > a longer description of the marker that describes it's usage and > > > > > > > context. Getting this there from the start is critical, because only > > > > > > > when adding the marker point do people still really remember why/what > > > > > > > (and having to type a good description also helps them to realize if > > > > > > > this is the right point or not). This can then be exposed to the user > > > > > > > so he has a standing chance of knowing what the marker is about. > > > > > > > > > > > > > > It also has a standing chance of being updated when the code changes > > > > > > > this way > > > > > > > > > > > > > > > > > > > I agree, and I think it might be required in both markers and > > > > > > tracepoints. > > > > > > > > > > > > Given that tracepoints are declared in a global header > > > > > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint > > > > > > uses within the kernel code (statements like : > > > > > > trace_sched_switch(prev, next); > > > > > > added to the scheduler) would therefore be tied to the description > > > > > > without having to contain it in the core kernel code. > > > > > > > > > > > > Markers, on the other hand, could become the "event description" > > > > > > interface which is exported to userspace. Considering that, I guess it's > > > > > > as important to let a precise description follow the markers. > > > > > > > > > > > > Mathieu > > > > > > > > > > > > > > > > > > > > > > hi, > > > > > > > > > > Tracepoints and markers seem to both have their place, with tracepoints > > > > > being integral to kernel users, and markers being important for > > > > > userspace. However, it seems to me like there is overlap in the > > > > > code and an extra level of indirection when markers are layered on > > > > > tracespoints. could they be merged a bit more? > > > > > > > > > > What if we extended DEFINE_TRACE() to also create a > > > > > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature: > > > > > > > > > > marker_cb(<tracepoint prototype>, *marker_probe_func); > > > > > > > > > > We then also create 'register_marker_##name' function in DEFINE_TRACE(), > > > > > which allows one to regiser marker callbacks in the usual way. > > > > > > > > > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has > > > > > registered a marker (which can set the tracepoint.state appropriately). > > > > > > > > > > The 'marker_cb' function then marshalls its arguemnts and passes them > > > > > through to the marker functions that were registered. > > > > > > > > > > I think in this way we can simplify the tracepoints and markers by > > > > > combining them to a large extent. > > > > > > > > > > thanks, > > > > > > > > > > -Jason > > > > > > > > > > > > > I think what you propose here is already in y LTTng tree in a different > > > > form. It's a patch to markers to allow declaring a marker which enables > > > > an associated tracepoint when enabled. This way, we can have a marker > > > > (exposed to userspace) connecting itself automatically to a tracepoint > > > > when enabled. > > > > > > > > It's here : > > > > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09 > > > > > > > > Note that locking depends on the psrwlock patch so we can have nested > > > > module list readers. Otherwise locking becomes _really_ messy. :-( > > > > > > > > Mathieu > > > > > > > > > > That patch simplifies using markers with tracepoints and couples > > > markers and tracepoints much more closely. But I was proposing to make > > > the coupling tighter... > > > > > > Couldn't 'marker_probe_register()' register the marker directly with > > > the tracepoint callsite? Have DEFINE_TRACE() take an additional argument > > > which references a marker callback funtion. That function would look > > > like (very loose C code): > > > > > > marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe, > > > > I don't want the tracepoints to be coupled with markers (which are a > > userspace API). The other way around is fine : letting a marker > > automatically enable a tracepoint makes sense, but the opposite would > > tie the in-kernel API (tracepoint) to the external marker > > representation, and I would like to avoid that. > > > > The interface to markers is still marker_probe_register() and > marker_probe_unregister(). I don't see how that changes with this > proposal? > "Have DEFINE_TRACE() take an additional argument which references a marker callback funtion." -> it would tie the tracepoint definition to a marker. Or am I misunderstanding something ? Mathieu > > > And how do you plan to deal with : > > > > TPPROTO(arg1, arg2) == void ? > > > > C won't let you define stuff like : > > > > blah(void, marker_probe_func *probe, void *private_data) > > > > it'd be simple enough to pass the the noargs requirement down as an > extra argument to DO_TRACE(), and then invoke the callback with no arguments. > > thanks, > > -Jason > > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-10-03 19:56 ` Mathieu Desnoyers @ 2008-10-03 20:25 ` Jason Baron 0 siblings, 0 replies; 122+ messages in thread From: Jason Baron @ 2008-10-03 20:25 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler On Fri, Oct 03, 2008 at 03:56:40PM -0400, Mathieu Desnoyers wrote: > * Jason Baron (jbaron@redhat.com) wrote: > > On Fri, Oct 03, 2008 at 03:10:26PM -0400, Mathieu Desnoyers wrote: > > > * Jason Baron (jbaron@redhat.com) wrote: > > > > On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote: > > > > > > > > > How about : > > > > > > > > > > > > > > > > > > trace_mark(ftrace_evname, "size %lu binary %pW", > > > > > > > > > sizeof(mystruct), mystruct); > > > > > > > > > or > > > > > > > > > trace_mark(sched_wakeup, "target_pid %ld", task->pid); > > > > > > > > > > > > > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here. > > > > > > > > > > > > > > > > > > That would encapsulate the whole > > > > > > > > > - Event ID registration > > > > > > > > > - Event type registration > > > > > > > > > - Sending data out > > > > > > > > > - Enabling the event source directly at the source > > > > > > > > > > > > > > > > > > We can then export the markers through a debugfs file and let userland > > > > > > > > > enable them one by one and possibly connect systemtap filters on them > > > > > > > > > (one table of registered filters, one table for the markers, a command > > > > > > > > > file to connect/disconnect filters to/from markers). > > > > > > > > > > > > > > > > I would like to ask for the following from the start: have a field for > > > > > > > > a longer description of the marker that describes it's usage and > > > > > > > > context. Getting this there from the start is critical, because only > > > > > > > > when adding the marker point do people still really remember why/what > > > > > > > > (and having to type a good description also helps them to realize if > > > > > > > > this is the right point or not). This can then be exposed to the user > > > > > > > > so he has a standing chance of knowing what the marker is about. > > > > > > > > > > > > > > > > It also has a standing chance of being updated when the code changes > > > > > > > > this way > > > > > > > > > > > > > > > > > > > > > > I agree, and I think it might be required in both markers and > > > > > > > tracepoints. > > > > > > > > > > > > > > Given that tracepoints are declared in a global header > > > > > > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint > > > > > > > uses within the kernel code (statements like : > > > > > > > trace_sched_switch(prev, next); > > > > > > > added to the scheduler) would therefore be tied to the description > > > > > > > without having to contain it in the core kernel code. > > > > > > > > > > > > > > Markers, on the other hand, could become the "event description" > > > > > > > interface which is exported to userspace. Considering that, I guess it's > > > > > > > as important to let a precise description follow the markers. > > > > > > > > > > > > > > Mathieu > > > > > > > > > > > > > > > > > > > > > > > > > > hi, > > > > > > > > > > > > Tracepoints and markers seem to both have their place, with tracepoints > > > > > > being integral to kernel users, and markers being important for > > > > > > userspace. However, it seems to me like there is overlap in the > > > > > > code and an extra level of indirection when markers are layered on > > > > > > tracespoints. could they be merged a bit more? > > > > > > > > > > > > What if we extended DEFINE_TRACE() to also create a > > > > > > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature: > > > > > > > > > > > > marker_cb(<tracepoint prototype>, *marker_probe_func); > > > > > > > > > > > > We then also create 'register_marker_##name' function in DEFINE_TRACE(), > > > > > > which allows one to regiser marker callbacks in the usual way. > > > > > > > > > > > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has > > > > > > registered a marker (which can set the tracepoint.state appropriately). > > > > > > > > > > > > The 'marker_cb' function then marshalls its arguemnts and passes them > > > > > > through to the marker functions that were registered. > > > > > > > > > > > > I think in this way we can simplify the tracepoints and markers by > > > > > > combining them to a large extent. > > > > > > > > > > > > thanks, > > > > > > > > > > > > -Jason > > > > > > > > > > > > > > > > I think what you propose here is already in y LTTng tree in a different > > > > > form. It's a patch to markers to allow declaring a marker which enables > > > > > an associated tracepoint when enabled. This way, we can have a marker > > > > > (exposed to userspace) connecting itself automatically to a tracepoint > > > > > when enabled. > > > > > > > > > > It's here : > > > > > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09 > > > > > > > > > > Note that locking depends on the psrwlock patch so we can have nested > > > > > module list readers. Otherwise locking becomes _really_ messy. :-( > > > > > > > > > > Mathieu > > > > > > > > > > > > > That patch simplifies using markers with tracepoints and couples > > > > markers and tracepoints much more closely. But I was proposing to make > > > > the coupling tighter... > > > > > > > > Couldn't 'marker_probe_register()' register the marker directly with > > > > the tracepoint callsite? Have DEFINE_TRACE() take an additional argument > > > > which references a marker callback funtion. That function would look > > > > like (very loose C code): > > > > > > > > marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe, > > > > > > I don't want the tracepoints to be coupled with markers (which are a > > > userspace API). The other way around is fine : letting a marker > > > automatically enable a tracepoint makes sense, but the opposite would > > > tie the in-kernel API (tracepoint) to the external marker > > > representation, and I would like to avoid that. > > > > > > > The interface to markers is still marker_probe_register() and > > marker_probe_unregister(). I don't see how that changes with this > > proposal? > > > > "Have DEFINE_TRACE() take an additional argument which references a > marker callback funtion." -> it would tie the tracepoint definition to a > marker. Or am I misunderstanding something ? > Not sure. Maybe the confusion is that I am really talking about two callbacks here. First, there is a tracepoint->marker callback which is the 'marker_blah_callback()' that I mentioned above, and is the one which is referenced in DEFINE_TRACE(). There is also the marker->userspace callback which is registered via something similar to marker_probe_register(), only it is registered directly with the tracepoint. I think this potentially better address's Arjan's concern b/c it ties the 'tracepoint->marker' callback directly to the tracepoint. And this 'tracepoint->marker' callback function in essense documents the marker interface for a tracepoint. And this proposal documents the interfaces (both tracepoints and markers) all in one place. If I'm not clear, I can prototype it if you think that would help? thanks, -Jason ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-10-03 19:10 ` Mathieu Desnoyers 2008-10-03 19:25 ` Jason Baron @ 2008-10-03 21:52 ` Frank Ch. Eigler 1 sibling, 0 replies; 122+ messages in thread From: Frank Ch. Eigler @ 2008-10-03 21:52 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jason Baron, Arjan van de Ven, Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od Hi - On Fri, Oct 03, 2008 at 03:10:26PM -0400, Mathieu Desnoyers wrote: > [...] > I don't want the tracepoints to be coupled with markers (which are a > userspace API). [...] I'm glad the discussion seems to be slowly turning toward the event layering issues, but ... "markers being a userspace API"? That seems to be grossly misleading terminology. It's a kernel API like anything else being discussed here. What's different about it (vs. tracepoints) is that it could be one of the first clients of the unified ring buffer widget that concretely addresses the issue of encoding abstract traceworthy events into serialized data. (Of course, merging the lower level aspects of the marker/tracepoint implementations is all well and good.) - FChE ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 13:55 ` Mathieu Desnoyers 2008-09-20 14:12 ` Arjan van de Ven @ 2008-09-22 3:09 ` KOSAKI Motohiro 1 sibling, 0 replies; 122+ messages in thread From: KOSAKI Motohiro @ 2008-09-22 3:09 UTC (permalink / raw) To: Mathieu Desnoyers Cc: kosaki.motohiro, Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler > > > CONTROL > > > ------- > > > > > > Sysfs style tree under debugfs > > > > > > /debugfs/tracing/<name>/buffers/enabed <--- binary value > > > > > > /debugfs/tracing/<name>/<event1> > > > /debugfs/tracing/<name>/<event2> > > > etc ... > > > > I wonder if we should make this another sub dir: > > > > /debugfs/tracing/buffers/events/<event-name> > > > > Sure. > > If needed, we could change the markers to take two separate parameters : > > trace_mark(tracer_name, event_name, "format", args) Sorry, disagreed. I believe any marker point should be independent to tracer. Otherwise, number of trace_mark() will blow up easily. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-20 9:03 ` Steven Rostedt 2008-09-20 13:55 ` Mathieu Desnoyers @ 2008-09-22 9:57 ` Peter Zijlstra 2008-09-23 2:36 ` Mathieu Desnoyers 1 sibling, 1 reply; 122+ messages in thread From: Peter Zijlstra @ 2008-09-22 9:57 UTC (permalink / raw) To: Steven Rostedt Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, od, Frank Ch. Eigler On Sat, 2008-09-20 at 05:03 -0400, Steven Rostedt wrote: > Oh, and all commands should start with the namespace. > > ring_buffer_alloc() > ring_buffer_free() > ring_buffer_record_event() I really think we should separate the ringbuffer management from the event stuff. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 9:57 ` Peter Zijlstra @ 2008-09-23 2:36 ` Mathieu Desnoyers 0 siblings, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 2:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote: > On Sat, 2008-09-20 at 05:03 -0400, Steven Rostedt wrote: > > > Oh, and all commands should start with the namespace. > > > > ring_buffer_alloc() > > ring_buffer_free() > > ring_buffer_record_event() > > I really think we should separate the ringbuffer management from the > event stuff. > Sure, I am strongly in favor of separating those two, given they represent two different things. However, the requirement I have heard at KS2008 was to provide - Unified buffering mechanism - Timestamps synchronized across all buffers - Unified event IDs management, so events from various sources could be shared between tools. - As of my understanding, unified event structure, which can be exported to userspace and be shared across different tools. - Unified buffer control/management mechanism. These all represent different infrastructure parts, but are all needed if we want tools to be able to share the data exported through those buffers. Relay is a good example of having only a _single_ of these layers in common : there is currently no way the different relay users can share the data they collect because they have simply no idea how others structure their data. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:33 Unified tracing buffer Martin Bligh ` (5 preceding siblings ...) 2008-09-20 9:03 ` Steven Rostedt @ 2008-09-22 13:57 ` K.Prasad 2008-09-22 19:45 ` Masami Hiramatsu 2008-09-23 3:33 ` Andi Kleen 8 siblings, 0 replies; 122+ messages in thread From: K.Prasad @ 2008-09-22 13:57 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote: > During kernel summit and Plumbers conference, Linus and others > expressed a desire for a unified > tracing buffer system for multiple tracing applications (eg ftrace, > lttng, systemtap, blktrace, etc) to use. > This provides several advantages, including the ability to interleave > data from multiple sources, > not having to learn 200 different tools, duplicated code/effort, etc. > With due apologies for pitching-in late, I thought I'd bring visibility to the two new interfaces - namely relay_printk() and relay_dump() - now a part of -mm tree (since 2.6.27-rc5-mm1) are meant to address such needs; although not completely in its present form but quite substantially. (Refer: Documentation/filesystems/relay.txt). As far as re-usability is concerned, many parts of this interface are directly adopted from SystemTap's runtime. Blktrace had been made to work using these interfaces (http://tinyurl.com/4q9d4p) reducing about ~130 lines of code from the blktrace related files. With more effort, say additions such as a)ability to specify custom names for files b)ability to create user-defined control files (in addition to what comes default) will make it usable along with tracers such as ftrace (ref:http://tinyurl.com/3ppbwh) (and is something that I intended to work upon). While relay_printk() interface brings a high-level abstract interface over 'relay' by masking all the setup/tear-down details and the ability to use per-CPU buffers; relay_dump() is its equivalent that performs binary dumping through debugfs interface (a requirement for the unified tracing buffer, as I learn from the email). Also the use of default file-names, debugfs output path results in huge reduction of setup code required by the end-user along with the ability to override the defaults if required in a special case. Examples of the resulting code-brevity can be seen at samples/relay/*.c in 2.6.27-rc5-mm1 tree. I am quite sure that with minimal changes to infrastructure underlying beneath these two interfaces, we can meet out most of the requirements stated above; and am open for suggestions. Kindly let me know what the community thinks about the same. Thanks, K.Prasad > Several of us got together last night and tried to cut this down to > the simplest usable system > we could agree on (and nobody got hurt!). This will form version 1. > I've sketched out a few > enhancements we know that we want, but have agreed to leave these > until version 2. > The answer to most questions about the below is "yes we know, we'll > fix that in version 2" > (or 3). Simplicity was the rule ... > > Sketch of design. Enjoy flaming me. Code will follow shortly. > > > STORAGE > ------- > > We will support multiple buffers for different tracing systems, with > separate names, event id spaces. > Event ids are 16 bit, dynamically allocated. > A "one line of text" print function will be provided for each event, > or use the default (probably hex printf) > Will provide a "flight data recorder" mode, and a "spool to disk" mode. > > Circular buffer per cpu, protected by per-cpu spinlock_irq > Word aligned records. > Variable record length, header will start with length record. > Timestamps in fixed timebase, monotonically increasing (across all CPUs) > > > INPUT_FUNCTIONS > --------------- > > allocate_buffer (name, size) > return buffer_handle > > register_event (buffer_handle, event_id, print_function) > You can pass in a requested event_id from a fixed set, and > will be given it, or an error > 0 means allocate me one dynamically > returns event_id (or -E_ERROR) > > record_event (buffer_handle, event_id, length, *buf) > > > OUTPUT > ------ > > Data will be output via debugfs, and provide the following output streams: > > /debugfs/tracing/<name>/buffers/text > clear text stream (will merge the per-cpu streams via insertion > sort, and use the print functions) > > /debugfs/tracing/<name>/buffers/binary[cpu_number] > per-cpu binary data > > > CONTROL > ------- > > Sysfs style tree under debugfs > > /debugfs/tracing/<name>/buffers/enabed <--- binary value > > /debugfs/tracing/<name>/<event1> > /debugfs/tracing/<name>/<event2> > etc ... > provides a way to enable/disable events, see what's available, and > what's enabled. > > > KNOWN ISSUES / PLANS > ------------------- > > No way to unregister buffers and events. > Will provide an unregister_buffer and unregister_event call > > > Generating systemwide time is hard on some platforms > Yes. Time-based output provides a lot of simplicity for the user though > We won't support these platforms at first, we'll add functionality > to make it work for them later. > (plan based on tick-based ms timing, plus counter offset from that > if needed). > > Spinlock_irq is ineffecient, and doesn't support tracing in NMIs > True. We'll implement a lockless scheme later (see lttng) > > Putting a length record in every event is inefficient > True. Fixed record length with optional extensions is better, but > more complex. v2. > > Putting a full timestamp rather than an offset in every event is inefficient > See above. True, but v2. > > Relayfs already exists! use that! > People were universally not keen on that idea. Complexity, interface, etc. > We're also providing some higher level shared functions for time & > event ids. > > There's no way to decode the binary data stream > Code will be shared from the kernel to decode it, so that we can > get the compact binary > format and decode it later. That code will be kept in the kernel > tree (it's a trivial piece of C). > Version 1.1 ;-) > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:33 Unified tracing buffer Martin Bligh ` (6 preceding siblings ...) 2008-09-22 13:57 ` K.Prasad @ 2008-09-22 19:45 ` Masami Hiramatsu 2008-09-22 20:13 ` Martin Bligh 2008-09-23 3:33 ` Andi Kleen 8 siblings, 1 reply; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-22 19:45 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: > During kernel summit and Plumbers conference, Linus and others > expressed a desire for a unified > tracing buffer system for multiple tracing applications (eg ftrace, > lttng, systemtap, blktrace, etc) to use. > This provides several advantages, including the ability to interleave > data from multiple sources, > not having to learn 200 different tools, duplicated code/effort, etc. > > Several of us got together last night and tried to cut this down to > the simplest usable system > we could agree on (and nobody got hurt!). This will form version 1. > I've sketched out a few > enhancements we know that we want, but have agreed to leave these > until version 2. > The answer to most questions about the below is "yes we know, we'll > fix that in version 2" > (or 3). Simplicity was the rule ... > > Sketch of design. Enjoy flaming me. Code will follow shortly. > > > STORAGE > ------- > > We will support multiple buffers for different tracing systems, with > separate names, event id spaces. > Event ids are 16 bit, dynamically allocated. > A "one line of text" print function will be provided for each event, > or use the default (probably hex printf) > Will provide a "flight data recorder" mode, and a "spool to disk" mode. > > Circular buffer per cpu, protected by per-cpu spinlock_irq > Word aligned records. > Variable record length, header will start with length record. > Timestamps in fixed timebase, monotonically increasing (across all CPUs) I agree to integrate tracing buffer mechanism, but I don't think your proposal is the simplest one. To simplify, I think the layered buffering mechanism is desirable. - The lowest layer just provides named circular buffers(both per-cpu and uni-buffer in system) and read/write scheme. - Next layer provides user/kernel interface including controls. - Top layer defines packet(and event) formatting utilities. - Additionally, it would better provide some library routines(timestamp, event-id synchronize, and so on). Since this unified buffer is used from different kind of tracers/loggers, I don't think all of them(and future tracers) want to be tied down by "event-id"+"parameter" format. So, Sorry, I disagree about that the tracing buffer defines its *data format*, it's just overkill for me. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 19:45 ` Masami Hiramatsu @ 2008-09-22 20:13 ` Martin Bligh 2008-09-22 22:25 ` Masami Hiramatsu 0 siblings, 1 reply; 122+ messages in thread From: Martin Bligh @ 2008-09-22 20:13 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml > I agree to integrate tracing buffer mechanism, but I don't think > your proposal is the simplest one. > > To simplify, I think the layered buffering mechanism is desirable. > - The lowest layer just provides named circular buffers(both per-cpu and > uni-buffer in system) and read/write scheme. > - Next layer provides user/kernel interface including controls. > - Top layer defines packet(and event) formatting utilities. > - Additionally, it would better provide some library routines(timestamp, > event-id synchronize, and so on). > > Since this unified buffer is used from different kind of tracers/loggers, > I don't think all of them(and future tracers) want to be tied down by > "event-id"+"parameter" format. > So, Sorry, I disagree about that the tracing buffer defines its *data format*, > it's just overkill for me. I think you're right that we can layer this, and we didn't make a particularly good job of splitting those things out. I'll try to pull together another revision to reflect this and other suggested changes. One thing that I think is still important is to have a unified timestamp mechanism on everything, so we can co-ordinate different things back together in userspace from different trace tools, so I intend to keep that at a lower level, but I think you're right that the event id, etc can move up into separate layers. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 20:13 ` Martin Bligh @ 2008-09-22 22:25 ` Masami Hiramatsu 2008-09-22 23:11 ` Darren Hart ` (2 more replies) 0 siblings, 3 replies; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-22 22:25 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: >> I agree to integrate tracing buffer mechanism, but I don't think >> your proposal is the simplest one. >> >> To simplify, I think the layered buffering mechanism is desirable. >> - The lowest layer just provides named circular buffers(both per-cpu and >> uni-buffer in system) and read/write scheme. >> - Next layer provides user/kernel interface including controls. >> - Top layer defines packet(and event) formatting utilities. >> - Additionally, it would better provide some library routines(timestamp, >> event-id synchronize, and so on). >> >> Since this unified buffer is used from different kind of tracers/loggers, >> I don't think all of them(and future tracers) want to be tied down by >> "event-id"+"parameter" format. >> So, Sorry, I disagree about that the tracing buffer defines its *data format*, >> it's just overkill for me. > > I think you're right that we can layer this, and we didn't make a particularly > good job of splitting those things out. I'll try to pull together > another revision > to reflect this and other suggested changes. I'm happy to hear that. :-) > One thing that I think is still important is to have a unified timestamp > mechanism on everything, so we can co-ordinate different things back > together in userspace from different trace tools, so I intend to keep > that at a lower level, but I think you're right that the event id, etc can > move up into separate layers. I'm not so sure that the unified 'timestamp' must be required by all tracers. If you just need to merge and sort per-cpu data, you can use an atomic sequential number for it. IMHO, the unified 'timestamp' would better be an option, because some architectures can't support it. I think preparing timestamp-callback function will help us. By the way, systemtap uses two modes; - single-channel mode In this mode, all cpus share one buffer channel to write and read. each writer locks spinlock and write a probe-local data to buffer. - per-cpu buffer mode In this mode, we use an atomic sequential number for ordering data. If user doesn't need it(because they have their own timestamps), they can choose not to use that seq-number. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 22:25 ` Masami Hiramatsu @ 2008-09-22 23:11 ` Darren Hart 2008-09-23 0:04 ` Masami Hiramatsu 2008-09-22 23:16 ` Martin Bligh 2008-09-23 14:36 ` KOSAKI Motohiro 2 siblings, 1 reply; 122+ messages in thread From: Darren Hart @ 2008-09-22 23:11 UTC (permalink / raw) To: Masami Hiramatsu Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml On Mon, Sep 22, 2008 at 3:25 PM, Masami Hiramatsu <mhiramat@redhat.com> wrote: >> One thing that I think is still important is to have a unified timestamp >> mechanism on everything, so we can co-ordinate different things back >> together in userspace from different trace tools, so I intend to keep >> that at a lower level, but I think you're right that the event id, etc can >> move up into separate layers. > > I'm not so sure that the unified 'timestamp' must be required by all tracers. > If you just need to merge and sort per-cpu data, you can use an atomic > sequential number for it. > IMHO, the unified 'timestamp' would better be an option, because some > architectures can't support it. I think preparing timestamp-callback > function will help us. > There have been several posts on the timestamp for the events. From a real-time perspective, this timestamp will be a very important datapoint for each event, and the more accurate/higher resolution the better. Some thoughts. o pretty print resolution should definitely be nanosecond (IMHO) o internal storage should be "whatever is fastest" with the transformation to ns data stored in the trace header (as I believe Mathieu mentioned). o for archs where the clock isn't synchronized across CPUs, perhaps for now it would be adequate to record the per cpu timestamps in the trace header and include the cpu id for each event as well. This is in keeping with the previous suggestion to collect the most primitive data available without doing any sort of transformation at trace time. -- Darren Hart ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 23:11 ` Darren Hart @ 2008-09-23 0:04 ` Masami Hiramatsu 0 siblings, 0 replies; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 0:04 UTC (permalink / raw) To: Darren Hart Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml Hi Darren, Darren Hart wrote: > On Mon, Sep 22, 2008 at 3:25 PM, Masami Hiramatsu <mhiramat@redhat.com> wrote: > >>> One thing that I think is still important is to have a unified timestamp >>> mechanism on everything, so we can co-ordinate different things back >>> together in userspace from different trace tools, so I intend to keep >>> that at a lower level, but I think you're right that the event id, etc can >>> move up into separate layers. >> I'm not so sure that the unified 'timestamp' must be required by all tracers. >> If you just need to merge and sort per-cpu data, you can use an atomic >> sequential number for it. >> IMHO, the unified 'timestamp' would better be an option, because some >> architectures can't support it. I think preparing timestamp-callback >> function will help us. >> > > There have been several posts on the timestamp for the events. From a > real-time perspective, this timestamp will be a very important datapoint for > each event, and the more accurate/higher resolution the better. Some thoughts. Sure, I know the precise timestamp is required for real-time sensitive tracers. however, there are some other cases. for example debugging, we don't need timestamps, but just want to know the order of events. :-) Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 22:25 ` Masami Hiramatsu 2008-09-22 23:11 ` Darren Hart @ 2008-09-22 23:16 ` Martin Bligh 2008-09-23 0:05 ` Masami Hiramatsu 2008-09-23 14:36 ` KOSAKI Motohiro 2 siblings, 1 reply; 122+ messages in thread From: Martin Bligh @ 2008-09-22 23:16 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml >> One thing that I think is still important is to have a unified timestamp >> mechanism on everything, so we can co-ordinate different things back >> together in userspace from different trace tools, so I intend to keep >> that at a lower level, but I think you're right that the event id, etc can >> move up into separate layers. > > I'm not so sure that the unified 'timestamp' must be required by all tracers. > If you just need to merge and sort per-cpu data, you can use an atomic > sequential number for it. > IMHO, the unified 'timestamp' would better be an option, because some > architectures can't support it. I think preparing timestamp-callback > function will help us. An atomic sequential number is: (a) far less meaningful than a timestamp for the user (b) more expensive to compute in many cases. I think we came up with a way to approximate this, using a callback every ms or so as the higher order bits, and a sequential counter in the lower order for those broken platforms. But perhaps it would be better if we started with a discussion of which platforms can't do global timestamps, and why not? I know some of them are fixable, but perhaps not all. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 23:16 ` Martin Bligh @ 2008-09-23 0:05 ` Masami Hiramatsu 2008-09-23 0:12 ` Martin Bligh 2008-09-23 0:39 ` Linus Torvalds 0 siblings, 2 replies; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 0:05 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: >>> One thing that I think is still important is to have a unified timestamp >>> mechanism on everything, so we can co-ordinate different things back >>> together in userspace from different trace tools, so I intend to keep >>> that at a lower level, but I think you're right that the event id, etc can >>> move up into separate layers. >> I'm not so sure that the unified 'timestamp' must be required by all tracers. >> If you just need to merge and sort per-cpu data, you can use an atomic >> sequential number for it. >> IMHO, the unified 'timestamp' would better be an option, because some >> architectures can't support it. I think preparing timestamp-callback >> function will help us. > > An atomic sequential number is: > > (a) far less meaningful than a timestamp for the user > (b) more expensive to compute in many cases. Sure, atomic counter might be more expensive but accurate for ordering. (and it can use on almost all architectures) The cost depends on the architecture and system configuration. So, I think it is preferable user to choose their timestamp rather than fix it. For example, calling callback when writing a log entry as following; write_log(struct buffer *buffer, char *data, int len) { /* reserve a logging space */ char *buf = reserve(buffer, len + buffer->timestamp.len); /* write a timestamp */ buf = buffer->timestamp.write(buf); /* write a body */ memcpy(buf, data, len); } And unified buffer prepares default timestamp.write callbacks. char * timestamp_write(char * buf); // write arch-specific timestamp char * seqnum_write(char * buf); // write an sequence number What would you think about it? > I think we came up with a way to approximate this, using a callback every > ms or so as the higher order bits, and a sequential counter in the lower > order for those broken platforms. Sure, that will work. > But perhaps it would be better if we started with a discussion of which > platforms can't do global timestamps, and why not? I know some of them > are fixable, but perhaps not all. For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-( Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:05 ` Masami Hiramatsu @ 2008-09-23 0:12 ` Martin Bligh 2008-09-23 14:49 ` Masami Hiramatsu 2008-09-23 0:39 ` Linus Torvalds 1 sibling, 1 reply; 122+ messages in thread From: Martin Bligh @ 2008-09-23 0:12 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml >> I think we came up with a way to approximate this, using a callback every >> ms or so as the higher order bits, and a sequential counter in the lower >> order for those broken platforms. > > Sure, that will work. OK, that'd fix 99% of it, even if we only get time within 1ms or so (but still good ordering). >> But perhaps it would be better if we started with a discussion of which >> platforms can't do global timestamps, and why not? I know some of them >> are fixable, but perhaps not all. > > For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-( Can you define incorrect for me (in this case)? We had similar problems with some AMD platforms that we can fix by syncing the TSCs on exit_idle, etc. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:12 ` Martin Bligh @ 2008-09-23 14:49 ` Masami Hiramatsu 2008-09-23 15:04 ` Mathieu Desnoyers 2008-09-23 15:46 ` Linus Torvalds 0 siblings, 2 replies; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 14:49 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: >>> But perhaps it would be better if we started with a discussion of which >>> platforms can't do global timestamps, and why not? I know some of them >>> are fixable, but perhaps not all. >> For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-( > > Can you define incorrect for me (in this case)? On my laptop, TSC is disabled at boot time. $ dmesg | grep TSC checking TSC synchronization [CPU#0 -> CPU#1]: Measured 4246549092 cycles TSC warp between CPUs, turning off TSC clock. Marking TSC unstable due to: check_tsc_sync_source failed. $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 6 cpu MHz : 1000.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 3998.45 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 6 cpu MHz : 1000.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 3994.43 clflush size : 64 power management: Actually, I had measured TSC drifting and reported to systemtap-bugzilla http://sources.redhat.com/bugzilla/show_bug.cgi?id=3916#c19 Curiously, I've tested on another Core2Duo laptop, which cpu is same model and same stepping, but on that laptop I couldn't see TSC drifting. So I think this might be a product level issue and a rare case... > We had similar problems with some AMD platforms that we can fix by syncing > the TSCs on exit_idle, etc. Hmm, very interesting. :-) Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:49 ` Masami Hiramatsu @ 2008-09-23 15:04 ` Mathieu Desnoyers 2008-09-23 15:30 ` Masami Hiramatsu 2008-09-23 15:46 ` Linus Torvalds 1 sibling, 1 reply; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 15:04 UTC (permalink / raw) To: Masami Hiramatsu Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml * Masami Hiramatsu (mhiramat@redhat.com) wrote: [...] > $ cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 15 > model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz > stepping : 6 > cpu MHz : 1000.000 [...] ... @ 2.00GHz and cpu MHz : 1000.000 ; isn't that a bit odd ? (same for both cpus) Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 15:04 ` Mathieu Desnoyers @ 2008-09-23 15:30 ` Masami Hiramatsu 2008-09-23 16:01 ` Linus Torvalds 0 siblings, 1 reply; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 15:30 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Mathieu Desnoyers wrote: > * Masami Hiramatsu (mhiramat@redhat.com) wrote: > [...] >> $ cat /proc/cpuinfo >> processor : 0 >> vendor_id : GenuineIntel >> cpu family : 6 >> model : 15 >> model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz >> stepping : 6 >> cpu MHz : 1000.000 > [...] > > ... @ 2.00GHz and cpu MHz : 1000.000 ; isn't that a bit odd ? (same for > both cpus) 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means current frequency. (yep, now I'm using cpufreq) Anyway, when I measured TSC drift, I killed cpuspeed service and fixed freq to 2000. ;-) Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 15:30 ` Masami Hiramatsu @ 2008-09-23 16:01 ` Linus Torvalds 2008-09-23 17:04 ` Masami Hiramatsu 0 siblings, 1 reply; 122+ messages in thread From: Linus Torvalds @ 2008-09-23 16:01 UTC (permalink / raw) To: Masami Hiramatsu Cc: Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > > 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means > current frequency. (yep, now I'm using cpufreq) > Anyway, when I measured TSC drift, I killed cpuspeed service and > fixed freq to 2000. ;-) Ahh. I have an idea.. Maybe that thing does thermal throttling? Fixing the frequency at the highest setting is actually one of the worst things you can do, because if the device is thermally limited, it will still do the whole throttling thing, but now it won't do it by changing the frequency any more, it will do it by essentially forxing the external frequency down. And that is going to be *very* inefficient. You really really don't want that. Your performance will actually be _worse_ than if the CPU went to a lower frequency. And it might explain the unreliable TSC too, because I suspect constant TSC is really constant only wrt the bus clock to the CPU. The termal throttling thing is a "protect the CPU from overheating" last ditch effort, and because it doesn't lower voltage, it isn't actually at all as efficient at saving power (and thus cooling the CPU) as a real frequency change event would be. And fixing the frequency to the highest frequency in a tight laptop enclosure is the best way to force that behavior (in contrast - in a desktop system with sufficient cooling, it's usually not a problem at all to just say "run at highest frequency"). And btw, that also explains why you had so *big* changes in frequency: the throttling I think happens with a 1/8 duty cycle thing, iirc. It's supposed to be very rare with Core 2. Thermal throttling was quite common with the P4 one, and was the main overheating protection initially. These days, you should only see it for really badly designed devices that simply don't have enough thermal cooling, but if the design calls for mostly running at low frequency because it's some thing-and-light notebook with insufficient cooling (or some thick-and-heavy thing that is just total crap), and you force it to always run at full speed, I can imagine it kicking in to protect the CPU. It's obviously also going to be much easier to see if the ambient temperature is high. If you want to get best peformance, take one of those compressed-air spray-cans, and spray on the laptop with the can held upside down (the can will generally tell you _not_ to do that, because then you'll get the liquid itself rather than gas, but that's what you want for cooling). So if you can test this, try it with (a) cpufreq at a fixed _low_ value (to not cause overheating) (b) with the spray-can cooling the thing and cpufreq at a fixed high value and see if the TSC is constant then. Linus ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 16:01 ` Linus Torvalds @ 2008-09-23 17:04 ` Masami Hiramatsu 2008-09-23 17:30 ` Thomas Gleixner 0 siblings, 1 reply; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 17:04 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Linus Torvalds wrote: > > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: >> 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means >> current frequency. (yep, now I'm using cpufreq) >> Anyway, when I measured TSC drift, I killed cpuspeed service and >> fixed freq to 2000. ;-) > > Ahh. I have an idea.. > > Maybe that thing does thermal throttling? > > Fixing the frequency at the highest setting is actually one of the worst > things you can do, because if the device is thermally limited, it will > still do the whole throttling thing, but now it won't do it by changing > the frequency any more, it will do it by essentially forxing the external > frequency down. > > And that is going to be *very* inefficient. You really really don't want > that. Your performance will actually be _worse_ than if the CPU went to a > lower frequency. And it might explain the unreliable TSC too, because I > suspect constant TSC is really constant only wrt the bus clock to the CPU. > > The termal throttling thing is a "protect the CPU from overheating" last > ditch effort, and because it doesn't lower voltage, it isn't actually at > all as efficient at saving power (and thus cooling the CPU) as a real > frequency change event would be. > > And fixing the frequency to the highest frequency in a tight laptop > enclosure is the best way to force that behavior (in contrast - in a > desktop system with sufficient cooling, it's usually not a problem at all > to just say "run at highest frequency"). > > And btw, that also explains why you had so *big* changes in frequency: the > throttling I think happens with a 1/8 duty cycle thing, iirc. > > It's supposed to be very rare with Core 2. Thermal throttling was quite > common with the P4 one, and was the main overheating protection initially. > These days, you should only see it for really badly designed devices that > simply don't have enough thermal cooling, but if the design calls for > mostly running at low frequency because it's some thing-and-light notebook > with insufficient cooling (or some thick-and-heavy thing that is just > total crap), and you force it to always run at full speed, I can imagine > it kicking in to protect the CPU. > > It's obviously also going to be much easier to see if the ambient > temperature is high. If you want to get best peformance, take one of those > compressed-air spray-cans, and spray on the laptop with the can held > upside down (the can will generally tell you _not_ to do that, because > then you'll get the liquid itself rather than gas, but that's what you > want for cooling). > > So if you can test this, try it with > (a) cpufreq at a fixed _low_ value (to not cause overheating) > (b) with the spray-can cooling the thing and cpufreq at a fixed high > value > and see if the TSC is constant then. Hi Linus, Thank you for your advice. I tested it again according your advice, I did: - service cpuspeed stop - echo 1000000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed and checked /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is 1000000. - echo 1 > /proc/acpi/thermal_zone/THM/polling_frequency - cooling with spray-can :) - cat /proc/acpi/thermal_zone/THM/temperature temperature: 39 C and ran the test. --- p0: c:1107576, ns:990280 ratio:111 p0: c:1805640, ns:1008787 ratio:178 p0: c:1998324, ns:1000127 ratio:199 p0: c:946380, ns:990280 ratio:95 p0: c:871728, ns:1000267 ratio:87 p0: c:1807380, ns:1007949 ratio:179 p0: c:1784808, ns:1000127 ratio:178 p0: c:1768488, ns:991676 ratio:178 p0: c:1802292, ns:1008299 ratio:178 p0: c:1787088, ns:1000406 ratio:178 p0: c:1999176, ns:1000896 ratio:199 p0: c:881364, ns:991956 ratio:88 p0: c:1802712, ns:1008019 ratio:178 p0: c:1787088, ns:998590 ratio:178 --- this seems not so stable yet. :-( After test I checked temperature again. # cat /proc/acpi/thermal_zone/THM/temperature temperature: 39 C Hmm, 39 C is not so high. I wouldn't be surprised even if this is an individual product bug. Anyway, currently, Linux itself works well on this laptop with hpet.:-) Thank you, > > Linus -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 17:04 ` Masami Hiramatsu @ 2008-09-23 17:30 ` Thomas Gleixner 2008-09-23 18:59 ` Masami Hiramatsu 0 siblings, 1 reply; 122+ messages in thread From: Thomas Gleixner @ 2008-09-23 17:30 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > > So if you can test this, try it with > > (a) cpufreq at a fixed _low_ value (to not cause overheating) > > (b) with the spray-can cooling the thing and cpufreq at a fixed high > > value > > and see if the TSC is constant then. > > Hi Linus, > > Thank you for your advice. I tested it again according your advice, > I did: > - service cpuspeed stop > - echo 1000000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed > and checked /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is > 1000000. > - echo 1 > /proc/acpi/thermal_zone/THM/polling_frequency > - cooling with spray-can :) > - cat /proc/acpi/thermal_zone/THM/temperature > temperature: 39 C > > and ran the test. > --- > p0: c:1107576, ns:990280 ratio:111 > p0: c:1805640, ns:1008787 ratio:178 > p0: c:1998324, ns:1000127 ratio:199 > p0: c:946380, ns:990280 ratio:95 > p0: c:871728, ns:1000267 ratio:87 > p0: c:1807380, ns:1007949 ratio:179 > p0: c:1784808, ns:1000127 ratio:178 > p0: c:1768488, ns:991676 ratio:178 > p0: c:1802292, ns:1008299 ratio:178 > p0: c:1787088, ns:1000406 ratio:178 > p0: c:1999176, ns:1000896 ratio:199 > p0: c:881364, ns:991956 ratio:88 > p0: c:1802712, ns:1008019 ratio:178 > p0: c:1787088, ns:998590 ratio:178 > --- > this seems not so stable yet. :-( > > After test I checked temperature again. > # cat /proc/acpi/thermal_zone/THM/temperature > temperature: 39 C > > Hmm, 39 C is not so high. I wouldn't be surprised even if this > is an individual product bug. Anyway, currently, Linux itself > works well on this laptop with hpet.:-) Do you have C-States enabled on that machine ? ls /sys/devices/system/cpu/cpu0/cpuidle/ has it more than a state0 entry ? If yes, please do: cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/usage where X is the highest number in there. cat /proc/acpi/processor/CPU0/power might be useful as well. Thanks, tglx ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 17:30 ` Thomas Gleixner @ 2008-09-23 18:59 ` Masami Hiramatsu 2008-09-23 19:36 ` Thomas Gleixner 0 siblings, 1 reply; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 18:59 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Hi Thomas, Thomas Gleixner wrote: > > Do you have C-States enabled on that machine ? > > ls /sys/devices/system/cpu/cpu0/cpuidle/ > > has it more than a state0 entry ? Yes, there are state0 to state3 # ls /sys/devices/system/cpu/cpu0/cpuidle/ state0 state1 state2 state3 > If yes, please do: > > cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/usage > > where X is the highest number in there. # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage 171210 > > cat /proc/acpi/processor/CPU0/power > > might be useful as well. # cat /proc/acpi/processor/CPU0/power active state: C0 max_cstate: C8 bus master activity: 00000000 maximum allowed latency: 2000000000 usec states: C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] Could these help you? Thank you, > > Thanks, > > tglx -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 18:59 ` Masami Hiramatsu @ 2008-09-23 19:36 ` Thomas Gleixner 2008-09-23 19:38 ` Martin Bligh 2008-09-23 20:03 ` Masami Hiramatsu 0 siblings, 2 replies; 122+ messages in thread From: Thomas Gleixner @ 2008-09-23 19:36 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage > 171210 C3 stops the TSC. So depending on how many C3 entries you have on the different cores, your TSCs will drift apart. Some BIOSes do even a lousy job trying to fixup the TSCs on exit from C3, which makes things even worse. > C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] > C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] > C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] > > Could these help you? Yup, explains your TSC observation. Nothing we can do about. Broken by system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! Thanks, tglx ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:36 ` Thomas Gleixner @ 2008-09-23 19:38 ` Martin Bligh 2008-09-23 19:41 ` Thomas Gleixner 2008-09-23 20:03 ` Masami Hiramatsu 1 sibling, 1 reply; 122+ messages in thread From: Martin Bligh @ 2008-09-23 19:38 UTC (permalink / raw) To: Thomas Gleixner Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, Sep 23, 2008 at 12:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage >> 171210 > > C3 stops the TSC. So depending on how many C3 entries you have on the > different cores, your TSCs will drift apart. Some BIOSes do even a > lousy job trying to fixup the TSCs on exit from C3, which makes things > even worse. > >> C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] >> C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] >> C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] >> >> Could these help you? > > Yup, explains your TSC observation. Nothing we can do about. Broken by > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! We have linux patches that sync the TSC on exit_idle. I'll see if I can get Michael to send them out. M. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:38 ` Martin Bligh @ 2008-09-23 19:41 ` Thomas Gleixner 2008-09-23 19:50 ` Martin Bligh 0 siblings, 1 reply; 122+ messages in thread From: Thomas Gleixner @ 2008-09-23 19:41 UTC (permalink / raw) To: Martin Bligh Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Martin Bligh wrote: > On Tue, Sep 23, 2008 at 12:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage > >> 171210 > > > > C3 stops the TSC. So depending on how many C3 entries you have on the > > different cores, your TSCs will drift apart. Some BIOSes do even a > > lousy job trying to fixup the TSCs on exit from C3, which makes things > > even worse. > > > >> C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] > >> C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] > >> C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] > >> > >> Could these help you? > > > > Yup, explains your TSC observation. Nothing we can do about. Broken by > > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! > > We have linux patches that sync the TSC on exit_idle. I'll see if I can get > Michael to send them out. Are you sure that they sync it precicely enough that there is no user space observable way of time going backwards between cores ? Thanks, tglx ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:41 ` Thomas Gleixner @ 2008-09-23 19:50 ` Martin Bligh 2008-09-23 20:03 ` Thomas Gleixner 0 siblings, 1 reply; 122+ messages in thread From: Martin Bligh @ 2008-09-23 19:50 UTC (permalink / raw) To: Thomas Gleixner Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml >> > Yup, explains your TSC observation. Nothing we can do about. Broken by >> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! >> >> We have linux patches that sync the TSC on exit_idle. I'll see if I can get >> Michael to send them out. > > Are you sure that they sync it precicely enough that there is no user > space observable way of time going backwards between cores ? I think the tolerance is about 500 cycles. If that's not sufficient, I guess we'll have to either live with some slight misordering (which people have pointed out is kind of inevitable anyway) on these broken machines? It was sufficient for what we were using it for, but maybe not for everyone. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:50 ` Martin Bligh @ 2008-09-23 20:03 ` Thomas Gleixner 2008-09-23 21:02 ` Martin Bligh 0 siblings, 1 reply; 122+ messages in thread From: Thomas Gleixner @ 2008-09-23 20:03 UTC (permalink / raw) To: Martin Bligh Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Martin Bligh wrote: > >> > Yup, explains your TSC observation. Nothing we can do about. Broken by > >> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! > >> > >> We have linux patches that sync the TSC on exit_idle. I'll see if I can get > >> Michael to send them out. > > > > Are you sure that they sync it precicely enough that there is no user > > space observable way of time going backwards between cores ? > > I think the tolerance is about 500 cycles. If that's not sufficient, I guess > we'll have to either live with some slight misordering (which people have > pointed out is kind of inevitable anyway) on these broken machines? > It was sufficient for what we were using it for, but maybe not for everyone. Well, I dont care about the trace reordering at all. I care about user space visible time going backwards issues observed via the gettimeofday vsyscall. 500 cycles should be fine, I doubt that we can migrate in less than that :) I guess you try this only for machines where the TSC runs with constant frequency, right ? Looking forward to your patches. Thanks, tglx ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 20:03 ` Thomas Gleixner @ 2008-09-23 21:02 ` Martin Bligh 0 siblings, 0 replies; 122+ messages in thread From: Martin Bligh @ 2008-09-23 21:02 UTC (permalink / raw) To: Thomas Gleixner Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml >> I think the tolerance is about 500 cycles. If that's not sufficient, I guess >> we'll have to either live with some slight misordering (which people have >> pointed out is kind of inevitable anyway) on these broken machines? >> It was sufficient for what we were using it for, but maybe not for everyone. > > Well, I dont care about the trace reordering at all. I care about user > space visible time going backwards issues observed via the > gettimeofday vsyscall. 500 cycles should be fine, I doubt that we can > migrate in less than that :) Right, that's what we were interested in. > I guess you try this only for machines where the TSC runs with > constant frequency, right ? We don't do DVFS on the 'broken' TSC machines, just halt. But yes, it's selective ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:36 ` Thomas Gleixner 2008-09-23 19:38 ` Martin Bligh @ 2008-09-23 20:03 ` Masami Hiramatsu 2008-09-23 20:08 ` Thomas Gleixner 1 sibling, 1 reply; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 20:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Thomas Gleixner wrote: > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage >> 171210 > > C3 stops the TSC. So depending on how many C3 entries you have on the > different cores, your TSCs will drift apart. Some BIOSes do even a > lousy job trying to fixup the TSCs on exit from C3, which makes things > even worse. > >> C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] >> C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] >> C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] >> >> Could these help you? > > Yup, explains your TSC observation. Nothing we can do about. Broken by > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! Thank you for analyzing! :-) Hmm, then could I fix that by fixing my dsdt...? Thanks again, > > Thanks, > > tglx -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 20:03 ` Masami Hiramatsu @ 2008-09-23 20:08 ` Thomas Gleixner 0 siblings, 0 replies; 122+ messages in thread From: Thomas Gleixner @ 2008-09-23 20:08 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > Thomas Gleixner wrote: > > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage > >> 171210 > > > > C3 stops the TSC. So depending on how many C3 entries you have on the > > different cores, your TSCs will drift apart. Some BIOSes do even a > > lousy job trying to fixup the TSCs on exit from C3, which makes things > > even worse. > > > >> C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] > >> C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] > >> C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] > >> > >> Could these help you? > > > > Yup, explains your TSC observation. Nothing we can do about. Broken by > > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! > > Thank you for analyzing! :-) > Hmm, then could I fix that by fixing my dsdt...? You can limit c-states so you dont do down to the C3 state, but there is a trade off vs. power saving. Lets wait for Martins magic TSC patches first :) tglx ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:49 ` Masami Hiramatsu 2008-09-23 15:04 ` Mathieu Desnoyers @ 2008-09-23 15:46 ` Linus Torvalds 1 sibling, 0 replies; 122+ messages in thread From: Linus Torvalds @ 2008-09-23 15:46 UTC (permalink / raw) To: Masami Hiramatsu Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > > $ dmesg | grep TSC > checking TSC synchronization [CPU#0 -> CPU#1]: > Measured 4246549092 cycles TSC warp between CPUs, turning off TSC clock. > Marking TSC unstable due to: check_tsc_sync_source failed. Hmm.. Very interesting. It smells of a non-stable TSC, but your Core2 Cpu shouldn't have that issue: > > $ cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 15 > model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz > stepping : 6 > cpu MHz : 1000.000 > cache size : 4096 KB > flags : ... constant_tsc ... > > Actually, I had measured TSC drifting and reported to systemtap-bugzilla > http://sources.redhat.com/bugzilla/show_bug.cgi?id=3916#c19 I'd have assumed it was just some initial offset issue, but your bug-report seems to say that it really does change the TSC frequency when the CPU frequency changes. That should _not_ happen on a core2 CPU, afaik. I didn't even know it could be a setup issue, but it does really smell like your TSC frequency changes. Now, unstable TSC's are not uncommon per se, and most older Intel CPU's will do it, it's just that I thought it was fixed in Core2 (and later P4's for that matter). The rule *should* be that: - family = 15 (netburst), model 3 or higher has constant TSC - family = 6 (PPro), model 14 or higher (Core, Core 2) have constant TSCs. This is quite clearly documented: see Intel ia docs, vol 3B, 18.10 "Time-stamp counter". Very odd. I wonder what your laptop does to screw this up. I also suspect that since we already _noticed_ that the TSC isn't stable, we should also have then cleared the "constant TSC" bit. And we apparently didn't. Btw, your CPU looks quite odd in other respects too. Judging by your bugzilla entry, the TSC sometimes ticks at 2GHz (fine), sometimes at 1Ghz (also fine), and sometimes at 667/500MHz judging by the ratios you show for TSC/timer tick. And that last one is really odd, afaik most 2GHz Core 2 duo's will have a lower frequency of 1GHz. Is that some special low-power version, perhaps? Or maybe it isn't a speedstep-able CPU at all, and the system actually changes the *bus* frequency (and then the CPU frequency is some constant factor of that). If so, the system messes with the CPU in bad ways. And btw, I'm almost certain that what you see isn't actually any "CPU drift" in the sense that I strongly suspect that the TSC's for both cores will change frequency together. So because the TSC isn't stable, it's not a good time-source, but despite that it's not necessarily a bad way to compare events across cores. To actually have different CPU's TSC drift wrt each other, you almost have to have them in different clock domains. And that is *very* rare. It happens when the CPU's are on different boards, and sure if happens if the CPU's have non-constant TSCs with different frequencies, but neither of those should be very common at all. The latter is uncommon because it's almost unheard of of having multi-socket devices with old CPU's that also do frequency changes. Older multi-core CPU's tend to do frequency changes the whole chip at a time, and newer multi-core CPU's should all basically have a fixed TSC so even when they do frequency changes independently, the TSC should still be off the same clock on the die. Linus ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:05 ` Masami Hiramatsu 2008-09-23 0:12 ` Martin Bligh @ 2008-09-23 0:39 ` Linus Torvalds 2008-09-23 1:26 ` Roland Dreier ` (2 more replies) 1 sibling, 3 replies; 122+ messages in thread From: Linus Torvalds @ 2008-09-23 0:39 UTC (permalink / raw) To: Masami Hiramatsu Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Masami Hiramatsu wrote: > > Sure, atomic counter might be more expensive but accurate for ordering. Don't be silly. An atomic counter is no more accurate for ordering than anything else. Why? Because all it tells you is the ordering of the atomic increment, not of the caller. The atomic increment is not related to all the other ops that the code that you trace actually does in any shape or form, and so the ordering of the trace doesn't actually imply anything for the ordering of the operations you are tracing! Except for a single CPU, of course, but for that case you don't need a sequence number either, since the ordering is entirely determined by the ring buffer itself. So the counter will be more expensive (cross-cpu cache bouncing for EVERY SINGLE EVENT), less useful (no real meaning for people who DO want to have a timestamp), and it's really no more "ordered" than anything that bases itself on a TSC. The fact is, you cannot order operations based on log messages unless you have a lock around the whole caller - absolutely _no_ amount of locking or atomic accesses in the log itself will guarantee ordering of the upper layers. And sure, if you have locking at a higher layer, then a sequence number is sufficient, but on the other hand, so is a well-synchronized TSC. So personally, I think that the optimal solution is: - let each ring buffer be associated with a "gettimestamp()" function, so that everybody _can_ set it to something of their own. But default to something sane, namely a raw TSC thing. - Add synchronization events to the ring buffer often enough that you can make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply noticing when the upper 32 bits change, although you could possibly do it with a heartbeat too. - Similarly, add a synchronization event when the TSC frequency changes. - Make the synchronization packet contain the full 64-bit TSC base, in addition to TSC frequency info _and_ the timebase. - From those synchronization events, you should be able to get a very accurate timestamp *after* the fact from the raw TSC numbers (ie do all the scaling not when you gather the info, but when you present it), even if you only spent 32 bits of TSC info on 99% of all events (an just had a overflow log occasionally to get the rest of the info) - Most people will be _way_ happier with a timestamp that has enough precision to also show ordering (assuming that the caller holds a lock over the operation _including_ the tracing) than they would ever be with a sequence number. - people who really want to can consider the incrementing counter a TSC, but it will suck in so many ways that I bet it will not be very popular at all. But having the option to set a special timestamp function will give people the option (on a per-buffer level) to make the "TSC" be a simple incrementing 32-bit counter using xaddl and the upper bits incrementing from a timer, but keep that as a "ok, the TSC is really broken, or this architecture doesn't support any fast cycle counters at all, or I really don't care about time, just sequence, and I guarantee I have a single lock in all callers that makes things unambiguous" Note the "single lock" part. It's not enough that you make any trace thing under a lock. They must be under the _same_ lock for all relevant events for you to be able to say anything about ordering. And that's actually pretty rare for any complex behavior. The timestamping, btw, is likely the most important part of the whole logging thing. So we need to get it right. But by "right" I mean really really low-latency so that it's acceptable to everybody, real-time enough that you can tell how far apart events were, and precise enough that you really _can_ see ordering. The "raw TSC value with correction information" should be able to give you all of that. At least on x86. On some platforms, the TSC may not give you enough resolution to get reasonable guesses on event ordering. Linus ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:39 ` Linus Torvalds @ 2008-09-23 1:26 ` Roland Dreier 2008-09-23 1:39 ` Steven Rostedt ` (2 more replies) 2008-09-23 2:30 ` Mathieu Desnoyers 2008-09-23 3:06 ` Masami Hiramatsu 2 siblings, 3 replies; 122+ messages in thread From: Roland Dreier @ 2008-09-23 1:26 UTC (permalink / raw) To: Linus Torvalds Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml > Because all it tells you is the ordering of the atomic increment, not of > the caller. The atomic increment is not related to all the other ops that > the code that you trace actually does in any shape or form, and so the > ordering of the trace doesn't actually imply anything for the ordering of > the operations you are tracing! This reminds me of a naive question that occurred to me while we were discussing this at KS. Namely, what does "ordering" mean for events? An example I'm all too familiar with is the lack of ordering of MMIO on big SGI systems -- if you forget an mmiowb(), then two CPUs taking a spinlock and doing writel() inside the spinlock and then dropping the spinlock (which should be enough to "order" things) might see the writel() reach the final device "out of order" because the write has to travel through a routed system fabric. Just like Einstein said, it really seems to me that the order of things depends on your frame of reference. - R. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 1:26 ` Roland Dreier @ 2008-09-23 1:39 ` Steven Rostedt 2008-09-23 2:02 ` Mathieu Desnoyers 2008-09-23 3:26 ` Linus Torvalds 2 siblings, 0 replies; 122+ messages in thread From: Steven Rostedt @ 2008-09-23 1:39 UTC (permalink / raw) To: Roland Dreier Cc: Linus Torvalds, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Roland Dreier wrote: > > Because all it tells you is the ordering of the atomic increment, not of > > the caller. The atomic increment is not related to all the other ops that > > the code that you trace actually does in any shape or form, and so the > > ordering of the trace doesn't actually imply anything for the ordering of > > the operations you are tracing! > > This reminds me of a naive question that occurred to me while we were > discussing this at KS. Namely, what does "ordering" mean for events? > > An example I'm all too familiar with is the lack of ordering of MMIO on > big SGI systems -- if you forget an mmiowb(), then two CPUs taking a > spinlock and doing writel() inside the spinlock and then dropping the > spinlock (which should be enough to "order" things) might see the > writel() reach the final device "out of order" because the write has to > travel through a routed system fabric. > > Just like Einstein said, it really seems to me that the order of things > depends on your frame of reference. In my logdev tracer (see http://rostedt.homelinux.com/logdev) I used an atomic counter to keep "order". But what I would say to people what this order means, is that order is among multiple traces between multiple CPUS. That is if you have. CPU 1 CPU 2 trace_point_a trace_point_c trace_point_b trace_point_d If you see in the trace: trace_point_a trace_point_c You really do not know which happened first. Simply because trace_point_c could have been hit first, but for interrupts and nmis and what not, trace_point_a could have easily been recorded first. But to me, trace_points are more like memory barriers. If I see: trace_point_c trace_point_a trace_point_b trace_point_d I can assume that everything before trace_point_c happened before everything after trace_point_a, and that all before trace_point_b happened before trace_point_d. One can not assume that the trace points themselves are in order. But you can assume that the things outside the trace points are, like memory barriers. I have found lots of race conditions with my logdev, and it was due to this "memory barrier" likeness to be able to see the races. Unfortunately, if you are using an out of sync TSC, you lose even the memory barrier characteristic of the trace. -- Steve ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 1:26 ` Roland Dreier 2008-09-23 1:39 ` Steven Rostedt @ 2008-09-23 2:02 ` Mathieu Desnoyers 2008-09-23 2:26 ` Darren Hart 2008-09-23 3:26 ` Linus Torvalds 2 siblings, 1 reply; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 2:02 UTC (permalink / raw) To: Roland Dreier Cc: Linus Torvalds, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml * Roland Dreier (rdreier@cisco.com) wrote: > > Because all it tells you is the ordering of the atomic increment, not of > > the caller. The atomic increment is not related to all the other ops that > > the code that you trace actually does in any shape or form, and so the > > ordering of the trace doesn't actually imply anything for the ordering of > > the operations you are tracing! > > This reminds me of a naive question that occurred to me while we were > discussing this at KS. Namely, what does "ordering" mean for events? > > An example I'm all too familiar with is the lack of ordering of MMIO on > big SGI systems -- if you forget an mmiowb(), then two CPUs taking a > spinlock and doing writel() inside the spinlock and then dropping the > spinlock (which should be enough to "order" things) might see the > writel() reach the final device "out of order" because the write has to > travel through a routed system fabric. > > Just like Einstein said, it really seems to me that the order of things > depends on your frame of reference. > > - R. > Exactly as Linus said, event ordering comes down to this : a choice between heavy locking around the real operation traced and the tracing statement itself (irq disable/spinlock) or the acknowledgement that the ordering is only insured across the actual tracing _instrumentation_. A worse case scenario would be to get an interrupt between the "real" operation (e.g. a memory or mmio write) and the tracing statement, be scheduled out, which would let a lot of stuff happen between the actual impact of the operation on kernel memory and the tracing statement itself. If we want to be _sure_ such thing never happen, we would then have to pay the price of heavy locking and that would not be pretty, especially for complex data structure modifications comes in play. I don't really think anyone with an half-sane mind would want to slow down such critical kernel operations for the benefit of totally ordered tracing. However, in many cases where ordering matters, e.g. to instrument spinlocks themselves, if we put the instrumentation within the critical section rather than outside of it, then we benefit from the existing kernel locking (but only for events related to this specific spinlock). This is the same for many synchronization primitives, except for atomic operations, where we have to accept that the order will be imperfect. So only in the specific case of instrumentation of things like locking, where it is possible to insure that instrumentation is synchronized with the instrumented operation, does it make a difference to choose the TSC (which implies a slight delta between the TSCs due to cache line delays at synchronization and delay due to TSCs drifts caused by temperature) over an atomic increment. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 2:02 ` Mathieu Desnoyers @ 2008-09-23 2:26 ` Darren Hart 2008-09-23 2:31 ` Mathieu Desnoyers 0 siblings, 1 reply; 122+ messages in thread From: Darren Hart @ 2008-09-23 2:26 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Roland Dreier, Linus Torvalds, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, Frank Ch. Eigler, systemtap-ml On Mon, Sep 22, 2008 at 7:02 PM, Mathieu Desnoyers <compudj@krystal.dyndns.org> > So only in the specific case of instrumentation of things like locking, > where it is possible to insure that instrumentation is synchronized with > the instrumented operation, does it make a difference to choose the TSC > (which implies a slight delta between the TSCs due to cache line delays > at synchronization and delay due to TSCs drifts caused by temperature) > over an atomic increment. > Hrm, i think that overlooks the other reason to use a time based counter over an atomic increment: you might care about time. Perhaps one might be less concerned with actual order tightly grouped events and more concerned with the actual time delta between more temporally distant events. In that case, using a clocksource would still be valuable. Although admitedtly the caller could embed that in their payload, but since we seem to agree we need some kind of counter, the time-based counter appears to be the most flexible. Thanks, -- Darren Hart ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 2:26 ` Darren Hart @ 2008-09-23 2:31 ` Mathieu Desnoyers 0 siblings, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 2:31 UTC (permalink / raw) To: Darren Hart Cc: Roland Dreier, Linus Torvalds, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, Frank Ch. Eigler, systemtap-ml * Darren Hart (darren@dvhart.com) wrote: > On Mon, Sep 22, 2008 at 7:02 PM, Mathieu Desnoyers <compudj@krystal.dyndns.org> > > So only in the specific case of instrumentation of things like locking, > > where it is possible to insure that instrumentation is synchronized with > > the instrumented operation, does it make a difference to choose the TSC > > (which implies a slight delta between the TSCs due to cache line delays > > at synchronization and delay due to TSCs drifts caused by temperature) > > over an atomic increment. > > > > Hrm, i think that overlooks the other reason to use a time based counter over > an atomic increment: you might care about time. Perhaps one might be less > concerned with actual order tightly grouped events and more concerned with the > actual time delta between more temporally distant events. In that case, using > a clocksource would still be valuable. Although admitedtly the caller could > embed that in their payload, but since we seem to agree we need some kind of > counter, the time-based counter appears to be the most flexible. > > Thanks, > See my answer to Linus for a proposal on how to do both :) Mathieu > -- > Darren Hart > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 1:26 ` Roland Dreier 2008-09-23 1:39 ` Steven Rostedt 2008-09-23 2:02 ` Mathieu Desnoyers @ 2008-09-23 3:26 ` Linus Torvalds 2008-09-23 3:36 ` Mathieu Desnoyers 2008-09-23 3:43 ` Steven Rostedt 2 siblings, 2 replies; 122+ messages in thread From: Linus Torvalds @ 2008-09-23 3:26 UTC (permalink / raw) To: Roland Dreier Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Roland Dreier wrote: > > Just like Einstein said, it really seems to me that the order of things > depends on your frame of reference. Heh. Yes. In general, there is no single ordering unless you actually use a serializing lock on all CPU's involved. And exactly as in the theory of relativity, two people on different CPU's can actually validly _disagree_ about the ordering of the same event. There are things that act as "light-cones" and are borders for what everybody can agree on, but basically, in the absence of explicit locks, it is very possible that no such thing as "ordering" may even exist. Now, an atomic increment on a single counter obviously does imply *one* certain ordering, but it really only defines the ordering of that counter itself. It does not at all necessarily imply any ordering on the events that go on around the counter in other unrelated cachelines. Which is exactly why even a global counter in no way orders "events" in general, unless those events have something else that does so. Linus ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:26 ` Linus Torvalds @ 2008-09-23 3:36 ` Mathieu Desnoyers 2008-09-23 4:05 ` Linus Torvalds 2008-09-23 3:43 ` Steven Rostedt 1 sibling, 1 reply; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 3:36 UTC (permalink / raw) To: Linus Torvalds Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 22 Sep 2008, Roland Dreier wrote: > > > > Just like Einstein said, it really seems to me that the order of things > > depends on your frame of reference. > > Heh. Yes. In general, there is no single ordering unless you actually use > a serializing lock on all CPU's involved. > > And exactly as in the theory of relativity, two people on different CPU's > can actually validly _disagree_ about the ordering of the same event. > There are things that act as "light-cones" and are borders for what > everybody can agree on, but basically, in the absence of explicit locks, > it is very possible that no such thing as "ordering" may even exist. > > Now, an atomic increment on a single counter obviously does imply *one* > certain ordering, but it really only defines the ordering of that counter > itself. It does not at all necessarily imply any ordering on the events > that go on around the counter in other unrelated cachelines. > > Which is exactly why even a global counter in no way orders "events" in > general, unless those events have something else that does so. > > Linus > Unless I am missing something, in the case we use an atomic operation which implies memory barriers (cmpxchg and atomic_add_return does), one can be sure that all memory operations done before the barrier are completed at the barrier and that all memory ops following the barrier will happen after. Did you have something else in mind ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:36 ` Mathieu Desnoyers @ 2008-09-23 4:05 ` Linus Torvalds 0 siblings, 0 replies; 122+ messages in thread From: Linus Torvalds @ 2008-09-23 4:05 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Mathieu Desnoyers wrote: > > Unless I am missing something, in the case we use an atomic operation > which implies memory barriers (cmpxchg and atomic_add_return does), one > can be sure that all memory operations done before the barrier are > completed at the barrier and that all memory ops following the barrier > will happen after. Sure (if you have a barrier - not all architectures will imply that for an incrment). But that still doesn't mean a thing. You have two events (a) and (b), and you put trace-points on each. In your trace, you see (a) before (b) by comparing the numbers. But what does that mean? The actual event that you traced is not the trace-point - the trace-point is more like a fancy "printk". And the fact that one showed up before another in the trace buffer, doesn't mean that the events _around_ the trace happened in the same order. You can use the barriers to make a partial ordering, and if you have a separate tracepoint for entry into a region and exit, you can perhaps show that they were totally disjoint. Or maybe they were partially overlapping, and you'll never know exactly how they overlapped. Example: trace(..); do_X(); being executed on two different CPU's. In the trace, CPU#1 was before CPU#2. Does that mean that "do_X()" happened first on CPU#1? No. The only way to show that would be to put a lock around the whole trace _and_ operation X, ie spin_lock(..); trace(..); do_X(); spin_unlock(..); and now, if CPU#1 shows up in the trace first, then you know that do_X() really did happen first on CPU#1. Otherwise you basically know *nothing*, and the ordering of the trace events was totally and utterly meaningless. See? Trace events themselves may be ordered, but the point of the trace event is never to know the ordering of the trace itself - it's to know the ordering of the code we're interested in tracing. The ordering of the trace events themselves is irrelevant and not useful. And I'd rather see people _understand_ that, than if they think the ordering is somehow something they can trust. Btw, if you _do_ have locking, then you can also know that the "do_X()" operations will be essentially as far apart in some theoretical notion of "time" (let's imagine that we do have global time, even if we don't) as the cost of the trace operation and do_X() itself. So if we _do_ have locking (and thus a valid ordering that actually can matter), then the TSC doesn't even have to be synchronized on a cycle basis across CPU's - it just needs to be close enough that you can tell which one happened first (and with ordering, that's a valid thing to do). So you don't even need "perfect" synchronization, you just need something reasonably close, and you'll be able to see ordering from TSC counts without having that horrible bouncing cross-CPU thing that will impact performance a lot. Quite frankly, I suspect that anybody who wants to have a global counter might as well almost just have a global ring-buffer. The trace events aren't going to be CPU-local anyway if you need to always update a shared cacheline - and you might as well make the shared cacheline be the ring buffer head with a spinlock in it. That may not be _quite_ true, but it's probably close enough. Linus ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:26 ` Linus Torvalds 2008-09-23 3:36 ` Mathieu Desnoyers @ 2008-09-23 3:43 ` Steven Rostedt 2008-09-23 4:10 ` Masami Hiramatsu 2008-09-23 4:19 ` Linus Torvalds 1 sibling, 2 replies; 122+ messages in thread From: Steven Rostedt @ 2008-09-23 3:43 UTC (permalink / raw) To: Linus Torvalds Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Linus Torvalds wrote: > > > On Mon, 22 Sep 2008, Roland Dreier wrote: > > > > Just like Einstein said, it really seems to me that the order of things > > depends on your frame of reference. > > Heh. Yes. In general, there is no single ordering unless you actually use > a serializing lock on all CPU's involved. > > And exactly as in the theory of relativity, two people on different CPU's > can actually validly _disagree_ about the ordering of the same event. > There are things that act as "light-cones" and are borders for what > everybody can agree on, but basically, in the absence of explicit locks, > it is very possible that no such thing as "ordering" may even exist. > > Now, an atomic increment on a single counter obviously does imply *one* > certain ordering, but it really only defines the ordering of that counter > itself. It does not at all necessarily imply any ordering on the events > that go on around the counter in other unrelated cachelines. > > Which is exactly why even a global counter in no way orders "events" in > general, unless those events have something else that does so. Hmm, I've been pretty spoiled by x86 mostly ording things correctly, and the not x86 boxes I've used has mostly been UP. But, with that, with a global atomic counter, and the following trace: cpu 0: trace_point_a cpu 1: trace_point_c cpu 0: trace_point_b cpu 1: trace_point_d Could the event a really come after event d, even though we already hit event b? But I guess you are stating the fact that what the computer does internally, no one really knows. Without the help of real memory barriers, ording of memory accesses is mostly determined by tarot cards. But basically, the perceived result of assembly commands is suppose to be accurate at the user level. The traces that I've used not only shows the order (or perceived order) of events, but also the output of the corrupted data when the race happens. Usually with the two together, you can pretty much guarantee that the traced events actually did occur in the order presented. But without some perceived accurate ording, even when you do see the corrupted data, the events can easily be misleading, even on an arch that usually orders the events as seen by the user. -- Steve ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:43 ` Steven Rostedt @ 2008-09-23 4:10 ` Masami Hiramatsu 2008-09-23 4:17 ` Martin Bligh 2008-09-23 10:53 ` Steven Rostedt 2008-09-23 4:19 ` Linus Torvalds 1 sibling, 2 replies; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 4:10 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Roland Dreier, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml Steven Rostedt wrote: > But, with that, with a global atomic counter, and the following trace: > > cpu 0: trace_point_a > cpu 1: trace_point_c > cpu 0: trace_point_b > cpu 1: trace_point_d > > > Could the event a really come after event d, even though we already hit > event b? yes, if event c is an interrupt event :-). cpu 0 cpu 1 hit event d hit event a log event a irq event c log event c hit event b log event b log event d so, I think if we really need to order events, we have to stop irq right after hitting an event. Anyway, in most case, I think it works, but as accurate as synchronized-TSC if hardware supports it. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 4:10 ` Masami Hiramatsu @ 2008-09-23 4:17 ` Martin Bligh 2008-09-23 15:23 ` Masami Hiramatsu 2008-09-23 10:53 ` Steven Rostedt 1 sibling, 1 reply; 122+ messages in thread From: Martin Bligh @ 2008-09-23 4:17 UTC (permalink / raw) To: Masami Hiramatsu Cc: Steven Rostedt, Linus Torvalds, Roland Dreier, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml > yes, if event c is an interrupt event :-). > > cpu 0 cpu 1 > hit event d > hit event a > log event a > irq event c > log event c > hit event b > log event b > log event d > > so, I think if we really need to order events, we have to stop > irq right after hitting an event. How could you fix that in any practical way? ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 4:17 ` Martin Bligh @ 2008-09-23 15:23 ` Masami Hiramatsu 0 siblings, 0 replies; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 15:23 UTC (permalink / raw) To: Martin Bligh Cc: Steven Rostedt, Linus Torvalds, Roland Dreier, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: >> log event c >> hit event b >> log event b >> log event d >> >> so, I think if we really need to order events, we have to stop >> irq right after hitting an event. > > How could you fix that in any practical way? In that case, I think it's easy to know that event c is irq related event by checking source code. :-) And practically, in most cases, I think we can presume what happened by event arguments and subsequent events. Anyway, since there must be a delay between hitting event(and as linus said, happening event) and logging it, I think buffering mechanism itself should focus on ensuring only writing and reading order. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 4:10 ` Masami Hiramatsu 2008-09-23 4:17 ` Martin Bligh @ 2008-09-23 10:53 ` Steven Rostedt 1 sibling, 0 replies; 122+ messages in thread From: Steven Rostedt @ 2008-09-23 10:53 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linus Torvalds, Roland Dreier, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > Steven Rostedt wrote: > > But, with that, with a global atomic counter, and the following trace: > > > > cpu 0: trace_point_a > > cpu 1: trace_point_c > > cpu 0: trace_point_b > > cpu 1: trace_point_d > > > > > > Could the event a really come after event d, even though we already hit > > event b? > > yes, if event c is an interrupt event :-). > > cpu 0 cpu 1 > hit event d > hit event a > log event a > irq event c > log event c heh, This is assuming that event c is in an IRQ handler. Since I control where event c is, I can prevent that. I'm talking about the CPU doing something funny that would have c come after d. But I didn't specify exactly what the events were, so I'll accept that explanation ;-) -- Steve > hit event b > log event b > log event d > > so, I think if we really need to order events, we have to stop > irq right after hitting an event. > > Anyway, in most case, I think it works, but as accurate as > synchronized-TSC if hardware supports it. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:43 ` Steven Rostedt 2008-09-23 4:10 ` Masami Hiramatsu @ 2008-09-23 4:19 ` Linus Torvalds 2008-09-23 14:12 ` Mathieu Desnoyers 1 sibling, 1 reply; 122+ messages in thread From: Linus Torvalds @ 2008-09-23 4:19 UTC (permalink / raw) To: Steven Rostedt Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Steven Rostedt wrote: > > But, with that, with a global atomic counter, and the following trace: > > cpu 0: trace_point_a > cpu 1: trace_point_c > cpu 0: trace_point_b > cpu 1: trace_point_d > > Could the event a really come after event d, even though we already hit > event b? Each tracepoint will basically give a partial ordering (if you make it so, of course - and on x86 it's hard to avoid it). And with many trace-points, you can narrow down ordering if you're lucky. But say that you have code like CPU#1 CPU#2 trace_a trace_c .. .. trace_b trace_d and since each CPU itself is obviously strictly ordered, you a priori know that a < b, and c < d. But your trace buffer can look many different ways: - a -> b -> c -> d c -> d -> a -> b Now you do know that what happened between c and d must all have happened entirely after/before the things that happened between a and b, and there is no overlap. This is only assuming the x86 full memory barrier from a "lock xadd" of course, but those are the semantics you'd get on x86. On others, the ordering might not be that strong. - a -> c -> b -> d a -> c -> d -> b With these trace point orderings, you really don't know anything at all about the order of any access that happened in between. CPU#1 might have gone first. Or not. Or partially. You simply do not know. > But I guess you are stating the fact that what the computer does > internally, no one really knows. Without the help of real memory barriers, > ording of memory accesses is mostly determined by tarot cards. Well, x86 defines a memory order. But what I'm trying to explain is that memory order still doesn't actually specify what happens to the code that actually does tracing! The trace is only going to show the order of the tracepoints, not the _other_ memory accesses. So you'll have *some* information, but it's very partial. And the thing is, all those other memory accesses are the ones that do all the real work. You'll know they happened _somewhere_ between two tracepoints, but not much more than that. This is why timestamps aren't really any worse than sequence numbers in all practical matters. They'll get you close enough that you can consider them equivalent to a cache-coherent counter, just one that you don't have to take a cache miss for, and that increments on its own! Quite a lot of CPU's have nice, dependable, TSC's that run at constant frequency. And quite a lot of traces care a _lot_ about real time. When you do IO tracing, the problem is almost never about lock ordering or anything like that. You want to see how long a request took. You don't care AT ALL how many tracepoints were in between the beginning and end, you care about how many microseconds there were! Linus ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 4:19 ` Linus Torvalds @ 2008-09-23 14:12 ` Mathieu Desnoyers 0 siblings, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 14:12 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, darren, Frank Ch. Eigler, systemtap-ml * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 22 Sep 2008, Steven Rostedt wrote: > > > > But, with that, with a global atomic counter, and the following trace: > > > > cpu 0: trace_point_a > > cpu 1: trace_point_c > > cpu 0: trace_point_b > > cpu 1: trace_point_d > > > > Could the event a really come after event d, even though we already hit > > event b? > > Each tracepoint will basically give a partial ordering (if you make it so, > of course - and on x86 it's hard to avoid it). > > And with many trace-points, you can narrow down ordering if you're lucky. > > But say that you have code like > > CPU#1 CPU#2 > > trace_a trace_c > .. .. > trace_b trace_d > > and since each CPU itself is obviously strictly ordered, you a priori know > that a < b, and c < d. But your trace buffer can look many different ways: > > - a -> b -> c -> d > c -> d -> a -> b > > Now you do know that what happened between c and d must all have > happened entirely after/before the things that happened between > a and b, and there is no overlap. > > This is only assuming the x86 full memory barrier from a "lock xadd" of > course, but those are the semantics you'd get on x86. On others, the > ordering might not be that strong. > Hrm, Documentation/atomic_ops.txt states that : "Unlike the above routines, it is required that explicit memory barriers are performed before and after the operation. It must be done such that all memory operations before and after the atomic operation calls are strongly ordered with respect to the atomic operation itself." So on architectures with weaker ordering, the kernel atomic ops already require that explicit smp_mb() are inserted before and after the atomic increment. The same applies to cmpxchg. Therefore I think it's ok, given the semantic provided by these two atomic operations, to assume they imply a smp_mb() for any given architecture. If not, then the architecture-specific implementation would be broken wrt the semantic. > - a -> c -> b -> d > a -> c -> d -> b > > With these trace point orderings, you really don't know anything at all > about the order of any access that happened in between. CPU#1 might > have gone first. Or not. Or partially. You simply do not know. > Yep. If two "real kernel" events happen to belong to the same overlapping time window, there is not much we can know about their order. Adding tracing statements before and after traced kernel operations could help to make this window as small as possible, but I doubt it's worth the performance penality and event duplication (and incremented trace size). Mathieu > > But I guess you are stating the fact that what the computer does > > internally, no one really knows. Without the help of real memory barriers, > > ording of memory accesses is mostly determined by tarot cards. > > Well, x86 defines a memory order. But what I'm trying to explain is that > memory order still doesn't actually specify what happens to the code that > actually does tracing! The trace is only going to show the order of the > tracepoints, not the _other_ memory accesses. So you'll have *some* > information, but it's very partial. > > And the thing is, all those other memory accesses are the ones that do all > the real work. You'll know they happened _somewhere_ between two > tracepoints, but not much more than that. > > This is why timestamps aren't really any worse than sequence numbers in > all practical matters. They'll get you close enough that you can consider > them equivalent to a cache-coherent counter, just one that you don't have > to take a cache miss for, and that increments on its own! > > Quite a lot of CPU's have nice, dependable, TSC's that run at constant > frequency. > > And quite a lot of traces care a _lot_ about real time. When you do IO > tracing, the problem is almost never about lock ordering or anything like > that. You want to see how long a request took. You don't care AT ALL how > many tracepoints were in between the beginning and end, you care about how > many microseconds there were! > > Linus > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:39 ` Linus Torvalds 2008-09-23 1:26 ` Roland Dreier @ 2008-09-23 2:30 ` Mathieu Desnoyers 2008-09-23 3:06 ` Masami Hiramatsu 2 siblings, 0 replies; 122+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 2:30 UTC (permalink / raw) To: Linus Torvalds Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 22 Sep 2008, Masami Hiramatsu wrote: > > > > Sure, atomic counter might be more expensive but accurate for ordering. > > Don't be silly. > > An atomic counter is no more accurate for ordering than anything else. > > Why? > > Because all it tells you is the ordering of the atomic increment, not of > the caller. The atomic increment is not related to all the other ops that > the code that you trace actually does in any shape or form, and so the > ordering of the trace doesn't actually imply anything for the ordering of > the operations you are tracing! > > Except for a single CPU, of course, but for that case you don't need a > sequence number either, since the ordering is entirely determined by the > ring buffer itself. > > So the counter will be more expensive (cross-cpu cache bouncing for EVERY > SINGLE EVENT), less useful (no real meaning for people who DO want to have > a timestamp), and it's really no more "ordered" than anything that bases > itself on a TSC. > > The fact is, you cannot order operations based on log messages unless you > have a lock around the whole caller - absolutely _no_ amount of locking or > atomic accesses in the log itself will guarantee ordering of the upper > layers. > > And sure, if you have locking at a higher layer, then a sequence number is > sufficient, but on the other hand, so is a well-synchronized TSC. > > So personally, I think that the optimal solution is: > > - let each ring buffer be associated with a "gettimestamp()" function, so > that everybody _can_ set it to something of their own. But default to > something sane, namely a raw TSC thing. > > - Add synchronization events to the ring buffer often enough that you can > make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply > noticing when the upper 32 bits change, although you could possibly do > it with a heartbeat too. > Hi Linus, I totally agree with all of the above, > - Similarly, add a synchronization event when the TSC frequency changes. > > - Make the synchronization packet contain the full 64-bit TSC base, in > addition to TSC frequency info _and_ the timebase. > If possible, I would recommend to stay as far away as possible from using frequency change events to support broken frequency scaling. First, some architectures, like AMD, does not trigger an event for every frequency changes on the CPU (if my memory is right, I think the southbridge temperature throttling does not generate any event) and we also have to consider the delay between the frequency change and and moment the instrumentation would be called. Each frequency change event would therefore increase the over TSC imprecision and I doubt we really want to rely on this for our time base. I welcome people to prove me wrong or paranoid by providing results proving the correct synchronicity of such approach over time. :-) I however explain how LTTng deals with such architectures I consider to fall into the "broken" category below. > - From those synchronization events, you should be able to get a very > accurate timestamp *after* the fact from the raw TSC numbers (ie do all > the scaling not when you gather the info, but when you present it), > even if you only spent 32 bits of TSC info on 99% of all events (an > just had a overflow log occasionally to get the rest of the info) > > - Most people will be _way_ happier with a timestamp that has enough > precision to also show ordering (assuming that the caller holds a > lock over the operation _including_ the tracing) than they would ever > be with a sequence number. > One gettimestamp() we could think of, to satisfy both people expecting tracing to perform a memory barrier and for people expecting a timestamp, would be to write timestamps taken from synchronized TSCs into a variable with a cmpxchg, which would succeed only if the TSC value is higher than the value present in this variable. That would give both the memory barrier behavior and the timestamping. Given the cache-line bouncing implied, I would only recommend to activate this when really needed, e.g. when debugging race issues. I currently use this scheme in LTTng to deal with broken x86 architecture with non-synchronized TSCs. Basically, all CPUs laging behind others take the TSC value read from memory and adds 200 cycles (or how many cycles it takes to actually read tsc) from the last cycle counter value written to memory by the previous time base read. I also plan to execute an heartbeat on every CPUs someday to give an upper bound to the imprecision of such TSCs so it could be use for some sort of performance measurement. > - people who really want to can consider the incrementing counter a TSC, > but it will suck in so many ways that I bet it will not be very popular > at all. But having the option to set a special timestamp function will > give people the option (on a per-buffer level) to make the "TSC" be a > simple incrementing 32-bit counter using xaddl and the upper bits > incrementing from a timer, but keep that as a "ok, the TSC is really > broken, or this architecture doesn't support any fast cycle counters at > all, or I really don't care about time, just sequence, and I guarantee > I have a single lock in all callers that makes things unambiguous" > I currently use the scheme you propose here on architectures lacking TSC. I hook in the timer interrupt to increment the MSBs. I doubt anyone will ever go the way of locking every caller, but yeah, why not. Also note that getting only 32-bits TSCs on MIPS makes things a bit less simple, but I did an RCU-style adaptation layer which generates a full 64-bits TSC from a 32-bits time base. It's currently in the -lttng tree. > Note the "single lock" part. It's not enough that you make any trace thing > under a lock. They must be under the _same_ lock for all relevant events > for you to be able to say anything about ordering. And that's actually > pretty rare for any complex behavior. > > The timestamping, btw, is likely the most important part of the whole > logging thing. So we need to get it right. But by "right" I mean really > really low-latency so that it's acceptable to everybody, real-time enough > that you can tell how far apart events were, and precise enough that you > really _can_ see ordering. > > The "raw TSC value with correction information" should be able to give you > all of that. At least on x86. On some platforms, the TSC may not give you > enough resolution to get reasonable guesses on event ordering. > Mathieu > Linus > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:39 ` Linus Torvalds 2008-09-23 1:26 ` Roland Dreier 2008-09-23 2:30 ` Mathieu Desnoyers @ 2008-09-23 3:06 ` Masami Hiramatsu 2 siblings, 0 replies; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 3:06 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Hi Linus, Linus Torvalds wrote: > > On Mon, 22 Sep 2008, Masami Hiramatsu wrote: >> Sure, atomic counter might be more expensive but accurate for ordering. > > Don't be silly. > > An atomic counter is no more accurate for ordering than anything else. > > Why? > > Because all it tells you is the ordering of the atomic increment, not of > the caller. The atomic increment is not related to all the other ops that > the code that you trace actually does in any shape or form, and so the > ordering of the trace doesn't actually imply anything for the ordering of > the operations you are tracing! > > Except for a single CPU, of course, but for that case you don't need a > sequence number either, since the ordering is entirely determined by the > ring buffer itself. > > So the counter will be more expensive (cross-cpu cache bouncing for EVERY > SINGLE EVENT), less useful (no real meaning for people who DO want to have > a timestamp), and it's really no more "ordered" than anything that bases > itself on a TSC. > > The fact is, you cannot order operations based on log messages unless you > have a lock around the whole caller - absolutely _no_ amount of locking or > atomic accesses in the log itself will guarantee ordering of the upper > layers. Indeed. If TSC(or similar time counter) can provide synchronized-time, I don't have any comment on that(AFAIK, latest x86 and ia64 can provide it). # I might be a bit nervous about Broken TSC... > And sure, if you have locking at a higher layer, then a sequence number is > sufficient, but on the other hand, so is a well-synchronized TSC. > > So personally, I think that the optimal solution is: > > - let each ring buffer be associated with a "gettimestamp()" function, so > that everybody _can_ set it to something of their own. But default to > something sane, namely a raw TSC thing. I agree, default to TSC is enough. > - Add synchronization events to the ring buffer often enough that you can > make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply > noticing when the upper 32 bits change, although you could possibly do > it with a heartbeat too. > > - Similarly, add a synchronization event when the TSC frequency changes. > > - Make the synchronization packet contain the full 64-bit TSC base, in > addition to TSC frequency info _and_ the timebase. > > - From those synchronization events, you should be able to get a very > accurate timestamp *after* the fact from the raw TSC numbers (ie do all > the scaling not when you gather the info, but when you present it), > even if you only spent 32 bits of TSC info on 99% of all events (an > just had a overflow log occasionally to get the rest of the info) > > - Most people will be _way_ happier with a timestamp that has enough > precision to also show ordering (assuming that the caller holds a > lock over the operation _including_ the tracing) than they would ever > be with a sequence number. > > - people who really want to can consider the incrementing counter a TSC, > but it will suck in so many ways that I bet it will not be very popular > at all. But having the option to set a special timestamp function will > give people the option (on a per-buffer level) to make the "TSC" be a > simple incrementing 32-bit counter using xaddl and the upper bits > incrementing from a timer, but keep that as a "ok, the TSC is really > broken, or this architecture doesn't support any fast cycle counters at > all, or I really don't care about time, just sequence, and I guarantee > I have a single lock in all callers that makes things unambiguous" Thank you very much for giving me a good idea! I agree with you. > Note the "single lock" part. It's not enough that you make any trace thing > under a lock. They must be under the _same_ lock for all relevant events > for you to be able to say anything about ordering. And that's actually > pretty rare for any complex behavior. > > The timestamping, btw, is likely the most important part of the whole > logging thing. So we need to get it right. But by "right" I mean really > really low-latency so that it's acceptable to everybody, real-time enough > that you can tell how far apart events were, and precise enough that you > really _can_ see ordering. > > The "raw TSC value with correction information" should be able to give you > all of that. At least on x86. On some platforms, the TSC may not give you > enough resolution to get reasonable guesses on event ordering. > > Linus -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-22 22:25 ` Masami Hiramatsu 2008-09-22 23:11 ` Darren Hart 2008-09-22 23:16 ` Martin Bligh @ 2008-09-23 14:36 ` KOSAKI Motohiro 2008-09-23 15:02 ` Frank Ch. Eigler 2008-09-23 15:21 ` Masami Hiramatsu 2 siblings, 2 replies; 122+ messages in thread From: KOSAKI Motohiro @ 2008-09-23 14:36 UTC (permalink / raw) To: Masami Hiramatsu Cc: kosaki.motohiro, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml > By the way, systemtap uses two modes; > > - single-channel mode > In this mode, all cpus share one buffer channel to write and read. > each writer locks spinlock and write a probe-local data to buffer. > > - per-cpu buffer mode > In this mode, we use an atomic sequential number for ordering data. If > user doesn't need it(because they have their own timestamps), they can > choose not to use that seq-number. I can't imazine a merit of the single-channel mode. Could you please explain it? Because some architecture don't have fine grained timestamp? if so, could you explain which architecture don't have it? ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:36 ` KOSAKI Motohiro @ 2008-09-23 15:02 ` Frank Ch. Eigler 2008-09-23 15:21 ` Masami Hiramatsu 1 sibling, 0 replies; 122+ messages in thread From: Frank Ch. Eigler @ 2008-09-23 15:02 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, systemtap-ml Hi - On Tue, Sep 23, 2008 at 11:36:26PM +0900, KOSAKI Motohiro wrote: > > By the way, systemtap uses two modes; > > > > - single-channel mode > > In this mode, all cpus share one buffer channel to write and read. > > each writer locks spinlock and write a probe-local data to buffer. > > - per-cpu buffer mode [...] > > I can't imazine a merit of the single-channel mode. > Could you please explain it? It could be a way of saving some memory and merging hassle for low-throughput data. (Remember that systemtap enables in-situ analysis of events so that often only brief final results need be sent along need be sent out.) If timestampwise cross-cpu merging can be done on demand by the hypothetical future buffer widget, then little reason remains not to use it. - FChE ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:36 ` KOSAKI Motohiro 2008-09-23 15:02 ` Frank Ch. Eigler @ 2008-09-23 15:21 ` Masami Hiramatsu 2008-09-23 17:59 ` KOSAKI Motohiro 1 sibling, 1 reply; 122+ messages in thread From: Masami Hiramatsu @ 2008-09-23 15:21 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml Hi Motohiro, KOSAKI Motohiro wrote: >> By the way, systemtap uses two modes; >> >> - single-channel mode >> In this mode, all cpus share one buffer channel to write and read. >> each writer locks spinlock and write a probe-local data to buffer. >> >> - per-cpu buffer mode >> In this mode, we use an atomic sequential number for ordering data. If >> user doesn't need it(because they have their own timestamps), they can >> choose not to use that seq-number. > > I can't imazine a merit of the single-channel mode. > Could you please explain it? Actually, single-channel mode is for not-frequently event tracing. At least systemtap case, sometimes we just want to collect data and watch it periodically(as like as 'top'). Or, just monitoring errors as additional printk. in these cases, overhead is not so important. I think the main reason of using single-channel mode is simplicity of userspace reader. We can use 'cat' or 'tail' to read the buffer on-line. I'm not sure how much overhead ftrace-like buffer merging routine has, but if kernel provides an interface which gives us single-merged buffer image(like ftrace buffer), we are grad to use it. :-) > Because some architecture don't have fine grained timestamp? > if so, could you explain which architecture don't have it? I heard that get_cycles always returns 0 on some cpu (ARM, MIPS, etc)... (I think some of their platforms support variants of get_cycles) Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 15:21 ` Masami Hiramatsu @ 2008-09-23 17:59 ` KOSAKI Motohiro 2008-09-23 18:28 ` Martin Bligh 0 siblings, 1 reply; 122+ messages in thread From: KOSAKI Motohiro @ 2008-09-23 17:59 UTC (permalink / raw) To: Masami Hiramatsu Cc: kosaki.motohiro, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml > > I can't imazine a merit of the single-channel mode. > > Could you please explain it? > > Actually, single-channel mode is for not-frequently event tracing. > At least systemtap case, sometimes we just want to collect data > and watch it periodically(as like as 'top'). Or, just monitoring > errors as additional printk. in these cases, overhead is not so > important. > > I think the main reason of using single-channel mode is simplicity of > userspace reader. We can use 'cat' or 'tail' to read the buffer on-line. > I'm not sure how much overhead ftrace-like buffer merging routine has, > but if kernel provides an interface which gives us single-merged buffer > image(like ftrace buffer), we are grad to use it. :-) Yup, I also think it is better. Thanks. ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 17:59 ` KOSAKI Motohiro @ 2008-09-23 18:28 ` Martin Bligh 0 siblings, 0 replies; 122+ messages in thread From: Martin Bligh @ 2008-09-23 18:28 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Masami Hiramatsu, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml >> I think the main reason of using single-channel mode is simplicity of >> userspace reader. We can use 'cat' or 'tail' to read the buffer on-line. >> I'm not sure how much overhead ftrace-like buffer merging routine has, >> but if kernel provides an interface which gives us single-merged buffer >> image(like ftrace buffer), we are grad to use it. :-) > > Yup, I also think it is better. That was the plan, yes. Merge sort is cheap ;-) ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-19 21:33 Unified tracing buffer Martin Bligh ` (7 preceding siblings ...) 2008-09-22 19:45 ` Masami Hiramatsu @ 2008-09-23 3:33 ` Andi Kleen 2008-09-23 3:47 ` Martin Bligh 8 siblings, 1 reply; 122+ messages in thread From: Andi Kleen @ 2008-09-23 3:33 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler "Martin Bligh" <mbligh@google.com> writes: > During kernel summit and Plumbers conference, Linus and others > expressed a desire for a unified > tracing buffer system for multiple tracing applications (eg ftrace, > lttng, systemtap, blktrace, etc) to use. This is what relayfs always was promised to be, but apparently never quite became. But before adding a new one I would recommend to remove relayfs first. -Andi ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:33 ` Andi Kleen @ 2008-09-23 3:47 ` Martin Bligh 2008-09-23 5:04 ` Andi Kleen 0 siblings, 1 reply; 122+ messages in thread From: Martin Bligh @ 2008-09-23 3:47 UTC (permalink / raw) To: Andi Kleen Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler >> During kernel summit and Plumbers conference, Linus and others >> expressed a desire for a unified >> tracing buffer system for multiple tracing applications (eg ftrace, >> lttng, systemtap, blktrace, etc) to use. > > This is what relayfs always was promised to be, but apparently > never quite became. But before adding a new one I would recommend > to remove relayfs first. It's a little different, though similar. relayfs is an unstructured buffer. This would be a sequence of events with a common timestamp format, and hopefully other commonalties too. I agree that the underlying buffer structure could be shared, as has been pointed out (buried in this long thread). However, in another buried comment, it was pointed out that relayfs would have no users once this was done, so ... I don't think we can remove relayfs before adding this and switching over the users though (possibly it could all be done at the same time, but messy) ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:47 ` Martin Bligh @ 2008-09-23 5:04 ` Andi Kleen 0 siblings, 0 replies; 122+ messages in thread From: Andi Kleen @ 2008-09-23 5:04 UTC (permalink / raw) To: Martin Bligh Cc: Andi Kleen, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler > I don't think we can remove relayfs before adding this and switching over > the users though (possibly it could all be done at the same time, but messy) It's only two in tree users (+ systemtap) so shouldn't be a big hazzle. And they're all tracers so presumably they all pass time stamps. But please don't add a multitude of new methods to do this. BTW I've been using something like that since early 2000. It was called ktrace and was based on a old SGI patch. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 122+ messages in thread
end of thread, other threads:[~2008-10-03 21:53 UTC | newest] Thread overview: 122+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-19 21:33 Unified tracing buffer Martin Bligh 2008-09-19 21:42 ` Randy Dunlap 2008-09-19 21:57 ` Martin Bligh 2008-09-19 22:41 ` Olaf Dabrunz 2008-09-19 22:19 ` Martin Bligh 2008-09-20 8:10 ` Olaf Dabrunz 2008-09-20 8:29 ` Steven Rostedt 2008-09-20 11:40 ` Mathieu Desnoyers 2008-09-20 8:26 ` Steven Rostedt 2008-09-20 11:44 ` Mathieu Desnoyers 2008-09-19 22:28 ` Olaf Dabrunz 2008-09-19 22:09 ` Martin Bligh 2008-09-19 23:18 ` Frank Ch. Eigler 2008-09-20 8:50 ` Steven Rostedt 2008-09-20 13:37 ` Mathieu Desnoyers 2008-09-20 13:51 ` Steven Rostedt 2008-09-20 14:54 ` Steven Rostedt 2008-09-22 18:45 ` Mathieu Desnoyers 2008-09-22 21:39 ` Steven Rostedt 2008-09-23 3:27 ` Mathieu Desnoyers 2008-09-20 0:07 ` Peter Zijlstra 2008-09-22 14:07 ` K.Prasad 2008-09-22 14:45 ` Peter Zijlstra 2008-09-22 16:29 ` Martin Bligh 2008-09-22 16:36 ` Peter Zijlstra 2008-09-22 20:50 ` Masami Hiramatsu 2008-09-23 3:05 ` Mathieu Desnoyers 2008-09-23 2:49 ` Mathieu Desnoyers 2008-09-23 5:25 ` Tom Zanussi 2008-09-23 9:31 ` Peter Zijlstra 2008-09-23 18:13 ` Mathieu Desnoyers 2008-09-23 18:33 ` Christoph Lameter 2008-09-23 18:56 ` Linus Torvalds 2008-09-23 13:50 ` Mathieu Desnoyers 2008-09-23 14:00 ` Martin Bligh 2008-09-23 17:55 ` K.Prasad 2008-09-23 18:27 ` Martin Bligh 2008-09-24 3:50 ` Tom Zanussi 2008-09-24 5:42 ` K.Prasad 2008-09-25 6:07 ` [RFC PATCH 0/8] current relay cleanup patchset Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 1/8] relay - Clean up relay_switch_subbuf() and make waking up consumers optional Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 2/8] relay - Make the relay sub-buffer switch code replaceable Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 3/8] relay - Add channel flags to relay, remove global callback param Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 4/8] relay - Add reserved param to switch-subbuf, in preparation for non-pad write/reserve Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 5/8] relay - Map the first sub-buffer at the end of the buffer, for temporary convenience Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 6/8] relay - Replace relay_reserve/relay_write with non-padded versions Tom Zanussi 2008-09-25 6:07 ` [RFC PATCH 7/8] relay - Remove padding-related code from relay_read()/relay_splice_read() et al Tom Zanussi 2008-09-25 6:08 ` [RFC PATCH 8/8] relay - Clean up remaining padding-related junk Tom Zanussi 2008-09-23 5:27 ` [PATCH 1/3] relay - clean up subbuf switch Tom Zanussi 2008-09-23 20:15 ` Andrew Morton 2008-09-23 5:27 ` [PATCH 2/3] relay - make subbuf switch replaceable Tom Zanussi 2008-09-23 20:17 ` Andrew Morton 2008-09-23 5:27 ` [PATCH 3/3] relay - add channel flags Tom Zanussi 2008-09-23 20:20 ` Andrew Morton 2008-09-24 3:57 ` Tom Zanussi 2008-09-20 0:26 ` Unified tracing buffer Marcel Holtmann 2008-09-20 9:03 ` Steven Rostedt 2008-09-20 13:55 ` Mathieu Desnoyers 2008-09-20 14:12 ` Arjan van de Ven 2008-09-22 18:52 ` Mathieu Desnoyers 2008-10-02 15:28 ` Jason Baron 2008-10-03 16:11 ` Mathieu Desnoyers 2008-10-03 18:37 ` Jason Baron 2008-10-03 19:10 ` Mathieu Desnoyers 2008-10-03 19:25 ` Jason Baron 2008-10-03 19:56 ` Mathieu Desnoyers 2008-10-03 20:25 ` Jason Baron 2008-10-03 21:52 ` Frank Ch. Eigler 2008-09-22 3:09 ` KOSAKI Motohiro 2008-09-22 9:57 ` Peter Zijlstra 2008-09-23 2:36 ` Mathieu Desnoyers 2008-09-22 13:57 ` K.Prasad 2008-09-22 19:45 ` Masami Hiramatsu 2008-09-22 20:13 ` Martin Bligh 2008-09-22 22:25 ` Masami Hiramatsu 2008-09-22 23:11 ` Darren Hart 2008-09-23 0:04 ` Masami Hiramatsu 2008-09-22 23:16 ` Martin Bligh 2008-09-23 0:05 ` Masami Hiramatsu 2008-09-23 0:12 ` Martin Bligh 2008-09-23 14:49 ` Masami Hiramatsu 2008-09-23 15:04 ` Mathieu Desnoyers 2008-09-23 15:30 ` Masami Hiramatsu 2008-09-23 16:01 ` Linus Torvalds 2008-09-23 17:04 ` Masami Hiramatsu 2008-09-23 17:30 ` Thomas Gleixner 2008-09-23 18:59 ` Masami Hiramatsu 2008-09-23 19:36 ` Thomas Gleixner 2008-09-23 19:38 ` Martin Bligh 2008-09-23 19:41 ` Thomas Gleixner 2008-09-23 19:50 ` Martin Bligh 2008-09-23 20:03 ` Thomas Gleixner 2008-09-23 21:02 ` Martin Bligh 2008-09-23 20:03 ` Masami Hiramatsu 2008-09-23 20:08 ` Thomas Gleixner 2008-09-23 15:46 ` Linus Torvalds 2008-09-23 0:39 ` Linus Torvalds 2008-09-23 1:26 ` Roland Dreier 2008-09-23 1:39 ` Steven Rostedt 2008-09-23 2:02 ` Mathieu Desnoyers 2008-09-23 2:26 ` Darren Hart 2008-09-23 2:31 ` Mathieu Desnoyers 2008-09-23 3:26 ` Linus Torvalds 2008-09-23 3:36 ` Mathieu Desnoyers 2008-09-23 4:05 ` Linus Torvalds 2008-09-23 3:43 ` Steven Rostedt 2008-09-23 4:10 ` Masami Hiramatsu 2008-09-23 4:17 ` Martin Bligh 2008-09-23 15:23 ` Masami Hiramatsu 2008-09-23 10:53 ` Steven Rostedt 2008-09-23 4:19 ` Linus Torvalds 2008-09-23 14:12 ` Mathieu Desnoyers 2008-09-23 2:30 ` Mathieu Desnoyers 2008-09-23 3:06 ` Masami Hiramatsu 2008-09-23 14:36 ` KOSAKI Motohiro 2008-09-23 15:02 ` Frank Ch. Eigler 2008-09-23 15:21 ` Masami Hiramatsu 2008-09-23 17:59 ` KOSAKI Motohiro 2008-09-23 18:28 ` Martin Bligh 2008-09-23 3:33 ` Andi Kleen 2008-09-23 3:47 ` Martin Bligh 2008-09-23 5:04 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox