Unified tracing buffer

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Unified tracing buffer
@ 2008-09-19 21:33 Martin Bligh
  2008-09-19 21:42 ` Randy Dunlap
                   ` (8 more replies)
  0 siblings, 9 replies; 122+ messages in thread
From: Martin Bligh @ 2008-09-19 21:33 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, od, Frank Ch. Eigler

During kernel summit and Plumbers conference, Linus and others
expressed a desire for a unified
tracing buffer system for multiple tracing applications (eg ftrace,
lttng, systemtap, blktrace, etc) to use.
This provides several advantages, including the ability to interleave
data from multiple sources,
not having to learn 200 different tools, duplicated code/effort, etc.

Several of us got together last night and tried to cut this down to
the simplest usable system
we could agree on (and nobody got hurt!). This will form version 1.
I've sketched out a few
enhancements we know that we want, but have agreed to leave these
until version 2.
The answer to most questions about the below is "yes we know, we'll
fix that in version 2"
(or 3). Simplicity was the rule ...

Sketch of design.  Enjoy flaming me. Code will follow shortly.

STORAGE
-------

We will support multiple buffers for different tracing systems, with
separate names, event id spaces.
Event ids are 16 bit, dynamically allocated.
A "one line of text" print function will be provided for each event,
or use the default (probably hex printf)
Will provide a "flight data recorder" mode, and a "spool to disk" mode.

Circular buffer per cpu, protected by per-cpu spinlock_irq
Word aligned records.
Variable record length, header will start with length record.
Timestamps in fixed timebase, monotonically increasing (across all CPUs)

INPUT_FUNCTIONS
---------------

allocate_buffer (name, size)
        return buffer_handle

register_event (buffer_handle, event_id, print_function)
        You can pass in a requested event_id from a fixed set, and
will be given it, or an error
        0 means allocate me one dynamically
        returns event_id     (or -E_ERROR)

record_event (buffer_handle, event_id, length, *buf)

OUTPUT
------

Data will be output via debugfs, and provide the following output streams:

/debugfs/tracing/<name>/buffers/text
    clear text stream (will merge the per-cpu streams via insertion
sort, and use the print functions)

/debugfs/tracing/<name>/buffers/binary[cpu_number]
    per-cpu binary data

CONTROL
-------

Sysfs style tree under debugfs

/debugfs/tracing/<name>/buffers/enabed         <--- binary value

/debugfs/tracing/<name>/<event1>
/debugfs/tracing/<name>/<event2>
    etc ...
    provides a way to enable/disable events, see what's available, and
what's enabled.

KNOWN ISSUES / PLANS
-------------------

No way to unregister buffers and events.
    Will provide an unregister_buffer and unregister_event call

Generating systemwide time is hard on some platforms
    Yes. Time-based output provides a lot of simplicity for the user though
    We won't support these platforms at first, we'll add functionality
to make it work for them later.
    (plan based on tick-based ms timing, plus counter offset from that
if needed).

Spinlock_irq is ineffecient, and doesn't support tracing in NMIs
    True. We'll implement a lockless scheme later (see lttng)

Putting a length record in every event is inefficient
    True. Fixed record length with optional extensions is better, but
more complex. v2.

Putting a full timestamp rather than an offset in every event is inefficient
    See above. True, but v2.

Relayfs already exists! use that!
    People were universally not keen on that idea. Complexity, interface, etc.
    We're also providing some higher level shared functions for time &
event ids.

There's no way to decode the binary data stream
    Code will be shared from the kernel to decode it, so that we can
get the compact binary
    format and decode it later. That code will be kept in the kernel
tree (it's a trivial piece of C).
    Version 1.1 ;-)

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:33 Unified tracing buffer Martin Bligh
@ 2008-09-19 21:42 ` Randy Dunlap
  2008-09-19 21:57   ` Martin Bligh
  2008-09-19 22:28 ` Olaf Dabrunz
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 122+ messages in thread
From: Randy Dunlap @ 2008-09-19 21:42 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler

On Fri, 19 Sep 2008 14:33:42 -0700 Martin Bligh wrote:

> STORAGE
> -------
> 
> We will support multiple buffers for different tracing systems, with
> separate names, event id spaces.
> Event ids are 16 bit, dynamically allocated.

What are these (like)?

> A "one line of text" print function will be provided for each event,
> or use the default (probably hex printf)
> Will provide a "flight data recorder" mode, and a "spool to disk" mode.
> 
> Circular buffer per cpu, protected by per-cpu spinlock_irq
> Word aligned records.

Arch-specific "word"?
or some fixed-size-for-all-systems (so that trace buffers can be
shared/used on other systems?)  Preferably the latter.


> Variable record length, header will start with length record.
> Timestamps in fixed timebase, monotonically increasing (across all CPUs)

what timestamp resolution?


---
~Randy

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:42 ` Randy Dunlap
@ 2008-09-19 21:57   ` Martin Bligh
  2008-09-19 22:41     ` Olaf Dabrunz
  2008-09-20  8:26     ` Steven Rostedt
  0 siblings, 2 replies; 122+ messages in thread
From: Martin Bligh @ 2008-09-19 21:57 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler

>> Event ids are 16 bit, dynamically allocated.
>
> What are these (like)?

u16

Sorry, probably lots of implicit assumptions in there that I forgot to explain

> Arch-specific "word"?
> or some fixed-size-for-all-systems (so that trace buffers can be
> shared/used on other systems?)  Preferably the latter.

Mmmm. I don't see anything wrong with making it just 8 byte aligned, personally.
Steven - this was your thing?

>> Variable record length, header will start with length record.
>> Timestamps in fixed timebase, monotonically increasing (across all CPUs)
>
> what timestamp resolution?

ns is probably sufficient for output, but may need to be higher
internally to get
correct ordering of events across CPUs. So, as long as we record this in
the buffer header, the internal resolution shouldn't be critical. The text
print output ... I'd say ns? We can put it relative to wall time in there,
as long as we record it in the buffer header at trace start. I guess we should
document the buffer header ;-)

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:57   ` Martin Bligh
@ 2008-09-19 22:41     ` Olaf Dabrunz
  2008-09-19 22:19       ` Martin Bligh
  2008-09-20  8:26     ` Steven Rostedt
  1 sibling, 1 reply; 122+ messages in thread
From: Olaf Dabrunz @ 2008-09-19 22:41 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt,
	od Frank Ch. Eigler

On 19-Sep-08, Martin Bligh wrote:
> >> Event ids are 16 bit, dynamically allocated.
> >
> > What are these (like)?
> 
> u16
>
> Sorry, probably lots of implicit assumptions in there that I forgot to explain

Ids for event types. Either allocated dynamically, if the tracer needs
new ids on each use, or statically assigned for others (like my fctrace
or Steven's ftrace, I believe). Should we have a reserved range / registry
for static allocation, maybe something like a very simple version of
devices.txt?

> > Arch-specific "word"?
> > or some fixed-size-for-all-systems (so that trace buffers can be
> > shared/used on other systems?)  Preferably the latter.
> 
> Mmmm. I don't see anything wrong with making it just 8 byte aligned, personally.
> Steven - this was your thing?

Unaligned can be much slower. I guess some very quick tracers can
benefit from alignment.

> as long as we record it in the buffer header at trace start. I guess we should
> document the buffer header ;-)

Yes. :)

-- 
Olaf Dabrunz (od/odabrunz), SUSE Linux Products GmbH, Nürnberg


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 22:41     ` Olaf Dabrunz
@ 2008-09-19 22:19       ` Martin Bligh
  2008-09-20  8:10         ` Olaf Dabrunz
  2008-09-20  8:29         ` Steven Rostedt
  0 siblings, 2 replies; 122+ messages in thread
From: Martin Bligh @ 2008-09-19 22:19 UTC (permalink / raw)
  To: Martin Bligh, Randy Dunlap, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, od Frank Ch. Eigler

>> Sorry, probably lots of implicit assumptions in there that I forgot to explain
>
> Ids for event types. Either allocated dynamically, if the tracer needs
> new ids on each use, or statically assigned for others (like my fctrace
> or Steven's ftrace, I believe). Should we have a reserved range / registry
> for static allocation, maybe something like a very simple version of
> devices.txt?

Sure, but it's per-tracer, so hopefully won't be a big problem (eg fctrace
would have a different event-id namespace from blktrace)

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 22:19       ` Martin Bligh
@ 2008-09-20  8:10         ` Olaf Dabrunz
  2008-09-20  8:29         ` Steven Rostedt
  1 sibling, 0 replies; 122+ messages in thread
From: Olaf Dabrunz @ 2008-09-20  8:10 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt,
	od Frank Ch. Eigler

On 19-Sep-08, Martin Bligh wrote:
> >> Sorry, probably lots of implicit assumptions in there that I forgot to explain
> >
> > Ids for event types. Either allocated dynamically, if the tracer needs
> > new ids on each use, or statically assigned for others (like my fctrace
> > or Steven's ftrace, I believe). Should we have a reserved range / registry
> > for static allocation, maybe something like a very simple version of
> > devices.txt?
> 
> Sure, but it's per-tracer, so hopefully won't be a big problem (eg fctrace
> would have a different event-id namespace from blktrace)

Ah, that is right. We can distinguish them.

-- 
Olaf Dabrunz (od/odabrunz), SUSE Linux Products GmbH, Nürnberg


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 22:19       ` Martin Bligh
  2008-09-20  8:10         ` Olaf Dabrunz
@ 2008-09-20  8:29         ` Steven Rostedt
  2008-09-20 11:40           ` Mathieu Desnoyers
  1 sibling, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2008-09-20  8:29 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, od Frank Ch. Eigler

On Fri, 19 Sep 2008, Martin Bligh wrote:

> >> Sorry, probably lots of implicit assumptions in there that I forgot to explain
> >
> > Ids for event types. Either allocated dynamically, if the tracer needs
> > new ids on each use, or statically assigned for others (like my fctrace
> > or Steven's ftrace, I believe). Should we have a reserved range / registry
> > for static allocation, maybe something like a very simple version of
> > devices.txt?
> 
> Sure, but it's per-tracer, so hopefully won't be a big problem (eg fctrace
> would have a different event-id namespace from blktrace)
> 

Right!

 We stated in our little meeting that the true event id association is 
buffer id / event id tuple. We will not be assigning ranges for events 
for specific tracers. Ftrace will not have its own range. The static ids 
are reserved for the static trace points and some various static trace 
types that the average kernel developer may use.

Think "string event type" for a event type that will simply hold an ASCII 
string.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20  8:29         ` Steven Rostedt
@ 2008-09-20 11:40           ` Mathieu Desnoyers
  0 siblings, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-20 11:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Martin Bligh, Randy Dunlap, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Frank Ch. Eigler

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 19 Sep 2008, Martin Bligh wrote:
> 
> > >> Sorry, probably lots of implicit assumptions in there that I forgot to explain
> > >
> > > Ids for event types. Either allocated dynamically, if the tracer needs
> > > new ids on each use, or statically assigned for others (like my fctrace
> > > or Steven's ftrace, I believe). Should we have a reserved range / registry
> > > for static allocation, maybe something like a very simple version of
> > > devices.txt?
> > 
> > Sure, but it's per-tracer, so hopefully won't be a big problem (eg fctrace
> > would have a different event-id namespace from blktrace)
> > 
> 
> Right!
> 
>  We stated in our little meeting that the true event id association is 
> buffer id / event id tuple. We will not be assigning ranges for events 
> for specific tracers. Ftrace will not have its own range. The static ids 
> are reserved for the static trace points and some various static trace 
> types that the average kernel developer may use.
> 

Just to be sure of the "that the average kernel developer may use"
meaning : I would recommend keeping those static ID range only for
internal buffering mechanism events. E.g., if we need to add a
periodical event in every stream so we can detect timestamp wrap-around,
that would be part of the buffering infrastructure itself, and thus
reserve an event ID.

core_heartbeat (u64 timestamp)

Same thing if we want to export the table that maps:

  event name <-> event ID <-> event typing (includes event size info)

That table can be presented into the buffers (possible a single metadata
buffer) in the form of two static event IDs :

core_id_name (u16 id, const char *name)
core_id_type (u16 id, const char *type)

Here another assumption for portability is to declare event type as a
possibly extended format string. Alternative suggestions are welcome.

Mathieu

> Think "string event type" for a event type that will simply hold an ASCII 
> string.
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:57   ` Martin Bligh
  2008-09-19 22:41     ` Olaf Dabrunz
@ 2008-09-20  8:26     ` Steven Rostedt
  2008-09-20 11:44       ` Mathieu Desnoyers
  1 sibling, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2008-09-20  8:26 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Randy Dunlap, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, od, Frank Ch. Eigler



On Fri, 19 Sep 2008, Martin Bligh wrote:

> >> Event ids are 16 bit, dynamically allocated.
> >
> > What are these (like)?
> 
> u16
> 
> Sorry, probably lots of implicit assumptions in there that I forgot to explain
> 
> > Arch-specific "word"?
> > or some fixed-size-for-all-systems (so that trace buffers can be
> > shared/used on other systems?)  Preferably the latter.
> 
> Mmmm. I don't see anything wrong with making it just 8 byte aligned, personally.
> Steven - this was your thing?

I'm fine with 8 byte aligned for all.

> 
> >> Variable record length, header will start with length record.
> >> Timestamps in fixed timebase, monotonically increasing (across all CPUs)
> >
> > what timestamp resolution?
> 
> ns is probably sufficient for output, but may need to be higher
> internally to get
> correct ordering of events across CPUs. So, as long as we record this in
> the buffer header, the internal resolution shouldn't be critical. The text
> print output ... I'd say ns? We can put it relative to wall time in there,
> as long as we record it in the buffer header at trace start. I guess we should
> document the buffer header ;-)
> 

ftrace has two outputs. One is ns from start of boot, the other is ns from 
start of trace. Not sure we need to make wall time in there, unless we add 
it in the trace header (as you stated). Probably best for Version #2.

-- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20  8:26     ` Steven Rostedt
@ 2008-09-20 11:44       ` Mathieu Desnoyers
  0 siblings, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-20 11:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Martin Bligh, Randy Dunlap, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> 
> On Fri, 19 Sep 2008, Martin Bligh wrote:
> 
> > >> Event ids are 16 bit, dynamically allocated.
> > >
> > > What are these (like)?
> > 
> > u16
> > 
> > Sorry, probably lots of implicit assumptions in there that I forgot to explain
> > 
> > > Arch-specific "word"?
> > > or some fixed-size-for-all-systems (so that trace buffers can be
> > > shared/used on other systems?)  Preferably the latter.
> > 
> > Mmmm. I don't see anything wrong with making it just 8 byte aligned, personally.
> > Steven - this was your thing?
> 
> I'm fine with 8 byte aligned for all.
> 
> > 
> > >> Variable record length, header will start with length record.
> > >> Timestamps in fixed timebase, monotonically increasing (across all CPUs)
> > >
> > > what timestamp resolution?
> > 
> > ns is probably sufficient for output, but may need to be higher
> > internally to get
> > correct ordering of events across CPUs. So, as long as we record this in
> > the buffer header, the internal resolution shouldn't be critical. The text
> > print output ... I'd say ns? We can put it relative to wall time in there,
> > as long as we record it in the buffer header at trace start. I guess we should
> > document the buffer header ;-)
> > 
> 
> ftrace has two outputs. One is ns from start of boot, the other is ns from 
> start of trace. Not sure we need to make wall time in there, unless we add 
> it in the trace header (as you stated). Probably best for Version #2.
> 
> -- Steve
> 

For simplicity and efficiency, I would try to keep the timestamp
recorded for every event as close as what we are reading from the
hardware without manipulation. We can save the timestamp value at the
beginning of the trace (in the buffer header) and we can also save the
scaling value in the same header so we can transform stuff like cycle
counts into ns, ps from boot time or from trace start when we
pretty-print.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:33 Unified tracing buffer Martin Bligh
  2008-09-19 21:42 ` Randy Dunlap
@ 2008-09-19 22:28 ` Olaf Dabrunz
  2008-09-19 22:09   ` Martin Bligh
  2008-09-19 23:18 ` Frank Ch. Eigler
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 122+ messages in thread
From: Olaf Dabrunz @ 2008-09-19 22:28 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, Olaf Dabrunz, Frank Ch. Eigler

Hi Martin and everyone,

thanks for writing this up.

Just one comment:

> /debugfs/tracing/<name>/buffers/enabed         <--- binary value

ASCII "0" and "1"?

And my e-mail is odabrunz@novell.com or preferably od@suse.de, as in
this mail.

Thanks, :)

-- 
Olaf Dabrunz (od/odabrunz), SUSE Linux Products GmbH, Nürnberg


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 22:28 ` Olaf Dabrunz
@ 2008-09-19 22:09   ` Martin Bligh
  0 siblings, 0 replies; 122+ messages in thread
From: Martin Bligh @ 2008-09-19 22:09 UTC (permalink / raw)
  To: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt,
	Frank Ch. Eigler
  Cc: Olaf Dabrunz

>> /debugfs/tracing/<name>/buffers/enabed         <--- binary value
>
> ASCII "0" and "1"?

Oops. s/binary/boolean/ and yes, in ASCII.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:33 Unified tracing buffer Martin Bligh
  2008-09-19 21:42 ` Randy Dunlap
  2008-09-19 22:28 ` Olaf Dabrunz
@ 2008-09-19 23:18 ` Frank Ch. Eigler
  2008-09-20  8:50   ` Steven Rostedt
  2008-09-20  0:07 ` Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 122+ messages in thread
From: Frank Ch. Eigler @ 2008-09-19 23:18 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od

Hi -

On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote:

> During kernel summit and Plumbers conference, Linus and others
> expressed a desire for a unified tracing buffer system for multiple
> tracing applications (eg ftrace, lttng, systemtap, blktrace, etc) to
> use.

OK.

> [...]
> STORAGE
> -------
> 
> We will support multiple buffers for different tracing systems, with
> separate names, event id spaces. [...]

OK.  This is completely orthogonal to ...

> INPUT_FUNCTIONS
> ---------------
> 
> allocate_buffer (name, size)
>         return buffer_handle
> 
> register_event (buffer_handle, event_id, print_function)
>         You can pass in a requested event_id from a fixed set, and
> will be given it, or an error
>         0 means allocate me one dynamically
>         returns event_id     (or -E_ERROR)
> 
> record_event (buffer_handle, event_id, length, *buf)

How do you imagine record_event() being used from the point of view of
the instrumented module?  Is it to be protected by some sort of test
of the control variable?  Is the little binary event buffer supposed
to be constructed unconditionally?  (On the stack?)

You should compare this to markers and tracepoints.  It sounds to me like
this is not that different from

     trace_mark (event_name, "%*b", length, buf);

where the goofy "%*b" could be some magic to identify the proposed
"everything is a short blob" event record type.

By the way, systemtap supports formatted printing that generates
binary records via directives like "%4b" for 4-byte ints.  I wonder if
that would be a suitable facility for this and/or markers to allow
instrumentation code to *generate* those binary event records.

Do you believe that fans of tracepoints would support a single
void*/length struct parametrization?

> Data will be output via debugfs, and provide the following output streams:
> 
> /debugfs/tracing/<name>/buffers/text
>     clear text stream (will merge the per-cpu streams via insertion
> sort, and use the print functions)

Can you spell out this part a little more?  I wonder because at the
tracing miniconf on Wednesday we talked about systemtap's likely need
to *consume* these trace events as they are being generated.

If systemtap can only see them as a binary blob or a rendered ascii
string, they would not be as useful as if the record was decomposable
in kernel.  Perhaps the event-type-registration call can declare the
binary struct, like a perl pack directive ... or a marker (binary)
format string.

> CONTROL
> 
> Sysfs style tree under debugfs
> 
> /debugfs/tracing/<name>/buffers/enabed         <--- binary value
> 
> /debugfs/tracing/<name>/<event1>
> /debugfs/tracing/<name>/<event2>
>     etc ...
>     provides a way to enable/disable events, see what's available, and
> what's enabled.

This sort of control is (or should be) already available for markers.

- FChE

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 23:18 ` Frank Ch. Eigler
@ 2008-09-20  8:50   ` Steven Rostedt
  2008-09-20 13:37     ` Mathieu Desnoyers
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2008-09-20  8:50 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, od

[ Note, It is very late for me and I probably should be finishing up 
packing for my trip home, but I'm stupid and decided to respond now 
instead ]

On Fri, 19 Sep 2008, Frank Ch. Eigler wrote:

> Hi -
> 
> 
> On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote:
> 
> > During kernel summit and Plumbers conference, Linus and others
> > expressed a desire for a unified tracing buffer system for multiple
> > tracing applications (eg ftrace, lttng, systemtap, blktrace, etc) to
> > use.
> 
> OK.
> 
> 
> > [...]
> > STORAGE
> > -------
> > 
> > We will support multiple buffers for different tracing systems, with
> > separate names, event id spaces. [...]
> 
> OK.  This is completely orthogonal to ...
> 
> 
> > INPUT_FUNCTIONS
> > ---------------
> > 
> > allocate_buffer (name, size)
> >         return buffer_handle
> > 
> > register_event (buffer_handle, event_id, print_function)
> >         You can pass in a requested event_id from a fixed set, and
> > will be given it, or an error
> >         0 means allocate me one dynamically
> >         returns event_id     (or -E_ERROR)
> > 
> > record_event (buffer_handle, event_id, length, *buf)
> 
> How do you imagine record_event() being used from the point of view of
> the instrumented module?  Is it to be protected by some sort of test
> of the control variable?  Is the little binary event buffer supposed
> to be constructed unconditionally?  (On the stack?)

The event buffer is allocated when you create the buffer. The tracer will 
do that on initialization.

> 
> 
> You should compare this to markers and tracepoints.  It sounds to me like
> this is not that different from
> 
>      trace_mark (event_name, "%*b", length, buf);
> 
> where the goofy "%*b" could be some magic to identify the proposed
> "everything is a short blob" event record type.

This is completely separate from the trace buffer itself. The trace points 
and trace markers simply write whatever they want into the trace buffer.
The tracepoints and trace markers are something to hook points of code to 
do some type of tracing. The trace buffer record_event is how it will do 
it if it chooses to do so.

> 
> By the way, systemtap supports formatted printing that generates
> binary records via directives like "%4b" for 4-byte ints.  I wonder if
> that would be a suitable facility for this and/or markers to allow
> instrumentation code to *generate* those binary event records.
> 

The tracepoints and markers can generate anything they want.

> 
> Do you believe that fans of tracepoints would support a single
> void*/length struct parametrization?

The "record_event" would not (I repeat, "not") be in general code. It will 
be used by the different tracers. The markers/tracepoints will be in the 
code that can hooked to do things like profiling, or if you want, record 
the data into the trace buffer via the record event.

> 
> 
> 
> > Data will be output via debugfs, and provide the following output streams:
> > 
> > /debugfs/tracing/<name>/buffers/text
> >     clear text stream (will merge the per-cpu streams via insertion
> > sort, and use the print functions)
> 
> Can you spell out this part a little more?  I wonder because at the
> tracing miniconf on Wednesday we talked about systemtap's likely need
> to *consume* these trace events as they are being generated.

What does this mean exactly. I should have asked more details about this
but I was too worried about time constraints (since Linus was speaking 
next to think about it at the time.

That is, when you read a marker, who and when does that data get consumed?
Where does that data go? Do you even need to store it in this ring buffer?

> 
> If systemtap can only see them as a binary blob or a rendered ascii
> string, they would not be as useful as if the record was decomposable
> in kernel.  Perhaps the event-type-registration call can declare the
> binary struct, like a perl pack directive ... or a marker (binary)
> format string.

I don't understand the above. I'm also thinking that we are 
miscommunicating a bit here.

Let make make an example with something that I know, ftrace.

We hit the tracepoint in, lets say, scheduler. Previously at 
initialization time, ftrace would have registered with this trace point 
and would have allocated a buffer. When the tracepoint is actually hit, it 
jumps to the function that ftrace registered. Then this function would 
record the event into the buffer using 'record_event'. At a later time, 
the user could read the buffer from the filesystem using either the pretty 
print format or raw binary format.

This code is not replacing tracepoints or markers. When you say you will 
consume at reading, it sounds like you don't even need to use the buffer 
mechanism and the trace points/markers is good enough for you. The 
tracepoints and markers are not something that is being replaced. The 
trace buffers are just something to use that we can record data to that we 
can retrieve at a later time.

> 
> 
> > CONTROL
> > 
> > Sysfs style tree under debugfs
> > 
> > /debugfs/tracing/<name>/buffers/enabed         <--- binary value
> > 
> > /debugfs/tracing/<name>/<event1>
> > /debugfs/tracing/<name>/<event2>
> >     etc ...
> >     provides a way to enable/disable events, see what's available, and
> > what's enabled.
> 
> This sort of control is (or should be) already available for markers.

Markers and the buffers are two separate things. Perhaps I'm just tired, 
but I'm thinking that you are thinking we are going to remove markers and 
trace points.

This code is only to give the kernel a ring buffer to use. Not a way to 
put hooks into kernel code. We have tracepoints and markers for that.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20  8:50   ` Steven Rostedt
@ 2008-09-20 13:37     ` Mathieu Desnoyers
  2008-09-20 13:51       ` Steven Rostedt
  0 siblings, 1 reply; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-20 13:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> [ Note, It is very late for me and I probably should be finishing up 
> packing for my trip home, but I'm stupid and decided to respond now 
> instead ]
> 
> 
> On Fri, 19 Sep 2008, Frank Ch. Eigler wrote:
> 
> > Hi -
> > 
> > 
> > On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote:
> > 
> > > During kernel summit and Plumbers conference, Linus and others
> > > expressed a desire for a unified tracing buffer system for multiple
> > > tracing applications (eg ftrace, lttng, systemtap, blktrace, etc) to
> > > use.
> > 
> > OK.
> > 
> > 
> > > [...]
> > > STORAGE
> > > -------
> > > 
> > > We will support multiple buffers for different tracing systems, with
> > > separate names, event id spaces. [...]
> > 
> > OK.  This is completely orthogonal to ...
> > 
> > 
> > > INPUT_FUNCTIONS
> > > ---------------
> > > 
> > > allocate_buffer (name, size)
> > >         return buffer_handle
> > > 
> > > register_event (buffer_handle, event_id, print_function)
> > >         You can pass in a requested event_id from a fixed set, and
> > > will be given it, or an error
> > >         0 means allocate me one dynamically
> > >         returns event_id     (or -E_ERROR)
> > > 
> > > record_event (buffer_handle, event_id, length, *buf)
> > 
> > How do you imagine record_event() being used from the point of view of
> > the instrumented module?  Is it to be protected by some sort of test
> > of the control variable?  Is the little binary event buffer supposed
> > to be constructed unconditionally?  (On the stack?)
> 
> The event buffer is allocated when you create the buffer. The tracer will 
> do that on initialization.
> 

I would propose disable the event source ASAP when disabled, without any
conditional test if possible. This is actually what the Kernel Markers
does with the help of Immediate Values. It should also come with a
finer-grained filtering based on a global "enable" variable and
per-buffer "enable" variables so a tracer can atomically start
collecting all of its event types. We could then require both
global tracing and per-buffer tracing to be enabled to write into the
buffer.

I think I see where Frank is going, and I agree that this is more or
less what I have had in mind : the Markers could be used as a global
event ID registry and could hold the event name, ID, types (format
string) table. Therefore, the markers would simply become the "Write to
buffer" interface, which would associate IDs automatically and keep the
format strings into a table dumped into a metainformation trace buffer
at trace start and whenever IDs are dynamically registered while the
trace is active.

> > 
> > 
> > You should compare this to markers and tracepoints.  It sounds to me like
> > this is not that different from
> > 
> >      trace_mark (event_name, "%*b", length, buf);
> > 
> > where the goofy "%*b" could be some magic to identify the proposed
> > "everything is a short blob" event record type.
> 
> This is completely separate from the trace buffer itself. The trace points 
> and trace markers simply write whatever they want into the trace buffer.
> The tracepoints and trace markers are something to hook points of code to 
> do some type of tracing. The trace buffer record_event is how it will do 
> it if it chooses to do so.
> 

As I said above, the tracepoints are meant to be a in-kernel API which
instruments the kernel code. It leaves the markers, which are meant to
be exposed to userspace anyway, for such record_event use. They would
actually accomplish two things : they would register the event (just
declaring the marker puts the event in a special section, which is our
mapping table) and would also record the event when enabled and
executed.

> > 
> > By the way, systemtap supports formatted printing that generates
> > binary records via directives like "%4b" for 4-byte ints.  I wonder if
> > that would be a suitable facility for this and/or markers to allow
> > instrumentation code to *generate* those binary event records.
> > 
> 
> The tracepoints and markers can generate anything they want.
> 
> > 
> > Do you believe that fans of tracepoints would support a single
> > void*/length struct parametrization?
> 
> The "record_event" would not (I repeat, "not") be in general code. It will 
> be used by the different tracers. The markers/tracepoints will be in the 
> code that can hooked to do things like profiling, or if you want, record 
> the data into the trace buffer via the record event.
> 
> > 
> > 
> > 
> > > Data will be output via debugfs, and provide the following output streams:
> > > 
> > > /debugfs/tracing/<name>/buffers/text
> > >     clear text stream (will merge the per-cpu streams via insertion
> > > sort, and use the print functions)
> > 
> > Can you spell out this part a little more?  I wonder because at the
> > tracing miniconf on Wednesday we talked about systemtap's likely need
> > to *consume* these trace events as they are being generated.
> 
> What does this mean exactly. I should have asked more details about this
> but I was too worried about time constraints (since Linus was speaking 
> next to think about it at the time.
> 
> That is, when you read a marker, who and when does that data get consumed?
> Where does that data go? Do you even need to store it in this ring buffer?
> 

Given that systemtap may need to access the kernel state and the moment
the instrumentation is reached, I think it implies they have to be
called _before_ the data is writter to the buffers. We can thus see
SystemTAP as a very powerful filtering mechanism. SystemTAP could also
choose to directly use the instrumentation available (kprobes,
tracepoints, ...) when it does not need to act as a filter, but more
like a statistic gathering module. So I think we should simply provide a
callback filter registration mechanism in the filtering chain, thus
having a filtering pseudo-code looking like :

if (unlikely(marker_enabled))
  if (likely(global_tracing_enabled))
    if (likely(buffer_tracing_enabled))
      if (likely(call_filter_chain()))
        write to buffer

> > 
> > If systemtap can only see them as a binary blob or a rendered ascii
> > string, they would not be as useful as if the record was decomposable
> > in kernel.  Perhaps the event-type-registration call can declare the
> > binary struct, like a perl pack directive ... or a marker (binary)
> > format string.
> 
> I don't understand the above. I'm also thinking that we are 
> miscommunicating a bit here.
> 
> Let make make an example with something that I know, ftrace.
> 
> We hit the tracepoint in, lets say, scheduler. Previously at 
> initialization time, ftrace would have registered with this trace point 
> and would have allocated a buffer. When the tracepoint is actually hit, it 
> jumps to the function that ftrace registered. Then this function would 
> record the event into the buffer using 'record_event'. At a later time, 
> the user could read the buffer from the filesystem using either the pretty 
> print format or raw binary format.
> 
> This code is not replacing tracepoints or markers. When you say you will 
> consume at reading, it sounds like you don't even need to use the buffer 
> mechanism and the trace points/markers is good enough for you. The 
> tracepoints and markers are not something that is being replaced. The 
> trace buffers are just something to use that we can record data to that we 
> can retrieve at a later time.
> 

If we export the data to record through markers, we would be able to let
SystemTAP (registered in our filter chain) look at the event types being
passed to the buffers and thus perform clever filtering on the
information.

> > 
> > 
> > > CONTROL
> > > 
> > > Sysfs style tree under debugfs
> > > 
> > > /debugfs/tracing/<name>/buffers/enabed         <--- binary value
> > > 
> > > /debugfs/tracing/<name>/<event1>
> > > /debugfs/tracing/<name>/<event2>
> > >     etc ...
> > >     provides a way to enable/disable events, see what's available, and
> > > what's enabled.
> > 
> > This sort of control is (or should be) already available for markers.
> 
> Markers and the buffers are two separate things. Perhaps I'm just tired, 
> but I'm thinking that you are thinking we are going to remove markers and 
> trace points.
> 
> This code is only to give the kernel a ring buffer to use. Not a way to 
> put hooks into kernel code. We have tracepoints and markers for that.
> 

I think what Frank tries to express is that we would not lose any
flexibility, but make life much easier for everyone, if we use the
markers as the API to register event ids, keep their type table and to
export the data at runtime.

One addition that would be needed to the markers is to create a "binary
blob" type (size %lu binary %pW ? W would be for "Write") which would
express that this data is actually only parsable by specialized
functions which knows the types embedded within the binary blob. If
necessary, systemtap could then develop incrementally specialized
functions to deal with these blob, only if required. All the other basic
types would be easy to print or filter with a vsnprintf-like function.

Mathieu

> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20 13:37     ` Mathieu Desnoyers
@ 2008-09-20 13:51       ` Steven Rostedt
  2008-09-20 14:54         ` Steven Rostedt
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2008-09-20 13:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od



On Sat, 20 Sep 2008, Mathieu Desnoyers wrote:

> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > [ Note, It is very late for me and I probably should be finishing up 
> > packing for my trip home, but I'm stupid and decided to respond now 
> > instead ]

[ Note, I'm now at the airport sipping on coffee, and I had 2 hours
of horrible sleep. ]

> > 
> > 
> > On Fri, 19 Sep 2008, Frank Ch. Eigler wrote:
> > 
> > > Hi -
> > > 
> > > 
> > > On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote:
> > > 
> > > > During kernel summit and Plumbers conference, Linus and others
> > > > expressed a desire for a unified tracing buffer system for multiple
> > > > tracing applications (eg ftrace, lttng, systemtap, blktrace, etc) to
> > > > use.
> > > 
> > > OK.
> > > 
> > > 
> > > > [...]
> > > > STORAGE
> > > > -------
> > > > 
> > > > We will support multiple buffers for different tracing systems, with
> > > > separate names, event id spaces. [...]
> > > 
> > > OK.  This is completely orthogonal to ...
> > > 
> > > 
> > > > INPUT_FUNCTIONS
> > > > ---------------
> > > > 
> > > > allocate_buffer (name, size)
> > > >         return buffer_handle
> > > > 
> > > > register_event (buffer_handle, event_id, print_function)
> > > >         You can pass in a requested event_id from a fixed set, and
> > > > will be given it, or an error
> > > >         0 means allocate me one dynamically
> > > >         returns event_id     (or -E_ERROR)
> > > > 
> > > > record_event (buffer_handle, event_id, length, *buf)
> > > 
> > > How do you imagine record_event() being used from the point of view of
> > > the instrumented module?  Is it to be protected by some sort of test
> > > of the control variable?  Is the little binary event buffer supposed
> > > to be constructed unconditionally?  (On the stack?)
> > 
> > The event buffer is allocated when you create the buffer. The tracer will 
> > do that on initialization.
> > 
> 
> I would propose disable the event source ASAP when disabled, without any
> conditional test if possible. This is actually what the Kernel Markers
> does with the help of Immediate Values. It should also come with a
> finer-grained filtering based on a global "enable" variable and
> per-buffer "enable" variables so a tracer can atomically start
> collecting all of its event types. We could then require both
> global tracing and per-buffer tracing to be enabled to write into the
> buffer.
> 
> I think I see where Frank is going, and I agree that this is more or
> less what I have had in mind : the Markers could be used as a global
> event ID registry and could hold the event name, ID, types (format
> string) table. Therefore, the markers would simply become the "Write to
> buffer" interface, which would associate IDs automatically and keep the
> format strings into a table dumped into a metainformation trace buffer
> at trace start and whenever IDs are dynamically registered while the
> trace is active.
> 
> > > 
> > > 
> > > You should compare this to markers and tracepoints.  It sounds to me like
> > > this is not that different from
> > > 
> > >      trace_mark (event_name, "%*b", length, buf);
> > > 
> > > where the goofy "%*b" could be some magic to identify the proposed
> > > "everything is a short blob" event record type.
> > 
> > This is completely separate from the trace buffer itself. The trace points 
> > and trace markers simply write whatever they want into the trace buffer.
> > The tracepoints and trace markers are something to hook points of code to 
> > do some type of tracing. The trace buffer record_event is how it will do 
> > it if it chooses to do so.
> > 
> 
> As I said above, the tracepoints are meant to be a in-kernel API which
> instruments the kernel code. It leaves the markers, which are meant to
> be exposed to userspace anyway, for such record_event use. They would
> actually accomplish two things : they would register the event (just
> declaring the marker puts the event in a special section, which is our
> mapping table) and would also record the event when enabled and
> executed.
> 
> 
> > > 
> > > By the way, systemtap supports formatted printing that generates
> > > binary records via directives like "%4b" for 4-byte ints.  I wonder if
> > > that would be a suitable facility for this and/or markers to allow
> > > instrumentation code to *generate* those binary event records.
> > > 
> > 
> > The tracepoints and markers can generate anything they want.
> > 
> > > 
> > > Do you believe that fans of tracepoints would support a single
> > > void*/length struct parametrization?
> > 
> > The "record_event" would not (I repeat, "not") be in general code. It will 
> > be used by the different tracers. The markers/tracepoints will be in the 
> > code that can hooked to do things like profiling, or if you want, record 
> > the data into the trace buffer via the record event.
> > 
> > > 
> > > 
> > > 
> > > > Data will be output via debugfs, and provide the following output streams:
> > > > 
> > > > /debugfs/tracing/<name>/buffers/text
> > > >     clear text stream (will merge the per-cpu streams via insertion
> > > > sort, and use the print functions)
> > > 
> > > Can you spell out this part a little more?  I wonder because at the
> > > tracing miniconf on Wednesday we talked about systemtap's likely need
> > > to *consume* these trace events as they are being generated.
> > 
> > What does this mean exactly. I should have asked more details about this
> > but I was too worried about time constraints (since Linus was speaking 
> > next to think about it at the time.
> > 
> > That is, when you read a marker, who and when does that data get consumed?
> > Where does that data go? Do you even need to store it in this ring buffer?
> > 
> 
> Given that systemtap may need to access the kernel state and the moment
> the instrumentation is reached, I think it implies they have to be
> called _before_ the data is writter to the buffers. We can thus see
> SystemTAP as a very powerful filtering mechanism. SystemTAP could also
> choose to directly use the instrumentation available (kprobes,
> tracepoints, ...) when it does not need to act as a filter, but more
> like a statistic gathering module. So I think we should simply provide a
> callback filter registration mechanism in the filtering chain, thus
> having a filtering pseudo-code looking like :
> 
> if (unlikely(marker_enabled))
>   if (likely(global_tracing_enabled))
>     if (likely(buffer_tracing_enabled))
>       if (likely(call_filter_chain()))
>         write to buffer
> 
> > > 
> > > If systemtap can only see them as a binary blob or a rendered ascii
> > > string, they would not be as useful as if the record was decomposable
> > > in kernel.  Perhaps the event-type-registration call can declare the
> > > binary struct, like a perl pack directive ... or a marker (binary)
> > > format string.
> > 
> > I don't understand the above. I'm also thinking that we are 
> > miscommunicating a bit here.
> > 
> > Let make make an example with something that I know, ftrace.
> > 
> > We hit the tracepoint in, lets say, scheduler. Previously at 
> > initialization time, ftrace would have registered with this trace point 
> > and would have allocated a buffer. When the tracepoint is actually hit, it 
> > jumps to the function that ftrace registered. Then this function would 
> > record the event into the buffer using 'record_event'. At a later time, 
> > the user could read the buffer from the filesystem using either the pretty 
> > print format or raw binary format.
> > 
> > This code is not replacing tracepoints or markers. When you say you will 
> > consume at reading, it sounds like you don't even need to use the buffer 
> > mechanism and the trace points/markers is good enough for you. The 
> > tracepoints and markers are not something that is being replaced. The 
> > trace buffers are just something to use that we can record data to that we 
> > can retrieve at a later time.
> > 
> 
> If we export the data to record through markers, we would be able to let
> SystemTAP (registered in our filter chain) look at the event types being
> passed to the buffers and thus perform clever filtering on the
> information.
> 
> > > 
> > > 
> > > > CONTROL
> > > > 
> > > > Sysfs style tree under debugfs
> > > > 
> > > > /debugfs/tracing/<name>/buffers/enabed         <--- binary value
> > > > 
> > > > /debugfs/tracing/<name>/<event1>
> > > > /debugfs/tracing/<name>/<event2>
> > > >     etc ...
> > > >     provides a way to enable/disable events, see what's available, and
> > > > what's enabled.
> > > 
> > > This sort of control is (or should be) already available for markers.
> > 
> > Markers and the buffers are two separate things. Perhaps I'm just tired, 
> > but I'm thinking that you are thinking we are going to remove markers and 
> > trace points.
> > 
> > This code is only to give the kernel a ring buffer to use. Not a way to 
> > put hooks into kernel code. We have tracepoints and markers for that.
> > 
> 
> I think what Frank tries to express is that we would not lose any
> flexibility, but make life much easier for everyone, if we use the
> markers as the API to register event ids, keep their type table and to
> export the data at runtime.

No, absolutely not!

Sorry, I don't want to touch markers. I'm fine with tracepoints, but
there should be no need to use a damn marker if I want to use the tracer.
I shouldn't need to even touch tracepoints to use the trace buffer.
That is making things too complicated again. The tracepoints and markers 
should allow you to hook into the buffers. They are separate. I can 
imagine using tracepoints without needing buffers and I can see using the 
buffers without using tracepoints or markers. They are separate things. Do 
not bind the use of the buffers around markers.


Markers are great for you and for many others, but this is about the 
tracing mechanism and one should not be forced to use markers if they want 
to do a trace.

-- Steve


> 
> One addition that would be needed to the markers is to create a "binary
> blob" type (size %lu binary %pW ? W would be for "Write") which would
> express that this data is actually only parsable by specialized
> functions which knows the types embedded within the binary blob. If
> necessary, systemtap could then develop incrementally specialized
> functions to deal with these blob, only if required. All the other basic
> types would be easy to print or filter with a vsnprintf-like function.
> 
> Mathieu
> 
> > -- Steve
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20 13:51       ` Steven Rostedt
@ 2008-09-20 14:54         ` Steven Rostedt
  2008-09-22 18:45           ` Mathieu Desnoyers
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2008-09-20 14:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od



On Sat, 20 Sep 2008, Steven Rostedt wrote:

> 
> 
> > > > 
> > > Markers and the buffers are two separate things. Perhaps I'm just tired, 
> > > but I'm thinking that you are thinking we are going to remove markers and 
> > > trace points.
> > > 
> > > This code is only to give the kernel a ring buffer to use. Not a way to 
> > > put hooks into kernel code. We have tracepoints and markers for that.
> > > 
> > 
> > I think what Frank tries to express is that we would not lose any
> > flexibility, but make life much easier for everyone, if we use the
> > markers as the API to register event ids, keep their type table and to
> > export the data at runtime.
> 
> No, absolutely not!
> 
> Sorry, I don't want to touch markers. I'm fine with tracepoints, but
> there should be no need to use a damn marker if I want to use the tracer.
> I shouldn't need to even touch tracepoints to use the trace buffer.
> That is making things too complicated again. The tracepoints and markers 
> should allow you to hook into the buffers. They are separate. I can 
> imagine using tracepoints without needing buffers and I can see using the 
> buffers without using tracepoints or markers. They are separate things. Do 
> not bind the use of the buffers around markers.
> 
> 
> Markers are great for you and for many others, but this is about the 
> tracing mechanism and one should not be forced to use markers if they want 
> to do a trace.
> 

Mathieu,

Think about the function tracer itself. It gets called at every funtion, 
where I record the interrupts enabled state, task pid, preempt state, 
function addr, and parent function addr. (that's just off the top of my 
head, I may even record more).

What I don't want is a:

function_call(unsigned long func, unsigned long parent)
{
	struct ftrace_event event;

	event.pid = current->pid;
	event.pc = preempt_count();
	event.irq = local_irq_flags();
	event.func = func;
	event.parent = parent;

	trace_mark(func_event_id, "%p",
		sizeof(event), &event);
}


and then to turn on function tracing, I need to hook into this marker. I'd 
rather just push the data right into the buffer here without having to 
make another function call to hook into this.

I'd rather have instead a simple:

	struct ftrace_event *event;

	event = ring_buffer_reserve(func_event_id,
				sizeof(*event));

	event->pid = current->pid;
	event->pc = preempt_count();
	event->irq = local_irq_flags();
	event->func = func;
	event->parent = parent;

	ring_buffer_commit(event);


-- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20 14:54         ` Steven Rostedt
@ 2008-09-22 18:45           ` Mathieu Desnoyers
  2008-09-22 21:39             ` Steven Rostedt
  0 siblings, 1 reply; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-22 18:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> 
> On Sat, 20 Sep 2008, Steven Rostedt wrote:
> 
> > 
> > 
> > > > > 
> > > > Markers and the buffers are two separate things. Perhaps I'm just tired, 
> > > > but I'm thinking that you are thinking we are going to remove markers and 
> > > > trace points.
> > > > 
> > > > This code is only to give the kernel a ring buffer to use. Not a way to 
> > > > put hooks into kernel code. We have tracepoints and markers for that.
> > > > 
> > > 
> > > I think what Frank tries to express is that we would not lose any
> > > flexibility, but make life much easier for everyone, if we use the
> > > markers as the API to register event ids, keep their type table and to
> > > export the data at runtime.
> > 
> > No, absolutely not!
> > 
> > Sorry, I don't want to touch markers. I'm fine with tracepoints, but
> > there should be no need to use a damn marker if I want to use the tracer.
> > I shouldn't need to even touch tracepoints to use the trace buffer.
> > That is making things too complicated again. The tracepoints and markers 
> > should allow you to hook into the buffers. They are separate. I can 
> > imagine using tracepoints without needing buffers and I can see using the 
> > buffers without using tracepoints or markers. They are separate things. Do 
> > not bind the use of the buffers around markers.
> > 
> > 
> > Markers are great for you and for many others, but this is about the 
> > tracing mechanism and one should not be forced to use markers if they want 
> > to do a trace.
> > 
> 

Hi Steven,

As I expressed above, this is merely one way I propose data could be
exported to user-space. If you have other simpler design ideas in mind,
I look forward to hear them so we can discuss the technical difficulties
associated with that kind of exercice : sending binary data across the
kernel-userspace boundary.

See below for comments.

> Mathieu,
> 
> Think about the function tracer itself. It gets called at every funtion, 
> where I record the interrupts enabled state, task pid, preempt state, 
> function addr, and parent function addr. (that's just off the top of my 
> head, I may even record more).
> 
> What I don't want is a:
> 
> function_call(unsigned long func, unsigned long parent)
> {
> 	struct ftrace_event event;
> 
> 	event.pid = current->pid;
> 	event.pc = preempt_count();
> 	event.irq = local_irq_flags();
> 	event.func = func;
> 	event.parent = parent;
> 
> 	trace_mark(func_event_id, "%p",
> 		sizeof(event), &event);
> }
> 
> 
> and then to turn on function tracing, I need to hook into this marker. I'd 
> rather just push the data right into the buffer here without having to 
> make another function call to hook into this.
> 
> I'd rather have instead a simple:
> 
> 	struct ftrace_event *event;
> 
> 	event = ring_buffer_reserve(func_event_id,
> 				sizeof(*event));
> 
> 	event->pid = current->pid;
> 	event->pc = preempt_count();
> 	event->irq = local_irq_flags();
> 	event->func = func;
> 	event->parent = parent;
> 
> 	ring_buffer_commit(event);
> 

The scheme you propose here is based on a few inherent assumptions :

- You assume ring_buffer_reserve() and ring_buffer_commit() are static
  inline and thus does not turn into function calls.
- You assume these are small enough so they can be inlined without
  causing L1 insn cache trashing when tracing is activated.
- You therefore assume they use a locking scheme that lets them be
  really really compact (e.g. interrupt disable and spin lock).
- You assume that the performance impact of doing a function call is
  bigger than the impact of locking, which is false by at least a factor
  10.

Interrupt disable and spin locks are _really_ slow. So I think putting
the function call concern up front here is really a matter of premature
optimization gone wrong.

I've got burned in the past history of LTTng. The first versions has a
code generator which created specialized code to serialize the
information into the buffers, exactly like you propose to do. But the
overall impact on kernel code size ended up being too big because we
have to repeat all the code to deal with the buffers for every different
type.

However, I think there might be a way to satisfy us both. An information
source like dynamic function trace happen to fit in a particular
use-case where one single execution site is used to format the data
received as parameter for a _lot_ of instrumented sites, and the type
and event names happen to be the same everywhere. This would therefore
benefit widely of having the capability to write directly into the
buffers.

The thing is that I would like ftrace to expose the types it expects to
write into the trace buffers so a generic trace buffer userspace
consumer could read it.

One way to do it, which would let you write data directly into the
buffers, would be something like : (before anyone says "that's many
lines of code", please compile it and look at the assembly result. A lot
of this translates in precomputed values, especially for the event size
computation). (And no, the following code has not been compile-tested)

include/linux/someheader.h :

/* Calculate the offset needed to align the type */
static inline unsigned int var_align(size_t align_drift,
                 size_t size_of_type)
{
        size_t alignment = min(sizeof(void *), size_of_type);
        return ((alignment - align_drift) & (alignment-1));
}

kernel/trace/ftrace.c :

/*
 * the following macro would only do the "declaration" part of the
 * markers, without doing all the function call stuff.
 */
DECLARE_MARKER(function_entry,
  "pid %d pc %d flags %lu func 0x%lX parent 0x%lX");

void ftrace_mcount(unsigned long ip, unsigned long parent_ip)
{
  size_t ev_size = 0;
  char *buffer;

  /*
   * We assume event payload aligned on sizeof(void *).
   * Event size calculated statically.
   */
  ev_size += sizeof(int);
  ev_size += var_align(ev_size, sizeof(int));
  ev_size += sizeof(int);
  ev_size += var_align(ev_size, sizeof(unsigned long));
  ev_size += sizeof(unsigned long);
  ev_size += var_align(ev_size, sizeof(unsigned long));
  ev_size += sizeof(unsigned long);
  ev_size += var_align(ev_size, sizeof(unsigned long));
  ev_size += sizeof(unsigned long);

  /*
   * Now reserve space and copy data.
   */
  buffer = ring_buffer_reserve(func_event_id, ev_size);
  /* Write pid */
  *(int *)buffer = current->pid;
  buffer += sizeof(int);

  /* Write pc */
  buffer += var_align(buffer, sizeof(int));
  *(int *)buffer = preempt_count();
  buffer += sizeof(int);

  /* Write flags */
  buffer += var_align(buffer, sizeof(unsigned long));
  *(unsigned long *)buffer = local_irq_flags();
  buffer += sizeof(unsigned long);

  /* Write func */
  buffer += var_align(buffer, sizeof(unsigned long));
  *(unsigned long *)buffer = func;
  buffer += sizeof(unsigned long);

  /* Write parent */
  buffer += var_align(buffer, sizeof(unsigned long));
  *(unsigned long *)buffer = parent;
  buffer += sizeof(unsigned long);

  ring_buffer_commit(buffer, ev_size);
}

Would that be suitable for you ?

We could also think of passing the function pointer of the bin to ascii
converter to DECLARE_MARKER(), such as :

void function_entry_show(struct seq_file *m, char *buffer);

DECLARE_MARKER(function_entry, function_entry_show,
  "pid %d pc %d flags %lu func 0x%lX parent 0x%lX");

void function_entry_show(struct seq_file *m, char *buffer)
{
  /* Read pid */
  seq_printf(m, "pid = %d ", *(int *)buffer);
  buffer += sizeof(int);

  /* Read pc */
  buffer += var_align(buffer, sizeof(int));
  seq_printf(m, "pc = %d ", *(int *)buffer);
  buffer += sizeof(int);

  /* Read flags */
  buffer += var_align(buffer, sizeof(unsigned long));
  seq_printf(m, "flags = %lu ", *(unsigned long *)buffer);
  buffer += sizeof(unsigned long);

  /* Read func */
  buffer += var_align(buffer, sizeof(unsigned long));
  seq_printf(m, "func = 0x%lX ", *(unsigned long *)buffer);
  buffer += sizeof(unsigned long);

  /* Read parent */
  buffer += var_align(buffer, sizeof(unsigned long));
  seq_printf(m, "parent = 0x%lX ", *(unsigned long *)buffer);
  buffer += sizeof(unsigned long);
}

Note that in this particular case, given we would not need any special
"dump everything as if it was an unorganized array of bytes", the
function_entry_show() would be totally useless if we provide a sane
vsnprintf-like decoder based on the format string.

I did this example to show you how we could deal with the special cases
where people would be interested to write a whole network packet (or any
similar structure) directly to the trace (given it has field structures
which are not too tied to the compiler internals and has field sizes
portable across architectures). We could do this without much problem by
adding a format string type which specified such a binary blob, and we
could even leave room for people to provide their ascii formatting
function pointer, as shows my second example here.

Mathieu

> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 18:45           ` Mathieu Desnoyers
@ 2008-09-22 21:39             ` Steven Rostedt
  2008-09-23  3:27               ` Mathieu Desnoyers
  0 siblings, 1 reply; 122+ messages in thread
From: Steven Rostedt @ 2008-09-22 21:39 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od


On Mon, 22 Sep 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
[...]
> > 
> > Think about the function tracer itself. It gets called at every funtion, 
> > where I record the interrupts enabled state, task pid, preempt state, 
> > function addr, and parent function addr. (that's just off the top of my 
> > head, I may even record more).
> > 
> > What I don't want is a:
> > 
> > function_call(unsigned long func, unsigned long parent)
> > {
> > 	struct ftrace_event event;
> > 
> > 	event.pid = current->pid;
> > 	event.pc = preempt_count();
> > 	event.irq = local_irq_flags();
> > 	event.func = func;
> > 	event.parent = parent;
> > 
> > 	trace_mark(func_event_id, "%p",
> > 		sizeof(event), &event);
> > }
> > 
> > 
> > and then to turn on function tracing, I need to hook into this marker. I'd 
> > rather just push the data right into the buffer here without having to 
> > make another function call to hook into this.
> > 
> > I'd rather have instead a simple:
> > 
> > 	struct ftrace_event *event;
> > 
> > 	event = ring_buffer_reserve(func_event_id,
> > 				sizeof(*event));
> > 
> > 	event->pid = current->pid;
> > 	event->pc = preempt_count();
> > 	event->irq = local_irq_flags();
> > 	event->func = func;
> > 	event->parent = parent;
> > 
> > 	ring_buffer_commit(event);
> > 
> 
> The scheme you propose here is based on a few inherent assumptions :
> 
> - You assume ring_buffer_reserve() and ring_buffer_commit() are static
>   inline and thus does not turn into function calls.
> - You assume these are small enough so they can be inlined without
>   causing L1 insn cache trashing when tracing is activated.
> - You therefore assume they use a locking scheme that lets them be
>   really really compact (e.g. interrupt disable and spin lock).
> - You assume that the performance impact of doing a function call is
>   bigger than the impact of locking, which is false by at least a factor
>   10.

I don't assume anything. I will have the requirement that reserve and 
commit must be paired, and for the first version, hold locks.

Maybe I should rename it to: ring_buffer_lock_reserve and 
ring_buffer_unlock_commit. To show this.

[...]

> kernel/trace/ftrace.c :
> 
> /*
>  * the following macro would only do the "declaration" part of the
>  * markers, without doing all the function call stuff.
>  */
> DECLARE_MARKER(function_entry,
>   "pid %d pc %d flags %lu func 0x%lX parent 0x%lX");
> 
> void ftrace_mcount(unsigned long ip, unsigned long parent_ip)
> {
>   size_t ev_size = 0;
>   char *buffer;
> 
>   /*
>    * We assume event payload aligned on sizeof(void *).
>    * Event size calculated statically.
>    */
>   ev_size += sizeof(int);
>   ev_size += var_align(ev_size, sizeof(int));
>   ev_size += sizeof(int);
>   ev_size += var_align(ev_size, sizeof(unsigned long));
>   ev_size += sizeof(unsigned long);
>   ev_size += var_align(ev_size, sizeof(unsigned long));
>   ev_size += sizeof(unsigned long);
>   ev_size += var_align(ev_size, sizeof(unsigned long));
>   ev_size += sizeof(unsigned long);
> 
>   /*
>    * Now reserve space and copy data.
>    */
>   buffer = ring_buffer_reserve(func_event_id, ev_size);
>   /* Write pid */
>   *(int *)buffer = current->pid;
>   buffer += sizeof(int);
> 
>   /* Write pc */
>   buffer += var_align(buffer, sizeof(int));
>   *(int *)buffer = preempt_count();
>   buffer += sizeof(int);
> 
>   /* Write flags */
>   buffer += var_align(buffer, sizeof(unsigned long));
>   *(unsigned long *)buffer = local_irq_flags();
>   buffer += sizeof(unsigned long);
> 
>   /* Write func */
>   buffer += var_align(buffer, sizeof(unsigned long));
>   *(unsigned long *)buffer = func;
>   buffer += sizeof(unsigned long);
> 
>   /* Write parent */
>   buffer += var_align(buffer, sizeof(unsigned long));
>   *(unsigned long *)buffer = parent;
>   buffer += sizeof(unsigned long);
> 
>   ring_buffer_commit(buffer, ev_size);
> }
> 
> 
> Would that be suitable for you ?

YUCK YUCK YUCK!!!!

Mathieu,

Do I have to bring up the argument of simplicity again? I will never use
such an API.  Mine was very simple, I have to spend 10 minutes trying to
figure out what the above is. I only spent 5 so I'm still at a lost.

> 
> We could also think of passing the function pointer of the bin to ascii
> converter to DECLARE_MARKER(), such as :
> 
> void function_entry_show(struct seq_file *m, char *buffer);
> 
> DECLARE_MARKER(function_entry, function_entry_show,
>   "pid %d pc %d flags %lu func 0x%lX parent 0x%lX");
> 
> void function_entry_show(struct seq_file *m, char *buffer)
> {
>   /* Read pid */
>   seq_printf(m, "pid = %d ", *(int *)buffer);
>   buffer += sizeof(int);
> 
>   /* Read pc */
>   buffer += var_align(buffer, sizeof(int));
>   seq_printf(m, "pc = %d ", *(int *)buffer);
>   buffer += sizeof(int);
> 
>   /* Read flags */
>   buffer += var_align(buffer, sizeof(unsigned long));
>   seq_printf(m, "flags = %lu ", *(unsigned long *)buffer);
>   buffer += sizeof(unsigned long);
> 
>   /* Read func */
>   buffer += var_align(buffer, sizeof(unsigned long));
>   seq_printf(m, "func = 0x%lX ", *(unsigned long *)buffer);
>   buffer += sizeof(unsigned long);
> 
>   /* Read parent */
>   buffer += var_align(buffer, sizeof(unsigned long));
>   seq_printf(m, "parent = 0x%lX ", *(unsigned long *)buffer);
>   buffer += sizeof(unsigned long);
> }
> 
> Note that in this particular case, given we would not need any special
> "dump everything as if it was an unorganized array of bytes", the
> function_entry_show() would be totally useless if we provide a sane
> vsnprintf-like decoder based on the format string.
> 
> I did this example to show you how we could deal with the special cases
> where people would be interested to write a whole network packet (or any
> similar structure) directly to the trace (given it has field structures
> which are not too tied to the compiler internals and has field sizes
> portable across architectures). We could do this without much problem by
> adding a format string type which specified such a binary blob, and we
> could even leave room for people to provide their ascii formatting
> function pointer, as shows my second example here.
> 

These are not special cases, these are what I use often. They are not 
special for me.

-- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 21:39             ` Steven Rostedt
@ 2008-09-23  3:27               ` Mathieu Desnoyers
  0 siblings, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  3:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frank Ch. Eigler, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Mon, 22 Sep 2008, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
[...]
> > > and then to turn on function tracing, I need to hook into this marker. I'd 
> > > rather just push the data right into the buffer here without having to 
> > > make another function call to hook into this.
> > > 
> > 
> > The scheme you propose here is based on a few inherent assumptions :
> > 
> > - You assume ring_buffer_reserve() and ring_buffer_commit() are static
> >   inline and thus does not turn into function calls.
> > - You assume these are small enough so they can be inlined without
> >   causing L1 insn cache trashing when tracing is activated.
> > - You therefore assume they use a locking scheme that lets them be
> >   really really compact (e.g. interrupt disable and spin lock).
> > - You assume that the performance impact of doing a function call is
> >   bigger than the impact of locking, which is false by at least a factor
> >   10.
> 
> I don't assume anything. I will have the requirement that reserve and 
> commit must be paired, and for the first version, hold locks.
> 

By saying you don't want to do any function call, the only technical
reason I see for you wanting that is performance, and thus you would
assume the above. If not, why don't you want to make another function
call ? This all I mean by "assumption" here.

> Maybe I should rename it to: ring_buffer_lock_reserve and 
> ring_buffer_unlock_commit. To show this.
> 
> [...]
> 
> > kernel/trace/ftrace.c :
> > 
> > /*
> >  * the following macro would only do the "declaration" part of the
> >  * markers, without doing all the function call stuff.
> >  */
> > DECLARE_MARKER(function_entry,
> >   "pid %d pc %d flags %lu func 0x%lX parent 0x%lX");
> > 
> > void ftrace_mcount(unsigned long ip, unsigned long parent_ip)
> > {
> >   size_t ev_size = 0;
> >   char *buffer;
> > 
> >   /*
> >    * We assume event payload aligned on sizeof(void *).
> >    * Event size calculated statically.
> >    */
> >   ev_size += sizeof(int);
> >   ev_size += var_align(ev_size, sizeof(int));
> >   ev_size += sizeof(int);
> >   ev_size += var_align(ev_size, sizeof(unsigned long));
> >   ev_size += sizeof(unsigned long);
> >   ev_size += var_align(ev_size, sizeof(unsigned long));
> >   ev_size += sizeof(unsigned long);
> >   ev_size += var_align(ev_size, sizeof(unsigned long));
> >   ev_size += sizeof(unsigned long);
> > 
> >   /*
> >    * Now reserve space and copy data.
> >    */
> >   buffer = ring_buffer_reserve(func_event_id, ev_size);
> >   /* Write pid */
> >   *(int *)buffer = current->pid;
> >   buffer += sizeof(int);
> > 
> >   /* Write pc */
> >   buffer += var_align(buffer, sizeof(int));
> >   *(int *)buffer = preempt_count();
> >   buffer += sizeof(int);
> > 
> >   /* Write flags */
> >   buffer += var_align(buffer, sizeof(unsigned long));
> >   *(unsigned long *)buffer = local_irq_flags();
> >   buffer += sizeof(unsigned long);
> > 
> >   /* Write func */
> >   buffer += var_align(buffer, sizeof(unsigned long));
> >   *(unsigned long *)buffer = func;
> >   buffer += sizeof(unsigned long);
> > 
> >   /* Write parent */
> >   buffer += var_align(buffer, sizeof(unsigned long));
> >   *(unsigned long *)buffer = parent;
> >   buffer += sizeof(unsigned long);
> > 
> >   ring_buffer_commit(buffer, ev_size);
> > }
> > 
> > 
> > Would that be suitable for you ?
> 
> YUCK YUCK YUCK!!!!
> 
> Mathieu,
> 
> Do I have to bring up the argument of simplicity again? I will never use
> such an API.  Mine was very simple, I have to spend 10 minutes trying to
> figure out what the above is. I only spent 5 so I'm still at a lost.
> 

I was actually waiting for you to propose an alternative, but I fear you
already did without me noticing :)

How do you deal with exporting data across kernel/user boundary in your
proposal exactly ? How does this work on architecture with 64-bits
kernel and 32-bits userland... ? A simple C structure copy might be
simple to _code_, but hellish to export to userspace and lead to hard to
debug binary incompatibilities (different gcc flags, 32/64 bits
user/kernel). And this is without telling about the non-portability of
the exported data.

If gcc/icc-knowledgeful people can reassure me by certifying it won't
generate a mess, fine, but until then, I stay very doubtful about
solutions involving to imply binary compability between kernel and
userland.

And common.. 10 minutes to understand the above code. Your _are_ kidding
me right ? Would that help if I create a small 4 lineish wrapper around
the buffer write ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:33 Unified tracing buffer Martin Bligh
                   ` (2 preceding siblings ...)
  2008-09-19 23:18 ` Frank Ch. Eigler
@ 2008-09-20  0:07 ` Peter Zijlstra
  2008-09-22 14:07   ` K.Prasad
  2008-09-20  0:26 ` Unified tracing buffer Marcel Holtmann
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2008-09-20  0:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler

Oddly whitespace damaged mail..

On Fri, 2008-09-19 at 14:33 -0700, Martin Bligh wrote:
> During kernel summit and Plumbers conference, Linus and others
> expressed a desire for a unified
> tracing buffer system for multiple tracing applications (eg ftrace,
> lttng, systemtap, blktrace, etc) to use.
> This provides several advantages, including the ability to interleave
> data from multiple sources,
> not having to learn 200 different tools, duplicated code/effort, etc.
> 
> Several of us got together last night and tried to cut this down to
> the simplest usable system
> we could agree on (and nobody got hurt!). This will form version 1.
> I've sketched out a few
> enhancements we know that we want, but have agreed to leave these
> until version 2.
> The answer to most questions about the below is "yes we know, we'll
> fix that in version 2"
> (or 3). Simplicity was the rule ...
> 
> Sketch of design.  Enjoy flaming me. Code will follow shortly.
> 
> 
> STORAGE
> -------
> 
> We will support multiple buffers for different tracing systems, with
> separate names, event id spaces.
> Event ids are 16 bit, dynamically allocated.
> A "one line of text" print function will be provided for each event,
> or use the default (probably hex printf)
> Will provide a "flight data recorder" mode, and a "spool to disk" mode.
> 
> Circular buffer per cpu, protected by per-cpu spinlock_irq
> Word aligned records.
> Variable record length, header will start with length record.
> Timestamps in fixed timebase, monotonically increasing (across all CPUs)
> 
> 
> INPUT_FUNCTIONS
> ---------------
> 
> allocate_buffer (name, size)
>         return buffer_handle
> 
> register_event (buffer_handle, event_id, print_function)
>         You can pass in a requested event_id from a fixed set, and
> will be given it, or an error
>         0 means allocate me one dynamically
>         returns event_id     (or -E_ERROR)
> 
> record_event (buffer_handle, event_id, length, *buf)

I'd hoped for an interface like:

struct ringbuffer *ringbuffer_alloc(const char *name, size_t size);
void ringbuffer_free(struct ringbuffer *buffer);
int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size);
int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size);

On top of which you'd do the event thing, the register event with a
callback idea makes sense, except I'd split the consumption into two:
 - one method to pull the binary event out, which knows how long it
ought to be etc..
 - one method to convert the binary event to ASCII

You don't always need the latter one, esp if you're dumping to disk.

You can also generalize the merge sorting forward iterator when you have
that, by providing an event compare function.

By doing it like this folks can focus on utterly optimizing the
ringbuffer to death, and other folks can toy around with doing fancy
event encodings (/me pitches asn.1-der encoded structured data and runs
like crazy)




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20  0:07 ` Peter Zijlstra
@ 2008-09-22 14:07   ` K.Prasad
  2008-09-22 14:45     ` Peter Zijlstra
  0 siblings, 1 reply; 122+ messages in thread
From: K.Prasad @ 2008-09-22 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi

On Sat, Sep 20, 2008 at 02:07:58AM +0200, Peter Zijlstra wrote:
> Oddly whitespace damaged mail..
> 
> On Fri, 2008-09-19 at 14:33 -0700, Martin Bligh wrote:
> > During kernel summit and Plumbers conference, Linus and others
> > expressed a desire for a unified
> > tracing buffer system for multiple tracing applications (eg ftrace,
> > lttng, systemtap, blktrace, etc) to use.
> > This provides several advantages, including the ability to interleave
> > data from multiple sources,
> > not having to learn 200 different tools, duplicated code/effort, etc.
> > 
> > Several of us got together last night and tried to cut this down to
> > the simplest usable system
> > we could agree on (and nobody got hurt!). This will form version 1.
> > I've sketched out a few
> > enhancements we know that we want, but have agreed to leave these
> > until version 2.
> > The answer to most questions about the below is "yes we know, we'll
> > fix that in version 2"
> > (or 3). Simplicity was the rule ...
> > 
> > Sketch of design.  Enjoy flaming me. Code will follow shortly.
> > 
> > 
> > STORAGE
> > -------
> > 
> > We will support multiple buffers for different tracing systems, with
> > separate names, event id spaces.
> > Event ids are 16 bit, dynamically allocated.
> > A "one line of text" print function will be provided for each event,
> > or use the default (probably hex printf)
> > Will provide a "flight data recorder" mode, and a "spool to disk" mode.
> > 
> > Circular buffer per cpu, protected by per-cpu spinlock_irq
> > Word aligned records.
> > Variable record length, header will start with length record.
> > Timestamps in fixed timebase, monotonically increasing (across all CPUs)
> > 
> > 
> > INPUT_FUNCTIONS
> > ---------------
> > 
> > allocate_buffer (name, size)
> >         return buffer_handle
> > 
> > register_event (buffer_handle, event_id, print_function)
> >         You can pass in a requested event_id from a fixed set, and
> > will be given it, or an error
> >         0 means allocate me one dynamically
> >         returns event_id     (or -E_ERROR)
> > 
> > record_event (buffer_handle, event_id, length, *buf)
> 
> I'd hoped for an interface like:
> 
> struct ringbuffer *ringbuffer_alloc(const char *name, size_t size);
> void ringbuffer_free(struct ringbuffer *buffer);
> int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size);
> int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size);
> 
> On top of which you'd do the event thing, the register event with a
> callback idea makes sense, except I'd split the consumption into two:
>  - one method to pull the binary event out, which knows how long it
> ought to be etc..
>  - one method to convert the binary event to ASCII
>
In conjunction with the previous email on this thread
(http://lkml.org/lkml/2008/9/22/160), may I suggest
the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be:

relay_printk(<some struct with default filenames/pathnames>, <string>,
...) ;
relay_dump(<some struct with default filenames/pathnames>, <binary
data>);
and
relay_cleanup_all(<the struct name>); - Single interface that cleans up
all files/directories/output data created under a logical entity.

Thanks,
K.Prasad


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 14:07   ` K.Prasad
@ 2008-09-22 14:45     ` Peter Zijlstra
  2008-09-22 16:29       ` Martin Bligh
                         ` (5 more replies)
  0 siblings, 6 replies; 122+ messages in thread
From: Peter Zijlstra @ 2008-09-22 14:45 UTC (permalink / raw)
  To: prasad
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi

On Mon, 2008-09-22 at 19:37 +0530, K.Prasad wrote:

> > > INPUT_FUNCTIONS
> > > ---------------
> > > 
> > > allocate_buffer (name, size)
> > >         return buffer_handle
> > > 
> > > register_event (buffer_handle, event_id, print_function)
> > >         You can pass in a requested event_id from a fixed set, and
> > > will be given it, or an error
> > >         0 means allocate me one dynamically
> > >         returns event_id     (or -E_ERROR)
> > > 
> > > record_event (buffer_handle, event_id, length, *buf)
> > 
> > I'd hoped for an interface like:
> > 
> > struct ringbuffer *ringbuffer_alloc(const char *name, size_t size);
> > void ringbuffer_free(struct ringbuffer *buffer);
> > int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size);
> > int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size);
> > 
> > On top of which you'd do the event thing, the register event with a
> > callback idea makes sense, except I'd split the consumption into two:
> >  - one method to pull the binary event out, which knows how long it
> > ought to be etc..
> >  - one method to convert the binary event to ASCII
> >
> In conjunction with the previous email on this thread
> (http://lkml.org/lkml/2008/9/22/160), may I suggest
> the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be:
> 
> relay_printk(<some struct with default filenames/pathnames>, <string>,
> ....) ;
> relay_dump(<some struct with default filenames/pathnames>, <binary
> data>);
> and
> relay_cleanup_all(<the struct name>); - Single interface that cleans up
> all files/directories/output data created under a logical entity.

Dude, relayfs is such a bad performing mess that extending it seems like
a bad idea. Better to write something new and delete everything relayfs
related.

Also, it seems prudent to separate the ring-buffer implementation from
the event encoding/decoding facilities.




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 14:45     ` Peter Zijlstra
@ 2008-09-22 16:29       ` Martin Bligh
  2008-09-22 16:36         ` Peter Zijlstra
  2008-09-23  2:49       ` Mathieu Desnoyers
                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 122+ messages in thread
From: Martin Bligh @ 2008-09-22 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi

>> In conjunction with the previous email on this thread
>> (http://lkml.org/lkml/2008/9/22/160), may I suggest
>> the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be:
>>
>> relay_printk(<some struct with default filenames/pathnames>, <string>,
>> ....) ;
>> relay_dump(<some struct with default filenames/pathnames>, <binary
>> data>);
>> and
>> relay_cleanup_all(<the struct name>); - Single interface that cleans up
>> all files/directories/output data created under a logical entity.
>
> Dude, relayfs is such a bad performing mess that extending it seems like
> a bad idea. Better to write something new and delete everything relayfs
> related.

There did seem to be pretty universal agreement that we'd rather not
use relayfs.

> Also, it seems prudent to separate the ring-buffer implementation from
> the event encoding/decoding facilities.

Right - in conversation I had with Mathieu later, he suggested cleaning up
relayfs - I fear this will delay us far too long, and get bogged down.
If we can get one clean circular buffer implementation, then both
relayfs and the tracing could share that common solution,

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 16:29       ` Martin Bligh
@ 2008-09-22 16:36         ` Peter Zijlstra
  2008-09-22 20:50           ` Masami Hiramatsu
  2008-09-23  3:05           ` Mathieu Desnoyers
  0 siblings, 2 replies; 122+ messages in thread
From: Peter Zijlstra @ 2008-09-22 16:36 UTC (permalink / raw)
  To: Martin Bligh
  Cc: prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi

On Mon, 2008-09-22 at 09:29 -0700, Martin Bligh wrote:
> >> In conjunction with the previous email on this thread
> >> (http://lkml.org/lkml/2008/9/22/160), may I suggest
> >> the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be:
> >>
> >> relay_printk(<some struct with default filenames/pathnames>, <string>,
> >> ....) ;
> >> relay_dump(<some struct with default filenames/pathnames>, <binary
> >> data>);
> >> and
> >> relay_cleanup_all(<the struct name>); - Single interface that cleans up
> >> all files/directories/output data created under a logical entity.
> >
> > Dude, relayfs is such a bad performing mess that extending it seems like
> > a bad idea. Better to write something new and delete everything relayfs
> > related.
> 
> There did seem to be pretty universal agreement that we'd rather not
> use relayfs.
> 
> > Also, it seems prudent to separate the ring-buffer implementation from
> > the event encoding/decoding facilities.
> 
> Right - in conversation I had with Mathieu later, he suggested cleaning up
> relayfs - I fear this will delay us far too long, and get bogged down.
> If we can get one clean circular buffer implementation, then both
> relayfs and the tracing could share that common solution,

Currently only blktrace and kvmtrace use relayfs, and I've heard people
talk about converting both to use lttng/ftrace infrastructure. At which
point relayfs is orphaned and ready for removal.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 16:36         ` Peter Zijlstra
@ 2008-09-22 20:50           ` Masami Hiramatsu
  2008-09-23  3:05           ` Mathieu Desnoyers
  1 sibling, 0 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-22 20:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Bligh, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder, zanussi

Peter Zijlstra wrote:
> On Mon, 2008-09-22 at 09:29 -0700, Martin Bligh wrote:
>>>> In conjunction with the previous email on this thread
>>>> (http://lkml.org/lkml/2008/9/22/160), may I suggest
>>>> the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be:
>>>>
>>>> relay_printk(<some struct with default filenames/pathnames>, <string>,
>>>> ....) ;
>>>> relay_dump(<some struct with default filenames/pathnames>, <binary
>>>> data>);
>>>> and
>>>> relay_cleanup_all(<the struct name>); - Single interface that cleans up
>>>> all files/directories/output data created under a logical entity.
>>> Dude, relayfs is such a bad performing mess that extending it seems like
>>> a bad idea. Better to write something new and delete everything relayfs
>>> related.
>> There did seem to be pretty universal agreement that we'd rather not
>> use relayfs.
>>
>>> Also, it seems prudent to separate the ring-buffer implementation from
>>> the event encoding/decoding facilities.
>> Right - in conversation I had with Mathieu later, he suggested cleaning up
>> relayfs - I fear this will delay us far too long, and get bogged down.
>> If we can get one clean circular buffer implementation, then both
>> relayfs and the tracing could share that common solution,
> 
> Currently only blktrace and kvmtrace use relayfs, and I've heard people
> talk about converting both to use lttng/ftrace infrastructure. At which
> point relayfs is orphaned and ready for removal.

Hi Peter,
Systemtap is still a heavy user of relayfs. :-)
Anyway, if new buffering mechanism is enough for us, I think
we're happy to move on it.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 16:36         ` Peter Zijlstra
  2008-09-22 20:50           ` Masami Hiramatsu
@ 2008-09-23  3:05           ` Mathieu Desnoyers
  1 sibling, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  3:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Bligh, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, zanussi

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Mon, 2008-09-22 at 09:29 -0700, Martin Bligh wrote:
> > >> In conjunction with the previous email on this thread
> > >> (http://lkml.org/lkml/2008/9/22/160), may I suggest
> > >> the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be:
> > >>
> > >> relay_printk(<some struct with default filenames/pathnames>, <string>,
> > >> ....) ;
> > >> relay_dump(<some struct with default filenames/pathnames>, <binary
> > >> data>);
> > >> and
> > >> relay_cleanup_all(<the struct name>); - Single interface that cleans up
> > >> all files/directories/output data created under a logical entity.
> > >
> > > Dude, relayfs is such a bad performing mess that extending it seems like
> > > a bad idea. Better to write something new and delete everything relayfs
> > > related.
> > 
> > There did seem to be pretty universal agreement that we'd rather not
> > use relayfs.
> > 
> > > Also, it seems prudent to separate the ring-buffer implementation from
> > > the event encoding/decoding facilities.
> > 
> > Right - in conversation I had with Mathieu later, he suggested cleaning up
> > relayfs - I fear this will delay us far too long, and get bogged down.
> > If we can get one clean circular buffer implementation, then both
> > relayfs and the tracing could share that common solution,
> 
> Currently only blktrace and kvmtrace use relayfs, and I've heard people
> talk about converting both to use lttng/ftrace infrastructure. At which
> point relayfs is orphaned and ready for removal.
> 

LTTng sits on top of relay for buffer allocation and for the mmap
operation (that's about it, it overrides the rest).

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 14:45     ` Peter Zijlstra
  2008-09-22 16:29       ` Martin Bligh
@ 2008-09-23  2:49       ` Mathieu Desnoyers
  2008-09-23  5:25       ` Tom Zanussi
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  2:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, zanussi

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Mon, 2008-09-22 at 19:37 +0530, K.Prasad wrote:
> 
> > > > INPUT_FUNCTIONS
> > > > ---------------
> > > > 
> > > > allocate_buffer (name, size)
> > > >         return buffer_handle
> > > > 
> > > > register_event (buffer_handle, event_id, print_function)
> > > >         You can pass in a requested event_id from a fixed set, and
> > > > will be given it, or an error
> > > >         0 means allocate me one dynamically
> > > >         returns event_id     (or -E_ERROR)
> > > > 
> > > > record_event (buffer_handle, event_id, length, *buf)
> > > 
> > > I'd hoped for an interface like:
> > > 
> > > struct ringbuffer *ringbuffer_alloc(const char *name, size_t size);
> > > void ringbuffer_free(struct ringbuffer *buffer);
> > > int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size);
> > > int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size);
> > > 
> > > On top of which you'd do the event thing, the register event with a
> > > callback idea makes sense, except I'd split the consumption into two:
> > >  - one method to pull the binary event out, which knows how long it
> > > ought to be etc..
> > >  - one method to convert the binary event to ASCII
> > >
> > In conjunction with the previous email on this thread
> > (http://lkml.org/lkml/2008/9/22/160), may I suggest
> > the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be:
> > 
> > relay_printk(<some struct with default filenames/pathnames>, <string>,
> > ....) ;
> > relay_dump(<some struct with default filenames/pathnames>, <binary
> > data>);
> > and
> > relay_cleanup_all(<the struct name>); - Single interface that cleans up
> > all files/directories/output data created under a logical entity.
> 
> Dude, relayfs is such a bad performing mess that extending it seems like
> a bad idea. Better to write something new and delete everything relayfs
> related.
> 

LTTng only uses relay for buffer mapping and mmap to userspace. The rest
of internal buffer management is done within LTTng by overriding relay
callbacks. One thing we could think of is to incrementally fix relay
rather than deleting it completely.

> Also, it seems prudent to separate the ring-buffer implementation from
> the event encoding/decoding facilities.
> 

Sure, but still I think both are needed, even if they are separated as
two different layers (as they should).

Mathieu

> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 14:45     ` Peter Zijlstra
  2008-09-22 16:29       ` Martin Bligh
  2008-09-23  2:49       ` Mathieu Desnoyers
@ 2008-09-23  5:25       ` Tom Zanussi
  2008-09-23  9:31         ` Peter Zijlstra
                           ` (2 more replies)
  2008-09-23  5:27       ` [PATCH 1/3] relay - clean up subbuf switch Tom Zanussi
                         ` (2 subsequent siblings)
  5 siblings, 3 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-23  5:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

On Mon, 2008-09-22 at 16:45 +0200, Peter Zijlstra wrote:
> On Mon, 2008-09-22 at 19:37 +0530, K.Prasad wrote:
> 
> > > > INPUT_FUNCTIONS
> > > > ---------------
> > > > 
> > > > allocate_buffer (name, size)
> > > >         return buffer_handle
> > > > 
> > > > register_event (buffer_handle, event_id, print_function)
> > > >         You can pass in a requested event_id from a fixed set, and
> > > > will be given it, or an error
> > > >         0 means allocate me one dynamically
> > > >         returns event_id     (or -E_ERROR)
> > > > 
> > > > record_event (buffer_handle, event_id, length, *buf)
> > > 
> > > I'd hoped for an interface like:
> > > 
> > > struct ringbuffer *ringbuffer_alloc(const char *name, size_t size);
> > > void ringbuffer_free(struct ringbuffer *buffer);
> > > int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size);
> > > int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size);
> > > 
> > > On top of which you'd do the event thing, the register event with a
> > > callback idea makes sense, except I'd split the consumption into two:
> > >  - one method to pull the binary event out, which knows how long it
> > > ought to be etc..
> > >  - one method to convert the binary event to ASCII
> > >
> > In conjunction with the previous email on this thread
> > (http://lkml.org/lkml/2008/9/22/160), may I suggest
> > the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be:
> > 
> > relay_printk(<some struct with default filenames/pathnames>, <string>,
> > ....) ;
> > relay_dump(<some struct with default filenames/pathnames>, <binary
> > data>);
> > and
> > relay_cleanup_all(<the struct name>); - Single interface that cleans up
> > all files/directories/output data created under a logical entity.
> 
> Dude, relayfs is such a bad performing mess that extending it seems like
> a bad idea. Better to write something new and delete everything relayfs
> related.

Hmm, I haven't seen complaints lately about about relayfs being 'bad
performing'.  The write/reserve functions are pretty fast - they
don't do much else in the fast path other than update an index, but
if they're still too slow, please let me know how to make them faster.
In any case, I'll post a couple patches in a few minutes that give
complete control over the write path for anyone who doesn't want to be
hampered by the existing versions.

As for the interface, yeah, it has gathered some some cruft over time
and has turned out to be too complex for most people.  The reason a lot
of that complexity is there in the first place though, ironically, is
that it was put there in explicit support of the requirements of
LTT/LTTng (sub-buffers, padding, mmap, etc), which supposedly
represented the needs of all 'industrial-strength' tracers at the time.
Well, four years after the 'troll merge' that initially got relayfs
streamlined and into the kernel, in anticipation of a soon-to-follow
streamlined LTT/LTTng which has yet to emerge, apparently those
requirements are no longer valid and neither LTTng nor anything else
needs the capabilities of relayfs.  That's fine, if it isn't needed, it
isn't needed.  But since it no longer has to conform to the requirements
of any imaginary tracer, maybe it should be put through yet another
streamlining effort and everything that's not required to support
current users removed:

- get rid of anything having to do with padding, nobody needs it and its
only affect has been to horribly distort and complicate a lot of the
code
- get rid of sub-buffers, they just cause confusion
- get rid of mmap, nobody uses it
- no sub-buffers and no mmap support means we can get rid of most of the
callbacks, and a lot of API confusion along with them
- add relay flags - they probably should have been used from the
beginning and options made explicit instead of being shoehorned into the
callback functions.

Going even further, why not just replace the current write functions
with versions that write into pages and SPLICE_F_MOVE them to their
destination - normally userspace doesn't want to see the data anyway -
and get rid of everything else.  Add support for splice_write() and
maybe you have an elegant way to do userspace tracing (via vmsplice)
too.

Another source of complexity has turned out to be the removal of the
'fs' part of relayfs - it basically meant adding callback hooks so relay
files could be used in other pseudo filesystems, which is great, but it
further complicated the API and scared away users.  We could add back
the fs part, but that would be going backwards, so those callbacks at
least would have to stay I guess.

Well, I'll post some patches shortly for a few of these things, but I
doubt I'll do much more than that, since on the one hand I only have a
few nights a week to work on this stuff and it's become a not-very-fun
hobby, and since I think you guys have already decided on the way
forward and anything I post would be removed soon anyway.

As for the relay_printk() etc stuff, the part that adds the common code
from blktrace for all tracers would definitely be a benefit, but I still
don't think it goes far enough in providing generic trace control - see
e.g. the kmemtrace-on-utt code where I still had to add code to add a
bunch of control files - it would be nice to have a standard and easy
way to do that.  For the printk() functionality itself, we submitted
something similar a year ago (dti_printk) and nobody was interested:

http://lwn.net/Articles/240330/
http://dti.sourceforge.net/

I told the folks in charge at IBM then that doing that kind of in-kernel
filtering and sorting might be interesting and useful for ad hoc kernel
hacking, but was basically a sideshow; the really useful part of the
blktrace tracing code and 90% of the work needed to make it into a
generically usable tracing system wasn't in the kernel at all, but in
the unglamorous userspace code that did the streaming and display of the
trace data via disk/network/live, etc.  Eventually I did go ahead and do
that 90%, which wasn't a small task, and now anyone can use the blktrace
code for generic tracing:

http://utt.sourceforge.net/

I can't say I did it justice, but it does work, and in fact, it didn't
take much time at all to convert the kmemtrace code to using it:

http://utt.sourceforge.net/kmemtrace-utt-kernel.patch
http://utt.sourceforge.net/kmemtrace-utt-user.patch

It should also be pretty straightforward to extend it to handle the
output from any number of trace sources as has been mentioned, assuming
you have a common sequencing source, so regardless of what you guys end
up replacing relayfs with, you might consider using it anyway...

> 
> Also, it seems prudent to separate the ring-buffer implementation from
> the event encoding/decoding facilities.
> 
> 
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  5:25       ` Tom Zanussi
@ 2008-09-23  9:31         ` Peter Zijlstra
  2008-09-23 18:13           ` Mathieu Desnoyers
  2008-09-23 13:50         ` Mathieu Desnoyers
  2008-09-23 14:00         ` Martin Bligh
  2 siblings, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2008-09-23  9:31 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

On Tue, 2008-09-23 at 00:25 -0500, Tom Zanussi wrote:

> - get rid of anything having to do with padding, nobody needs it and its
> only affect has been to horribly distort and complicate a lot of the
> code
> - get rid of sub-buffers, they just cause confusion
> - get rid of mmap, nobody uses it
> - no sub-buffers and no mmap support means we can get rid of most of the
> callbacks, and a lot of API confusion along with them
> - add relay flags - they probably should have been used from the
> beginning and options made explicit instead of being shoehorned into the
> callback functions.

 - get rid of the vmap buffers as they cause tlb pressure and eat up
precious vspace on 32 bit platforms.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  9:31         ` Peter Zijlstra
@ 2008-09-23 18:13           ` Mathieu Desnoyers
  2008-09-23 18:33             ` Christoph Lameter
  0 siblings, 1 reply; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23 18:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tom Zanussi, prasad, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder,
	Christoph Lameter, linux-mm

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Tue, 2008-09-23 at 00:25 -0500, Tom Zanussi wrote:
> 
> > - get rid of anything having to do with padding, nobody needs it and its
> > only affect has been to horribly distort and complicate a lot of the
> > code
> > - get rid of sub-buffers, they just cause confusion
> > - get rid of mmap, nobody uses it
> > - no sub-buffers and no mmap support means we can get rid of most of the
> > callbacks, and a lot of API confusion along with them
> > - add relay flags - they probably should have been used from the
> > beginning and options made explicit instead of being shoehorned into the
> > callback functions.
> 
>  - get rid of the vmap buffers as they cause tlb pressure and eat up
> precious vspace on 32 bit platforms.
> 

Although I agree on the basic idea, namely to use a sane amount of TLB
entries for tracing, I disagree on the way proposed to reach this goal.
Such memory management concerns belong to the mm field and should not be
done "oh so cleverly" by a buffer management infrastructure in the back
of the kernel memory management infrastructure.

I think we should instead try to figure out what is currently missing in
the kernel vmap mechanism (probably the ability to vmap from large 4MB
pages after boot), and fix _that_ instead (if possible), which would not
only benefit to tracing, but also to module support.

Also, I would like to keep a contiguous address mapping within buffers
so we could keep the buffer read/write code as simple as possible,
leveraging the existing CPU MM unit.

I added Christoph Lameter to the CC list, he always comes with clever
ideas. :)

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 18:13           ` Mathieu Desnoyers
@ 2008-09-23 18:33             ` Christoph Lameter
  2008-09-23 18:56               ` Linus Torvalds
  0 siblings, 1 reply; 122+ messages in thread
From: Christoph Lameter @ 2008-09-23 18:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Tom Zanussi, prasad, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch,
	David Wilder, linux-mm

Mathieu Desnoyers wrote:
> 
> I think we should instead try to figure out what is currently missing in
> the kernel vmap mechanism (probably the ability to vmap from large 4MB
> pages after boot), and fix _that_ instead (if possible), which would not
> only benefit to tracing, but also to module support.

With some custom code one can vmap 2MB pages on x86. See the VMEMMAP support
in the x86 arch. The code in mm/sparse-vmemmap.c could be abstracted for a
general 2MB mapping API to reduce TLB pressure for the buffers. If there are
concerns about fragmentation then one could fallback to 4kb TLBs. See the
virtualizable compound page patchset which does something similar.

> I added Christoph Lameter to the CC list, he always comes with clever
> ideas. :)

Oh mostly we are just recycling the old ideas.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 18:33             ` Christoph Lameter
@ 2008-09-23 18:56               ` Linus Torvalds
  0 siblings, 0 replies; 122+ messages in thread
From: Linus Torvalds @ 2008-09-23 18:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Peter Zijlstra, Tom Zanussi, prasad,
	Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner,
	Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch,
	David Wilder, linux-mm

On Tue, 23 Sep 2008, Christoph Lameter wrote:

> Mathieu Desnoyers wrote:
> > 
> > I think we should instead try to figure out what is currently missing in
> > the kernel vmap mechanism (probably the ability to vmap from large 4MB
> > pages after boot), and fix _that_ instead (if possible), which would not
> > only benefit to tracing, but also to module support.

No. Don't go there. Piece of absolute shit.

The problem with VMAP is that it's _limited_. We don't have reasonable 
virtual address space holes for x86-32.

The other is that physically contiguos buffers are hard to come by. 
Certainly not an acceptable solution.

The third is that if you have multiple buffers, you need to look them up 
in software anyway, so the whole notion of mis-using the TLB to avoid a 
software lookup is TOTAL CRAP.

Don't do virtual mapping. IT IS BROKEN. IT IS A TOTAL AND UTTER PIECE OF 
SHIT.

I will absolutely not take any general-purpse tracing code if I'm aware of 
it mis-using the TLB to play games.

		Linus

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  5:25       ` Tom Zanussi
  2008-09-23  9:31         ` Peter Zijlstra
@ 2008-09-23 13:50         ` Mathieu Desnoyers
  2008-09-23 14:00         ` Martin Bligh
  2 siblings, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23 13:50 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Peter Zijlstra, prasad, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

* Tom Zanussi (zanussi@comcast.net) wrote:
 
> - get rid of anything having to do with padding, nobody needs it and its
> only affect has been to horribly distort and complicate a lot of the
> code
> - get rid of sub-buffers, they just cause confusion
> - get rid of mmap, nobody uses it

LTTng uses relay mmap. That's about the only feature of relay it uses
along with memory allocation.  It however implements its own buffer
management mechanism with poll() and ioctl GET_SUBBUF/PUT_SUBBUF to read
subbuffers. But these ops are all within LTTng.

BTW it would be good to change relay so it can take a buffer pointer as
input for relay_open. That would help getting memory mapped in the
linear mapping to be used for tracing when known at boot time.

Mathieu

> - no sub-buffers and no mmap support means we can get rid of most of the
> callbacks, and a lot of API confusion along with them
> - add relay flags - they probably should have been used from the
> beginning and options made explicit instead of being shoehorned into the
> callback functions.
> 
> Going even further, why not just replace the current write functions
> with versions that write into pages and SPLICE_F_MOVE them to their
> destination - normally userspace doesn't want to see the data anyway -
> and get rid of everything else.  Add support for splice_write() and
> maybe you have an elegant way to do userspace tracing (via vmsplice)
> too.

Sounds interesting. So then vmsplice would be used to support sending
trace data over the network or to disk ?

> 
> Another source of complexity has turned out to be the removal of the
> 'fs' part of relayfs - it basically meant adding callback hooks so relay
> files could be used in other pseudo filesystems, which is great, but it
> further complicated the API and scared away users.  We could add back
> the fs part, but that would be going backwards, so those callbacks at
> least would have to stay I guess.
> 
> Well, I'll post some patches shortly for a few of these things, but I
> doubt I'll do much more than that, since on the one hand I only have a
> few nights a week to work on this stuff and it's become a not-very-fun
> hobby, and since I think you guys have already decided on the way
> forward and anything I post would be removed soon anyway.
> 

I am not sure of that. I think there is some room for relay improvements
we could work on. As for the mechanism used to insure data coherency, I
think relay does not provide any. Could it be changed to an interrupt
disable+spinlock ? Then, in a second phase, we can optimize it by using
a lockless mechanism like LTTng does.

> As for the relay_printk() etc stuff, the part that adds the common code
> from blktrace for all tracers would definitely be a benefit, but I still
> don't think it goes far enough in providing generic trace control - see
> e.g. the kmemtrace-on-utt code where I still had to add code to add a
> bunch of control files - it would be nice to have a standard and easy
> way to do that.  For the printk() functionality itself, we submitted
> something similar a year ago (dti_printk) and nobody was interested:
> 
> http://lwn.net/Articles/240330/
> http://dti.sourceforge.net/
> 
> I told the folks in charge at IBM then that doing that kind of in-kernel
> filtering and sorting might be interesting and useful for ad hoc kernel
> hacking, but was basically a sideshow; the really useful part of the
> blktrace tracing code and 90% of the work needed to make it into a
> generically usable tracing system wasn't in the kernel at all, but in
> the unglamorous userspace code that did the streaming and display of the
> trace data via disk/network/live, etc.  Eventually I did go ahead and do
> that 90%, which wasn't a small task, and now anyone can use the blktrace
> code for generic tracing:
> 
> http://utt.sourceforge.net/
> 
> I can't say I did it justice, but it does work, and in fact, it didn't
> take much time at all to convert the kmemtrace code to using it:
> 
> http://utt.sourceforge.net/kmemtrace-utt-kernel.patch
> http://utt.sourceforge.net/kmemtrace-utt-user.patch
> 
> It should also be pretty straightforward to extend it to handle the
> output from any number of trace sources as has been mentioned, assuming
> you have a common sequencing source, so regardless of what you guys end
> up replacing relayfs with, you might consider using it anyway...
> 

I did the same with LTTV :) Writing userspace tools, including GUIs and
everything, can be quite a big task.

I would be good to keep in mind that a layered infrastructure would be
good. A bit like network packet encapsulation, we could have :

Layer 2 : Event payload (dealt by a unified event encoding
                         infrastructure)
          - Structure defined by event ID/type mapping table

Layer 1 : Events (dealt by a unified buffer layout infrastructure)
          - Event header
            - Timestamp
            - Event ID
            - Event size

Layer 0 : Buffers (dealt by a unified buffering infrastructure)
          - Buffer header
          - Subbuffers

Mathieu


> > 
> > Also, it seems prudent to separate the ring-buffer implementation from
> > the event encoding/decoding facilities.
> > 
> > 
> > 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  5:25       ` Tom Zanussi
  2008-09-23  9:31         ` Peter Zijlstra
  2008-09-23 13:50         ` Mathieu Desnoyers
@ 2008-09-23 14:00         ` Martin Bligh
  2008-09-23 17:55           ` K.Prasad
  2008-09-24  3:50           ` Tom Zanussi
  2 siblings, 2 replies; 122+ messages in thread
From: Martin Bligh @ 2008-09-23 14:00 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

> - get rid of anything having to do with padding, nobody needs it and its
> only affect has been to horribly distort and complicate a lot of the
> code
> - get rid of sub-buffers, they just cause confusion
> - get rid of mmap, nobody uses it
> - no sub-buffers and no mmap support means we can get rid of most of the
> callbacks, and a lot of API confusion along with them
> - add relay flags - they probably should have been used from the
> beginning and options made explicit instead of being shoehorned into the
> callback functions.

Actually, I think if you did all that, it'd be pretty close to what we
want anyway ...

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:00         ` Martin Bligh
@ 2008-09-23 17:55           ` K.Prasad
  2008-09-23 18:27             ` Martin Bligh
  2008-09-24  3:50           ` Tom Zanussi
  1 sibling, 1 reply; 122+ messages in thread
From: K.Prasad @ 2008-09-23 17:55 UTC (permalink / raw)
  To: Martin Bligh, Tom Zanussi, Mathieu Desnoyers
  Cc: Peter Zijlstra, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

On Tue, Sep 23, 2008 at 07:00:38AM -0700, Martin Bligh wrote:
> > - get rid of anything having to do with padding, nobody needs it and its
> > only affect has been to horribly distort and complicate a lot of the
> > code
> > - get rid of sub-buffers, they just cause confusion
> > - get rid of mmap, nobody uses it
> > - no sub-buffers and no mmap support means we can get rid of most of the
> > callbacks, and a lot of API confusion along with them
> > - add relay flags - they probably should have been used from the
> > beginning and options made explicit instead of being shoehorned into the
> > callback functions.
> 
> Actually, I think if you did all that, it'd be pretty close to what we
> want anyway ...

In the perspective of having a layered infrastructure, can we consider
the interfaces later added over relay (to be used as a wrapper), namely
relay_printk() and relay_dump()?

Also add the following features to it and we get close to the
functionality that is sought:

- Add callbacks to append fine-granular timestamp information depending
upon user's requirement
- Ability to provision more custom-defined control files that can suit
independent tracer's requirements

These interfaces already come along with the following features (some
repetition here from my previous email for the sake of completeness):

- Very minimal work required to log data using the interfaces. Usage is
made simple to resemble the printk(). Like

	struct relay_printk_data *tpk;
	tpk->parent_dir = "PARENT";
	tpk->dir = "DIR";
	relay_printk(tpk, <String to be output>);
	relay_dump(tpk, <Some binary data to output>); 

Output at:
	<debugfs_mount>/PARENT/DIR/<TRACE FILES>

- Assumes default values for most tunables, such as per-CPU buffer size,
relay-flags (such as global vs local per-cpu buffers, flight recorder vs
overwrite mode), thus reducing the work required for setting up these
interfaces. They can be over-written in case of advanced needs.

- Well defined control operations to start, stop tracing operations, status
files to indicate buffer overflow, etc.

- Given the recent patches that Tom Zanussi has sent to bring in the
erstwhile 'trace' functionality into relay itself, there can be a lot of
code-reduction in relay.c (in -mm) thereby leading to a light-weight
implementation.

Thanks,
K.Prasad


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 17:55           ` K.Prasad
@ 2008-09-23 18:27             ` Martin Bligh
  0 siblings, 0 replies; 122+ messages in thread
From: Martin Bligh @ 2008-09-23 18:27 UTC (permalink / raw)
  To: prasad
  Cc: Tom Zanussi, Mathieu Desnoyers, Peter Zijlstra,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder

On Tue, Sep 23, 2008 at 10:55 AM, K.Prasad <prasad@linux.vnet.ibm.com> wrote:
> On Tue, Sep 23, 2008 at 07:00:38AM -0700, Martin Bligh wrote:
>> > - get rid of anything having to do with padding, nobody needs it and its
>> > only affect has been to horribly distort and complicate a lot of the
>> > code
>> > - get rid of sub-buffers, they just cause confusion
>> > - get rid of mmap, nobody uses it
>> > - no sub-buffers and no mmap support means we can get rid of most of the
>> > callbacks, and a lot of API confusion along with them
>> > - add relay flags - they probably should have been used from the
>> > beginning and options made explicit instead of being shoehorned into the
>> > callback functions.
>>
>> Actually, I think if you did all that, it'd be pretty close to what we
>> want anyway ...
>
> In the perspective of having a layered infrastructure, can we consider
> the interfaces later added over relay (to be used as a wrapper), namely
> relay_printk() and relay_dump()?

Might well work, but let's see what relayfs comes out looking like.
If it's heavily simplfiied, hopefully people will like it.

> - Very minimal work required to log data using the interfaces. Usage is
> made simple to resemble the printk(). Like
>
>        struct relay_printk_data *tpk;
>        tpk->parent_dir = "PARENT";
>        tpk->dir = "DIR";
>        relay_printk(tpk, <String to be output>);
>        relay_dump(tpk, <Some binary data to output>);

You really don't want to store strings in the buffer, it's horribly
inefficient. I think the intent was to store binary data from tagged
events, along with the format strings, and do all the expansion later.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:00         ` Martin Bligh
  2008-09-23 17:55           ` K.Prasad
@ 2008-09-24  3:50           ` Tom Zanussi
  2008-09-24  5:42             ` K.Prasad
                               ` (9 more replies)
  1 sibling, 10 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-24  3:50 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder


On Tue, 2008-09-23 at 07:00 -0700, Martin Bligh wrote:
> > - get rid of anything having to do with padding, nobody needs it and its
> > only affect has been to horribly distort and complicate a lot of the
> > code
> > - get rid of sub-buffers, they just cause confusion
> > - get rid of mmap, nobody uses it
> > - no sub-buffers and no mmap support means we can get rid of most of the
> > callbacks, and a lot of API confusion along with them
> > - add relay flags - they probably should have been used from the
> > beginning and options made explicit instead of being shoehorned into the
> > callback functions.
> 
> Actually, I think if you did all that, it'd be pretty close to what we
> want anyway ...

OK, then, I'll continue with the cleanup patchset and see where it
goes...

Tom


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-24  3:50           ` Tom Zanussi
@ 2008-09-24  5:42             ` K.Prasad
  2008-09-25  6:07             ` [RFC PATCH 0/8] current relay cleanup patchset Tom Zanussi
                               ` (8 subsequent siblings)
  9 siblings, 0 replies; 122+ messages in thread
From: K.Prasad @ 2008-09-24  5:42 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Martin Bligh, Peter Zijlstra, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, od, Frank Ch. Eigler, Andrew Morton, hch,
	David Wilder

On Tue, Sep 23, 2008 at 10:50:15PM -0500, Tom Zanussi wrote:
> 
> On Tue, 2008-09-23 at 07:00 -0700, Martin Bligh wrote:
> > > - get rid of anything having to do with padding, nobody needs it and its
> > > only affect has been to horribly distort and complicate a lot of the
> > > code
> > > - get rid of sub-buffers, they just cause confusion
> > > - get rid of mmap, nobody uses it
> > > - no sub-buffers and no mmap support means we can get rid of most of the
> > > callbacks, and a lot of API confusion along with them
> > > - add relay flags - they probably should have been used from the
> > > beginning and options made explicit instead of being shoehorned into the
> > > callback functions.
> > 
> > Actually, I think if you did all that, it'd be pretty close to what we
> > want anyway ...
> 
> OK, then, I'll continue with the cleanup patchset and see where it
> goes...
> 
> Tom
>
Hi Tom,
	Kindly let us know if the patches are available in some
downloadable location or have been maintained in a git tree.

I'm planning to re-base the relay_* interfaces (erstwhile 'trace' code),
to work on top of your patches.

Thanks,
K.Prasad
 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [RFC PATCH 0/8] current relay cleanup patchset
  2008-09-24  3:50           ` Tom Zanussi
  2008-09-24  5:42             ` K.Prasad
@ 2008-09-25  6:07             ` Tom Zanussi
  2008-09-25  6:07             ` [RFC PATCH 1/8] relay - Clean up relay_switch_subbuf() and make waking up consumers optional Tom Zanussi
                               ` (7 subsequent siblings)
  9 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-25  6:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Here's the current relay cleanup patchset.  The first two patches make
the write path completely replaceable, the third adds flags along with
some related cleanup, and the next 5 remove the padding in several
stages. 

It's a work in progress, but because I wanted the intermediate stages to
actually work and not break anything, some of these patches, especially
05, are just temporary and will be removed in the next iteration.

I didn't have time to clean up the first 3 either - I'll also do that
the next time around.

Anyway, removing the padding has simplified the read/splice code
significantly; it's always been a source of headaches so I'm glad it's
gone, and it doesn't seem to have broken anything - a quick test using
blktrace in both read() and splice() modes didn't show any problems.

In the next round I plan to do vmap and sub-buffer removal.

Tom

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [RFC PATCH 1/8] relay - Clean up relay_switch_subbuf() and make waking up consumers optional.
  2008-09-24  3:50           ` Tom Zanussi
  2008-09-24  5:42             ` K.Prasad
  2008-09-25  6:07             ` [RFC PATCH 0/8] current relay cleanup patchset Tom Zanussi
@ 2008-09-25  6:07             ` Tom Zanussi
  2008-09-25  6:07             ` [RFC PATCH 2/8] relay - Make the relay sub-buffer switch code replaceable Tom Zanussi
                               ` (6 subsequent siblings)
  9 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-25  6:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Clean up relay_switch_subbuf() and make waking up consumers optional.

Over time, relay_switch_subbuf() has accumulated some cruft - this
patch cleans it up and at the same time makes available some of it
available as common functions that any subbuf-switch implementor would
need (this is partially in preparation for the next patch, which makes
the subbuf-switch function completely replaceable).  It also removes
the hard-coded reader wakeup and moves it into a replaceable callback
called notify_consumers(); this allows any given tracer to implement
consumer notification as it sees fit.
---
 include/linux/relay.h |   51 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/relay.c        |   43 +++++++++++++++++++++++------------------
 2 files changed, 75 insertions(+), 19 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 953fc05..2242004 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -159,6 +159,15 @@ struct rchan_callbacks
 	 * The callback should return 0 if successful, negative if not.
 	 */
 	int (*remove_buf_file)(struct dentry *dentry);
+
+	/*
+	 * wakeup_readers - sub-buffer was switched, let readers know
+	 * @buf: the channel buffer
+	 *
+	 * Called during sub-buffer switch.  Users who don't want any
+	 * wakeups should implement an empty version.
+	 */
+        void (*wakeup_readers)(struct rchan_buf *buf);
 };
 
 /*
@@ -186,6 +195,48 @@ extern size_t relay_switch_subbuf(struct rchan_buf *buf,
 				  size_t length);
 
 /**
+ *	relay_event_toobig - is event too big to fit in a sub-buffer?
+ *	@buf: relay channel buffer
+ *	@length: length of event
+ *
+ *	Returns 1 if too big, 0 otherwise.
+ *
+ *	switch_subbuf() helper function
+ */
+static inline int relay_event_toobig(struct rchan_buf *buf, size_t length)
+{
+	return length > buf->chan->subbuf_size;
+}
+
+/**
+ *	relay_update_filesize - add to filesize of relay file
+ *	@buf: relay channel buffer
+ *	@length: length to add
+ *
+ *	switch_subbuf() helper function
+ */
+static inline void relay_update_filesize(struct rchan_buf *buf, size_t length)
+{
+	if (buf->dentry)
+		buf->dentry->d_inode->i_size +=	length;
+	else
+		buf->early_bytes += length;
+
+	smp_mb();
+}
+
+/**
+ *	relay_inc_produced - add 1 to buf->produced
+ *	@buf: relay channel buffer
+ *
+ *	switch_subbuf() helper function
+ */
+static inline void relay_inc_produced(struct rchan_buf *buf)
+{
+	buf->subbufs_produced++;
+}
+
+/**
  *	relay_write - write data into the channel
  *	@chan: relay channel
  *	@data: data to be written
diff --git a/kernel/relay.c b/kernel/relay.c
index 8d13a78..7d588fe 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -324,6 +324,21 @@ static int remove_buf_file_default_callback(struct dentry *dentry)
 	return -EINVAL;
 }
 
+/*
+ * wakeup_readers() default callback.
+ */
+static void wakeup_readers_default_callback(struct rchan_buf *buf)
+{
+	if (waitqueue_active(&buf->read_wait))
+		/*
+		 * Calling wake_up_interruptible() from here
+		 * will deadlock if we happen to be logging
+		 * from the scheduler (trying to re-grab
+		 * rq->lock), so defer it.
+		 */
+		__mod_timer(&buf->timer, jiffies + 1);
+}
+
 /* relay channel default callbacks */
 static struct rchan_callbacks default_channel_callbacks = {
 	.subbuf_start = subbuf_start_default_callback,
@@ -331,6 +346,7 @@ static struct rchan_callbacks default_channel_callbacks = {
 	.buf_unmapped = buf_unmapped_default_callback,
 	.create_buf_file = create_buf_file_default_callback,
 	.remove_buf_file = remove_buf_file_default_callback,
+	.wakeup_readers = wakeup_readers_default_callback,
 };
 
 /**
@@ -508,6 +524,8 @@ static void setup_callbacks(struct rchan *chan,
 		cb->create_buf_file = create_buf_file_default_callback;
 	if (!cb->remove_buf_file)
 		cb->remove_buf_file = remove_buf_file_default_callback;
+	if (!cb->wakeup_readers)
+		cb->wakeup_readers = wakeup_readers_default_callback;
 	chan->cb = cb;
 }
 
@@ -726,30 +744,17 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
 	void *old, *new;
 	size_t old_subbuf, new_subbuf;
 
-	if (unlikely(length > buf->chan->subbuf_size))
+	if (unlikely(relay_event_toobig(buf, length)))
 		goto toobig;
 
 	if (buf->offset != buf->chan->subbuf_size + 1) {
 		buf->prev_padding = buf->chan->subbuf_size - buf->offset;
 		old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs;
 		buf->padding[old_subbuf] = buf->prev_padding;
-		buf->subbufs_produced++;
-		if (buf->dentry)
-			buf->dentry->d_inode->i_size +=
-				buf->chan->subbuf_size -
-				buf->padding[old_subbuf];
-		else
-			buf->early_bytes += buf->chan->subbuf_size -
-					    buf->padding[old_subbuf];
-		smp_mb();
-		if (waitqueue_active(&buf->read_wait))
-			/*
-			 * Calling wake_up_interruptible() from here
-			 * will deadlock if we happen to be logging
-			 * from the scheduler (trying to re-grab
-			 * rq->lock), so defer it.
-			 */
-			__mod_timer(&buf->timer, jiffies + 1);
+		relay_inc_produced(buf);
+		relay_update_filesize(buf, buf->chan->subbuf_size -
+				      buf->padding[old_subbuf]);
+		buf->chan->cb->wakeup_readers(buf);
 	}
 
 	old = buf->data;
@@ -763,7 +768,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
 	buf->data = new;
 	buf->padding[new_subbuf] = 0;
 
-	if (unlikely(length + buf->offset > buf->chan->subbuf_size))
+	if (unlikely(relay_event_toobig(buf, length + buf->offset)))
 		goto toobig;
 
 	return length;
-- 
1.5.3.5




^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH 2/8] relay - Make the relay sub-buffer switch code replaceable.
  2008-09-24  3:50           ` Tom Zanussi
                               ` (2 preceding siblings ...)
  2008-09-25  6:07             ` [RFC PATCH 1/8] relay - Clean up relay_switch_subbuf() and make waking up consumers optional Tom Zanussi
@ 2008-09-25  6:07             ` Tom Zanussi
  2008-09-25  6:07             ` [RFC PATCH 3/8] relay - Add channel flags to relay, remove global callback param Tom Zanussi
                               ` (5 subsequent siblings)
  9 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-25  6:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Make the relay sub-buffer switch code replaceable.

With this patch, tracers now have complete control over the relay
write (or reserve) path if they choose to do so, by implementing their
own version of the sub-buffer switch function (switch_subbuf()), in
addition to their own local write/reserve functions.  Tracers who
choose not to do so automatically default to the normal behavior.
---
 include/linux/relay.h |   22 +++++++++++++++++-----
 kernel/relay.c        |   13 ++++++++-----
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 2242004..a1dcfc1 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -168,6 +168,18 @@ struct rchan_callbacks
 	 * wakeups should implement an empty version.
 	 */
         void (*wakeup_readers)(struct rchan_buf *buf);
+
+	/*
+	 * switch_subbuf - sub-buffer switch callback
+	 * @buf: the channel buffer
+	 * @length: size of current event
+	 *
+	 * Returns either the length passed in or 0 if full.
+	 *
+	 * Performs sub-buffer-switch tasks such as updating filesize,
+	 * waking up readers, etc.
+	 */
+	size_t (*switch_subbuf)(struct rchan_buf *buf, size_t length);
 };
 
 /*
@@ -191,8 +203,8 @@ extern void relay_subbufs_consumed(struct rchan *chan,
 extern void relay_reset(struct rchan *chan);
 extern int relay_buf_full(struct rchan_buf *buf);
 
-extern size_t relay_switch_subbuf(struct rchan_buf *buf,
-				  size_t length);
+extern size_t switch_subbuf_default_callback(struct rchan_buf *buf,
+					     size_t length);
 
 /**
  *	relay_event_toobig - is event too big to fit in a sub-buffer?
@@ -259,7 +271,7 @@ static inline void relay_write(struct rchan *chan,
 	local_irq_save(flags);
 	buf = chan->buf[smp_processor_id()];
 	if (unlikely(buf->offset + length > chan->subbuf_size))
-		length = relay_switch_subbuf(buf, length);
+		length = chan->cb->switch_subbuf(buf, length);
 	memcpy(buf->data + buf->offset, data, length);
 	buf->offset += length;
 	local_irq_restore(flags);
@@ -285,7 +297,7 @@ static inline void __relay_write(struct rchan *chan,
 
 	buf = chan->buf[get_cpu()];
 	if (unlikely(buf->offset + length > buf->chan->subbuf_size))
-		length = relay_switch_subbuf(buf, length);
+		length = chan->cb->switch_subbuf(buf, length);
 	memcpy(buf->data + buf->offset, data, length);
 	buf->offset += length;
 	put_cpu();
@@ -308,7 +320,7 @@ static inline void *relay_reserve(struct rchan *chan, size_t length)
 	struct rchan_buf *buf = chan->buf[smp_processor_id()];
 
 	if (unlikely(buf->offset + length > buf->chan->subbuf_size)) {
-		length = relay_switch_subbuf(buf, length);
+		length = chan->cb->switch_subbuf(buf, length);
 		if (!length)
 			return NULL;
 	}
diff --git a/kernel/relay.c b/kernel/relay.c
index 7d588fe..f1f55ae 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -347,6 +347,7 @@ static struct rchan_callbacks default_channel_callbacks = {
 	.create_buf_file = create_buf_file_default_callback,
 	.remove_buf_file = remove_buf_file_default_callback,
 	.wakeup_readers = wakeup_readers_default_callback,
+	.switch_subbuf = switch_subbuf_default_callback,
 };
 
 /**
@@ -526,6 +527,8 @@ static void setup_callbacks(struct rchan *chan,
 		cb->remove_buf_file = remove_buf_file_default_callback;
 	if (!cb->wakeup_readers)
 		cb->wakeup_readers = wakeup_readers_default_callback;
+	if (!cb->switch_subbuf)
+		cb->switch_subbuf = switch_subbuf_default_callback;
 	chan->cb = cb;
 }
 
@@ -730,7 +733,7 @@ int relay_late_setup_files(struct rchan *chan,
 }
 
 /**
- *	relay_switch_subbuf - switch to a new sub-buffer
+ *	switch_subbuf_default_callback - switch to a new sub-buffer
  *	@buf: channel buffer
  *	@length: size of current event
  *
@@ -739,7 +742,7 @@ int relay_late_setup_files(struct rchan *chan,
  *	Performs sub-buffer-switch tasks such as invoking callbacks,
  *	updating padding counts, waking up readers, etc.
  */
-size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
+size_t switch_subbuf_default_callback(struct rchan_buf *buf, size_t length)
 {
 	void *old, *new;
 	size_t old_subbuf, new_subbuf;
@@ -777,7 +780,7 @@ toobig:
 	buf->chan->last_toobig = length;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(relay_switch_subbuf);
+EXPORT_SYMBOL_GPL(switch_subbuf_default_callback);
 
 /**
  *	relay_subbufs_consumed - update the buffer's sub-buffers-consumed count
@@ -857,14 +860,14 @@ void relay_flush(struct rchan *chan)
 		return;
 
 	if (chan->is_global && chan->buf[0]) {
-		relay_switch_subbuf(chan->buf[0], 0);
+		chan->cb->switch_subbuf(chan->buf[0], 0);
 		return;
 	}
 
 	mutex_lock(&relay_channels_mutex);
 	for_each_possible_cpu(i)
 		if (chan->buf[i])
-			relay_switch_subbuf(chan->buf[i], 0);
+			chan->cb->switch_subbuf(chan->buf[i], 0);
 	mutex_unlock(&relay_channels_mutex);
 }
 EXPORT_SYMBOL_GPL(relay_flush);
-- 
1.5.3.5




^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH 3/8] relay - Add channel flags to relay, remove global callback param.
  2008-09-24  3:50           ` Tom Zanussi
                               ` (3 preceding siblings ...)
  2008-09-25  6:07             ` [RFC PATCH 2/8] relay - Make the relay sub-buffer switch code replaceable Tom Zanussi
@ 2008-09-25  6:07             ` Tom Zanussi
  2008-09-25  6:07             ` [RFC PATCH 4/8] relay - Add reserved param to switch-subbuf, in preparation for non-pad write/reserve Tom Zanussi
                               ` (4 subsequent siblings)
  9 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-25  6:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

relay should probably have had a flags param from the beginning; it
wasn't originally added because it wasn't originally needed - it
probably would have helped avoid some of the callback contortions
that were added due to a lack of flags.  This adds them and does a
small amount of low-hanging cleanup, and is also in preparation for
some new flags in future patches.
---
 block/blktrace.c      |    5 ++---
 include/linux/relay.h |   19 ++++++++++---------
 kernel/relay.c        |   20 ++++++++++----------
 virt/kvm/kvm_trace.c  |    9 ++++-----
 4 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/block/blktrace.c b/block/blktrace.c
index eb9651c..150c5f7 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry)
 static struct dentry *blk_create_buf_file_callback(const char *filename,
 						   struct dentry *parent,
 						   int mode,
-						   struct rchan_buf *buf,
-						   int *is_global)
+						   struct rchan_buf *buf)
 {
 	return debugfs_create_file(filename, mode, parent, buf,
 					&relay_file_operations);
@@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 		goto err;
 
 	bt->rchan = relay_open("trace", dir, buts->buf_size,
-				buts->buf_nr, &blk_relay_callbacks, bt);
+			       buts->buf_nr, &blk_relay_callbacks, bt, 0UL);
 	if (!bt->rchan)
 		goto err;
 
diff --git a/include/linux/relay.h b/include/linux/relay.h
index a1dcfc1..648b4da 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -28,6 +28,12 @@
 #define RELAYFS_CHANNEL_VERSION		7
 
 /*
+ * relay channel flags
+ */
+#define RCHAN_MODE_OVERWRITE		0x00000001	/* 'flight' mode */
+#define RCHAN_GLOBAL_BUFFER		0x00000002	/* not using per-cpu */
+
+/*
  * Per-cpu relay channel buffer
  */
 struct rchan_buf
@@ -66,11 +72,11 @@ struct rchan
 	void *private_data;		/* for user-defined data */
 	size_t last_toobig;		/* tried to log event > subbuf size */
 	struct rchan_buf *buf[NR_CPUS]; /* per-cpu channel buffers */
-	int is_global;			/* One global buffer ? */
 	struct list_head list;		/* for channel list */
 	struct dentry *parent;		/* parent dentry passed to open */
 	int has_base_filename;		/* has a filename associated? */
 	char base_filename[NAME_MAX];	/* saved base filename */
+	unsigned long flags;		/* relay flags for this channel */
 };
 
 /*
@@ -125,7 +131,6 @@ struct rchan_callbacks
 	 * @parent: the parent of the file to create
 	 * @mode: the mode of the file to create
 	 * @buf: the channel buffer
-	 * @is_global: outparam - set non-zero if the buffer should be global
 	 *
 	 * Called during relay_open(), once for each per-cpu buffer,
 	 * to allow the client to create a file to be used to
@@ -136,17 +141,12 @@ struct rchan_callbacks
 	 * The callback should return the dentry of the file created
 	 * to represent the relay buffer.
 	 *
-	 * Setting the is_global outparam to a non-zero value will
-	 * cause relay_open() to create a single global buffer rather
-	 * than the default set of per-cpu buffers.
-	 *
 	 * See Documentation/filesystems/relayfs.txt for more info.
 	 */
 	struct dentry *(*create_buf_file)(const char *filename,
 					  struct dentry *parent,
 					  int mode,
-					  struct rchan_buf *buf,
-					  int *is_global);
+					  struct rchan_buf *buf);
 
 	/*
 	 * remove_buf_file - remove file representing a relay channel buffer
@@ -191,7 +191,8 @@ struct rchan *relay_open(const char *base_filename,
 			 size_t subbuf_size,
 			 size_t n_subbufs,
 			 struct rchan_callbacks *cb,
-			 void *private_data);
+			 void *private_data,
+			 unsigned long rchan_flags);
 extern int relay_late_setup_files(struct rchan *chan,
 				  const char *base_filename,
 				  struct dentry *parent);
diff --git a/kernel/relay.c b/kernel/relay.c
index f1f55ae..d7a6458 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -310,8 +310,7 @@ static void buf_unmapped_default_callback(struct rchan_buf *buf,
 static struct dentry *create_buf_file_default_callback(const char *filename,
 						       struct dentry *parent,
 						       int mode,
-						       struct rchan_buf *buf,
-						       int *is_global)
+						       struct rchan_buf *buf)
 {
 	return NULL;
 }
@@ -411,7 +410,7 @@ void relay_reset(struct rchan *chan)
 	if (!chan)
 		return;
 
-	if (chan->is_global && chan->buf[0]) {
+	if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) {
 		__relay_reset(chan->buf[0], 0);
 		return;
 	}
@@ -445,8 +444,7 @@ static struct dentry *relay_create_buf_file(struct rchan *chan,
 
 	/* Create file in fs */
 	dentry = chan->cb->create_buf_file(tmpname, chan->parent,
-					   S_IRUSR, buf,
-					   &chan->is_global);
+					   S_IRUSR, buf);
 
 	kfree(tmpname);
 
@@ -463,7 +461,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
  	struct rchan_buf *buf = NULL;
 	struct dentry *dentry;
 
- 	if (chan->is_global)
+	if (chan->flags & RCHAN_GLOBAL_BUFFER)
 		return chan->buf[0];
 
 	buf = relay_create_buf(chan);
@@ -480,7 +478,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
  	buf->cpu = cpu;
  	__relay_reset(buf, 1);
 
- 	if(chan->is_global) {
+	if(chan->flags & RCHAN_GLOBAL_BUFFER) {
  		chan->buf[0] = buf;
  		buf->cpu = 0;
   	}
@@ -595,7 +593,8 @@ struct rchan *relay_open(const char *base_filename,
 			 size_t subbuf_size,
 			 size_t n_subbufs,
 			 struct rchan_callbacks *cb,
-			 void *private_data)
+			 void *private_data,
+			 unsigned long rchan_flags)
 {
 	unsigned int i;
 	struct rchan *chan;
@@ -612,6 +611,7 @@ struct rchan *relay_open(const char *base_filename,
 	chan->subbuf_size = subbuf_size;
 	chan->alloc_size = FIX_SIZE(subbuf_size * n_subbufs);
 	chan->parent = parent;
+	chan->flags = rchan_flags;
 	chan->private_data = private_data;
 	if (base_filename) {
 		chan->has_base_filename = 1;
@@ -828,7 +828,7 @@ void relay_close(struct rchan *chan)
 		return;
 
 	mutex_lock(&relay_channels_mutex);
-	if (chan->is_global && chan->buf[0])
+	if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0])
 		relay_close_buf(chan->buf[0]);
 	else
 		for_each_possible_cpu(i)
@@ -859,7 +859,7 @@ void relay_flush(struct rchan *chan)
 	if (!chan)
 		return;
 
-	if (chan->is_global && chan->buf[0]) {
+	if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) {
 		chan->cb->switch_subbuf(chan->buf[0], 0);
 		return;
 	}
diff --git a/virt/kvm/kvm_trace.c b/virt/kvm/kvm_trace.c
index 58141f3..d0a9e1c 100644
--- a/virt/kvm/kvm_trace.c
+++ b/virt/kvm/kvm_trace.c
@@ -130,10 +130,9 @@ static int kvm_subbuf_start_callback(struct rchan_buf *buf, void *subbuf,
 }
 
 static struct dentry *kvm_create_buf_file_callack(const char *filename,
-						 struct dentry *parent,
-						 int mode,
-						 struct rchan_buf *buf,
-						 int *is_global)
+						  struct dentry *parent,
+						  int mode,
+						  struct rchan_buf *buf)
 {
 	return debugfs_create_file(filename, mode, parent, buf,
 				   &relay_file_operations);
@@ -171,7 +170,7 @@ static int do_kvm_trace_enable(struct kvm_user_trace_setup *kuts)
 		goto err;
 
 	kt->rchan = relay_open("trace", kvm_debugfs_dir, kuts->buf_size,
-				kuts->buf_nr, &kvm_relay_callbacks, kt);
+			       kuts->buf_nr, &kvm_relay_callbacks, kt, 0UL);
 	if (!kt->rchan)
 		goto err;
 
-- 
1.5.3.5




^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH 4/8] relay - Add reserved param to switch-subbuf, in preparation for non-pad write/reserve.
  2008-09-24  3:50           ` Tom Zanussi
                               ` (4 preceding siblings ...)
  2008-09-25  6:07             ` [RFC PATCH 3/8] relay - Add channel flags to relay, remove global callback param Tom Zanussi
@ 2008-09-25  6:07             ` Tom Zanussi
  2008-09-25  6:07             ` [RFC PATCH 5/8] relay - Map the first sub-buffer at the end of the buffer, for temporary convenience Tom Zanussi
                               ` (3 subsequent siblings)
  9 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-25  6:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Add reserved param to switch-subbuf, in preparation for non-pad write/reserve.

Because a write/reserve can now cross sub-buffer boundaries, we use
the length returned as a remainder for the new sub-buffer, and use the
reserved param to return a pointer to the reserved space, or NULL if
it couldn't be reserved.  This patch also changes write/reserve to
preserve their current behavior despite that change.  This all goes
away in a future patch, but is here now so things don't break.
---
 include/linux/relay.h |   24 ++++++++++++++++--------
 kernel/relay.c        |   12 +++++++++---
 2 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 648b4da..13163b0 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -173,13 +173,16 @@ struct rchan_callbacks
 	 * switch_subbuf - sub-buffer switch callback
 	 * @buf: the channel buffer
 	 * @length: size of current event
+	 * @reserved: a pointer to the space reserved
 	 *
 	 * Returns either the length passed in or 0 if full.
 	 *
 	 * Performs sub-buffer-switch tasks such as updating filesize,
 	 * waking up readers, etc.
 	 */
-	size_t (*switch_subbuf)(struct rchan_buf *buf, size_t length);
+	size_t (*switch_subbuf)(struct rchan_buf *buf,
+				size_t length,
+				void **reserved);
 };
 
 /*
@@ -205,7 +208,8 @@ extern void relay_reset(struct rchan *chan);
 extern int relay_buf_full(struct rchan_buf *buf);
 
 extern size_t switch_subbuf_default_callback(struct rchan_buf *buf,
-					     size_t length);
+					     size_t length,
+					     void **reserved);
 
 /**
  *	relay_event_toobig - is event too big to fit in a sub-buffer?
@@ -268,12 +272,14 @@ static inline void relay_write(struct rchan *chan,
 {
 	unsigned long flags;
 	struct rchan_buf *buf;
+	void *reserved;
 
 	local_irq_save(flags);
 	buf = chan->buf[smp_processor_id()];
+	reserved = buf->data + buf->offset;
 	if (unlikely(buf->offset + length > chan->subbuf_size))
-		length = chan->cb->switch_subbuf(buf, length);
-	memcpy(buf->data + buf->offset, data, length);
+		length = chan->cb->switch_subbuf(buf, length, &reserved);
+	memcpy(reserved, data, length);
 	buf->offset += length;
 	local_irq_restore(flags);
 }
@@ -295,11 +301,13 @@ static inline void __relay_write(struct rchan *chan,
 				 size_t length)
 {
 	struct rchan_buf *buf;
+	void *reserved;
 
 	buf = chan->buf[get_cpu()];
+	reserved = buf->data + buf->offset;
 	if (unlikely(buf->offset + length > buf->chan->subbuf_size))
-		length = chan->cb->switch_subbuf(buf, length);
-	memcpy(buf->data + buf->offset, data, length);
+		length = chan->cb->switch_subbuf(buf, length, &reserved);
+	memcpy(reserved, data, length);
 	buf->offset += length;
 	put_cpu();
 }
@@ -320,12 +328,12 @@ static inline void *relay_reserve(struct rchan *chan, size_t length)
 	void *reserved;
 	struct rchan_buf *buf = chan->buf[smp_processor_id()];
 
+	reserved = buf->data + buf->offset;
 	if (unlikely(buf->offset + length > buf->chan->subbuf_size)) {
-		length = chan->cb->switch_subbuf(buf, length);
+		length = chan->cb->switch_subbuf(buf, length, &reserved);
 		if (!length)
 			return NULL;
 	}
-	reserved = buf->data + buf->offset;
 	buf->offset += length;
 
 	return reserved;
diff --git a/kernel/relay.c b/kernel/relay.c
index d7a6458..9ea9240 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -736,13 +736,16 @@ int relay_late_setup_files(struct rchan *chan,
  *	switch_subbuf_default_callback - switch to a new sub-buffer
  *	@buf: channel buffer
  *	@length: size of current event
+ *	@reserved: a pointer to the space reserved
  *
  *	Returns either the length passed in or 0 if full.
  *
  *	Performs sub-buffer-switch tasks such as invoking callbacks,
  *	updating padding counts, waking up readers, etc.
  */
-size_t switch_subbuf_default_callback(struct rchan_buf *buf, size_t length)
+size_t switch_subbuf_default_callback(struct rchan_buf *buf,
+				      size_t length,
+				      void **reserved)
 {
 	void *old, *new;
 	size_t old_subbuf, new_subbuf;
@@ -774,6 +777,9 @@ size_t switch_subbuf_default_callback(struct rchan_buf *buf, size_t length)
 	if (unlikely(relay_event_toobig(buf, length + buf->offset)))
 		goto toobig;
 
+	if (reserved)
+		*reserved = buf->data;
+
 	return length;
 
 toobig:
@@ -860,14 +866,14 @@ void relay_flush(struct rchan *chan)
 		return;
 
 	if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) {
-		chan->cb->switch_subbuf(chan->buf[0], 0);
+		chan->cb->switch_subbuf(chan->buf[0], 0, NULL);
 		return;
 	}
 
 	mutex_lock(&relay_channels_mutex);
 	for_each_possible_cpu(i)
 		if (chan->buf[i])
-			chan->cb->switch_subbuf(chan->buf[i], 0);
+			chan->cb->switch_subbuf(chan->buf[i], 0, NULL);
 	mutex_unlock(&relay_channels_mutex);
 }
 EXPORT_SYMBOL_GPL(relay_flush);
-- 
1.5.3.5




^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH 5/8] relay - Map the first sub-buffer at the end of the buffer, for temporary convenience.
  2008-09-24  3:50           ` Tom Zanussi
                               ` (5 preceding siblings ...)
  2008-09-25  6:07             ` [RFC PATCH 4/8] relay - Add reserved param to switch-subbuf, in preparation for non-pad write/reserve Tom Zanussi
@ 2008-09-25  6:07             ` Tom Zanussi
  2008-09-25  6:07             ` [RFC PATCH 6/8] relay - Replace relay_reserve/relay_write with non-padded versions Tom Zanussi
                               ` (2 subsequent siblings)
  9 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-25  6:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Map the first sub-buffer at the end of the buffer, for temporary convenience.

Make relay buffers 'circular' for writing by mapping the first subbuf
at end of last subbuf.  This is so we can do writes across last subbuf
boundary without adding special write logic.  This is a temporary
state of affairs and it all goes away in a future patch, but it's here
now so things will still work.
---
 kernel/relay.c |   26 +++++++++++++++-----------
 1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/kernel/relay.c b/kernel/relay.c
index 9ea9240..9a08fec 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -125,20 +125,20 @@ static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
 /**
  *	relay_alloc_buf - allocate a channel buffer
  *	@buf: the buffer struct
- *	@size: total size of the buffer
  *
  *	Returns a pointer to the resulting buffer, %NULL if unsuccessful. The
  *	passed in size will get page aligned, if it isn't already.
  */
-static void *relay_alloc_buf(struct rchan_buf *buf, size_t *size)
+static void *relay_alloc_buf(struct rchan_buf *buf)
 {
 	void *mem;
-	unsigned int i, j, n_pages;
+	unsigned int i, j, n_pages, n_subbuf_pages;
 
-	*size = PAGE_ALIGN(*size);
-	n_pages = *size >> PAGE_SHIFT;
+	buf->chan->alloc_size = PAGE_ALIGN(buf->chan->alloc_size);
+	n_pages = buf->chan->alloc_size >> PAGE_SHIFT;
+	n_subbuf_pages = PAGE_ALIGN(buf->chan->subbuf_size) >> PAGE_SHIFT;
 
-	buf->page_array = relay_alloc_page_array(n_pages);
+	buf->page_array = relay_alloc_page_array(n_pages + n_subbuf_pages);
 	if (!buf->page_array)
 		return NULL;
 
@@ -148,11 +148,14 @@ static void *relay_alloc_buf(struct rchan_buf *buf, size_t *size)
 			goto depopulate;
 		set_page_private(buf->page_array[i], (unsigned long)buf);
 	}
-	mem = vmap(buf->page_array, n_pages, VM_MAP, PAGE_KERNEL);
+	for (i = 0; i < n_subbuf_pages; i++)
+		buf->page_array[n_pages + i] = buf->page_array[i];
+	mem = vmap(buf->page_array, n_pages + n_subbuf_pages, VM_MAP,
+		   PAGE_KERNEL);
 	if (!mem)
 		goto depopulate;
 
-	memset(mem, 0, *size);
+	memset(mem, 0, buf->chan->alloc_size);
 	buf->page_count = n_pages;
 	return mem;
 
@@ -179,12 +182,13 @@ static struct rchan_buf *relay_create_buf(struct rchan *chan)
 	if (!buf->padding)
 		goto free_buf;
 
-	buf->start = relay_alloc_buf(buf, &chan->alloc_size);
+	buf->chan = chan;
+	kref_get(&buf->chan->kref);
+
+	buf->start = relay_alloc_buf(buf);
 	if (!buf->start)
 		goto free_buf;
 
-	buf->chan = chan;
-	kref_get(&buf->chan->kref);
 	return buf;
 
 free_buf:
-- 
1.5.3.5




^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH 6/8] relay - Replace relay_reserve/relay_write with non-padded versions.
  2008-09-24  3:50           ` Tom Zanussi
                               ` (6 preceding siblings ...)
  2008-09-25  6:07             ` [RFC PATCH 5/8] relay - Map the first sub-buffer at the end of the buffer, for temporary convenience Tom Zanussi
@ 2008-09-25  6:07             ` Tom Zanussi
  2008-09-25  6:07             ` [RFC PATCH 7/8] relay - Remove padding-related code from relay_read()/relay_splice_read() et al Tom Zanussi
  2008-09-25  6:08             ` [RFC PATCH 8/8] relay - Clean up remaining padding-related junk Tom Zanussi
  9 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-25  6:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Replace relay_reserve/relay_write with non-padded versions.

The old versions of relay_reserve/relay_write would write/reserve an
event only if the whole thing could fit in the remaining space of the
current sub-buffer; if it couldn't it would add padding to the current
sub-buffer and reserve in the next.  The new versions don't add
padding but use up all the space in a sub-buffer and write the
remainder in the next sub-buffer.  They won't however write a partial
event - if there's not enough space for the event in the current
sub-buffer and the next sub-buffer isn't free, the whole reserve/write
will fail.
---
 include/linux/relay.h |   41 +++++++++++++++++++----------
 kernel/relay.c        |   69 +++++++++++++++++++++++++++----------------------
 2 files changed, 65 insertions(+), 45 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 13163b0..c42b2d3 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -207,9 +207,9 @@ extern void relay_subbufs_consumed(struct rchan *chan,
 extern void relay_reset(struct rchan *chan);
 extern int relay_buf_full(struct rchan_buf *buf);
 
-extern size_t switch_subbuf_default_callback(struct rchan_buf *buf,
-					     size_t length,
-					     void **reserved);
+extern size_t relay_switch_subbuf_default_callback(struct rchan_buf *buf,
+						   size_t length,
+						   void **reserved);
 
 /**
  *	relay_event_toobig - is event too big to fit in a sub-buffer?
@@ -270,17 +270,23 @@ static inline void relay_write(struct rchan *chan,
 			       const void *data,
 			       size_t length)
 {
-	unsigned long flags;
+	size_t remainder = length;
 	struct rchan_buf *buf;
+	unsigned long flags;
 	void *reserved;
 
 	local_irq_save(flags);
 	buf = chan->buf[smp_processor_id()];
 	reserved = buf->data + buf->offset;
-	if (unlikely(buf->offset + length > chan->subbuf_size))
-		length = chan->cb->switch_subbuf(buf, length, &reserved);
+	if (unlikely(buf->offset + length > buf->chan->subbuf_size)) {
+		remainder = chan->cb->switch_subbuf(buf, length, &reserved);
+		if (unlikely(!reserved)) {
+			local_irq_restore(flags);
+			return;
+		}
+	}
 	memcpy(reserved, data, length);
-	buf->offset += length;
+	buf->offset += remainder;
 	local_irq_restore(flags);
 }
 
@@ -300,15 +306,22 @@ static inline void __relay_write(struct rchan *chan,
 				 const void *data,
 				 size_t length)
 {
+	size_t remainder = length;
 	struct rchan_buf *buf;
+	unsigned long flags;
 	void *reserved;
 
 	buf = chan->buf[get_cpu()];
 	reserved = buf->data + buf->offset;
-	if (unlikely(buf->offset + length > buf->chan->subbuf_size))
-		length = chan->cb->switch_subbuf(buf, length, &reserved);
+	if (unlikely(buf->offset + length > buf->chan->subbuf_size)) {
+		remainder = chan->cb->switch_subbuf(buf, length, &reserved);
+		if (unlikely(!reserved)) {
+			local_irq_restore(flags);
+			return;
+		}
+	}
 	memcpy(reserved, data, length);
-	buf->offset += length;
+	buf->offset += remainder;
 	put_cpu();
 }
 
@@ -323,15 +336,15 @@ static inline void __relay_write(struct rchan *chan,
  *	Does not protect the buffer at all - caller must provide
  *	appropriate synchronization.
  */
-static inline void *relay_reserve(struct rchan *chan, size_t length)
+static inline void *relay_reserve(struct rchan *chan,
+				  size_t length)
 {
-	void *reserved;
 	struct rchan_buf *buf = chan->buf[smp_processor_id()];
+	void *reserved = buf->data + buf->offset;
 
-	reserved = buf->data + buf->offset;
 	if (unlikely(buf->offset + length > buf->chan->subbuf_size)) {
 		length = chan->cb->switch_subbuf(buf, length, &reserved);
-		if (!length)
+		if (unlikely(!reserved))
 			return NULL;
 	}
 	buf->offset += length;
diff --git a/kernel/relay.c b/kernel/relay.c
index 9a08fec..15e4de2 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -350,7 +350,7 @@ static struct rchan_callbacks default_channel_callbacks = {
 	.create_buf_file = create_buf_file_default_callback,
 	.remove_buf_file = remove_buf_file_default_callback,
 	.wakeup_readers = wakeup_readers_default_callback,
-	.switch_subbuf = switch_subbuf_default_callback,
+	.switch_subbuf = relay_switch_subbuf_default_callback,
 };
 
 /**
@@ -530,7 +530,7 @@ static void setup_callbacks(struct rchan *chan,
 	if (!cb->wakeup_readers)
 		cb->wakeup_readers = wakeup_readers_default_callback;
 	if (!cb->switch_subbuf)
-		cb->switch_subbuf = switch_subbuf_default_callback;
+		cb->switch_subbuf = relay_switch_subbuf_default_callback;
 	chan->cb = cb;
 }
 
@@ -736,8 +736,20 @@ int relay_late_setup_files(struct rchan *chan,
 	return err;
 }
 
+static inline int next_subbuf_free(struct rchan_buf *buf)
+{
+	size_t full_subbufs;
+
+	if (buf->chan->flags & RCHAN_MODE_OVERWRITE)
+		return 1;
+
+	full_subbufs = buf->subbufs_produced - buf->subbufs_consumed;
+
+	return (full_subbufs < buf->chan->n_subbufs - 1);
+}
+
 /**
- *	switch_subbuf_default_callback - switch to a new sub-buffer
+ *	relay_switch_subbuf_default_callback - switch to a new sub-buffer
  *	@buf: channel buffer
  *	@length: size of current event
  *	@reserved: a pointer to the space reserved
@@ -747,50 +759,45 @@ int relay_late_setup_files(struct rchan *chan,
  *	Performs sub-buffer-switch tasks such as invoking callbacks,
  *	updating padding counts, waking up readers, etc.
  */
-size_t switch_subbuf_default_callback(struct rchan_buf *buf,
-				      size_t length,
-				      void **reserved)
+size_t relay_switch_subbuf_default_callback(struct rchan_buf *buf,
+					    size_t length,
+					    void **reserved)
 {
-	void *old, *new;
-	size_t old_subbuf, new_subbuf;
+	size_t remainder, new_subbuf;
+	void *new_data;
 
 	if (unlikely(relay_event_toobig(buf, length)))
 		goto toobig;
 
-	if (buf->offset != buf->chan->subbuf_size + 1) {
-		buf->prev_padding = buf->chan->subbuf_size - buf->offset;
-		old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs;
-		buf->padding[old_subbuf] = buf->prev_padding;
-		relay_inc_produced(buf);
-		relay_update_filesize(buf, buf->chan->subbuf_size -
-				      buf->padding[old_subbuf]);
-		buf->chan->cb->wakeup_readers(buf);
+	/* don't write anything unless we can write it all. */
+	if (!next_subbuf_free(buf)) {
+		*reserved = NULL;
+		return 0;
 	}
 
-	old = buf->data;
+	if (reserved)
+		*reserved = buf->data + buf->offset;
+
+	remainder = length - (buf->chan->subbuf_size - buf->offset);
+	relay_inc_produced(buf);
+	relay_update_filesize(buf, buf->chan->subbuf_size + remainder);
+	buf->chan->cb->wakeup_readers(buf);
+
 	new_subbuf = buf->subbufs_produced % buf->chan->n_subbufs;
-	new = buf->start + new_subbuf * buf->chan->subbuf_size;
-	buf->offset = 0;
-	if (!buf->chan->cb->subbuf_start(buf, new, old, buf->prev_padding)) {
-		buf->offset = buf->chan->subbuf_size + 1;
-		return 0;
-	}
-	buf->data = new;
-	buf->padding[new_subbuf] = 0;
+	new_data = buf->start + new_subbuf * buf->chan->subbuf_size;
+
+	buf->data = new_data;
+	buf->offset = 0; /* remainder will be added by caller */
 
 	if (unlikely(relay_event_toobig(buf, length + buf->offset)))
 		goto toobig;
 
-	if (reserved)
-		*reserved = buf->data;
-
-	return length;
-
+	return remainder;
 toobig:
 	buf->chan->last_toobig = length;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(switch_subbuf_default_callback);
+EXPORT_SYMBOL_GPL(relay_switch_subbuf_default_callback);
 
 /**
  *	relay_subbufs_consumed - update the buffer's sub-buffers-consumed count
-- 
1.5.3.5




^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH 7/8] relay - Remove padding-related code from relay_read()/relay_splice_read() et al.
  2008-09-24  3:50           ` Tom Zanussi
                               ` (7 preceding siblings ...)
  2008-09-25  6:07             ` [RFC PATCH 6/8] relay - Replace relay_reserve/relay_write with non-padded versions Tom Zanussi
@ 2008-09-25  6:07             ` Tom Zanussi
  2008-09-25  6:08             ` [RFC PATCH 8/8] relay - Clean up remaining padding-related junk Tom Zanussi
  9 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-25  6:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Remove padding-related code from relay_read()/relay_splice_read() et al.

Because we no longer write padding, we no longer have to read it or
account for it anywhere else, greatly simplifying the related code.
---
 kernel/relay.c |  149 ++++++++------------------------------------------------
 1 files changed, 20 insertions(+), 129 deletions(-)

diff --git a/kernel/relay.c b/kernel/relay.c
index 15e4de2..21b3e19 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -966,72 +966,13 @@ static void relay_file_read_consume(struct rchan_buf *buf,
 				    size_t bytes_consumed)
 {
 	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
-	size_t read_subbuf;
-
-	if (buf->subbufs_produced == buf->subbufs_consumed &&
-	    buf->offset == buf->bytes_consumed)
-		return;
-
-	if (buf->bytes_consumed + bytes_consumed > subbuf_size) {
-		relay_subbufs_consumed(buf->chan, buf->cpu, 1);
-		buf->bytes_consumed = 0;
-	}
 
 	buf->bytes_consumed += bytes_consumed;
-	if (!read_pos)
-		read_subbuf = buf->subbufs_consumed % n_subbufs;
-	else
-		read_subbuf = read_pos / buf->chan->subbuf_size;
-	if (buf->bytes_consumed + buf->padding[read_subbuf] == subbuf_size) {
-		if ((read_subbuf == buf->subbufs_produced % n_subbufs) &&
-		    (buf->offset == subbuf_size))
-			return;
-		relay_subbufs_consumed(buf->chan, buf->cpu, 1);
-		buf->bytes_consumed = 0;
-	}
-}
 
-/*
- *	relay_file_read_avail - boolean, are there unconsumed bytes available?
- */
-static int relay_file_read_avail(struct rchan_buf *buf, size_t read_pos)
-{
-	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
-	size_t produced = buf->subbufs_produced;
-	size_t consumed = buf->subbufs_consumed;
-
-	relay_file_read_consume(buf, read_pos, 0);
-
-	consumed = buf->subbufs_consumed;
-
-	if (unlikely(buf->offset > subbuf_size)) {
-		if (produced == consumed)
-			return 0;
-		return 1;
-	}
-
-	if (unlikely(produced - consumed >= n_subbufs)) {
-		consumed = produced - n_subbufs + 1;
-		buf->subbufs_consumed = consumed;
+	if (buf->bytes_consumed == subbuf_size) {
+		relay_subbufs_consumed(buf->chan, buf->cpu, 1);
 		buf->bytes_consumed = 0;
 	}
-
-	produced = (produced % n_subbufs) * subbuf_size + buf->offset;
-	consumed = (consumed % n_subbufs) * subbuf_size + buf->bytes_consumed;
-
-	if (consumed > produced)
-		produced += n_subbufs * subbuf_size;
-
-	if (consumed == produced) {
-		if (buf->offset == subbuf_size &&
-		    buf->subbufs_produced > buf->subbufs_consumed)
-			return 1;
-		return 0;
-	}
-
-	return 1;
 }
 
 /**
@@ -1042,21 +983,19 @@ static int relay_file_read_avail(struct rchan_buf *buf, size_t read_pos)
 static size_t relay_file_read_subbuf_avail(size_t read_pos,
 					   struct rchan_buf *buf)
 {
-	size_t padding, avail = 0;
+	size_t avail;
 	size_t read_subbuf, read_offset, write_subbuf, write_offset;
 	size_t subbuf_size = buf->chan->subbuf_size;
 
 	write_subbuf = (buf->data - buf->start) / subbuf_size;
-	write_offset = buf->offset > subbuf_size ? subbuf_size : buf->offset;
+	write_offset = buf->offset;
 	read_subbuf = read_pos / subbuf_size;
 	read_offset = read_pos % subbuf_size;
-	padding = buf->padding[read_subbuf];
 
-	if (read_subbuf == write_subbuf) {
-		if (read_offset + padding < write_offset)
-			avail = write_offset - (read_offset + padding);
-	} else
-		avail = (subbuf_size - padding) - read_offset;
+	avail = subbuf_size - read_offset;
+
+	if (read_subbuf == write_subbuf && read_offset < write_offset)
+		avail = write_offset - read_offset;
 
 	return avail;
 }
@@ -1066,28 +1005,17 @@ static size_t relay_file_read_subbuf_avail(size_t read_pos,
  *	@read_pos: file read position
  *	@buf: relay channel buffer
  *
- *	If the @read_pos is in the middle of padding, return the
- *	position of the first actually available byte, otherwise
- *	return the original value.
+ *	If the @read_pos is 0, return the position of the first
+ *	unconsumed byte, otherwise return the original value.
  */
 static size_t relay_file_read_start_pos(size_t read_pos,
 					struct rchan_buf *buf)
 {
-	size_t read_subbuf, padding, padding_start, padding_end;
 	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
-	size_t consumed = buf->subbufs_consumed % n_subbufs;
+	size_t consumed = buf->subbufs_consumed % buf->chan->n_subbufs;
 
 	if (!read_pos)
 		read_pos = consumed * subbuf_size + buf->bytes_consumed;
-	read_subbuf = read_pos / subbuf_size;
-	padding = buf->padding[read_subbuf];
-	padding_start = (read_subbuf + 1) * subbuf_size - padding;
-	padding_end = (read_subbuf + 1) * subbuf_size;
-	if (read_pos >= padding_start && read_pos < padding_end) {
-		read_subbuf = (read_subbuf + 1) % n_subbufs;
-		read_pos = read_subbuf * subbuf_size;
-	}
 
 	return read_pos;
 }
@@ -1102,17 +1030,9 @@ static size_t relay_file_read_end_pos(struct rchan_buf *buf,
 				      size_t read_pos,
 				      size_t count)
 {
-	size_t read_subbuf, padding, end_pos;
-	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
+	size_t end_pos = read_pos + count;
 
-	read_subbuf = read_pos / subbuf_size;
-	padding = buf->padding[read_subbuf];
-	if (read_pos % subbuf_size + count + padding == subbuf_size)
-		end_pos = (read_subbuf + 1) * subbuf_size;
-	else
-		end_pos = read_pos + count;
-	if (end_pos >= subbuf_size * n_subbufs)
+	if (end_pos >= buf->chan->subbuf_size * buf->chan->n_subbufs)
 		end_pos = 0;
 
 	return end_pos;
@@ -1166,9 +1086,6 @@ static ssize_t relay_file_read_subbufs(struct file *filp, loff_t *ppos,
 
 	mutex_lock(&filp->f_path.dentry->d_inode->i_mutex);
 	do {
-		if (!relay_file_read_avail(buf, *ppos))
-			break;
-
 		read_start = relay_file_read_start_pos(*ppos, buf);
 		avail = relay_file_read_subbuf_avail(read_start, buf);
 		if (!avail)
@@ -1243,8 +1160,7 @@ static int subbuf_splice_actor(struct file *in,
 			       loff_t *ppos,
 			       struct pipe_inode_info *pipe,
 			       size_t len,
-			       unsigned int flags,
-			       int *nonpad_ret)
+			       unsigned int flags)
 {
 	unsigned int pidx, poff, total_len, subbuf_pages, nr_pages, ret;
 	struct rchan_buf *rbuf = in->private_data;
@@ -1252,9 +1168,6 @@ static int subbuf_splice_actor(struct file *in,
 	uint64_t pos = (uint64_t) *ppos;
 	uint32_t alloc_size = (uint32_t) rbuf->chan->alloc_size;
 	size_t read_start = (size_t) do_div(pos, alloc_size);
-	size_t read_subbuf = read_start / subbuf_size;
-	size_t padding = rbuf->padding[read_subbuf];
-	size_t nonpad_end = read_subbuf * subbuf_size + subbuf_size - padding;
 	struct page *pages[PIPE_BUFFERS];
 	struct partial_page partial[PIPE_BUFFERS];
 	struct splice_pipe_desc spd = {
@@ -1266,7 +1179,8 @@ static int subbuf_splice_actor(struct file *in,
 		.spd_release = relay_page_release,
 	};
 
-	if (rbuf->subbufs_produced == rbuf->subbufs_consumed)
+	if (rbuf->subbufs_produced == rbuf->subbufs_consumed &&
+	    rbuf->offset == rbuf->bytes_consumed)
 		return 0;
 
 	/*
@@ -1281,46 +1195,25 @@ static int subbuf_splice_actor(struct file *in,
 	nr_pages = min_t(unsigned int, subbuf_pages, PIPE_BUFFERS);
 
 	for (total_len = 0; spd.nr_pages < nr_pages; spd.nr_pages++) {
-		unsigned int this_len, this_end, private;
-		unsigned int cur_pos = read_start + total_len;
+		unsigned int this_len;
 
 		if (!len)
 			break;
 
 		this_len = min_t(unsigned long, len, PAGE_SIZE - poff);
-		private = this_len;
 
 		spd.pages[spd.nr_pages] = rbuf->page_array[pidx];
 		spd.partial[spd.nr_pages].offset = poff;
 
-		this_end = cur_pos + this_len;
-		if (this_end >= nonpad_end) {
-			this_len = nonpad_end - cur_pos;
-			private = this_len + padding;
-		}
 		spd.partial[spd.nr_pages].len = this_len;
-		spd.partial[spd.nr_pages].private = private;
 
 		len -= this_len;
 		total_len += this_len;
 		poff = 0;
 		pidx = (pidx + 1) % subbuf_pages;
-
-		if (this_end >= nonpad_end) {
-			spd.nr_pages++;
-			break;
-		}
 	}
 
-	if (!spd.nr_pages)
-		return 0;
-
-	ret = *nonpad_ret = splice_to_pipe(pipe, &spd);
-	if (ret < 0 || ret < total_len)
-		return ret;
-
-        if (read_start + ret == nonpad_end)
-                ret += padding;
+	ret = splice_to_pipe(pipe, &spd);
 
         return ret;
 }
@@ -1333,13 +1226,12 @@ static ssize_t relay_file_splice_read(struct file *in,
 {
 	ssize_t spliced;
 	int ret;
-	int nonpad_ret = 0;
 
 	ret = 0;
 	spliced = 0;
 
 	while (len && !spliced) {
-		ret = subbuf_splice_actor(in, ppos, pipe, len, flags, &nonpad_ret);
+		ret = subbuf_splice_actor(in, ppos, pipe, len, flags);
 		if (ret < 0)
 			break;
 		else if (!ret) {
@@ -1356,8 +1248,7 @@ static ssize_t relay_file_splice_read(struct file *in,
 			len = 0;
 		else
 			len -= ret;
-		spliced += nonpad_ret;
-		nonpad_ret = 0;
+		spliced += ret;
 	}
 
 	if (spliced)
-- 
1.5.3.5




^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [RFC PATCH 8/8] relay - Clean up remaining padding-related junk.
  2008-09-24  3:50           ` Tom Zanussi
                               ` (8 preceding siblings ...)
  2008-09-25  6:07             ` [RFC PATCH 7/8] relay - Remove padding-related code from relay_read()/relay_splice_read() et al Tom Zanussi
@ 2008-09-25  6:08             ` Tom Zanussi
  9 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-25  6:08 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, prasad, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Clean up remaining padding-related junk.

Removes the rest of the padding-related junk.  Also simplifies the
subbuf_start callback a bit.
---
 block/blktrace.c      |    5 +++--
 include/linux/relay.h |   12 ++----------
 kernel/relay.c        |   19 ++++---------------
 virt/kvm/kvm_trace.c  |    7 ++++---
 4 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/block/blktrace.c b/block/blktrace.c
index 150c5f7..271b7b7 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -334,8 +334,9 @@ static const struct file_operations blk_msg_fops = {
  * Keep track of how many times we encountered a full subbuffer, to aid
  * the user space app in telling how many lost events there were.
  */
-static int blk_subbuf_start_callback(struct rchan_buf *buf, void *subbuf,
-				     void *prev_subbuf, size_t prev_padding)
+static int blk_subbuf_start_callback(struct rchan_buf *buf,
+				     void *subbuf,
+				     int first_subbuf)
 {
 	struct blk_trace *bt;
 
diff --git a/include/linux/relay.h b/include/linux/relay.h
index c42b2d3..85f43a0 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -51,8 +51,6 @@ struct rchan_buf
 	struct page **page_array;	/* array of current buffer pages */
 	unsigned int page_count;	/* number of current buffer pages */
 	unsigned int finalized;		/* buffer has been finalized */
-	size_t *padding;		/* padding counts per sub-buffer */
-	size_t prev_padding;		/* temporary variable */
 	size_t bytes_consumed;		/* bytes consumed in cur read subbuf */
 	size_t early_bytes;		/* bytes consumed before VFS inited */
 	unsigned int cpu;		/* this buf's cpu */
@@ -88,23 +86,17 @@ struct rchan_callbacks
 	 * subbuf_start - called on buffer-switch to a new sub-buffer
 	 * @buf: the channel buffer containing the new sub-buffer
 	 * @subbuf: the start of the new sub-buffer
-	 * @prev_subbuf: the start of the previous sub-buffer
-	 * @prev_padding: unused space at the end of previous sub-buffer
+	 * @first_subbuf: boolean, is this the first subbuf?
 	 *
 	 * The client should return 1 to continue logging, 0 to stop
 	 * logging.
 	 *
-	 * NOTE: subbuf_start will also be invoked when the buffer is
-	 *       created, so that the first sub-buffer can be initialized
-	 *       if necessary.  In this case, prev_subbuf will be NULL.
-	 *
 	 * NOTE: the client can reserve bytes at the beginning of the new
 	 *       sub-buffer by calling subbuf_start_reserve() in this callback.
 	 */
 	int (*subbuf_start) (struct rchan_buf *buf,
 			     void *subbuf,
-			     void *prev_subbuf,
-			     size_t prev_padding);
+			     int first_subbuf);
 
 	/*
 	 * buf_mapped - relay buffer mmap notification
diff --git a/kernel/relay.c b/kernel/relay.c
index 21b3e19..b2bf510 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -178,10 +178,6 @@ static struct rchan_buf *relay_create_buf(struct rchan *chan)
 	if (!buf)
 		return NULL;
 
-	buf->padding = kmalloc(chan->n_subbufs * sizeof(size_t *), GFP_KERNEL);
-	if (!buf->padding)
-		goto free_buf;
-
 	buf->chan = chan;
 	kref_get(&buf->chan->kref);
 
@@ -192,7 +188,6 @@ static struct rchan_buf *relay_create_buf(struct rchan *chan)
 	return buf;
 
 free_buf:
-	kfree(buf->padding);
 	kfree(buf);
 	return NULL;
 }
@@ -225,7 +220,6 @@ static void relay_destroy_buf(struct rchan_buf *buf)
 		relay_free_page_array(buf->page_array);
 	}
 	chan->buf[buf->cpu] = NULL;
-	kfree(buf->padding);
 	kfree(buf);
 	kref_put(&chan->kref, relay_destroy_channel);
 }
@@ -283,8 +277,7 @@ EXPORT_SYMBOL_GPL(relay_buf_full);
  */
 static int subbuf_start_default_callback (struct rchan_buf *buf,
 					  void *subbuf,
-					  void *prev_subbuf,
-					  size_t prev_padding)
+					  int first_subbuf)
 {
 	if (relay_buf_full(buf))
 		return 0;
@@ -374,8 +367,6 @@ static void wakeup_readers(unsigned long data)
  */
 static void __relay_reset(struct rchan_buf *buf, unsigned int init)
 {
-	size_t i;
-
 	if (init) {
 		init_waitqueue_head(&buf->read_wait);
 		kref_init(&buf->kref);
@@ -390,10 +381,7 @@ static void __relay_reset(struct rchan_buf *buf, unsigned int init)
 	buf->data = buf->start;
 	buf->offset = 0;
 
-	for (i = 0; i < buf->chan->n_subbufs; i++)
-		buf->padding[i] = 0;
-
-	buf->chan->cb->subbuf_start(buf, buf->data, NULL, 0);
+	buf->chan->cb->subbuf_start(buf, buf->data, 1);
 }
 
 /**
@@ -757,7 +745,7 @@ static inline int next_subbuf_free(struct rchan_buf *buf)
  *	Returns either the length passed in or 0 if full.
  *
  *	Performs sub-buffer-switch tasks such as invoking callbacks,
- *	updating padding counts, waking up readers, etc.
+ *	waking up readers, etc.
  */
 size_t relay_switch_subbuf_default_callback(struct rchan_buf *buf,
 					    size_t length,
@@ -788,6 +776,7 @@ size_t relay_switch_subbuf_default_callback(struct rchan_buf *buf,
 
 	buf->data = new_data;
 	buf->offset = 0; /* remainder will be added by caller */
+	buf->chan->cb->subbuf_start(buf, new_data, 0);
 
 	if (unlikely(relay_event_toobig(buf, length + buf->offset)))
 		goto toobig;
diff --git a/virt/kvm/kvm_trace.c b/virt/kvm/kvm_trace.c
index d0a9e1c..4626caa 100644
--- a/virt/kvm/kvm_trace.c
+++ b/virt/kvm/kvm_trace.c
@@ -105,13 +105,14 @@ DEFINE_SIMPLE_ATTRIBUTE(kvm_trace_lost_ops, lost_records_get, NULL, "%llu\n");
  *  many times we encountered a full subbuffer, to tell user space app the
  *  lost records there were.
  */
-static int kvm_subbuf_start_callback(struct rchan_buf *buf, void *subbuf,
-				     void *prev_subbuf, size_t prev_padding)
+static int kvm_subbuf_start_callback(struct rchan_buf *buf,
+				     void *subbuf,
+				     int first_subbuf)
 {
 	struct kvm_trace *kt;
 
 	if (!relay_buf_full(buf)) {
-		if (!prev_subbuf) {
+		if (first_subbuf) {
 			/*
 			 * executed only once when the channel is opened
 			 * save metadata as first record
-- 
1.5.3.5




^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 1/3] relay - clean up subbuf switch
  2008-09-22 14:45     ` Peter Zijlstra
                         ` (2 preceding siblings ...)
  2008-09-23  5:25       ` Tom Zanussi
@ 2008-09-23  5:27       ` Tom Zanussi
  2008-09-23 20:15         ` Andrew Morton
  2008-09-23  5:27       ` [PATCH 2/3] relay - make subbuf switch replaceable Tom Zanussi
  2008-09-23  5:27       ` [PATCH 3/3] relay - add channel flags Tom Zanussi
  5 siblings, 1 reply; 122+ messages in thread
From: Tom Zanussi @ 2008-09-23  5:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Clean up relay_switch_subbuf() and make waking up consumers optional.
    
Over time, relay_switch_subbuf() has accumulated some cruft - this
patch cleans it up and at the same time makes available some of it
available as common functions that any subbuf-switch implementor would
need (this is partially in preparation for the next patch, which makes
the subbuf-switch function completely replaceable).  It also removes
the hard-coded reader wakeup and moves it into a replaceable callback
called notify_consumers(); this allows any given tracer to implement
consumer notification as it sees fit.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 953fc05..17f0515 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -159,6 +159,15 @@ struct rchan_callbacks
 	 * The callback should return 0 if successful, negative if not.
 	 */
 	int (*remove_buf_file)(struct dentry *dentry);
+
+	/*
+	 * notify_consumers - new sub-buffer available, let consumers know
+	 * @buf: the channel buffer
+	 *
+	 * Called during sub-buffer switch.  Applications which don't
+	 * want to notify anyone should implement an empty version.
+	 */
+        void (*notify_consumers)(struct rchan_buf *buf);
 };
 
 /*
@@ -186,6 +195,48 @@ extern size_t relay_switch_subbuf(struct rchan_buf *buf,
 				  size_t length);
 
 /**
+ *	relay_event_toobig - is event too big to fit in a sub-buffer?
+ *	@buf: relay channel buffer
+ *	@length: length of event
+ *
+ *	Returns 1 if too big, 0 otherwise.
+ *
+ *	switch_subbuf() helper function.
+ */
+static inline int relay_event_toobig(struct rchan_buf *buf, size_t length)
+{
+	return length > buf->chan->subbuf_size;
+}
+
+/**
+ *	relay_update_filesize - increase relay file i_size by length
+ *	@buf: relay channel buffer
+ *	@length: length to add
+ *
+ *	switch_subbuf() helper function.
+ */
+static inline void relay_update_filesize(struct rchan_buf *buf, size_t length)
+{
+	if (buf->dentry)
+		buf->dentry->d_inode->i_size +=	length;
+	else
+		buf->early_bytes += length;
+
+	smp_mb();
+}
+
+/**
+ *	relay_inc_produced - increase number of sub-buffers produced by 1
+ *	@buf: relay channel buffer
+ *
+ *	switch_subbuf() helper function.
+ */
+static inline void relay_inc_produced(struct rchan_buf *buf)
+{
+	buf->subbufs_produced++;
+}
+
+/**
  *	relay_write - write data into the channel
  *	@chan: relay channel
  *	@data: data to be written
diff --git a/kernel/relay.c b/kernel/relay.c
index 8d13a78..53652f1 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -324,6 +324,21 @@ static int remove_buf_file_default_callback(struct dentry *dentry)
 	return -EINVAL;
 }
 
+/*
+ * notify_consumers() default callback.
+ */
+static void notify_consumers_default_callback(struct rchan_buf *buf)
+{
+	if (waitqueue_active(&buf->read_wait))
+		/*
+		 * Calling wake_up_interruptible() from here
+		 * will deadlock if we happen to be logging
+		 * from the scheduler (trying to re-grab
+		 * rq->lock), so defer it.
+		 */
+		__mod_timer(&buf->timer, jiffies + 1);
+}
+
 /* relay channel default callbacks */
 static struct rchan_callbacks default_channel_callbacks = {
 	.subbuf_start = subbuf_start_default_callback,
@@ -331,6 +346,7 @@ static struct rchan_callbacks default_channel_callbacks = {
 	.buf_unmapped = buf_unmapped_default_callback,
 	.create_buf_file = create_buf_file_default_callback,
 	.remove_buf_file = remove_buf_file_default_callback,
+	.notify_consumers = notify_consumers_default_callback,
 };
 
 /**
@@ -508,6 +524,8 @@ static void setup_callbacks(struct rchan *chan,
 		cb->create_buf_file = create_buf_file_default_callback;
 	if (!cb->remove_buf_file)
 		cb->remove_buf_file = remove_buf_file_default_callback;
+	if (!cb->notify_consumers)
+		cb->notify_consumers = notify_consumers_default_callback;
 	chan->cb = cb;
 }
 
@@ -726,30 +744,17 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
 	void *old, *new;
 	size_t old_subbuf, new_subbuf;
 
-	if (unlikely(length > buf->chan->subbuf_size))
+	if (unlikely(relay_event_toobig(buf, length)))
 		goto toobig;
 
 	if (buf->offset != buf->chan->subbuf_size + 1) {
 		buf->prev_padding = buf->chan->subbuf_size - buf->offset;
 		old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs;
 		buf->padding[old_subbuf] = buf->prev_padding;
-		buf->subbufs_produced++;
-		if (buf->dentry)
-			buf->dentry->d_inode->i_size +=
-				buf->chan->subbuf_size -
-				buf->padding[old_subbuf];
-		else
-			buf->early_bytes += buf->chan->subbuf_size -
-					    buf->padding[old_subbuf];
-		smp_mb();
-		if (waitqueue_active(&buf->read_wait))
-			/*
-			 * Calling wake_up_interruptible() from here
-			 * will deadlock if we happen to be logging
-			 * from the scheduler (trying to re-grab
-			 * rq->lock), so defer it.
-			 */
-			__mod_timer(&buf->timer, jiffies + 1);
+		relay_inc_produced(buf);
+		relay_update_filesize(buf, buf->chan->subbuf_size -
+				      buf->padding[old_subbuf]);
+		buf->chan->cb->notify_consumers(buf);
 	}
 
 	old = buf->data;
@@ -763,7 +768,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
 	buf->data = new;
 	buf->padding[new_subbuf] = 0;
 
-	if (unlikely(length + buf->offset > buf->chan->subbuf_size))
+	if (unlikely(relay_event_toobig(buf, length + buf->offset)))
 		goto toobig;
 
 	return length;



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH 1/3] relay - clean up subbuf switch
  2008-09-23  5:27       ` [PATCH 1/3] relay - clean up subbuf switch Tom Zanussi
@ 2008-09-23 20:15         ` Andrew Morton
  0 siblings, 0 replies; 122+ messages in thread
From: Andrew Morton @ 2008-09-23 20:15 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: a.p.zijlstra, prasad, mbligh, linux-kernel, torvalds, tglx,
	compudj, rostedt, od, fche, hch, dwilder

On Tue, 23 Sep 2008 00:27:02 -0500
Tom Zanussi <zanussi@comcast.net> wrote:

> Clean up relay_switch_subbuf() and make waking up consumers optional.
>     
> Over time, relay_switch_subbuf() has accumulated some cruft - this
> patch cleans it up and at the same time makes available some of it
> available as common functions that any subbuf-switch implementor would
> need (this is partially in preparation for the next patch, which makes
> the subbuf-switch function completely replaceable).  It also removes
> the hard-coded reader wakeup and moves it into a replaceable callback
> called notify_consumers(); this allows any given tracer to implement
> consumer notification as it sees fit.
> 
> Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
> 
> diff --git a/include/linux/relay.h b/include/linux/relay.h
> index 953fc05..17f0515 100644
> --- a/include/linux/relay.h
> +++ b/include/linux/relay.h
> @@ -159,6 +159,15 @@ struct rchan_callbacks
>  	 * The callback should return 0 if successful, negative if not.
>  	 */
>  	int (*remove_buf_file)(struct dentry *dentry);
> +
> +	/*
> +	 * notify_consumers - new sub-buffer available, let consumers know
> +	 * @buf: the channel buffer
> +	 *
> +	 * Called during sub-buffer switch.  Applications which don't
> +	 * want to notify anyone should implement an empty version.
> +	 */
> +        void (*notify_consumers)(struct rchan_buf *buf);
>  };

Does this comment format and placement get properly processed by the
kerneldoc tools?

>  /*
> @@ -186,6 +195,48 @@ extern size_t relay_switch_subbuf(struct rchan_buf *buf,
>  				  size_t length);
>  
>  /**
> + *	relay_event_toobig - is event too big to fit in a sub-buffer?
> + *	@buf: relay channel buffer
> + *	@length: length of event
> + *
> + *	Returns 1 if too big, 0 otherwise.
> + *
> + *	switch_subbuf() helper function.
> + */
> +static inline int relay_event_toobig(struct rchan_buf *buf, size_t length)
> +{
> +	return length > buf->chan->subbuf_size;
> +}
> +
> +/**
> + *	relay_update_filesize - increase relay file i_size by length
> + *	@buf: relay channel buffer
> + *	@length: length to add
> + *
> + *	switch_subbuf() helper function.
> + */
> +static inline void relay_update_filesize(struct rchan_buf *buf, size_t length)
> +{
> +	if (buf->dentry)
> +		buf->dentry->d_inode->i_size +=	length;
> +	else
> +		buf->early_bytes += length;
> +
> +	smp_mb();
> +}

What locking protects the non-atomic modification of the 64-bit i_size
here?

> +/**
> + *	relay_inc_produced - increase number of sub-buffers produced by 1
> + *	@buf: relay channel buffer
> + *
> + *	switch_subbuf() helper function.
> + */
> +static inline void relay_inc_produced(struct rchan_buf *buf)
> +{
> +	buf->subbufs_produced++;
> +}

This also needs caller-provided locking.  That's part of the function's
interface and should be documented,

> +/**
>   *	relay_write - write data into the channel
>   *	@chan: relay channel
>   *	@data: data to be written
> diff --git a/kernel/relay.c b/kernel/relay.c
> index 8d13a78..53652f1 100644
> --- a/kernel/relay.c
> +++ b/kernel/relay.c
> @@ -324,6 +324,21 @@ static int remove_buf_file_default_callback(struct dentry *dentry)
>  	return -EINVAL;
>  }
>  
> +/*
> + * notify_consumers() default callback.
> + */
> +static void notify_consumers_default_callback(struct rchan_buf *buf)
> +{
> +	if (waitqueue_active(&buf->read_wait))
> +		/*
> +		 * Calling wake_up_interruptible() from here
> +		 * will deadlock if we happen to be logging
> +		 * from the scheduler (trying to re-grab
> +		 * rq->lock), so defer it.
> +		 */
> +		__mod_timer(&buf->timer, jiffies + 1);
> +}
> +
>  /* relay channel default callbacks */
>  static struct rchan_callbacks default_channel_callbacks = {
>  	.subbuf_start = subbuf_start_default_callback,
> @@ -331,6 +346,7 @@ static struct rchan_callbacks default_channel_callbacks = {
>  	.buf_unmapped = buf_unmapped_default_callback,
>  	.create_buf_file = create_buf_file_default_callback,
>  	.remove_buf_file = remove_buf_file_default_callback,
> +	.notify_consumers = notify_consumers_default_callback,
>  };
>  
>  /**
> @@ -508,6 +524,8 @@ static void setup_callbacks(struct rchan *chan,
>  		cb->create_buf_file = create_buf_file_default_callback;
>  	if (!cb->remove_buf_file)
>  		cb->remove_buf_file = remove_buf_file_default_callback;
> +	if (!cb->notify_consumers)
> +		cb->notify_consumers = notify_consumers_default_callback;
>  	chan->cb = cb;
>  }
>  
> @@ -726,30 +744,17 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
>  	void *old, *new;
>  	size_t old_subbuf, new_subbuf;
>  
> -	if (unlikely(length > buf->chan->subbuf_size))
> +	if (unlikely(relay_event_toobig(buf, length)))
>  		goto toobig;
>  
>  	if (buf->offset != buf->chan->subbuf_size + 1) {
>  		buf->prev_padding = buf->chan->subbuf_size - buf->offset;
>  		old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs;
>  		buf->padding[old_subbuf] = buf->prev_padding;
> -		buf->subbufs_produced++;
> -		if (buf->dentry)
> -			buf->dentry->d_inode->i_size +=
> -				buf->chan->subbuf_size -
> -				buf->padding[old_subbuf];
> -		else
> -			buf->early_bytes += buf->chan->subbuf_size -
> -					    buf->padding[old_subbuf];
> -		smp_mb();
> -		if (waitqueue_active(&buf->read_wait))
> -			/*
> -			 * Calling wake_up_interruptible() from here
> -			 * will deadlock if we happen to be logging
> -			 * from the scheduler (trying to re-grab
> -			 * rq->lock), so defer it.
> -			 */
> -			__mod_timer(&buf->timer, jiffies + 1);
> +		relay_inc_produced(buf);
> +		relay_update_filesize(buf, buf->chan->subbuf_size -
> +				      buf->padding[old_subbuf]);
> +		buf->chan->cb->notify_consumers(buf);
>  	}
>  
>  	old = buf->data;
> @@ -763,7 +768,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
>  	buf->data = new;
>  	buf->padding[new_subbuf] = 0;
>  
> -	if (unlikely(length + buf->offset > buf->chan->subbuf_size))
> +	if (unlikely(relay_event_toobig(buf, length + buf->offset)))
>  		goto toobig;

I think you can put the unlikely() into relay_event_toobig() and gcc
will dtrt.  If that has any value.

>  	return length;
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 2/3] relay - make subbuf switch replaceable
  2008-09-22 14:45     ` Peter Zijlstra
                         ` (3 preceding siblings ...)
  2008-09-23  5:27       ` [PATCH 1/3] relay - clean up subbuf switch Tom Zanussi
@ 2008-09-23  5:27       ` Tom Zanussi
  2008-09-23 20:17         ` Andrew Morton
  2008-09-23  5:27       ` [PATCH 3/3] relay - add channel flags Tom Zanussi
  5 siblings, 1 reply; 122+ messages in thread
From: Tom Zanussi @ 2008-09-23  5:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Make the relay sub-buffer switch code replaceable.
    
With this patch, tracers now have complete control over the relay
write (or reserve) path if they choose to do so, by implementing their
own version of the sub-buffer switch function (switch_subbuf()), in
addition to their own local write/reserve functions.  Tracers who
choose not to do so automatically default to the normal behavior.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 17f0515..52e4d61 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -168,6 +168,18 @@ struct rchan_callbacks
 	 * want to notify anyone should implement an empty version.
 	 */
         void (*notify_consumers)(struct rchan_buf *buf);
+
+	/*
+	 * switch_subbuf - sub-buffer switch callback
+	 * @buf: the channel buffer
+	 * @length: size of current event
+	 *
+	 * Returns either the length passed in or 0 if full.
+	 *
+	 * Performs sub-buffer-switch tasks such as updating filesize,
+	 * waking up readers, etc.
+	 */
+	size_t (*switch_subbuf)(struct rchan_buf *buf, size_t length);
 };
 
 /*
@@ -191,8 +203,9 @@ extern void relay_subbufs_consumed(struct rchan *chan,
 extern void relay_reset(struct rchan *chan);
 extern int relay_buf_full(struct rchan_buf *buf);
 
-extern size_t relay_switch_subbuf(struct rchan_buf *buf,
-				  size_t length);
+extern size_t switch_subbuf_default_callback(struct rchan_buf *buf,
+					     size_t length);
+
 
 /**
  *	relay_event_toobig - is event too big to fit in a sub-buffer?
@@ -259,7 +272,7 @@ static inline void relay_write(struct rchan *chan,
 	local_irq_save(flags);
 	buf = chan->buf[smp_processor_id()];
 	if (unlikely(buf->offset + length > chan->subbuf_size))
-		length = relay_switch_subbuf(buf, length);
+		length = chan->cb->switch_subbuf(buf, length);
 	memcpy(buf->data + buf->offset, data, length);
 	buf->offset += length;
 	local_irq_restore(flags);
@@ -285,7 +298,7 @@ static inline void __relay_write(struct rchan *chan,
 
 	buf = chan->buf[get_cpu()];
 	if (unlikely(buf->offset + length > buf->chan->subbuf_size))
-		length = relay_switch_subbuf(buf, length);
+		length = chan->cb->switch_subbuf(buf, length);
 	memcpy(buf->data + buf->offset, data, length);
 	buf->offset += length;
 	put_cpu();
@@ -308,7 +321,7 @@ static inline void *relay_reserve(struct rchan *chan, size_t length)
 	struct rchan_buf *buf = chan->buf[smp_processor_id()];
 
 	if (unlikely(buf->offset + length > buf->chan->subbuf_size)) {
-		length = relay_switch_subbuf(buf, length);
+		length = chan->cb->switch_subbuf(buf, length);
 		if (!length)
 			return NULL;
 	}
diff --git a/kernel/relay.c b/kernel/relay.c
index 53652f1..e299f49 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -347,6 +347,7 @@ static struct rchan_callbacks default_channel_callbacks = {
 	.create_buf_file = create_buf_file_default_callback,
 	.remove_buf_file = remove_buf_file_default_callback,
 	.notify_consumers = notify_consumers_default_callback,
+	.switch_subbuf = switch_subbuf_default_callback,
 };
 
 /**
@@ -526,6 +527,8 @@ static void setup_callbacks(struct rchan *chan,
 		cb->remove_buf_file = remove_buf_file_default_callback;
 	if (!cb->notify_consumers)
 		cb->notify_consumers = notify_consumers_default_callback;
+	if (!cb->switch_subbuf)
+		cb->switch_subbuf = switch_subbuf_default_callback;
 	chan->cb = cb;
 }
 
@@ -730,7 +733,7 @@ int relay_late_setup_files(struct rchan *chan,
 }
 
 /**
- *	relay_switch_subbuf - switch to a new sub-buffer
+ *	switch_subbuf_default_callback - switch to a new sub-buffer
  *	@buf: channel buffer
  *	@length: size of current event
  *
@@ -739,7 +742,7 @@ int relay_late_setup_files(struct rchan *chan,
  *	Performs sub-buffer-switch tasks such as invoking callbacks,
  *	updating padding counts, waking up readers, etc.
  */
-size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
+size_t switch_subbuf_default_callback(struct rchan_buf *buf, size_t length)
 {
 	void *old, *new;
 	size_t old_subbuf, new_subbuf;
@@ -777,7 +780,7 @@ toobig:
 	buf->chan->last_toobig = length;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(relay_switch_subbuf);
+EXPORT_SYMBOL_GPL(switch_subbuf_default_callback);
 
 /**
  *	relay_subbufs_consumed - update the buffer's sub-buffers-consumed count
@@ -857,14 +860,14 @@ void relay_flush(struct rchan *chan)
 		return;
 
 	if (chan->is_global && chan->buf[0]) {
-		relay_switch_subbuf(chan->buf[0], 0);
+		chan->cb->switch_subbuf(chan->buf[0], 0);
 		return;
 	}
 
 	mutex_lock(&relay_channels_mutex);
 	for_each_possible_cpu(i)
 		if (chan->buf[i])
-			relay_switch_subbuf(chan->buf[i], 0);
+			chan->cb->switch_subbuf(chan->buf[i], 0);
 	mutex_unlock(&relay_channels_mutex);
 }
 EXPORT_SYMBOL_GPL(relay_flush);



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH 2/3] relay - make subbuf switch replaceable
  2008-09-23  5:27       ` [PATCH 2/3] relay - make subbuf switch replaceable Tom Zanussi
@ 2008-09-23 20:17         ` Andrew Morton
  0 siblings, 0 replies; 122+ messages in thread
From: Andrew Morton @ 2008-09-23 20:17 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: a.p.zijlstra, prasad, mbligh, linux-kernel, torvalds, tglx,
	compudj, rostedt, od, fche, hch, dwilder

On Tue, 23 Sep 2008 00:27:30 -0500
Tom Zanussi <zanussi@comcast.net> wrote:

> Make the relay sub-buffer switch code replaceable.
>     
> With this patch, tracers now have complete control over the relay
> write (or reserve) path if they choose to do so, by implementing their
> own version of the sub-buffer switch function (switch_subbuf()), in
> addition to their own local write/reserve functions.  Tracers who
> choose not to do so automatically default to the normal behavior.
> 
>
> ...
>
> -EXPORT_SYMBOL_GPL(relay_switch_subbuf);
> +EXPORT_SYMBOL_GPL(switch_subbuf_default_callback);

It would be nice to keep the `relay_' prefix on the exported relay
interface?  Something called `switch_subbuf_default_callback' could
belong to pretty much anywhere in the kernel.



^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 3/3] relay - add channel flags
  2008-09-22 14:45     ` Peter Zijlstra
                         ` (4 preceding siblings ...)
  2008-09-23  5:27       ` [PATCH 2/3] relay - make subbuf switch replaceable Tom Zanussi
@ 2008-09-23  5:27       ` Tom Zanussi
  2008-09-23 20:20         ` Andrew Morton
  5 siblings, 1 reply; 122+ messages in thread
From: Tom Zanussi @ 2008-09-23  5:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: prasad, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, Andrew Morton, hch, David Wilder

Add channel flags to relay, remove global callback param.
    
relay should probably have had a flags param from the beginning; it
wasn't originally added because it wasn't originally needed - it
probably would have helped avoid some of the callback contortions
that were added due to a lack of flags.  This adds them and does a
small amount of low-hanging cleanup, and is also in preparation for
some new flags in future patches.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>

diff --git a/block/blktrace.c b/block/blktrace.c
index eb9651c..150c5f7 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry)
 static struct dentry *blk_create_buf_file_callback(const char *filename,
 						   struct dentry *parent,
 						   int mode,
-						   struct rchan_buf *buf,
-						   int *is_global)
+						   struct rchan_buf *buf)
 {
 	return debugfs_create_file(filename, mode, parent, buf,
 					&relay_file_operations);
@@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 		goto err;
 
 	bt->rchan = relay_open("trace", dir, buts->buf_size,
-				buts->buf_nr, &blk_relay_callbacks, bt);
+			       buts->buf_nr, &blk_relay_callbacks, bt, 0UL);
 	if (!bt->rchan)
 		goto err;
 
diff --git a/include/linux/relay.h b/include/linux/relay.h
index 52e4d61..18fd269 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -25,7 +25,13 @@
 /*
  * Tracks changes to rchan/rchan_buf structs
  */
-#define RELAYFS_CHANNEL_VERSION		7
+#define RELAYFS_CHANNEL_VERSION		8
+
+/*
+ * relay channel flags
+ */
+#define RCHAN_MODE_OVERWRITE		0x00000001	/* 'flight' mode */
+#define RCHAN_GLOBAL_BUFFER		0x00000002	/* not using per-cpu */
 
 /*
  * Per-cpu relay channel buffer
@@ -66,11 +72,11 @@ struct rchan
 	void *private_data;		/* for user-defined data */
 	size_t last_toobig;		/* tried to log event > subbuf size */
 	struct rchan_buf *buf[NR_CPUS]; /* per-cpu channel buffers */
-	int is_global;			/* One global buffer ? */
 	struct list_head list;		/* for channel list */
 	struct dentry *parent;		/* parent dentry passed to open */
 	int has_base_filename;		/* has a filename associated? */
 	char base_filename[NAME_MAX];	/* saved base filename */
+	unsigned long flags;		/* relay flags for this channel */
 };
 
 /*
@@ -125,7 +131,6 @@ struct rchan_callbacks
 	 * @parent: the parent of the file to create
 	 * @mode: the mode of the file to create
 	 * @buf: the channel buffer
-	 * @is_global: outparam - set non-zero if the buffer should be global
 	 *
 	 * Called during relay_open(), once for each per-cpu buffer,
 	 * to allow the client to create a file to be used to
@@ -136,17 +141,12 @@ struct rchan_callbacks
 	 * The callback should return the dentry of the file created
 	 * to represent the relay buffer.
 	 *
-	 * Setting the is_global outparam to a non-zero value will
-	 * cause relay_open() to create a single global buffer rather
-	 * than the default set of per-cpu buffers.
-	 *
 	 * See Documentation/filesystems/relayfs.txt for more info.
 	 */
 	struct dentry *(*create_buf_file)(const char *filename,
 					  struct dentry *parent,
 					  int mode,
-					  struct rchan_buf *buf,
-					  int *is_global);
+					  struct rchan_buf *buf);
 
 	/*
 	 * remove_buf_file - remove file representing a relay channel buffer
@@ -191,7 +191,8 @@ struct rchan *relay_open(const char *base_filename,
 			 size_t subbuf_size,
 			 size_t n_subbufs,
 			 struct rchan_callbacks *cb,
-			 void *private_data);
+			 void *private_data,
+			 unsigned long rchan_flags);
 extern int relay_late_setup_files(struct rchan *chan,
 				  const char *base_filename,
 				  struct dentry *parent);
diff --git a/kernel/relay.c b/kernel/relay.c
index e299f49..a2e06b0 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -310,8 +310,7 @@ static void buf_unmapped_default_callback(struct rchan_buf *buf,
 static struct dentry *create_buf_file_default_callback(const char *filename,
 						       struct dentry *parent,
 						       int mode,
-						       struct rchan_buf *buf,
-						       int *is_global)
+						       struct rchan_buf *buf)
 {
 	return NULL;
 }
@@ -411,7 +410,7 @@ void relay_reset(struct rchan *chan)
 	if (!chan)
 		return;
 
-	if (chan->is_global && chan->buf[0]) {
+	if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) {
 		__relay_reset(chan->buf[0], 0);
 		return;
 	}
@@ -445,8 +444,7 @@ static struct dentry *relay_create_buf_file(struct rchan *chan,
 
 	/* Create file in fs */
 	dentry = chan->cb->create_buf_file(tmpname, chan->parent,
-					   S_IRUSR, buf,
-					   &chan->is_global);
+					   S_IRUSR, buf);
 
 	kfree(tmpname);
 
@@ -463,7 +461,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
  	struct rchan_buf *buf = NULL;
 	struct dentry *dentry;
 
- 	if (chan->is_global)
+	if (chan->flags & RCHAN_GLOBAL_BUFFER)
 		return chan->buf[0];
 
 	buf = relay_create_buf(chan);
@@ -480,7 +478,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
  	buf->cpu = cpu;
  	__relay_reset(buf, 1);
 
- 	if(chan->is_global) {
+	if(chan->flags & RCHAN_GLOBAL_BUFFER) {
  		chan->buf[0] = buf;
  		buf->cpu = 0;
   	}
@@ -595,7 +593,8 @@ struct rchan *relay_open(const char *base_filename,
 			 size_t subbuf_size,
 			 size_t n_subbufs,
 			 struct rchan_callbacks *cb,
-			 void *private_data)
+			 void *private_data,
+			 unsigned long rchan_flags)
 {
 	unsigned int i;
 	struct rchan *chan;
@@ -612,6 +611,7 @@ struct rchan *relay_open(const char *base_filename,
 	chan->subbuf_size = subbuf_size;
 	chan->alloc_size = FIX_SIZE(subbuf_size * n_subbufs);
 	chan->parent = parent;
+	chan->flags = rchan_flags;
 	chan->private_data = private_data;
 	if (base_filename) {
 		chan->has_base_filename = 1;
@@ -828,7 +828,7 @@ void relay_close(struct rchan *chan)
 		return;
 
 	mutex_lock(&relay_channels_mutex);
-	if (chan->is_global && chan->buf[0])
+	if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0])
 		relay_close_buf(chan->buf[0]);
 	else
 		for_each_possible_cpu(i)
@@ -859,7 +859,7 @@ void relay_flush(struct rchan *chan)
 	if (!chan)
 		return;
 
-	if (chan->is_global && chan->buf[0]) {
+	if (chan->flags & RCHAN_GLOBAL_BUFFER && chan->buf[0]) {
 		chan->cb->switch_subbuf(chan->buf[0], 0);
 		return;
 	}
diff --git a/virt/kvm/kvm_trace.c b/virt/kvm/kvm_trace.c
index 58141f3..d0a9e1c 100644
--- a/virt/kvm/kvm_trace.c
+++ b/virt/kvm/kvm_trace.c
@@ -130,10 +130,9 @@ static int kvm_subbuf_start_callback(struct rchan_buf *buf, void *subbuf,
 }
 
 static struct dentry *kvm_create_buf_file_callack(const char *filename,
-						 struct dentry *parent,
-						 int mode,
-						 struct rchan_buf *buf,
-						 int *is_global)
+						  struct dentry *parent,
+						  int mode,
+						  struct rchan_buf *buf)
 {
 	return debugfs_create_file(filename, mode, parent, buf,
 				   &relay_file_operations);
@@ -171,7 +170,7 @@ static int do_kvm_trace_enable(struct kvm_user_trace_setup *kuts)
 		goto err;
 
 	kt->rchan = relay_open("trace", kvm_debugfs_dir, kuts->buf_size,
-				kuts->buf_nr, &kvm_relay_callbacks, kt);
+			       kuts->buf_nr, &kvm_relay_callbacks, kt, 0UL);
 	if (!kt->rchan)
 		goto err;
 



^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH 3/3] relay - add channel flags
  2008-09-23  5:27       ` [PATCH 3/3] relay - add channel flags Tom Zanussi
@ 2008-09-23 20:20         ` Andrew Morton
  2008-09-24  3:57           ` Tom Zanussi
  0 siblings, 1 reply; 122+ messages in thread
From: Andrew Morton @ 2008-09-23 20:20 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: a.p.zijlstra, prasad, mbligh, linux-kernel, torvalds, tglx,
	compudj, rostedt, od, fche, hch, dwilder

On Tue, 23 Sep 2008 00:27:56 -0500
Tom Zanussi <zanussi@comcast.net> wrote:

> Add channel flags to relay, remove global callback param.
>     
> relay should probably have had a flags param from the beginning; it
> wasn't originally added because it wasn't originally needed - it
> probably would have helped avoid some of the callback contortions
> that were added due to a lack of flags.  This adds them and does a
> small amount of low-hanging cleanup, and is also in preparation for
> some new flags in future patches.
> 
> Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
> 
> diff --git a/block/blktrace.c b/block/blktrace.c
> index eb9651c..150c5f7 100644
> --- a/block/blktrace.c
> +++ b/block/blktrace.c
> @@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry)
>  static struct dentry *blk_create_buf_file_callback(const char *filename,
>  						   struct dentry *parent,
>  						   int mode,
> -						   struct rchan_buf *buf,
> -						   int *is_global)
> +						   struct rchan_buf *buf)
>  {
>  	return debugfs_create_file(filename, mode, parent, buf,
>  					&relay_file_operations);
> @@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>  		goto err;
>  
>  	bt->rchan = relay_open("trace", dir, buts->buf_size,
> -				buts->buf_nr, &blk_relay_callbacks, bt);
> +			       buts->buf_nr, &blk_relay_callbacks, bt, 0UL);
>  	if (!bt->rchan)
>  		goto err;
>  
> diff --git a/include/linux/relay.h b/include/linux/relay.h
> index 52e4d61..18fd269 100644
> --- a/include/linux/relay.h
> +++ b/include/linux/relay.h
> @@ -25,7 +25,13 @@
>  /*
>   * Tracks changes to rchan/rchan_buf structs
>   */
> -#define RELAYFS_CHANNEL_VERSION		7
> +#define RELAYFS_CHANNEL_VERSION		8

What is the significance of this change?  Does it affect the
kernel<->userspace interface?  Is it back-compatible with existing
userspace?


> +/*
> + * relay channel flags
> + */
> +#define RCHAN_MODE_OVERWRITE		0x00000001	/* 'flight' mode */
> +#define RCHAN_GLOBAL_BUFFER		0x00000002	/* not using per-cpu */
>  
>
> ...
>
> @@ -480,7 +478,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
>   	buf->cpu = cpu;
>   	__relay_reset(buf, 1);
>  
> - 	if(chan->is_global) {
> +	if(chan->flags & RCHAN_GLOBAL_BUFFER) {
	
Please use checkpatch.  It's a little thing, but it's so easy to fix..

>   		chan->buf[0] = buf;
>   		buf->cpu = 0;
>    	}
>
> ...

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 3/3] relay - add channel flags
  2008-09-23 20:20         ` Andrew Morton
@ 2008-09-24  3:57           ` Tom Zanussi
  0 siblings, 0 replies; 122+ messages in thread
From: Tom Zanussi @ 2008-09-24  3:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: a.p.zijlstra, prasad, mbligh, linux-kernel, torvalds, tglx,
	compudj, rostedt, od, fche, hch, dwilder

Thanks for the comments on these patches - I'll include the changes in
the next go-round.

On Tue, 2008-09-23 at 13:20 -0700, Andrew Morton wrote:
> On Tue, 23 Sep 2008 00:27:56 -0500
> Tom Zanussi <zanussi@comcast.net> wrote:
> 
> > Add channel flags to relay, remove global callback param.
> >     
> > relay should probably have had a flags param from the beginning; it
> > wasn't originally added because it wasn't originally needed - it
> > probably would have helped avoid some of the callback contortions
> > that were added due to a lack of flags.  This adds them and does a
> > small amount of low-hanging cleanup, and is also in preparation for
> > some new flags in future patches.
> > 
> > Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
> > 
> > diff --git a/block/blktrace.c b/block/blktrace.c
> > index eb9651c..150c5f7 100644
> > --- a/block/blktrace.c
> > +++ b/block/blktrace.c
> > @@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry)
> >  static struct dentry *blk_create_buf_file_callback(const char *filename,
> >  						   struct dentry *parent,
> >  						   int mode,
> > -						   struct rchan_buf *buf,
> > -						   int *is_global)
> > +						   struct rchan_buf *buf)
> >  {
> >  	return debugfs_create_file(filename, mode, parent, buf,
> >  					&relay_file_operations);
> > @@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> >  		goto err;
> >  
> >  	bt->rchan = relay_open("trace", dir, buts->buf_size,
> > -				buts->buf_nr, &blk_relay_callbacks, bt);
> > +			       buts->buf_nr, &blk_relay_callbacks, bt, 0UL);
> >  	if (!bt->rchan)
> >  		goto err;
> >  
> > diff --git a/include/linux/relay.h b/include/linux/relay.h
> > index 52e4d61..18fd269 100644
> > --- a/include/linux/relay.h
> > +++ b/include/linux/relay.h
> > @@ -25,7 +25,13 @@
> >  /*
> >   * Tracks changes to rchan/rchan_buf structs
> >   */
> > -#define RELAYFS_CHANNEL_VERSION		7
> > +#define RELAYFS_CHANNEL_VERSION		8
> 
> What is the significance of this change?  Does it affect the
> kernel<->userspace interface?  Is it back-compatible with existing
> userspace?
> 

No, nothing to do with the kernel-userspace interface.  The channel
version is included in the channel struct and was meant as an aid in
deciphering channel data in crash dumps.

Tom

> 
> > +/*
> > + * relay channel flags
> > + */
> > +#define RCHAN_MODE_OVERWRITE		0x00000001	/* 'flight' mode */
> > +#define RCHAN_GLOBAL_BUFFER		0x00000002	/* not using per-cpu */
> >  
> >
> > ...
> >
> > @@ -480,7 +478,7 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
> >   	buf->cpu = cpu;
> >   	__relay_reset(buf, 1);
> >  
> > - 	if(chan->is_global) {
> > +	if(chan->flags & RCHAN_GLOBAL_BUFFER) {
> 	
> Please use checkpatch.  It's a little thing, but it's so easy to fix..
> 
> >   		chan->buf[0] = buf;
> >   		buf->cpu = 0;
> >    	}
> >
> > ...


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:33 Unified tracing buffer Martin Bligh
                   ` (3 preceding siblings ...)
  2008-09-20  0:07 ` Peter Zijlstra
@ 2008-09-20  0:26 ` Marcel Holtmann
  2008-09-20  9:03 ` Steven Rostedt
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 122+ messages in thread
From: Marcel Holtmann @ 2008-09-20  0:26 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler

Hi Martin,

> During kernel summit and Plumbers conference, Linus and others
> expressed a desire for a unified
> tracing buffer system for multiple tracing applications (eg ftrace,
> lttng, systemtap, blktrace, etc) to use.
> This provides several advantages, including the ability to interleave
> data from multiple sources,
> not having to learn 200 different tools, duplicated code/effort, etc.

I talked with Thomas and Steven about it during lunch and we might wanna
also use it for sniffing/monitoring/capturing efforts. Inaky and I
talked about unifying things like usbmon, hcidump, tcpdump etc. with a
common interface into the kernel. Currently every subsystem does it
differently. Especially when it comes to add monitor support to SDIO we
ran into the problem that we don't just wanna invent another interface.

Right now I haven't spent much time on this and regret that I didn't sit
together with you guys yesterday, but it was not that high priority on
my list of nice things to have. Anyway, since I am mostly interested in
subsystems that copy a lot of packets around, it would be nice to only
reference to the data. For example for structures like SKBs and URBs and
only have to copy the data if we have a consumer. My thoughts :)

Regards

Marcel

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:33 Unified tracing buffer Martin Bligh
                   ` (4 preceding siblings ...)
  2008-09-20  0:26 ` Unified tracing buffer Marcel Holtmann
@ 2008-09-20  9:03 ` Steven Rostedt
  2008-09-20 13:55   ` Mathieu Desnoyers
  2008-09-22  9:57   ` Peter Zijlstra
  2008-09-22 13:57 ` K.Prasad
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 122+ messages in thread
From: Steven Rostedt @ 2008-09-20  9:03 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, od, Frank Ch. Eigler



Martin,

First I like to express my appreciation to you for writing this up. Not 
only that, but being the one person from keeping us from killing each 
other ;-)


On Fri, 19 Sep 2008, Martin Bligh wrote:

> During kernel summit and Plumbers conference, Linus and others
> expressed a desire for a unified
> tracing buffer system for multiple tracing applications (eg ftrace,
> lttng, systemtap, blktrace, etc) to use.
> This provides several advantages, including the ability to interleave
> data from multiple sources,
> not having to learn 200 different tools, duplicated code/effort, etc.
> 
> Several of us got together last night and tried to cut this down to
> the simplest usable system
> we could agree on (and nobody got hurt!). This will form version 1.

Yes, we kept the chairs on the floor the whole time.

> I've sketched out a few
> enhancements we know that we want, but have agreed to leave these
> until version 2.
> The answer to most questions about the below is "yes we know, we'll
> fix that in version 2"
> (or 3). Simplicity was the rule ...
> 
> Sketch of design.  Enjoy flaming me. Code will follow shortly.
> 
> 
> STORAGE
> -------
> 
> We will support multiple buffers for different tracing systems, with
> separate names, event id spaces.
> Event ids are 16 bit, dynamically allocated.
> A "one line of text" print function will be provided for each event,
> or use the default (probably hex printf)
> Will provide a "flight data recorder" mode, and a "spool to disk" mode.

I don't remember talking about the "spool to disk" for version 1.
We still want to do this? I thought we would have overwrite mode (flight
data record), and a "throw all new data away when the producer fills the 
buffer before the consumer takes" mode.

> 
> Circular buffer per cpu, protected by per-cpu spinlock_irq
> Word aligned records.

As stated in another email "8 byte aligned" words should be fine.

> Variable record length, header will start with length record.
> Timestamps in fixed timebase, monotonically increasing (across all CPUs)
> 
> 
> INPUT_FUNCTIONS
> ---------------
> 
> allocate_buffer (name, size)
>         return buffer_handle
> 
> register_event (buffer_handle, event_id, print_function)
>         You can pass in a requested event_id from a fixed set, and
> will be given it, or an error
>         0 means allocate me one dynamically
>         returns event_id     (or -E_ERROR)
> 
> record_event (buffer_handle, event_id, length, *buf)

I was talking with Thomas about this, and we probably want (and I'm sure 
Mathieu and others would agree), a...

  event_handle = reserve_event(buffer_handle, event_id, length)

as well as a..

  comit_event(event_handle).


Oh, and all commands should start with the namespace.

  ring_buffer_alloc()
  ring_buffer_free()
  ring_buffer_record_event()

  etc.

> 
> 
> OUTPUT
> ------
> 
> Data will be output via debugfs, and provide the following output streams:
> 
> /debugfs/tracing/<name>/buffers/text
>     clear text stream (will merge the per-cpu streams via insertion
> sort, and use the print functions)
> 
> /debugfs/tracing/<name>/buffers/binary[cpu_number]
>     per-cpu binary data

Ah, I thought we were going to have:

  /debugfs/tracing/buffers/<name>/<buffer crap>

and each tracer have

  /debugfs/tracing/<name>/<trace command crap>

This way we can easily see all the buffers in one place that are allocated
without having to see a tracer name first.

The reason I like the way I propose, is that a utility that needs to read 
all the buffers, doesn't need to go into directories that don't even have 
buffers. Not all tracers will allocate a buffer.


> 
> 
> CONTROL
> -------
> 
> Sysfs style tree under debugfs
> 
> /debugfs/tracing/<name>/buffers/enabed         <--- binary value
> 
> /debugfs/tracing/<name>/<event1>
> /debugfs/tracing/<name>/<event2>
>     etc ...

I wonder if we should make this another sub dir:

 /debugfs/tracing/buffers/events/<event-name>


>     provides a way to enable/disable events, see what's available, and
> what's enabled.
> 
> 
> KNOWN ISSUES / PLANS
> -------------------
> 
> No way to unregister buffers and events.
>     Will provide an unregister_buffer and unregister_event call

I can see registering events, but shouldn't we "allocate" buffers?

> 
> 
> Generating systemwide time is hard on some platforms
>     Yes. Time-based output provides a lot of simplicity for the user though
>     We won't support these platforms at first, we'll add functionality
> to make it work for them later.
>     (plan based on tick-based ms timing, plus counter offset from that
> if needed).
> 
> Spinlock_irq is ineffecient, and doesn't support tracing in NMIs
>     True. We'll implement a lockless scheme later (see lttng)
> 
> Putting a length record in every event is inefficient
>     True. Fixed record length with optional extensions is better, but
> more complex. v2.
> 
> Putting a full timestamp rather than an offset in every event is inefficient
>     See above. True, but v2.
> 
> Relayfs already exists! use that!
>     People were universally not keen on that idea. Complexity, interface, etc.
>     We're also providing some higher level shared functions for time &
> event ids.
> 
> There's no way to decode the binary data stream
>     Code will be shared from the kernel to decode it, so that we can
> get the compact binary
>     format and decode it later. That code will be kept in the kernel
> tree (it's a trivial piece of C).
>     Version 1.1 ;-)
> 

Sounds good,

Thanks!

-- Steve


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20  9:03 ` Steven Rostedt
@ 2008-09-20 13:55   ` Mathieu Desnoyers
  2008-09-20 14:12     ` Arjan van de Ven
  2008-09-22  3:09     ` KOSAKI Motohiro
  2008-09-22  9:57   ` Peter Zijlstra
  1 sibling, 2 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-20 13:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, od, Frank Ch. Eigler

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> 
> Martin,
> 
> First I like to express my appreciation to you for writing this up. Not 
> only that, but being the one person from keeping us from killing each 
> other ;-)
> 
> 
> On Fri, 19 Sep 2008, Martin Bligh wrote:
> 
> > During kernel summit and Plumbers conference, Linus and others
> > expressed a desire for a unified
> > tracing buffer system for multiple tracing applications (eg ftrace,
> > lttng, systemtap, blktrace, etc) to use.
> > This provides several advantages, including the ability to interleave
> > data from multiple sources,
> > not having to learn 200 different tools, duplicated code/effort, etc.
> > 
> > Several of us got together last night and tried to cut this down to
> > the simplest usable system
> > we could agree on (and nobody got hurt!). This will form version 1.
> 
> Yes, we kept the chairs on the floor the whole time.
> 

Yes, they were too heavy. ;)

> > I've sketched out a few
> > enhancements we know that we want, but have agreed to leave these
> > until version 2.
> > The answer to most questions about the below is "yes we know, we'll
> > fix that in version 2"
> > (or 3). Simplicity was the rule ...
> > 
> > Sketch of design.  Enjoy flaming me. Code will follow shortly.
> > 
> > 
> > STORAGE
> > -------
> > 
> > We will support multiple buffers for different tracing systems, with
> > separate names, event id spaces.
> > Event ids are 16 bit, dynamically allocated.
> > A "one line of text" print function will be provided for each event,
> > or use the default (probably hex printf)
> > Will provide a "flight data recorder" mode, and a "spool to disk" mode.
> 
> I don't remember talking about the "spool to disk" for version 1.
> We still want to do this? I thought we would have overwrite mode (flight
> data record), and a "throw all new data away when the producer fills the 
> buffer before the consumer takes" mode.
> 

Yes, I think the spool to disk mode will be the default mode needed by
a big amount people who want to stream data out continuously. The flight
recorder is needed mostly for event backlog analysis. I think we have to
provide both.

> > 
> > Circular buffer per cpu, protected by per-cpu spinlock_irq
> > Word aligned records.
> 
> As stated in another email "8 byte aligned" words should be fine.
> 

It's also easy to be sizeof(void *) aligned, as long as we export
sizeof(void *) in the buffer header so we keep portability. But we can
keep that for v2. It's also good to write a magic number in the trace
header to auto-detect endianness.

> > Variable record length, header will start with length record.
> > Timestamps in fixed timebase, monotonically increasing (across all CPUs)
> > 
> > 
> > INPUT_FUNCTIONS
> > ---------------
> > 
> > allocate_buffer (name, size)
> >         return buffer_handle
> > 
> > register_event (buffer_handle, event_id, print_function)
> >         You can pass in a requested event_id from a fixed set, and
> > will be given it, or an error
> >         0 means allocate me one dynamically
> >         returns event_id     (or -E_ERROR)
> > 
> > record_event (buffer_handle, event_id, length, *buf)
> 
> I was talking with Thomas about this, and we probably want (and I'm sure 
> Mathieu and others would agree), a...
> 
>   event_handle = reserve_event(buffer_handle, event_id, length)
> 
> as well as a..
> 
>   comit_event(event_handle).
> 

How about :

  trace_mark(ftrace_evname, "size %lu binary %pW",
    sizeof(mystruct), mystruct);
  or
  trace_mark(sched_wakeup, "target_pid %ld", task->pid);

Note the namespacing with buffers being "ftrace" and "sched" here.

That would encapsulate the whole
  - Event ID registration
  - Event type registration
  - Sending data out
  - Enabling the event source directly at the source

We can then export the markers through a debugfs file and let userland
enable them one by one and possibly connect systemtap filters on them
(one table of registered filters, one table for the markers, a command
file to connect/disconnect filters to/from markers).

> 
> Oh, and all commands should start with the namespace.
> 
>   ring_buffer_alloc()
>   ring_buffer_free()
>   ring_buffer_record_event()
> 

We could even rename markers if required, I don't really care. e.g. :
  trace_mark -> ring_buffer_record_event()
  but note that this would contain all the event ID registration.

>   etc.
> 
> > 
> > 
> > OUTPUT
> > ------
> > 
> > Data will be output via debugfs, and provide the following output streams:
> > 
> > /debugfs/tracing/<name>/buffers/text
> >     clear text stream (will merge the per-cpu streams via insertion
> > sort, and use the print functions)
> > 
> > /debugfs/tracing/<name>/buffers/binary[cpu_number]
> >     per-cpu binary data
> 
> Ah, I thought we were going to have:
> 
>   /debugfs/tracing/buffers/<name>/<buffer crap>
> 
> and each tracer have
> 
>   /debugfs/tracing/<name>/<trace command crap>
> 
> This way we can easily see all the buffers in one place that are allocated
> without having to see a tracer name first.
> 
> The reason I like the way I propose, is that a utility that needs to read 
> all the buffers, doesn't need to go into directories that don't even have 
> buffers. Not all tracers will allocate a buffer.
> 

people can still do ls debugfs/tracing/*/buffers/. But yes, we did agree
on having the buffers/ subdir outside of the "trace command crap". It
makes the buffers easier to see in the directory tree, and makes it
clear that those buffers can be used by other users than the actual
tracer this controls their input.

> 
> > 
> > 
> > CONTROL
> > -------
> > 
> > Sysfs style tree under debugfs
> > 
> > /debugfs/tracing/<name>/buffers/enabed         <--- binary value
> > 
> > /debugfs/tracing/<name>/<event1>
> > /debugfs/tracing/<name>/<event2>
> >     etc ...
> 
> I wonder if we should make this another sub dir:
> 
>  /debugfs/tracing/buffers/events/<event-name>
> 

Sure.

If needed, we could change the markers to take two separate parameters :

trace_mark(tracer_name, event_name, "format", args)

Mathieu

> 
> >     provides a way to enable/disable events, see what's available, and
> > what's enabled.
> > 
> > 
> > KNOWN ISSUES / PLANS
> > -------------------
> > 
> > No way to unregister buffers and events.
> >     Will provide an unregister_buffer and unregister_event call
> 
> I can see registering events, but shouldn't we "allocate" buffers?
> 
> > 
> > 
> > Generating systemwide time is hard on some platforms
> >     Yes. Time-based output provides a lot of simplicity for the user though
> >     We won't support these platforms at first, we'll add functionality
> > to make it work for them later.
> >     (plan based on tick-based ms timing, plus counter offset from that
> > if needed).
> > 
> > Spinlock_irq is ineffecient, and doesn't support tracing in NMIs
> >     True. We'll implement a lockless scheme later (see lttng)
> > 
> > Putting a length record in every event is inefficient
> >     True. Fixed record length with optional extensions is better, but
> > more complex. v2.
> > 
> > Putting a full timestamp rather than an offset in every event is inefficient
> >     See above. True, but v2.
> > 
> > Relayfs already exists! use that!
> >     People were universally not keen on that idea. Complexity, interface, etc.
> >     We're also providing some higher level shared functions for time &
> > event ids.
> > 
> > There's no way to decode the binary data stream
> >     Code will be shared from the kernel to decode it, so that we can
> > get the compact binary
> >     format and decode it later. That code will be kept in the kernel
> > tree (it's a trivial piece of C).
> >     Version 1.1 ;-)
> > 
> 
> Sounds good,
> 
> Thanks!
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20 13:55   ` Mathieu Desnoyers
@ 2008-09-20 14:12     ` Arjan van de Ven
  2008-09-22 18:52       ` Mathieu Desnoyers
  2008-09-22  3:09     ` KOSAKI Motohiro
  1 sibling, 1 reply; 122+ messages in thread
From: Arjan van de Ven @ 2008-09-20 14:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler

On Sat, 20 Sep 2008 09:55:48 -0400
Mathieu Desnoyers <compudj@krystal.dyndns.org> wrote:

> How about :
> 
>   trace_mark(ftrace_evname, "size %lu binary %pW",
>     sizeof(mystruct), mystruct);
>   or
>   trace_mark(sched_wakeup, "target_pid %ld", task->pid);
> 
> Note the namespacing with buffers being "ftrace" and "sched" here.
> 
> That would encapsulate the whole
>   - Event ID registration
>   - Event type registration
>   - Sending data out
>   - Enabling the event source directly at the source
> 
> We can then export the markers through a debugfs file and let userland
> enable them one by one and possibly connect systemtap filters on them
> (one table of registered filters, one table for the markers, a command
> file to connect/disconnect filters to/from markers).

I would like to ask for the following from the start: have a field for
a longer description of the marker that describes it's usage and
context. Getting this there from the start is critical, because only
when adding the marker point do people still really remember why/what
(and having to type a good description also helps them to realize if
this is the right point or not). This can then be exposed to the user
so he has a standing chance of knowing what the marker is about.

It also has a standing chance of being updated when the code changes
this way

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20 14:12     ` Arjan van de Ven
@ 2008-09-22 18:52       ` Mathieu Desnoyers
  2008-10-02 15:28         ` Jason Baron
  0 siblings, 1 reply; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-22 18:52 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Steven Rostedt, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler

* Arjan van de Ven (arjan@infradead.org) wrote:
> On Sat, 20 Sep 2008 09:55:48 -0400
> Mathieu Desnoyers <compudj@krystal.dyndns.org> wrote:
> 
> > How about :
> > 
> >   trace_mark(ftrace_evname, "size %lu binary %pW",
> >     sizeof(mystruct), mystruct);
> >   or
> >   trace_mark(sched_wakeup, "target_pid %ld", task->pid);
> > 
> > Note the namespacing with buffers being "ftrace" and "sched" here.
> > 
> > That would encapsulate the whole
> >   - Event ID registration
> >   - Event type registration
> >   - Sending data out
> >   - Enabling the event source directly at the source
> > 
> > We can then export the markers through a debugfs file and let userland
> > enable them one by one and possibly connect systemtap filters on them
> > (one table of registered filters, one table for the markers, a command
> > file to connect/disconnect filters to/from markers).
> 
> I would like to ask for the following from the start: have a field for
> a longer description of the marker that describes it's usage and
> context. Getting this there from the start is critical, because only
> when adding the marker point do people still really remember why/what
> (and having to type a good description also helps them to realize if
> this is the right point or not). This can then be exposed to the user
> so he has a standing chance of knowing what the marker is about.
> 
> It also has a standing chance of being updated when the code changes
> this way
> 

I agree, and I think it might be required in both markers and
tracepoints.

Given that tracepoints are declared in a global header
(DECLARE_TRACE()), I would add this kind of description here. Tracepoint
uses within the kernel code (statements like :
  trace_sched_switch(prev, next);
added to the scheduler) would therefore be tied to the description
without having to contain it in the core kernel code.

Markers, on the other hand, could become the "event description"
interface which is exported to userspace. Considering that, I guess it's
as important to let a precise description follow the markers.

Mathieu


> -- 
> Arjan van de Ven 	Intel Open Source Technology Centre
> For development, discussion and tips for power savings, 
> visit http://www.lesswatts.org
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 18:52       ` Mathieu Desnoyers
@ 2008-10-02 15:28         ` Jason Baron
  2008-10-03 16:11           ` Mathieu Desnoyers
  0 siblings, 1 reply; 122+ messages in thread
From: Jason Baron @ 2008-10-02 15:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od,
	Frank Ch. Eigler

On Mon, Sep 22, 2008 at 02:52:09PM -0400, Mathieu Desnoyers wrote:
> > On Sat, 20 Sep 2008 09:55:48 -0400
> > Mathieu Desnoyers <compudj@krystal.dyndns.org> wrote:
> > 
> > > How about :
> > > 
> > >   trace_mark(ftrace_evname, "size %lu binary %pW",
> > >     sizeof(mystruct), mystruct);
> > >   or
> > >   trace_mark(sched_wakeup, "target_pid %ld", task->pid);
> > > 
> > > Note the namespacing with buffers being "ftrace" and "sched" here.
> > > 
> > > That would encapsulate the whole
> > >   - Event ID registration
> > >   - Event type registration
> > >   - Sending data out
> > >   - Enabling the event source directly at the source
> > > 
> > > We can then export the markers through a debugfs file and let userland
> > > enable them one by one and possibly connect systemtap filters on them
> > > (one table of registered filters, one table for the markers, a command
> > > file to connect/disconnect filters to/from markers).
> > 
> > I would like to ask for the following from the start: have a field for
> > a longer description of the marker that describes it's usage and
> > context. Getting this there from the start is critical, because only
> > when adding the marker point do people still really remember why/what
> > (and having to type a good description also helps them to realize if
> > this is the right point or not). This can then be exposed to the user
> > so he has a standing chance of knowing what the marker is about.
> > 
> > It also has a standing chance of being updated when the code changes
> > this way
> > 
> 
> I agree, and I think it might be required in both markers and
> tracepoints.
> 
> Given that tracepoints are declared in a global header
> (DECLARE_TRACE()), I would add this kind of description here. Tracepoint
> uses within the kernel code (statements like :
>   trace_sched_switch(prev, next);
> added to the scheduler) would therefore be tied to the description
> without having to contain it in the core kernel code.
> 
> Markers, on the other hand, could become the "event description"
> interface which is exported to userspace. Considering that, I guess it's
> as important to let a precise description follow the markers.
> 
> Mathieu
> 
>

hi,

Tracepoints and markers seem to both have their place, with tracepoints
being integral to kernel users, and markers being important for
userspace. However, it seems to me like there is overlap in the
code and an extra level of indirection when markers are layered on
tracespoints. could they be merged a bit more?

What if we extended DEFINE_TRACE() to also create a
'set_marker(marker_cb)' function where 'marker_cb' has the function signature:

marker_cb(<tracepoint prototype>, *marker_probe_func);

We then also create 'register_marker_##name' function in DEFINE_TRACE(),
which allows one to regiser marker callbacks in the usual way.

Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has
registered a marker (which can set the tracepoint.state appropriately).

The 'marker_cb' function then marshalls its arguemnts and passes them
through to the marker functions that were registered.

I think in this way we can simplify the tracepoints and markers by
combining them to a large extent.

thanks,

-Jason





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-10-02 15:28         ` Jason Baron
@ 2008-10-03 16:11           ` Mathieu Desnoyers
  2008-10-03 18:37             ` Jason Baron
  0 siblings, 1 reply; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-10-03 16:11 UTC (permalink / raw)
  To: Jason Baron
  Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od,
	Frank Ch. Eigler

* Jason Baron (jbaron@redhat.com) wrote:
> On Mon, Sep 22, 2008 at 02:52:09PM -0400, Mathieu Desnoyers wrote:
> > > On Sat, 20 Sep 2008 09:55:48 -0400
> > > Mathieu Desnoyers <compudj@krystal.dyndns.org> wrote:
> > > 
> > > > How about :
> > > > 
> > > >   trace_mark(ftrace_evname, "size %lu binary %pW",
> > > >     sizeof(mystruct), mystruct);
> > > >   or
> > > >   trace_mark(sched_wakeup, "target_pid %ld", task->pid);
> > > > 
> > > > Note the namespacing with buffers being "ftrace" and "sched" here.
> > > > 
> > > > That would encapsulate the whole
> > > >   - Event ID registration
> > > >   - Event type registration
> > > >   - Sending data out
> > > >   - Enabling the event source directly at the source
> > > > 
> > > > We can then export the markers through a debugfs file and let userland
> > > > enable them one by one and possibly connect systemtap filters on them
> > > > (one table of registered filters, one table for the markers, a command
> > > > file to connect/disconnect filters to/from markers).
> > > 
> > > I would like to ask for the following from the start: have a field for
> > > a longer description of the marker that describes it's usage and
> > > context. Getting this there from the start is critical, because only
> > > when adding the marker point do people still really remember why/what
> > > (and having to type a good description also helps them to realize if
> > > this is the right point or not). This can then be exposed to the user
> > > so he has a standing chance of knowing what the marker is about.
> > > 
> > > It also has a standing chance of being updated when the code changes
> > > this way
> > > 
> > 
> > I agree, and I think it might be required in both markers and
> > tracepoints.
> > 
> > Given that tracepoints are declared in a global header
> > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint
> > uses within the kernel code (statements like :
> >   trace_sched_switch(prev, next);
> > added to the scheduler) would therefore be tied to the description
> > without having to contain it in the core kernel code.
> > 
> > Markers, on the other hand, could become the "event description"
> > interface which is exported to userspace. Considering that, I guess it's
> > as important to let a precise description follow the markers.
> > 
> > Mathieu
> > 
> >
> 
> hi,
> 
> Tracepoints and markers seem to both have their place, with tracepoints
> being integral to kernel users, and markers being important for
> userspace. However, it seems to me like there is overlap in the
> code and an extra level of indirection when markers are layered on
> tracespoints. could they be merged a bit more?
> 
> What if we extended DEFINE_TRACE() to also create a
> 'set_marker(marker_cb)' function where 'marker_cb' has the function signature:
> 
> marker_cb(<tracepoint prototype>, *marker_probe_func);
> 
> We then also create 'register_marker_##name' function in DEFINE_TRACE(),
> which allows one to regiser marker callbacks in the usual way.
> 
> Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has
> registered a marker (which can set the tracepoint.state appropriately).
> 
> The 'marker_cb' function then marshalls its arguemnts and passes them
> through to the marker functions that were registered.
> 
> I think in this way we can simplify the tracepoints and markers by
> combining them to a large extent.
> 
> thanks,
> 
> -Jason
> 

I think what you propose here is already in y LTTng tree in a different
form. It's a patch to markers to allow declaring a marker which enables
an associated tracepoint when enabled. This way, we can have a marker
(exposed to userspace) connecting itself automatically to a tracepoint
when enabled.

It's here :
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09

Note that locking depends on the psrwlock patch so we can have nested
module list readers. Otherwise locking becomes _really_ messy. :-(

Mathieu

> 
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-10-03 16:11           ` Mathieu Desnoyers
@ 2008-10-03 18:37             ` Jason Baron
  2008-10-03 19:10               ` Mathieu Desnoyers
  0 siblings, 1 reply; 122+ messages in thread
From: Jason Baron @ 2008-10-03 18:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od,
	Frank Ch. Eigler

On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote:
> > > > > How about :
> > > > > 
> > > > >   trace_mark(ftrace_evname, "size %lu binary %pW",
> > > > >     sizeof(mystruct), mystruct);
> > > > >   or
> > > > >   trace_mark(sched_wakeup, "target_pid %ld", task->pid);
> > > > > 
> > > > > Note the namespacing with buffers being "ftrace" and "sched" here.
> > > > > 
> > > > > That would encapsulate the whole
> > > > >   - Event ID registration
> > > > >   - Event type registration
> > > > >   - Sending data out
> > > > >   - Enabling the event source directly at the source
> > > > > 
> > > > > We can then export the markers through a debugfs file and let userland
> > > > > enable them one by one and possibly connect systemtap filters on them
> > > > > (one table of registered filters, one table for the markers, a command
> > > > > file to connect/disconnect filters to/from markers).
> > > > 
> > > > I would like to ask for the following from the start: have a field for
> > > > a longer description of the marker that describes it's usage and
> > > > context. Getting this there from the start is critical, because only
> > > > when adding the marker point do people still really remember why/what
> > > > (and having to type a good description also helps them to realize if
> > > > this is the right point or not). This can then be exposed to the user
> > > > so he has a standing chance of knowing what the marker is about.
> > > > 
> > > > It also has a standing chance of being updated when the code changes
> > > > this way
> > > > 
> > > 
> > > I agree, and I think it might be required in both markers and
> > > tracepoints.
> > > 
> > > Given that tracepoints are declared in a global header
> > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint
> > > uses within the kernel code (statements like :
> > >   trace_sched_switch(prev, next);
> > > added to the scheduler) would therefore be tied to the description
> > > without having to contain it in the core kernel code.
> > > 
> > > Markers, on the other hand, could become the "event description"
> > > interface which is exported to userspace. Considering that, I guess it's
> > > as important to let a precise description follow the markers.
> > > 
> > > Mathieu
> > > 
> > >
> > 
> > hi,
> > 
> > Tracepoints and markers seem to both have their place, with tracepoints
> > being integral to kernel users, and markers being important for
> > userspace. However, it seems to me like there is overlap in the
> > code and an extra level of indirection when markers are layered on
> > tracespoints. could they be merged a bit more?
> > 
> > What if we extended DEFINE_TRACE() to also create a
> > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature:
> > 
> > marker_cb(<tracepoint prototype>, *marker_probe_func);
> > 
> > We then also create 'register_marker_##name' function in DEFINE_TRACE(),
> > which allows one to regiser marker callbacks in the usual way.
> > 
> > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has
> > registered a marker (which can set the tracepoint.state appropriately).
> > 
> > The 'marker_cb' function then marshalls its arguemnts and passes them
> > through to the marker functions that were registered.
> > 
> > I think in this way we can simplify the tracepoints and markers by
> > combining them to a large extent.
> > 
> > thanks,
> > 
> > -Jason
> > 
> 
> I think what you propose here is already in y LTTng tree in a different
> form. It's a patch to markers to allow declaring a marker which enables
> an associated tracepoint when enabled. This way, we can have a marker
> (exposed to userspace) connecting itself automatically to a tracepoint
> when enabled.
> 
> It's here :
> http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09
> 
> Note that locking depends on the psrwlock patch so we can have nested
> module list readers. Otherwise locking becomes _really_ messy. :-(
> 
> Mathieu
> 

That patch simplifies using markers with tracepoints and couples
markers and tracepoints much more closely. But I was proposing to make
the coupling tighter...

Couldn't 'marker_probe_register()' register the marker directly with
the tracepoint callsite? Have DEFINE_TRACE() take an additional argument
which references a marker callback funtion. That function would look 
like (very loose C code):

marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe,
private_data)
{
	probe(private_data, "%arg1 %arg2", arg1->a, arg2->b);
}

The 'marker_blah_callback()' would be invoked from within DO_TRACE() for
each marker that has been registered with the associated tracepoint, in
a similar way to how we iterate over the tracepoint callbacks, we can
iterate over the registered markers and pass them to the
'marker_blah_callback()' function. 

By associating the marker_blah_callback() in DEFINE_TRACE(), we only
need to look in one file to understand what is associated with a
particular tracepoint. I think marker.c and tracepoint.c could also be
consolidated at that point.

thanks,

-Jason








^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-10-03 18:37             ` Jason Baron
@ 2008-10-03 19:10               ` Mathieu Desnoyers
  2008-10-03 19:25                 ` Jason Baron
  2008-10-03 21:52                 ` Frank Ch. Eigler
  0 siblings, 2 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-10-03 19:10 UTC (permalink / raw)
  To: Jason Baron
  Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od,
	Frank Ch. Eigler

* Jason Baron (jbaron@redhat.com) wrote:
> On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote:
> > > > > > How about :
> > > > > > 
> > > > > >   trace_mark(ftrace_evname, "size %lu binary %pW",
> > > > > >     sizeof(mystruct), mystruct);
> > > > > >   or
> > > > > >   trace_mark(sched_wakeup, "target_pid %ld", task->pid);
> > > > > > 
> > > > > > Note the namespacing with buffers being "ftrace" and "sched" here.
> > > > > > 
> > > > > > That would encapsulate the whole
> > > > > >   - Event ID registration
> > > > > >   - Event type registration
> > > > > >   - Sending data out
> > > > > >   - Enabling the event source directly at the source
> > > > > > 
> > > > > > We can then export the markers through a debugfs file and let userland
> > > > > > enable them one by one and possibly connect systemtap filters on them
> > > > > > (one table of registered filters, one table for the markers, a command
> > > > > > file to connect/disconnect filters to/from markers).
> > > > > 
> > > > > I would like to ask for the following from the start: have a field for
> > > > > a longer description of the marker that describes it's usage and
> > > > > context. Getting this there from the start is critical, because only
> > > > > when adding the marker point do people still really remember why/what
> > > > > (and having to type a good description also helps them to realize if
> > > > > this is the right point or not). This can then be exposed to the user
> > > > > so he has a standing chance of knowing what the marker is about.
> > > > > 
> > > > > It also has a standing chance of being updated when the code changes
> > > > > this way
> > > > > 
> > > > 
> > > > I agree, and I think it might be required in both markers and
> > > > tracepoints.
> > > > 
> > > > Given that tracepoints are declared in a global header
> > > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint
> > > > uses within the kernel code (statements like :
> > > >   trace_sched_switch(prev, next);
> > > > added to the scheduler) would therefore be tied to the description
> > > > without having to contain it in the core kernel code.
> > > > 
> > > > Markers, on the other hand, could become the "event description"
> > > > interface which is exported to userspace. Considering that, I guess it's
> > > > as important to let a precise description follow the markers.
> > > > 
> > > > Mathieu
> > > > 
> > > >
> > > 
> > > hi,
> > > 
> > > Tracepoints and markers seem to both have their place, with tracepoints
> > > being integral to kernel users, and markers being important for
> > > userspace. However, it seems to me like there is overlap in the
> > > code and an extra level of indirection when markers are layered on
> > > tracespoints. could they be merged a bit more?
> > > 
> > > What if we extended DEFINE_TRACE() to also create a
> > > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature:
> > > 
> > > marker_cb(<tracepoint prototype>, *marker_probe_func);
> > > 
> > > We then also create 'register_marker_##name' function in DEFINE_TRACE(),
> > > which allows one to regiser marker callbacks in the usual way.
> > > 
> > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has
> > > registered a marker (which can set the tracepoint.state appropriately).
> > > 
> > > The 'marker_cb' function then marshalls its arguemnts and passes them
> > > through to the marker functions that were registered.
> > > 
> > > I think in this way we can simplify the tracepoints and markers by
> > > combining them to a large extent.
> > > 
> > > thanks,
> > > 
> > > -Jason
> > > 
> > 
> > I think what you propose here is already in y LTTng tree in a different
> > form. It's a patch to markers to allow declaring a marker which enables
> > an associated tracepoint when enabled. This way, we can have a marker
> > (exposed to userspace) connecting itself automatically to a tracepoint
> > when enabled.
> > 
> > It's here :
> > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09
> > 
> > Note that locking depends on the psrwlock patch so we can have nested
> > module list readers. Otherwise locking becomes _really_ messy. :-(
> > 
> > Mathieu
> > 
> 
> That patch simplifies using markers with tracepoints and couples
> markers and tracepoints much more closely. But I was proposing to make
> the coupling tighter...
> 
> Couldn't 'marker_probe_register()' register the marker directly with
> the tracepoint callsite? Have DEFINE_TRACE() take an additional argument
> which references a marker callback funtion. That function would look 
> like (very loose C code):
> 
> marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe,

I don't want the tracepoints to be coupled with markers (which are a
userspace API). The other way around is fine : letting a marker
automatically enable a tracepoint makes sense, but the opposite would
tie the in-kernel API (tracepoint) to the external marker
representation, and I would like to avoid that.

And how do you plan to deal with :

TPPROTO(arg1, arg2) == void ?

C won't let you define stuff like :

blah(void, marker_probe_func *probe, void *private_data)

The devil is in the details.... ;)

Mathieu

> private_data)
> {
> 	probe(private_data, "%arg1 %arg2", arg1->a, arg2->b);
> }
> 
> The 'marker_blah_callback()' would be invoked from within DO_TRACE() for
> each marker that has been registered with the associated tracepoint, in
> a similar way to how we iterate over the tracepoint callbacks, we can
> iterate over the registered markers and pass them to the
> 'marker_blah_callback()' function. 
> 
> By associating the marker_blah_callback() in DEFINE_TRACE(), we only
> need to look in one file to understand what is associated with a
> particular tracepoint. I think marker.c and tracepoint.c could also be
> consolidated at that point.
> 
> thanks,
> 
> -Jason
> 
> 
> 
> 
> 
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-10-03 19:10               ` Mathieu Desnoyers
@ 2008-10-03 19:25                 ` Jason Baron
  2008-10-03 19:56                   ` Mathieu Desnoyers
  2008-10-03 21:52                 ` Frank Ch. Eigler
  1 sibling, 1 reply; 122+ messages in thread
From: Jason Baron @ 2008-10-03 19:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od,
	Frank Ch. Eigler

On Fri, Oct 03, 2008 at 03:10:26PM -0400, Mathieu Desnoyers wrote:
> * Jason Baron (jbaron@redhat.com) wrote:
> > On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote:
> > > > > > > How about :
> > > > > > > 
> > > > > > >   trace_mark(ftrace_evname, "size %lu binary %pW",
> > > > > > >     sizeof(mystruct), mystruct);
> > > > > > >   or
> > > > > > >   trace_mark(sched_wakeup, "target_pid %ld", task->pid);
> > > > > > > 
> > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here.
> > > > > > > 
> > > > > > > That would encapsulate the whole
> > > > > > >   - Event ID registration
> > > > > > >   - Event type registration
> > > > > > >   - Sending data out
> > > > > > >   - Enabling the event source directly at the source
> > > > > > > 
> > > > > > > We can then export the markers through a debugfs file and let userland
> > > > > > > enable them one by one and possibly connect systemtap filters on them
> > > > > > > (one table of registered filters, one table for the markers, a command
> > > > > > > file to connect/disconnect filters to/from markers).
> > > > > > 
> > > > > > I would like to ask for the following from the start: have a field for
> > > > > > a longer description of the marker that describes it's usage and
> > > > > > context. Getting this there from the start is critical, because only
> > > > > > when adding the marker point do people still really remember why/what
> > > > > > (and having to type a good description also helps them to realize if
> > > > > > this is the right point or not). This can then be exposed to the user
> > > > > > so he has a standing chance of knowing what the marker is about.
> > > > > > 
> > > > > > It also has a standing chance of being updated when the code changes
> > > > > > this way
> > > > > > 
> > > > > 
> > > > > I agree, and I think it might be required in both markers and
> > > > > tracepoints.
> > > > > 
> > > > > Given that tracepoints are declared in a global header
> > > > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint
> > > > > uses within the kernel code (statements like :
> > > > >   trace_sched_switch(prev, next);
> > > > > added to the scheduler) would therefore be tied to the description
> > > > > without having to contain it in the core kernel code.
> > > > > 
> > > > > Markers, on the other hand, could become the "event description"
> > > > > interface which is exported to userspace. Considering that, I guess it's
> > > > > as important to let a precise description follow the markers.
> > > > > 
> > > > > Mathieu
> > > > > 
> > > > >
> > > > 
> > > > hi,
> > > > 
> > > > Tracepoints and markers seem to both have their place, with tracepoints
> > > > being integral to kernel users, and markers being important for
> > > > userspace. However, it seems to me like there is overlap in the
> > > > code and an extra level of indirection when markers are layered on
> > > > tracespoints. could they be merged a bit more?
> > > > 
> > > > What if we extended DEFINE_TRACE() to also create a
> > > > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature:
> > > > 
> > > > marker_cb(<tracepoint prototype>, *marker_probe_func);
> > > > 
> > > > We then also create 'register_marker_##name' function in DEFINE_TRACE(),
> > > > which allows one to regiser marker callbacks in the usual way.
> > > > 
> > > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has
> > > > registered a marker (which can set the tracepoint.state appropriately).
> > > > 
> > > > The 'marker_cb' function then marshalls its arguemnts and passes them
> > > > through to the marker functions that were registered.
> > > > 
> > > > I think in this way we can simplify the tracepoints and markers by
> > > > combining them to a large extent.
> > > > 
> > > > thanks,
> > > > 
> > > > -Jason
> > > > 
> > > 
> > > I think what you propose here is already in y LTTng tree in a different
> > > form. It's a patch to markers to allow declaring a marker which enables
> > > an associated tracepoint when enabled. This way, we can have a marker
> > > (exposed to userspace) connecting itself automatically to a tracepoint
> > > when enabled.
> > > 
> > > It's here :
> > > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09
> > > 
> > > Note that locking depends on the psrwlock patch so we can have nested
> > > module list readers. Otherwise locking becomes _really_ messy. :-(
> > > 
> > > Mathieu
> > > 
> > 
> > That patch simplifies using markers with tracepoints and couples
> > markers and tracepoints much more closely. But I was proposing to make
> > the coupling tighter...
> > 
> > Couldn't 'marker_probe_register()' register the marker directly with
> > the tracepoint callsite? Have DEFINE_TRACE() take an additional argument
> > which references a marker callback funtion. That function would look 
> > like (very loose C code):
> > 
> > marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe,
> 
> I don't want the tracepoints to be coupled with markers (which are a
> userspace API). The other way around is fine : letting a marker
> automatically enable a tracepoint makes sense, but the opposite would
> tie the in-kernel API (tracepoint) to the external marker
> representation, and I would like to avoid that.
> 

The interface to markers is still marker_probe_register() and
marker_probe_unregister(). I don't see how that changes with this
proposal?


> And how do you plan to deal with :
> 
> TPPROTO(arg1, arg2) == void ?
> 
> C won't let you define stuff like :
> 
> blah(void, marker_probe_func *probe, void *private_data)
> 

it'd be simple enough to pass the the noargs requirement down as an
extra argument to DO_TRACE(), and then invoke the callback with no arguments.

thanks,

-Jason



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-10-03 19:25                 ` Jason Baron
@ 2008-10-03 19:56                   ` Mathieu Desnoyers
  2008-10-03 20:25                     ` Jason Baron
  0 siblings, 1 reply; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-10-03 19:56 UTC (permalink / raw)
  To: Jason Baron
  Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od,
	Frank Ch. Eigler

* Jason Baron (jbaron@redhat.com) wrote:
> On Fri, Oct 03, 2008 at 03:10:26PM -0400, Mathieu Desnoyers wrote:
> > * Jason Baron (jbaron@redhat.com) wrote:
> > > On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote:
> > > > > > > > How about :
> > > > > > > > 
> > > > > > > >   trace_mark(ftrace_evname, "size %lu binary %pW",
> > > > > > > >     sizeof(mystruct), mystruct);
> > > > > > > >   or
> > > > > > > >   trace_mark(sched_wakeup, "target_pid %ld", task->pid);
> > > > > > > > 
> > > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here.
> > > > > > > > 
> > > > > > > > That would encapsulate the whole
> > > > > > > >   - Event ID registration
> > > > > > > >   - Event type registration
> > > > > > > >   - Sending data out
> > > > > > > >   - Enabling the event source directly at the source
> > > > > > > > 
> > > > > > > > We can then export the markers through a debugfs file and let userland
> > > > > > > > enable them one by one and possibly connect systemtap filters on them
> > > > > > > > (one table of registered filters, one table for the markers, a command
> > > > > > > > file to connect/disconnect filters to/from markers).
> > > > > > > 
> > > > > > > I would like to ask for the following from the start: have a field for
> > > > > > > a longer description of the marker that describes it's usage and
> > > > > > > context. Getting this there from the start is critical, because only
> > > > > > > when adding the marker point do people still really remember why/what
> > > > > > > (and having to type a good description also helps them to realize if
> > > > > > > this is the right point or not). This can then be exposed to the user
> > > > > > > so he has a standing chance of knowing what the marker is about.
> > > > > > > 
> > > > > > > It also has a standing chance of being updated when the code changes
> > > > > > > this way
> > > > > > > 
> > > > > > 
> > > > > > I agree, and I think it might be required in both markers and
> > > > > > tracepoints.
> > > > > > 
> > > > > > Given that tracepoints are declared in a global header
> > > > > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint
> > > > > > uses within the kernel code (statements like :
> > > > > >   trace_sched_switch(prev, next);
> > > > > > added to the scheduler) would therefore be tied to the description
> > > > > > without having to contain it in the core kernel code.
> > > > > > 
> > > > > > Markers, on the other hand, could become the "event description"
> > > > > > interface which is exported to userspace. Considering that, I guess it's
> > > > > > as important to let a precise description follow the markers.
> > > > > > 
> > > > > > Mathieu
> > > > > > 
> > > > > >
> > > > > 
> > > > > hi,
> > > > > 
> > > > > Tracepoints and markers seem to both have their place, with tracepoints
> > > > > being integral to kernel users, and markers being important for
> > > > > userspace. However, it seems to me like there is overlap in the
> > > > > code and an extra level of indirection when markers are layered on
> > > > > tracespoints. could they be merged a bit more?
> > > > > 
> > > > > What if we extended DEFINE_TRACE() to also create a
> > > > > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature:
> > > > > 
> > > > > marker_cb(<tracepoint prototype>, *marker_probe_func);
> > > > > 
> > > > > We then also create 'register_marker_##name' function in DEFINE_TRACE(),
> > > > > which allows one to regiser marker callbacks in the usual way.
> > > > > 
> > > > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has
> > > > > registered a marker (which can set the tracepoint.state appropriately).
> > > > > 
> > > > > The 'marker_cb' function then marshalls its arguemnts and passes them
> > > > > through to the marker functions that were registered.
> > > > > 
> > > > > I think in this way we can simplify the tracepoints and markers by
> > > > > combining them to a large extent.
> > > > > 
> > > > > thanks,
> > > > > 
> > > > > -Jason
> > > > > 
> > > > 
> > > > I think what you propose here is already in y LTTng tree in a different
> > > > form. It's a patch to markers to allow declaring a marker which enables
> > > > an associated tracepoint when enabled. This way, we can have a marker
> > > > (exposed to userspace) connecting itself automatically to a tracepoint
> > > > when enabled.
> > > > 
> > > > It's here :
> > > > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09
> > > > 
> > > > Note that locking depends on the psrwlock patch so we can have nested
> > > > module list readers. Otherwise locking becomes _really_ messy. :-(
> > > > 
> > > > Mathieu
> > > > 
> > > 
> > > That patch simplifies using markers with tracepoints and couples
> > > markers and tracepoints much more closely. But I was proposing to make
> > > the coupling tighter...
> > > 
> > > Couldn't 'marker_probe_register()' register the marker directly with
> > > the tracepoint callsite? Have DEFINE_TRACE() take an additional argument
> > > which references a marker callback funtion. That function would look 
> > > like (very loose C code):
> > > 
> > > marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe,
> > 
> > I don't want the tracepoints to be coupled with markers (which are a
> > userspace API). The other way around is fine : letting a marker
> > automatically enable a tracepoint makes sense, but the opposite would
> > tie the in-kernel API (tracepoint) to the external marker
> > representation, and I would like to avoid that.
> > 
> 
> The interface to markers is still marker_probe_register() and
> marker_probe_unregister(). I don't see how that changes with this
> proposal?
> 

"Have DEFINE_TRACE() take an additional argument which references a
marker callback funtion." -> it would tie the tracepoint definition to a
marker. Or am I misunderstanding something ?

Mathieu

> 
> > And how do you plan to deal with :
> > 
> > TPPROTO(arg1, arg2) == void ?
> > 
> > C won't let you define stuff like :
> > 
> > blah(void, marker_probe_func *probe, void *private_data)
> > 
> 
> it'd be simple enough to pass the the noargs requirement down as an
> extra argument to DO_TRACE(), and then invoke the callback with no arguments.
> 
> thanks,
> 
> -Jason
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-10-03 19:56                   ` Mathieu Desnoyers
@ 2008-10-03 20:25                     ` Jason Baron
  0 siblings, 0 replies; 122+ messages in thread
From: Jason Baron @ 2008-10-03 20:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Arjan van de Ven, Steven Rostedt, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od,
	Frank Ch. Eigler

On Fri, Oct 03, 2008 at 03:56:40PM -0400, Mathieu Desnoyers wrote:
> * Jason Baron (jbaron@redhat.com) wrote:
> > On Fri, Oct 03, 2008 at 03:10:26PM -0400, Mathieu Desnoyers wrote:
> > > * Jason Baron (jbaron@redhat.com) wrote:
> > > > On Fri, Oct 03, 2008 at 12:11:54PM -0400, Mathieu Desnoyers wrote:
> > > > > > > > > How about :
> > > > > > > > > 
> > > > > > > > >   trace_mark(ftrace_evname, "size %lu binary %pW",
> > > > > > > > >     sizeof(mystruct), mystruct);
> > > > > > > > >   or
> > > > > > > > >   trace_mark(sched_wakeup, "target_pid %ld", task->pid);
> > > > > > > > > 
> > > > > > > > > Note the namespacing with buffers being "ftrace" and "sched" here.
> > > > > > > > > 
> > > > > > > > > That would encapsulate the whole
> > > > > > > > >   - Event ID registration
> > > > > > > > >   - Event type registration
> > > > > > > > >   - Sending data out
> > > > > > > > >   - Enabling the event source directly at the source
> > > > > > > > > 
> > > > > > > > > We can then export the markers through a debugfs file and let userland
> > > > > > > > > enable them one by one and possibly connect systemtap filters on them
> > > > > > > > > (one table of registered filters, one table for the markers, a command
> > > > > > > > > file to connect/disconnect filters to/from markers).
> > > > > > > > 
> > > > > > > > I would like to ask for the following from the start: have a field for
> > > > > > > > a longer description of the marker that describes it's usage and
> > > > > > > > context. Getting this there from the start is critical, because only
> > > > > > > > when adding the marker point do people still really remember why/what
> > > > > > > > (and having to type a good description also helps them to realize if
> > > > > > > > this is the right point or not). This can then be exposed to the user
> > > > > > > > so he has a standing chance of knowing what the marker is about.
> > > > > > > > 
> > > > > > > > It also has a standing chance of being updated when the code changes
> > > > > > > > this way
> > > > > > > > 
> > > > > > > 
> > > > > > > I agree, and I think it might be required in both markers and
> > > > > > > tracepoints.
> > > > > > > 
> > > > > > > Given that tracepoints are declared in a global header
> > > > > > > (DECLARE_TRACE()), I would add this kind of description here. Tracepoint
> > > > > > > uses within the kernel code (statements like :
> > > > > > >   trace_sched_switch(prev, next);
> > > > > > > added to the scheduler) would therefore be tied to the description
> > > > > > > without having to contain it in the core kernel code.
> > > > > > > 
> > > > > > > Markers, on the other hand, could become the "event description"
> > > > > > > interface which is exported to userspace. Considering that, I guess it's
> > > > > > > as important to let a precise description follow the markers.
> > > > > > > 
> > > > > > > Mathieu
> > > > > > > 
> > > > > > >
> > > > > > 
> > > > > > hi,
> > > > > > 
> > > > > > Tracepoints and markers seem to both have their place, with tracepoints
> > > > > > being integral to kernel users, and markers being important for
> > > > > > userspace. However, it seems to me like there is overlap in the
> > > > > > code and an extra level of indirection when markers are layered on
> > > > > > tracespoints. could they be merged a bit more?
> > > > > > 
> > > > > > What if we extended DEFINE_TRACE() to also create a
> > > > > > 'set_marker(marker_cb)' function where 'marker_cb' has the function signature:
> > > > > > 
> > > > > > marker_cb(<tracepoint prototype>, *marker_probe_func);
> > > > > > 
> > > > > > We then also create 'register_marker_##name' function in DEFINE_TRACE(),
> > > > > > which allows one to regiser marker callbacks in the usual way.
> > > > > > 
> > > > > > Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has
> > > > > > registered a marker (which can set the tracepoint.state appropriately).
> > > > > > 
> > > > > > The 'marker_cb' function then marshalls its arguemnts and passes them
> > > > > > through to the marker functions that were registered.
> > > > > > 
> > > > > > I think in this way we can simplify the tracepoints and markers by
> > > > > > combining them to a large extent.
> > > > > > 
> > > > > > thanks,
> > > > > > 
> > > > > > -Jason
> > > > > > 
> > > > > 
> > > > > I think what you propose here is already in y LTTng tree in a different
> > > > > form. It's a patch to markers to allow declaring a marker which enables
> > > > > an associated tracepoint when enabled. This way, we can have a marker
> > > > > (exposed to userspace) connecting itself automatically to a tracepoint
> > > > > when enabled.
> > > > > 
> > > > > It's here :
> > > > > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d52ea7c48f47a1179aee01636d515cfea4ff6ede;hp=0a7b5c02209f3582ed1369ec818a1b389bd45a09
> > > > > 
> > > > > Note that locking depends on the psrwlock patch so we can have nested
> > > > > module list readers. Otherwise locking becomes _really_ messy. :-(
> > > > > 
> > > > > Mathieu
> > > > > 
> > > > 
> > > > That patch simplifies using markers with tracepoints and couples
> > > > markers and tracepoints much more closely. But I was proposing to make
> > > > the coupling tighter...
> > > > 
> > > > Couldn't 'marker_probe_register()' register the marker directly with
> > > > the tracepoint callsite? Have DEFINE_TRACE() take an additional argument
> > > > which references a marker callback funtion. That function would look 
> > > > like (very loose C code):
> > > > 
> > > > marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe,
> > > 
> > > I don't want the tracepoints to be coupled with markers (which are a
> > > userspace API). The other way around is fine : letting a marker
> > > automatically enable a tracepoint makes sense, but the opposite would
> > > tie the in-kernel API (tracepoint) to the external marker
> > > representation, and I would like to avoid that.
> > > 
> > 
> > The interface to markers is still marker_probe_register() and
> > marker_probe_unregister(). I don't see how that changes with this
> > proposal?
> > 
> 
> "Have DEFINE_TRACE() take an additional argument which references a
> marker callback funtion." -> it would tie the tracepoint definition to a
> marker. Or am I misunderstanding something ?
> 

Not sure. Maybe the confusion is that I am really talking about two
callbacks here. First, there is a tracepoint->marker callback which is
the 'marker_blah_callback()' that I mentioned above, and is the one
which is referenced in DEFINE_TRACE(). There is also the marker->userspace 
callback which is registered via something similar to marker_probe_register(),
only it is registered directly with the tracepoint. 

I think this potentially better address's Arjan's concern b/c it ties
the 'tracepoint->marker' callback directly to the tracepoint. And this
'tracepoint->marker' callback function in essense documents the marker
interface for a tracepoint. And this proposal documents the interfaces
(both tracepoints and markers) all in one place. 

If I'm not clear, I can prototype it if you think that would help?

thanks,

-Jason


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-10-03 19:10               ` Mathieu Desnoyers
  2008-10-03 19:25                 ` Jason Baron
@ 2008-10-03 21:52                 ` Frank Ch. Eigler
  1 sibling, 0 replies; 122+ messages in thread
From: Frank Ch. Eigler @ 2008-10-03 21:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jason Baron, Arjan van de Ven, Steven Rostedt, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od

Hi -

On Fri, Oct 03, 2008 at 03:10:26PM -0400, Mathieu Desnoyers wrote:
> [...]
> I don't want the tracepoints to be coupled with markers (which are a
> userspace API). [...]

I'm glad the discussion seems to be slowly turning toward the event
layering issues, but ... "markers being a userspace API"?  That seems
to be grossly misleading terminology.  It's a kernel API like anything
else being discussed here.  What's different about it
(vs. tracepoints) is that it could be one of the first clients of the
unified ring buffer widget that concretely addresses the issue of
encoding abstract traceworthy events into serialized data.

(Of course, merging the lower level aspects of the marker/tracepoint
implementations is all well and good.)

- FChE

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20 13:55   ` Mathieu Desnoyers
  2008-09-20 14:12     ` Arjan van de Ven
@ 2008-09-22  3:09     ` KOSAKI Motohiro
  1 sibling, 0 replies; 122+ messages in thread
From: KOSAKI Motohiro @ 2008-09-22  3:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: kosaki.motohiro, Steven Rostedt, Martin Bligh,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, od,
	Frank Ch. Eigler

> > > CONTROL
> > > -------
> > > 
> > > Sysfs style tree under debugfs
> > > 
> > > /debugfs/tracing/<name>/buffers/enabed         <--- binary value
> > > 
> > > /debugfs/tracing/<name>/<event1>
> > > /debugfs/tracing/<name>/<event2>
> > >     etc ...
> > 
> > I wonder if we should make this another sub dir:
> > 
> >  /debugfs/tracing/buffers/events/<event-name>
> > 
> 
> Sure.
> 
> If needed, we could change the markers to take two separate parameters :
> 
> trace_mark(tracer_name, event_name, "format", args)

Sorry, disagreed.

I believe any marker point should be independent to tracer.
Otherwise, number of trace_mark() will blow up easily.




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-20  9:03 ` Steven Rostedt
  2008-09-20 13:55   ` Mathieu Desnoyers
@ 2008-09-22  9:57   ` Peter Zijlstra
  2008-09-23  2:36     ` Mathieu Desnoyers
  1 sibling, 1 reply; 122+ messages in thread
From: Peter Zijlstra @ 2008-09-22  9:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, od, Frank Ch. Eigler

On Sat, 2008-09-20 at 05:03 -0400, Steven Rostedt wrote:

> Oh, and all commands should start with the namespace.
> 
>   ring_buffer_alloc()
>   ring_buffer_free()
>   ring_buffer_record_event()

I really think we should separate the ringbuffer management from the
event stuff.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22  9:57   ` Peter Zijlstra
@ 2008-09-23  2:36     ` Mathieu Desnoyers
  0 siblings, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  2:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Sat, 2008-09-20 at 05:03 -0400, Steven Rostedt wrote:
> 
> > Oh, and all commands should start with the namespace.
> > 
> >   ring_buffer_alloc()
> >   ring_buffer_free()
> >   ring_buffer_record_event()
> 
> I really think we should separate the ringbuffer management from the
> event stuff.
> 

Sure, I am strongly in favor of separating those two, given they
represent two different things. However, the requirement I have heard at
KS2008 was to provide 

- Unified buffering mechanism
- Timestamps synchronized across all buffers
- Unified event IDs management, so events from various sources could be
  shared between tools.
- As of my understanding, unified event structure, which can be exported
  to userspace and be shared across different tools.
- Unified buffer control/management mechanism.

These all represent different infrastructure parts, but are all needed
if we want tools to be able to share the data exported through those
buffers.

Relay is a good example of having only a _single_ of these layers in
common : there is currently no way the different relay users can share
the data they collect because they have simply no idea how others
structure their data.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:33 Unified tracing buffer Martin Bligh
                   ` (5 preceding siblings ...)
  2008-09-20  9:03 ` Steven Rostedt
@ 2008-09-22 13:57 ` K.Prasad
  2008-09-22 19:45 ` Masami Hiramatsu
  2008-09-23  3:33 ` Andi Kleen
  8 siblings, 0 replies; 122+ messages in thread
From: K.Prasad @ 2008-09-22 13:57 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, zanussi

On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote:
> During kernel summit and Plumbers conference, Linus and others
> expressed a desire for a unified
> tracing buffer system for multiple tracing applications (eg ftrace,
> lttng, systemtap, blktrace, etc) to use.
> This provides several advantages, including the ability to interleave
> data from multiple sources,
> not having to learn 200 different tools, duplicated code/effort, etc.
> 

With due apologies for pitching-in late, I thought I'd bring visibility to
the two new interfaces - namely relay_printk() and relay_dump() - now a part
of -mm tree (since 2.6.27-rc5-mm1) are meant to address such needs;
although not completely in its present form but quite substantially.
(Refer: Documentation/filesystems/relay.txt). As far as re-usability is
concerned, many parts of this interface are directly adopted from
SystemTap's runtime. Blktrace had been made to work using these interfaces
(http://tinyurl.com/4q9d4p) reducing about ~130 lines of code from the
blktrace related files.

With more effort, say additions such as a)ability to specify custom
names for files b)ability to create user-defined control files (in
addition to what comes default) will make it usable along with tracers
such as ftrace (ref:http://tinyurl.com/3ppbwh) (and is something that I
intended to work upon).

While relay_printk() interface brings a high-level abstract interface
over 'relay' by masking all the setup/tear-down details and the ability
to use per-CPU buffers; relay_dump() is its equivalent that performs
binary dumping through debugfs interface (a requirement for the unified
tracing buffer, as I learn from the email). Also the use of default
file-names, debugfs output path results in huge reduction of setup code
required by the end-user along with the ability to override the defaults
if required in a special case. Examples of the resulting code-brevity can
be seen at samples/relay/*.c in 2.6.27-rc5-mm1 tree.

I am quite sure that with minimal changes to infrastructure underlying
beneath these two interfaces, we can meet out most of the requirements
stated above; and am open for suggestions.

Kindly let me know what the community thinks about the same.

Thanks,
K.Prasad

> Several of us got together last night and tried to cut this down to
> the simplest usable system
> we could agree on (and nobody got hurt!). This will form version 1.
> I've sketched out a few
> enhancements we know that we want, but have agreed to leave these
> until version 2.
> The answer to most questions about the below is "yes we know, we'll
> fix that in version 2"
> (or 3). Simplicity was the rule ...
> 
> Sketch of design.  Enjoy flaming me. Code will follow shortly.
> 
> 
> STORAGE
> -------
> 
> We will support multiple buffers for different tracing systems, with
> separate names, event id spaces.
> Event ids are 16 bit, dynamically allocated.
> A "one line of text" print function will be provided for each event,
> or use the default (probably hex printf)
> Will provide a "flight data recorder" mode, and a "spool to disk" mode.
> 
> Circular buffer per cpu, protected by per-cpu spinlock_irq
> Word aligned records.
> Variable record length, header will start with length record.
> Timestamps in fixed timebase, monotonically increasing (across all CPUs)
> 
> 
> INPUT_FUNCTIONS
> ---------------
> 
> allocate_buffer (name, size)
>         return buffer_handle
> 
> register_event (buffer_handle, event_id, print_function)
>         You can pass in a requested event_id from a fixed set, and
> will be given it, or an error
>         0 means allocate me one dynamically
>         returns event_id     (or -E_ERROR)
> 
> record_event (buffer_handle, event_id, length, *buf)
> 
> 
> OUTPUT
> ------
> 
> Data will be output via debugfs, and provide the following output streams:
> 
> /debugfs/tracing/<name>/buffers/text
>     clear text stream (will merge the per-cpu streams via insertion
> sort, and use the print functions)
> 
> /debugfs/tracing/<name>/buffers/binary[cpu_number]
>     per-cpu binary data
> 
> 
> CONTROL
> -------
> 
> Sysfs style tree under debugfs
> 
> /debugfs/tracing/<name>/buffers/enabed         <--- binary value
> 
> /debugfs/tracing/<name>/<event1>
> /debugfs/tracing/<name>/<event2>
>     etc ...
>     provides a way to enable/disable events, see what's available, and
> what's enabled.
> 
> 
> KNOWN ISSUES / PLANS
> -------------------
> 
> No way to unregister buffers and events.
>     Will provide an unregister_buffer and unregister_event call
> 
> 
> Generating systemwide time is hard on some platforms
>     Yes. Time-based output provides a lot of simplicity for the user though
>     We won't support these platforms at first, we'll add functionality
> to make it work for them later.
>     (plan based on tick-based ms timing, plus counter offset from that
> if needed).
> 
> Spinlock_irq is ineffecient, and doesn't support tracing in NMIs
>     True. We'll implement a lockless scheme later (see lttng)
> 
> Putting a length record in every event is inefficient
>     True. Fixed record length with optional extensions is better, but
> more complex. v2.
> 
> Putting a full timestamp rather than an offset in every event is inefficient
>     See above. True, but v2.
> 
> Relayfs already exists! use that!
>     People were universally not keen on that idea. Complexity, interface, etc.
>     We're also providing some higher level shared functions for time &
> event ids.
> 
> There's no way to decode the binary data stream
>     Code will be shared from the kernel to decode it, so that we can
> get the compact binary
>     format and decode it later. That code will be kept in the kernel
> tree (it's a trivial piece of C).
>     Version 1.1 ;-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:33 Unified tracing buffer Martin Bligh
                   ` (6 preceding siblings ...)
  2008-09-22 13:57 ` K.Prasad
@ 2008-09-22 19:45 ` Masami Hiramatsu
  2008-09-22 20:13   ` Martin Bligh
  2008-09-23  3:33 ` Andi Kleen
  8 siblings, 1 reply; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-22 19:45 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	systemtap-ml

Hi Martin,

Martin Bligh wrote:
> During kernel summit and Plumbers conference, Linus and others
> expressed a desire for a unified
> tracing buffer system for multiple tracing applications (eg ftrace,
> lttng, systemtap, blktrace, etc) to use.
> This provides several advantages, including the ability to interleave
> data from multiple sources,
> not having to learn 200 different tools, duplicated code/effort, etc.
> 
> Several of us got together last night and tried to cut this down to
> the simplest usable system
> we could agree on (and nobody got hurt!). This will form version 1.
> I've sketched out a few
> enhancements we know that we want, but have agreed to leave these
> until version 2.
> The answer to most questions about the below is "yes we know, we'll
> fix that in version 2"
> (or 3). Simplicity was the rule ...
>
> Sketch of design.  Enjoy flaming me. Code will follow shortly.
>
>
> STORAGE
> -------
>
> We will support multiple buffers for different tracing systems, with
> separate names, event id spaces.
> Event ids are 16 bit, dynamically allocated.
> A "one line of text" print function will be provided for each event,
> or use the default (probably hex printf)
> Will provide a "flight data recorder" mode, and a "spool to disk" mode.
>
> Circular buffer per cpu, protected by per-cpu spinlock_irq
> Word aligned records.
> Variable record length, header will start with length record.
> Timestamps in fixed timebase, monotonically increasing (across all CPUs)

I agree to integrate tracing buffer mechanism, but I don't think
your proposal is the simplest one.

To simplify, I think the layered buffering mechanism is desirable.
- The lowest layer just provides named circular buffers(both per-cpu and
  uni-buffer in system) and read/write scheme.
- Next layer provides user/kernel interface including controls.
- Top layer defines packet(and event) formatting utilities.
- Additionally, it would better provide some library routines(timestamp,
  event-id synchronize, and so on).

Since this unified buffer is used from different kind of tracers/loggers,
I don't think all of them(and future tracers) want to be tied down by
"event-id"+"parameter" format.
So, Sorry, I disagree about that the tracing buffer defines its *data format*,
it's just overkill for me.

Thank you,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 19:45 ` Masami Hiramatsu
@ 2008-09-22 20:13   ` Martin Bligh
  2008-09-22 22:25     ` Masami Hiramatsu
  0 siblings, 1 reply; 122+ messages in thread
From: Martin Bligh @ 2008-09-22 20:13 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	systemtap-ml

> I agree to integrate tracing buffer mechanism, but I don't think
> your proposal is the simplest one.
>
> To simplify, I think the layered buffering mechanism is desirable.
> - The lowest layer just provides named circular buffers(both per-cpu and
>  uni-buffer in system) and read/write scheme.
> - Next layer provides user/kernel interface including controls.
> - Top layer defines packet(and event) formatting utilities.
> - Additionally, it would better provide some library routines(timestamp,
>  event-id synchronize, and so on).
>
> Since this unified buffer is used from different kind of tracers/loggers,
> I don't think all of them(and future tracers) want to be tied down by
> "event-id"+"parameter" format.
> So, Sorry, I disagree about that the tracing buffer defines its *data format*,
> it's just overkill for me.

I think you're right that we can layer this, and we didn't make a particularly
good job of splitting those things out. I'll try to pull together
another revision
to reflect this and other suggested changes.

One thing that I think is still important is to have a unified timestamp
mechanism on everything, so we can co-ordinate different things back
together in userspace from different trace tools, so I intend to keep
that at a lower level, but I think you're right that the event id, etc can
move up into separate layers.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 20:13   ` Martin Bligh
@ 2008-09-22 22:25     ` Masami Hiramatsu
  2008-09-22 23:11       ` Darren Hart
                         ` (2 more replies)
  0 siblings, 3 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-22 22:25 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	systemtap-ml

Hi Martin,

Martin Bligh wrote:
>> I agree to integrate tracing buffer mechanism, but I don't think
>> your proposal is the simplest one.
>>
>> To simplify, I think the layered buffering mechanism is desirable.
>> - The lowest layer just provides named circular buffers(both per-cpu and
>>  uni-buffer in system) and read/write scheme.
>> - Next layer provides user/kernel interface including controls.
>> - Top layer defines packet(and event) formatting utilities.
>> - Additionally, it would better provide some library routines(timestamp,
>>  event-id synchronize, and so on).
>>
>> Since this unified buffer is used from different kind of tracers/loggers,
>> I don't think all of them(and future tracers) want to be tied down by
>> "event-id"+"parameter" format.
>> So, Sorry, I disagree about that the tracing buffer defines its *data format*,
>> it's just overkill for me.
> 
> I think you're right that we can layer this, and we didn't make a particularly
> good job of splitting those things out. I'll try to pull together
> another revision
> to reflect this and other suggested changes.

I'm happy to hear that. :-)

> One thing that I think is still important is to have a unified timestamp
> mechanism on everything, so we can co-ordinate different things back
> together in userspace from different trace tools, so I intend to keep
> that at a lower level, but I think you're right that the event id, etc can
> move up into separate layers.

I'm not so sure that the unified 'timestamp' must be required by all tracers.
If you just need to merge and sort per-cpu data, you can use an atomic
sequential number for it.
IMHO, the unified 'timestamp' would better be an option, because some
architectures can't support it. I think preparing timestamp-callback
function will help us.

By the way, systemtap uses two modes;

- single-channel mode
 In this mode, all cpus share one buffer channel to write and read.
 each writer locks spinlock and write a probe-local data to buffer.

- per-cpu buffer mode
 In this mode, we use an atomic sequential number for ordering data. If
 user doesn't need it(because they have their own timestamps), they can
 choose not to use that seq-number.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 22:25     ` Masami Hiramatsu
@ 2008-09-22 23:11       ` Darren Hart
  2008-09-23  0:04         ` Masami Hiramatsu
  2008-09-22 23:16       ` Martin Bligh
  2008-09-23 14:36       ` KOSAKI Motohiro
  2 siblings, 1 reply; 122+ messages in thread
From: Darren Hart @ 2008-09-22 23:11 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, systemtap-ml

On Mon, Sep 22, 2008 at 3:25 PM, Masami Hiramatsu <mhiramat@redhat.com> wrote:

>> One thing that I think is still important is to have a unified timestamp
>> mechanism on everything, so we can co-ordinate different things back
>> together in userspace from different trace tools, so I intend to keep
>> that at a lower level, but I think you're right that the event id, etc can
>> move up into separate layers.
>
> I'm not so sure that the unified 'timestamp' must be required by all tracers.
> If you just need to merge and sort per-cpu data, you can use an atomic
> sequential number for it.
> IMHO, the unified 'timestamp' would better be an option, because some
> architectures can't support it. I think preparing timestamp-callback
> function will help us.
>

There have been several posts on the timestamp for the events.  From a
real-time perspective, this timestamp will be a very important datapoint for
each event, and the more accurate/higher resolution the better.  Some thoughts.

o pretty print resolution should definitely be nanosecond (IMHO)

o internal storage should be "whatever is fastest" with the transformation to
 ns data stored in the trace header (as I believe Mathieu mentioned).

o for archs where the clock isn't synchronized across CPUs, perhaps for now it
 would be adequate to record the per cpu timestamps in the trace header and
 include the cpu id for each event as well.  This is in keeping with the
 previous suggestion to collect the most primitive data available without
 doing any sort of transformation at trace time.

--
Darren Hart

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 23:11       ` Darren Hart
@ 2008-09-23  0:04         ` Masami Hiramatsu
  0 siblings, 0 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23  0:04 UTC (permalink / raw)
  To: Darren Hart
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, systemtap-ml

Hi Darren,

Darren Hart wrote:
> On Mon, Sep 22, 2008 at 3:25 PM, Masami Hiramatsu <mhiramat@redhat.com> wrote:
> 
>>> One thing that I think is still important is to have a unified timestamp
>>> mechanism on everything, so we can co-ordinate different things back
>>> together in userspace from different trace tools, so I intend to keep
>>> that at a lower level, but I think you're right that the event id, etc can
>>> move up into separate layers.
>> I'm not so sure that the unified 'timestamp' must be required by all tracers.
>> If you just need to merge and sort per-cpu data, you can use an atomic
>> sequential number for it.
>> IMHO, the unified 'timestamp' would better be an option, because some
>> architectures can't support it. I think preparing timestamp-callback
>> function will help us.
>>
> 
> There have been several posts on the timestamp for the events.  From a
> real-time perspective, this timestamp will be a very important datapoint for
> each event, and the more accurate/higher resolution the better.  Some thoughts.

Sure, I know the precise timestamp is required for real-time sensitive
tracers. however, there are some other cases. for example debugging,
we don't need timestamps, but just want to know the order of events. :-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 22:25     ` Masami Hiramatsu
  2008-09-22 23:11       ` Darren Hart
@ 2008-09-22 23:16       ` Martin Bligh
  2008-09-23  0:05         ` Masami Hiramatsu
  2008-09-23 14:36       ` KOSAKI Motohiro
  2 siblings, 1 reply; 122+ messages in thread
From: Martin Bligh @ 2008-09-22 23:16 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	systemtap-ml

>> One thing that I think is still important is to have a unified timestamp
>> mechanism on everything, so we can co-ordinate different things back
>> together in userspace from different trace tools, so I intend to keep
>> that at a lower level, but I think you're right that the event id, etc can
>> move up into separate layers.
>
> I'm not so sure that the unified 'timestamp' must be required by all tracers.
> If you just need to merge and sort per-cpu data, you can use an atomic
> sequential number for it.
> IMHO, the unified 'timestamp' would better be an option, because some
> architectures can't support it. I think preparing timestamp-callback
> function will help us.

An atomic sequential number is:

(a) far less meaningful than a timestamp for the user
(b) more expensive to compute in many cases.

I think we came up with a way to approximate this, using a callback every
ms or so as the higher order bits, and a sequential counter in the lower
order for those broken platforms.

But perhaps it would be better if we started with a discussion of which
platforms can't do global timestamps, and why not? I know some of them
are fixable, but perhaps not all.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 23:16       ` Martin Bligh
@ 2008-09-23  0:05         ` Masami Hiramatsu
  2008-09-23  0:12           ` Martin Bligh
  2008-09-23  0:39           ` Linus Torvalds
  0 siblings, 2 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23  0:05 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Hi Martin,

Martin Bligh wrote:
>>> One thing that I think is still important is to have a unified timestamp
>>> mechanism on everything, so we can co-ordinate different things back
>>> together in userspace from different trace tools, so I intend to keep
>>> that at a lower level, but I think you're right that the event id, etc can
>>> move up into separate layers.
>> I'm not so sure that the unified 'timestamp' must be required by all tracers.
>> If you just need to merge and sort per-cpu data, you can use an atomic
>> sequential number for it.
>> IMHO, the unified 'timestamp' would better be an option, because some
>> architectures can't support it. I think preparing timestamp-callback
>> function will help us.
> 
> An atomic sequential number is:
> 
> (a) far less meaningful than a timestamp for the user
> (b) more expensive to compute in many cases.

Sure, atomic counter might be more expensive but accurate for ordering.
(and it can use on almost all architectures)
The cost depends on the architecture and system configuration.

So, I think it is preferable user to choose their timestamp rather
than fix it. For example, calling callback when writing a log entry
as following;

write_log(struct buffer *buffer, char *data, int len)
{
    /* reserve a logging space */
    char *buf = reserve(buffer, len + buffer->timestamp.len);

    /* write a timestamp */
    buf = buffer->timestamp.write(buf);

    /* write a body */
    memcpy(buf, data, len);
}

And unified buffer prepares default timestamp.write callbacks.

char * timestamp_write(char * buf); // write arch-specific timestamp
char * seqnum_write(char * buf); // write an sequence number

What would you think about it?

> I think we came up with a way to approximate this, using a callback every
> ms or so as the higher order bits, and a sequential counter in the lower
> order for those broken platforms.

Sure, that will work.

> But perhaps it would be better if we started with a discussion of which
> platforms can't do global timestamps, and why not? I know some of them
> are fixable, but perhaps not all.

For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-(

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:05         ` Masami Hiramatsu
@ 2008-09-23  0:12           ` Martin Bligh
  2008-09-23 14:49             ` Masami Hiramatsu
  2008-09-23  0:39           ` Linus Torvalds
  1 sibling, 1 reply; 122+ messages in thread
From: Martin Bligh @ 2008-09-23  0:12 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

>> I think we came up with a way to approximate this, using a callback every
>> ms or so as the higher order bits, and a sequential counter in the lower
>> order for those broken platforms.
>
> Sure, that will work.

OK, that'd fix 99% of it, even if we only get time within 1ms or so (but still
good ordering).

>> But perhaps it would be better if we started with a discussion of which
>> platforms can't do global timestamps, and why not? I know some of them
>> are fixable, but perhaps not all.
>
> For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-(

Can you define incorrect for me (in this case)?

We had similar problems with some AMD platforms that we can fix by syncing
the TSCs on exit_idle, etc.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:12           ` Martin Bligh
@ 2008-09-23 14:49             ` Masami Hiramatsu
  2008-09-23 15:04               ` Mathieu Desnoyers
  2008-09-23 15:46               ` Linus Torvalds
  0 siblings, 2 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 14:49 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Hi Martin,

Martin Bligh wrote:
>>> But perhaps it would be better if we started with a discussion of which
>>> platforms can't do global timestamps, and why not? I know some of them
>>> are fixable, but perhaps not all.
>> For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-(
> 
> Can you define incorrect for me (in this case)?

On my laptop, TSC is disabled at boot time.

$ dmesg | grep TSC
checking TSC synchronization [CPU#0 -> CPU#1]:
Measured 4246549092 cycles TSC warp between CPUs, turning off TSC clock.
Marking TSC unstable due to: check_tsc_sync_source failed.

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips	: 3998.45
clflush size	: 64
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips	: 3994.43
clflush size	: 64
power management:

Actually, I had measured TSC drifting and reported to systemtap-bugzilla
http://sources.redhat.com/bugzilla/show_bug.cgi?id=3916#c19

Curiously, I've tested on another Core2Duo laptop, which cpu is same model and
same stepping, but on that laptop I couldn't see TSC drifting.
So I think this might be a product level issue and a rare case...


> We had similar problems with some AMD platforms that we can fix by syncing
> the TSCs on exit_idle, etc.

Hmm, very interesting. :-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:49             ` Masami Hiramatsu
@ 2008-09-23 15:04               ` Mathieu Desnoyers
  2008-09-23 15:30                 ` Masami Hiramatsu
  2008-09-23 15:46               ` Linus Torvalds
  1 sibling, 1 reply; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23 15:04 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
[...]
> $ cat /proc/cpuinfo
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 15
> model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
> stepping	: 6
> cpu MHz		: 1000.000
[...]

... @ 2.00GHz and cpu MHz : 1000.000 ; isn't that a bit odd ? (same for
both cpus)

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 15:04               ` Mathieu Desnoyers
@ 2008-09-23 15:30                 ` Masami Hiramatsu
  2008-09-23 16:01                   ` Linus Torvalds
  0 siblings, 1 reply; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 15:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
> [...]
>> $ cat /proc/cpuinfo
>> processor	: 0
>> vendor_id	: GenuineIntel
>> cpu family	: 6
>> model		: 15
>> model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
>> stepping	: 6
>> cpu MHz		: 1000.000
> [...]
> 
> ... @ 2.00GHz and cpu MHz : 1000.000 ; isn't that a bit odd ? (same for
> both cpus)

2.00GHz is the maximum(model) frequency. And 'cpu MHz' means
current frequency. (yep, now I'm using cpufreq)
Anyway, when I measured TSC drift, I killed cpuspeed service and
fixed freq to 2000. ;-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 15:30                 ` Masami Hiramatsu
@ 2008-09-23 16:01                   ` Linus Torvalds
  2008-09-23 17:04                     ` Masami Hiramatsu
  0 siblings, 1 reply; 122+ messages in thread
From: Linus Torvalds @ 2008-09-23 16:01 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> 
> 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means
> current frequency. (yep, now I'm using cpufreq)
> Anyway, when I measured TSC drift, I killed cpuspeed service and
> fixed freq to 2000. ;-)

Ahh. I have an idea..

Maybe that thing does thermal throttling? 

Fixing the frequency at the highest setting is actually one of the worst 
things you can do, because if the device is thermally limited, it will 
still do the whole throttling thing, but now it won't do it by changing 
the frequency any more, it will do it by essentially forxing the external 
frequency down.

And that is going to be *very* inefficient. You really really don't want 
that. Your performance will actually be _worse_ than if the CPU went to a 
lower frequency. And it might explain the unreliable TSC too, because I 
suspect constant TSC is really constant only wrt the bus clock to the CPU.

The termal throttling thing is a "protect the CPU from overheating" last 
ditch effort, and because it doesn't lower voltage, it isn't actually at 
all as efficient at saving power (and thus cooling the CPU) as a real 
frequency change event would be.

And fixing the frequency to the highest frequency in a tight laptop 
enclosure is the best way to force that behavior (in contrast - in a 
desktop system with sufficient cooling, it's usually not a problem at all 
to just say "run at highest frequency").

And btw, that also explains why you had so *big* changes in frequency: the 
throttling I think happens with a 1/8 duty cycle thing, iirc.

It's supposed to be very rare with Core 2. Thermal throttling was quite 
common with the P4 one, and was the main overheating protection initially. 
These days, you should only see it for really badly designed devices that 
simply don't have enough thermal cooling, but if the design calls for 
mostly running at low frequency because it's some thing-and-light notebook 
with insufficient cooling (or some thick-and-heavy thing that is just 
total crap), and you force it to always run at full speed, I can imagine 
it kicking in to protect the CPU.

It's obviously also going to be much easier to see if the ambient 
temperature is high. If you want to get best peformance, take one of those 
compressed-air spray-cans, and spray on the laptop with the can held 
upside down (the can will generally tell you _not_ to do that, because 
then you'll get the liquid itself rather than gas, but that's what you 
want for cooling).

So if you can test this, try it with 
 (a) cpufreq at a fixed _low_ value (to not cause overheating)
 (b) with the spray-can cooling the thing and cpufreq at a fixed high 
     value
and see if the TSC is constant then.

			Linus

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 16:01                   ` Linus Torvalds
@ 2008-09-23 17:04                     ` Masami Hiramatsu
  2008-09-23 17:30                       ` Thomas Gleixner
  0 siblings, 1 reply; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 17:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Linus Torvalds wrote:
> 
> On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
>> 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means
>> current frequency. (yep, now I'm using cpufreq)
>> Anyway, when I measured TSC drift, I killed cpuspeed service and
>> fixed freq to 2000. ;-)
> 
> Ahh. I have an idea..
> 
> Maybe that thing does thermal throttling? 
> 
> Fixing the frequency at the highest setting is actually one of the worst 
> things you can do, because if the device is thermally limited, it will 
> still do the whole throttling thing, but now it won't do it by changing 
> the frequency any more, it will do it by essentially forxing the external 
> frequency down.
> 
> And that is going to be *very* inefficient. You really really don't want 
> that. Your performance will actually be _worse_ than if the CPU went to a 
> lower frequency. And it might explain the unreliable TSC too, because I 
> suspect constant TSC is really constant only wrt the bus clock to the CPU.
> 
> The termal throttling thing is a "protect the CPU from overheating" last 
> ditch effort, and because it doesn't lower voltage, it isn't actually at 
> all as efficient at saving power (and thus cooling the CPU) as a real 
> frequency change event would be.
> 
> And fixing the frequency to the highest frequency in a tight laptop 
> enclosure is the best way to force that behavior (in contrast - in a 
> desktop system with sufficient cooling, it's usually not a problem at all 
> to just say "run at highest frequency").
> 
> And btw, that also explains why you had so *big* changes in frequency: the 
> throttling I think happens with a 1/8 duty cycle thing, iirc.
> 
> It's supposed to be very rare with Core 2. Thermal throttling was quite 
> common with the P4 one, and was the main overheating protection initially. 
> These days, you should only see it for really badly designed devices that 
> simply don't have enough thermal cooling, but if the design calls for 
> mostly running at low frequency because it's some thing-and-light notebook 
> with insufficient cooling (or some thick-and-heavy thing that is just 
> total crap), and you force it to always run at full speed, I can imagine 
> it kicking in to protect the CPU.
> 
> It's obviously also going to be much easier to see if the ambient 
> temperature is high. If you want to get best peformance, take one of those 
> compressed-air spray-cans, and spray on the laptop with the can held 
> upside down (the can will generally tell you _not_ to do that, because 
> then you'll get the liquid itself rather than gas, but that's what you 
> want for cooling).
> 
> So if you can test this, try it with 
>  (a) cpufreq at a fixed _low_ value (to not cause overheating)
>  (b) with the spray-can cooling the thing and cpufreq at a fixed high 
>      value
> and see if the TSC is constant then.

Hi Linus,

Thank you for your advice. I tested it again according your advice,
I did:
- service cpuspeed stop
- echo 1000000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
  and checked /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is
  1000000.
- echo 1 >  /proc/acpi/thermal_zone/THM/polling_frequency
- cooling with spray-can :)
- cat /proc/acpi/thermal_zone/THM/temperature
  temperature:             39 C

and ran the test.
---
p0: c:1107576,	 ns:990280	 ratio:111
p0: c:1805640,	 ns:1008787	 ratio:178
p0: c:1998324,	 ns:1000127	 ratio:199
p0: c:946380,	 ns:990280	 ratio:95
p0: c:871728,	 ns:1000267	 ratio:87
p0: c:1807380,	 ns:1007949	 ratio:179
p0: c:1784808,	 ns:1000127	 ratio:178
p0: c:1768488,	 ns:991676	 ratio:178
p0: c:1802292,	 ns:1008299	 ratio:178
p0: c:1787088,	 ns:1000406	 ratio:178
p0: c:1999176,	 ns:1000896	 ratio:199
p0: c:881364,	 ns:991956	 ratio:88
p0: c:1802712,	 ns:1008019	 ratio:178
p0: c:1787088,	 ns:998590	 ratio:178
---
this seems not so stable yet. :-(

After test I checked temperature again.
# cat /proc/acpi/thermal_zone/THM/temperature
temperature:             39 C

Hmm, 39 C is not so high. I wouldn't be surprised even if this
is an individual product bug. Anyway, currently, Linux itself
works well on this laptop with hpet.:-)

Thank you,


> 
> 			Linus

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 17:04                     ` Masami Hiramatsu
@ 2008-09-23 17:30                       ` Thomas Gleixner
  2008-09-23 18:59                         ` Masami Hiramatsu
  0 siblings, 1 reply; 122+ messages in thread
From: Thomas Gleixner @ 2008-09-23 17:30 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> > So if you can test this, try it with 
> >  (a) cpufreq at a fixed _low_ value (to not cause overheating)
> >  (b) with the spray-can cooling the thing and cpufreq at a fixed high 
> >      value
> > and see if the TSC is constant then.
> 
> Hi Linus,
> 
> Thank you for your advice. I tested it again according your advice,
> I did:
> - service cpuspeed stop
> - echo 1000000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
>   and checked /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is
>   1000000.
> - echo 1 >  /proc/acpi/thermal_zone/THM/polling_frequency
> - cooling with spray-can :)
> - cat /proc/acpi/thermal_zone/THM/temperature
>   temperature:             39 C
> 
> and ran the test.
> ---
> p0: c:1107576,	 ns:990280	 ratio:111
> p0: c:1805640,	 ns:1008787	 ratio:178
> p0: c:1998324,	 ns:1000127	 ratio:199
> p0: c:946380,	 ns:990280	 ratio:95
> p0: c:871728,	 ns:1000267	 ratio:87
> p0: c:1807380,	 ns:1007949	 ratio:179
> p0: c:1784808,	 ns:1000127	 ratio:178
> p0: c:1768488,	 ns:991676	 ratio:178
> p0: c:1802292,	 ns:1008299	 ratio:178
> p0: c:1787088,	 ns:1000406	 ratio:178
> p0: c:1999176,	 ns:1000896	 ratio:199
> p0: c:881364,	 ns:991956	 ratio:88
> p0: c:1802712,	 ns:1008019	 ratio:178
> p0: c:1787088,	 ns:998590	 ratio:178
> ---
> this seems not so stable yet. :-(
> 
> After test I checked temperature again.
> # cat /proc/acpi/thermal_zone/THM/temperature
> temperature:             39 C
> 
> Hmm, 39 C is not so high. I wouldn't be surprised even if this
> is an individual product bug. Anyway, currently, Linux itself
> works well on this laptop with hpet.:-)

Do you have C-States enabled on that machine ?

ls /sys/devices/system/cpu/cpu0/cpuidle/

has it more than a state0 entry ?

If yes, please do:

cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/usage

where X is the highest number in there.

cat /proc/acpi/processor/CPU0/power

might be useful as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 17:30                       ` Thomas Gleixner
@ 2008-09-23 18:59                         ` Masami Hiramatsu
  2008-09-23 19:36                           ` Thomas Gleixner
  0 siblings, 1 reply; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 18:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

Hi Thomas,

Thomas Gleixner wrote:
> 
> Do you have C-States enabled on that machine ?
> 
> ls /sys/devices/system/cpu/cpu0/cpuidle/
> 
> has it more than a state0 entry ?

Yes, there are state0 to state3
# ls /sys/devices/system/cpu/cpu0/cpuidle/
state0  state1  state2  state3

> If yes, please do:
> 
> cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/usage
> 
> where X is the highest number in there.

# cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
171210

> 
> cat /proc/acpi/processor/CPU0/power
> 
> might be useful as well.

# cat /proc/acpi/processor/CPU0/power
active state:            C0
max_cstate:              C8
bus master activity:     00000000
maximum allowed latency: 2000000000 usec
states:
    C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
    C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
    C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]

Could these help you?

Thank you,

> 
> Thanks,
> 
> 	tglx

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 18:59                         ` Masami Hiramatsu
@ 2008-09-23 19:36                           ` Thomas Gleixner
  2008-09-23 19:38                             ` Martin Bligh
  2008-09-23 20:03                             ` Masami Hiramatsu
  0 siblings, 2 replies; 122+ messages in thread
From: Thomas Gleixner @ 2008-09-23 19:36 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
> 171210

C3 stops the TSC. So depending on how many C3 entries you have on the
different cores, your TSCs will drift apart. Some BIOSes do even a
lousy job trying to fixup the TSCs on exit from C3, which makes things
even worse.

>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
> 
> Could these help you?

Yup, explains your TSC observation. Nothing we can do about. Broken by
system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:36                           ` Thomas Gleixner
@ 2008-09-23 19:38                             ` Martin Bligh
  2008-09-23 19:41                               ` Thomas Gleixner
  2008-09-23 20:03                             ` Masami Hiramatsu
  1 sibling, 1 reply; 122+ messages in thread
From: Martin Bligh @ 2008-09-23 19:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, Sep 23, 2008 at 12:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
>> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
>> 171210
>
> C3 stops the TSC. So depending on how many C3 entries you have on the
> different cores, your TSCs will drift apart. Some BIOSes do even a
> lousy job trying to fixup the TSCs on exit from C3, which makes things
> even worse.
>
>>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
>>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
>>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
>>
>> Could these help you?
>
> Yup, explains your TSC observation. Nothing we can do about. Broken by
> system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !

We have linux patches that sync the TSC on exit_idle. I'll see if I can get
Michael to send them out.

M.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:38                             ` Martin Bligh
@ 2008-09-23 19:41                               ` Thomas Gleixner
  2008-09-23 19:50                                 ` Martin Bligh
  0 siblings, 1 reply; 122+ messages in thread
From: Thomas Gleixner @ 2008-09-23 19:41 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Martin Bligh wrote:

> On Tue, Sep 23, 2008 at 12:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
> >> 171210
> >
> > C3 stops the TSC. So depending on how many C3 entries you have on the
> > different cores, your TSCs will drift apart. Some BIOSes do even a
> > lousy job trying to fixup the TSCs on exit from C3, which makes things
> > even worse.
> >
> >>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
> >>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
> >>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
> >>
> >> Could these help you?
> >
> > Yup, explains your TSC observation. Nothing we can do about. Broken by
> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !
> 
> We have linux patches that sync the TSC on exit_idle. I'll see if I can get
> Michael to send them out.

Are you sure that they sync it precicely enough that there is no user
space observable way of time going backwards between cores ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:41                               ` Thomas Gleixner
@ 2008-09-23 19:50                                 ` Martin Bligh
  2008-09-23 20:03                                   ` Thomas Gleixner
  0 siblings, 1 reply; 122+ messages in thread
From: Martin Bligh @ 2008-09-23 19:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

>> > Yup, explains your TSC observation. Nothing we can do about. Broken by
>> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !
>>
>> We have linux patches that sync the TSC on exit_idle. I'll see if I can get
>> Michael to send them out.
>
> Are you sure that they sync it precicely enough that there is no user
> space observable way of time going backwards between cores ?

I think the tolerance is about 500 cycles. If that's not sufficient, I guess
we'll have to either live with some slight misordering (which people have
pointed out is kind of inevitable anyway) on these broken machines?
It was sufficient for what we were using it for, but maybe not for everyone.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:50                                 ` Martin Bligh
@ 2008-09-23 20:03                                   ` Thomas Gleixner
  2008-09-23 21:02                                     ` Martin Bligh
  0 siblings, 1 reply; 122+ messages in thread
From: Thomas Gleixner @ 2008-09-23 20:03 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Martin Bligh wrote:

> >> > Yup, explains your TSC observation. Nothing we can do about. Broken by
> >> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !
> >>
> >> We have linux patches that sync the TSC on exit_idle. I'll see if I can get
> >> Michael to send them out.
> >
> > Are you sure that they sync it precicely enough that there is no user
> > space observable way of time going backwards between cores ?
> 
> I think the tolerance is about 500 cycles. If that's not sufficient, I guess
> we'll have to either live with some slight misordering (which people have
> pointed out is kind of inevitable anyway) on these broken machines?
> It was sufficient for what we were using it for, but maybe not for everyone.

Well, I dont care about the trace reordering at all. I care about user
space visible time going backwards issues observed via the
gettimeofday vsyscall. 500 cycles should be fine, I doubt that we can
migrate in less than that :)

I guess you try this only for machines where the TSC runs with
constant frequency, right ?

Looking forward to your patches.

Thanks,

	tglx


 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 20:03                                   ` Thomas Gleixner
@ 2008-09-23 21:02                                     ` Martin Bligh
  0 siblings, 0 replies; 122+ messages in thread
From: Martin Bligh @ 2008-09-23 21:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

>> I think the tolerance is about 500 cycles. If that's not sufficient, I guess
>> we'll have to either live with some slight misordering (which people have
>> pointed out is kind of inevitable anyway) on these broken machines?
>> It was sufficient for what we were using it for, but maybe not for everyone.
>
> Well, I dont care about the trace reordering at all. I care about user
> space visible time going backwards issues observed via the
> gettimeofday vsyscall. 500 cycles should be fine, I doubt that we can
> migrate in less than that :)

Right, that's what we were interested in.

> I guess you try this only for machines where the TSC runs with
> constant frequency, right ?

We don't do DVFS on the 'broken' TSC machines, just halt.
But yes, it's selective

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:36                           ` Thomas Gleixner
  2008-09-23 19:38                             ` Martin Bligh
@ 2008-09-23 20:03                             ` Masami Hiramatsu
  2008-09-23 20:08                               ` Thomas Gleixner
  1 sibling, 1 reply; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 20:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

Thomas Gleixner wrote:
> On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
>> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
>> 171210
> 
> C3 stops the TSC. So depending on how many C3 entries you have on the
> different cores, your TSCs will drift apart. Some BIOSes do even a
> lousy job trying to fixup the TSCs on exit from C3, which makes things
> even worse.
> 
>>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
>>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
>>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
>>
>> Could these help you?
> 
> Yup, explains your TSC observation. Nothing we can do about. Broken by
> system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !

Thank you for analyzing! :-)
Hmm, then could I fix that by fixing my dsdt...?

Thanks again,

> 
> Thanks,
> 
> 	tglx

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 20:03                             ` Masami Hiramatsu
@ 2008-09-23 20:08                               ` Thomas Gleixner
  0 siblings, 0 replies; 122+ messages in thread
From: Thomas Gleixner @ 2008-09-23 20:08 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:

> Thomas Gleixner wrote:
> > On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
> >> 171210
> > 
> > C3 stops the TSC. So depending on how many C3 entries you have on the
> > different cores, your TSCs will drift apart. Some BIOSes do even a
> > lousy job trying to fixup the TSCs on exit from C3, which makes things
> > even worse.
> > 
> >>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
> >>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
> >>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
> >>
> >> Could these help you?
> > 
> > Yup, explains your TSC observation. Nothing we can do about. Broken by
> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !
> 
> Thank you for analyzing! :-)
> Hmm, then could I fix that by fixing my dsdt...?

You can limit c-states so you dont do down to the C3 state, but there
is a trade off vs. power saving.

Lets wait for Martins magic TSC patches first :)

     tglx

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:49             ` Masami Hiramatsu
  2008-09-23 15:04               ` Mathieu Desnoyers
@ 2008-09-23 15:46               ` Linus Torvalds
  1 sibling, 0 replies; 122+ messages in thread
From: Linus Torvalds @ 2008-09-23 15:46 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> 
> $ dmesg | grep TSC
> checking TSC synchronization [CPU#0 -> CPU#1]:
> Measured 4246549092 cycles TSC warp between CPUs, turning off TSC clock.
> Marking TSC unstable due to: check_tsc_sync_source failed.

Hmm.. Very interesting.

It smells of a non-stable TSC, but your Core2 Cpu shouldn't have that 
issue:

> 
> $ cat /proc/cpuinfo
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 15
> model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
> stepping	: 6
> cpu MHz	: 1000.000
> cache size	: 4096 KB
> flags		: ... constant_tsc ...
> 
> Actually, I had measured TSC drifting and reported to systemtap-bugzilla
> http://sources.redhat.com/bugzilla/show_bug.cgi?id=3916#c19

I'd have assumed it was just some initial offset issue, but your 
bug-report seems to say that it really does change the TSC frequency when 
the CPU frequency changes. That should _not_ happen on a core2 CPU, afaik. 
I didn't even know it could be a setup issue, but it does really smell 
like your TSC frequency changes.

Now, unstable TSC's are not uncommon per se, and most older Intel CPU's 
will do it, it's just that I thought it was fixed in Core2 (and later P4's 
for that matter).

The rule *should* be that:
 - family = 15 (netburst), model 3 or higher has constant TSC
 - family = 6 (PPro), model 14 or higher (Core, Core 2)
have constant TSCs. This is quite clearly documented: see Intel ia docs, 
vol 3B, 18.10 "Time-stamp counter".

Very odd. I wonder what your laptop does to screw this up. 

I also suspect that since we already _noticed_ that the TSC isn't stable, 
we should also have then cleared the "constant TSC" bit. And we apparently 
didn't.

Btw, your CPU looks quite odd in other respects too. Judging by your 
bugzilla entry, the TSC sometimes ticks at 2GHz (fine), sometimes at 1Ghz 
(also fine), and sometimes at 667/500MHz judging by the ratios you show 
for TSC/timer tick.

And that last one is really odd, afaik most 2GHz Core 2 duo's will have a 
lower frequency of 1GHz. Is that some special low-power version, perhaps? 
Or maybe it isn't a speedstep-able CPU at all, and the system actually 
changes the *bus* frequency (and then the CPU frequency is some constant 
factor of that). If so, the system messes with the CPU in bad ways.

And btw, I'm almost certain that what you see isn't actually any "CPU 
drift" in the sense that I strongly suspect that the TSC's for both cores 
will change frequency together. So because the TSC isn't stable, it's not 
a good time-source, but despite that it's not necessarily a bad way to 
compare events across cores.

To actually have different CPU's TSC drift wrt each other, you almost have 
to have them in different clock domains. And that is *very* rare. It 
happens when the CPU's are on different boards, and sure if happens if the 
CPU's have non-constant TSCs with different frequencies, but neither of 
those should be very common at all. The latter is uncommon because it's 
almost unheard of of having multi-socket devices with old CPU's that also 
do frequency changes.

Older multi-core CPU's tend to do frequency changes the whole chip at a 
time, and newer multi-core CPU's should all basically have a fixed TSC so 
even when they do frequency changes independently, the TSC should still be 
off the same clock on the die.

			Linus

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:05         ` Masami Hiramatsu
  2008-09-23  0:12           ` Martin Bligh
@ 2008-09-23  0:39           ` Linus Torvalds
  2008-09-23  1:26             ` Roland Dreier
                               ` (2 more replies)
  1 sibling, 3 replies; 122+ messages in thread
From: Linus Torvalds @ 2008-09-23  0:39 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

On Mon, 22 Sep 2008, Masami Hiramatsu wrote:
> 
> Sure, atomic counter might be more expensive but accurate for ordering.

Don't be silly.

An atomic counter is no more accurate for ordering than anything else.

Why?

Because all it tells you is the ordering of the atomic increment, not of 
the caller. The atomic increment is not related to all the other ops that 
the code that you trace actually does in any shape or form, and so the 
ordering of the trace doesn't actually imply anything for the ordering of 
the operations you are tracing!

Except for a single CPU, of course, but for that case you don't need a 
sequence number either, since the ordering is entirely determined by the 
ring buffer itself.

So the counter will be more expensive (cross-cpu cache bouncing for EVERY 
SINGLE EVENT), less useful (no real meaning for people who DO want to have 
a timestamp), and it's really no more "ordered" than anything that bases 
itself on a TSC.

The fact is, you cannot order operations based on log messages unless you 
have a lock around the whole caller - absolutely _no_ amount of locking or 
atomic accesses in the log itself will guarantee ordering of the upper 
layers.

And sure, if you have locking at a higher layer, then a sequence number is 
sufficient, but on the other hand, so is a well-synchronized TSC.

So personally, I think that the optimal solution is:

 - let each ring buffer be associated with a "gettimestamp()" function, so 
   that everybody _can_ set it to something of their own. But default to 
   something sane, namely a raw TSC thing.

 - Add synchronization events to the ring buffer often enough that you can 
   make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply 
   noticing when the upper 32 bits change, although you could possibly do 
   it with a heartbeat too.

 - Similarly, add a synchronization event when the TSC frequency changes.

 - Make the synchronization packet contain the full 64-bit TSC base, in 
   addition to TSC frequency info _and_ the timebase.

 - From those synchronization events, you should be able to get a very 
   accurate timestamp *after* the fact from the raw TSC numbers (ie do all 
   the scaling not when you gather the info, but when you present it), 
   even if you only spent 32 bits of TSC info on 99% of all events (an 
   just had a overflow log occasionally to get the rest of the info)

 - Most people will be _way_ happier with a timestamp that has enough 
   precision to also show ordering (assuming that the caller holds a 
   lock over the operation _including_ the tracing) than they would ever 
   be with a sequence number.

 - people who really want to can consider the incrementing counter a TSC, 
   but it will suck in so many ways that I bet it will not be very popular 
   at all. But having the option to set a special timestamp function will
   give people the option (on a per-buffer level) to make the "TSC" be a 
   simple incrementing 32-bit counter using xaddl and the upper bits 
   incrementing from a timer, but keep that as a "ok, the TSC is really 
   broken, or this architecture doesn't support any fast cycle counters at 
   all, or I really don't care about time, just sequence, and I guarantee 
   I have a single lock in all callers that makes things unambiguous"

Note the "single lock" part. It's not enough that you make any trace thing 
under a lock. They must be under the _same_ lock for all relevant events 
for you to be able to say anything about ordering. And that's actually 
pretty rare for any complex behavior.

The timestamping, btw, is likely the most important part of the whole 
logging thing. So we need to get it right. But by "right" I mean really 
really low-latency so that it's acceptable to everybody, real-time enough 
that you can tell how far apart events were, and precise enough that you 
really _can_ see ordering.

The "raw TSC value with correction information" should be able to give you 
all of that. At least on x86. On some platforms, the TSC may not give you 
enough resolution to get reasonable guesses on event ordering.

			Linus

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:39           ` Linus Torvalds
@ 2008-09-23  1:26             ` Roland Dreier
  2008-09-23  1:39               ` Steven Rostedt
                                 ` (2 more replies)
  2008-09-23  2:30             ` Mathieu Desnoyers
  2008-09-23  3:06             ` Masami Hiramatsu
  2 siblings, 3 replies; 122+ messages in thread
From: Roland Dreier @ 2008-09-23  1:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

 > Because all it tells you is the ordering of the atomic increment, not of 
 > the caller. The atomic increment is not related to all the other ops that 
 > the code that you trace actually does in any shape or form, and so the 
 > ordering of the trace doesn't actually imply anything for the ordering of 
 > the operations you are tracing!

This reminds me of a naive question that occurred to me while we were
discussing this at KS.  Namely, what does "ordering" mean for events?

An example I'm all too familiar with is the lack of ordering of MMIO on
big SGI systems -- if you forget an mmiowb(), then two CPUs taking a
spinlock and doing writel() inside the spinlock and then dropping the
spinlock (which should be enough to "order" things) might see the
writel() reach the final device "out of order" because the write has to
travel through a routed system fabric.

Just like Einstein said, it really seems to me that the order of things
depends on your frame of reference.

 - R.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  1:26             ` Roland Dreier
@ 2008-09-23  1:39               ` Steven Rostedt
  2008-09-23  2:02               ` Mathieu Desnoyers
  2008-09-23  3:26               ` Linus Torvalds
  2 siblings, 0 replies; 122+ messages in thread
From: Steven Rostedt @ 2008-09-23  1:39 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Linus Torvalds, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Roland Dreier wrote:

>  > Because all it tells you is the ordering of the atomic increment, not of 
>  > the caller. The atomic increment is not related to all the other ops that 
>  > the code that you trace actually does in any shape or form, and so the 
>  > ordering of the trace doesn't actually imply anything for the ordering of 
>  > the operations you are tracing!
> 
> This reminds me of a naive question that occurred to me while we were
> discussing this at KS.  Namely, what does "ordering" mean for events?
> 
> An example I'm all too familiar with is the lack of ordering of MMIO on
> big SGI systems -- if you forget an mmiowb(), then two CPUs taking a
> spinlock and doing writel() inside the spinlock and then dropping the
> spinlock (which should be enough to "order" things) might see the
> writel() reach the final device "out of order" because the write has to
> travel through a routed system fabric.
> 
> Just like Einstein said, it really seems to me that the order of things
> depends on your frame of reference.

In my logdev tracer (see http://rostedt.homelinux.com/logdev) I used an 
atomic counter to keep "order". But what I would say to people what this 
order means, is that order is among multiple traces between multiple CPUS.
That is if you have.

   CPU 1                                CPU 2
trace_point_a                        trace_point_c
trace_point_b                        trace_point_d

If you see in the trace:

trace_point_a
trace_point_c

You really do not know which happened first. Simply because trace_point_c 
could have been hit first, but for interrupts and nmis and what not, 
trace_point_a could have easily been recorded first. But to me, 
trace_points are more like memory barriers.

If I see:

trace_point_c
trace_point_a
trace_point_b
trace_point_d

I can assume that everything before trace_point_c happened before 
everything after trace_point_a, and that all before trace_point_b happened 
before trace_point_d.

One can not assume that the trace points themselves are in order. But you 
can assume that the things outside the trace points are, like memory 
barriers. I have found lots of race conditions with my logdev, and it was 
due to this "memory barrier" likeness to be able to see the races.

Unfortunately, if you are using an out of sync TSC, you lose even the 
memory barrier characteristic of the trace.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  1:26             ` Roland Dreier
  2008-09-23  1:39               ` Steven Rostedt
@ 2008-09-23  2:02               ` Mathieu Desnoyers
  2008-09-23  2:26                 ` Darren Hart
  2008-09-23  3:26               ` Linus Torvalds
  2 siblings, 1 reply; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  2:02 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Linus Torvalds, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	darren, Frank Ch. Eigler, systemtap-ml

* Roland Dreier (rdreier@cisco.com) wrote:
>  > Because all it tells you is the ordering of the atomic increment, not of 
>  > the caller. The atomic increment is not related to all the other ops that 
>  > the code that you trace actually does in any shape or form, and so the 
>  > ordering of the trace doesn't actually imply anything for the ordering of 
>  > the operations you are tracing!
> 
> This reminds me of a naive question that occurred to me while we were
> discussing this at KS.  Namely, what does "ordering" mean for events?
> 
> An example I'm all too familiar with is the lack of ordering of MMIO on
> big SGI systems -- if you forget an mmiowb(), then two CPUs taking a
> spinlock and doing writel() inside the spinlock and then dropping the
> spinlock (which should be enough to "order" things) might see the
> writel() reach the final device "out of order" because the write has to
> travel through a routed system fabric.
> 
> Just like Einstein said, it really seems to me that the order of things
> depends on your frame of reference.
> 
>  - R.
> 

Exactly as Linus said, event ordering comes down to this : a choice
between heavy locking around the real operation traced and the tracing
statement itself (irq disable/spinlock) or the acknowledgement that the
ordering is only insured across the actual tracing _instrumentation_.

A worse case scenario would be to get an interrupt between the "real"
operation (e.g. a memory or mmio write) and the tracing statement, be
scheduled out, which would let a lot of stuff happen between the actual
impact of the operation on kernel memory and the tracing statement
itself.

If we want to be _sure_ such thing never happen, we would then have to
pay the price of heavy locking and that would not be pretty, especially
for complex data structure modifications comes in play. I don't really
think anyone with an half-sane mind would want to slow down such
critical kernel operations for the benefit of totally ordered tracing.

However, in many cases where ordering matters, e.g. to instrument
spinlocks themselves, if we put the instrumentation within the critical
section rather than outside of it, then we benefit from the existing
kernel locking (but only for events related to this specific spinlock).
This is the same for many synchronization primitives, except for atomic
operations, where we have to accept that the order will be imperfect.

So only in the specific case of instrumentation of things like locking,
where it is possible to insure that instrumentation is synchronized with
the instrumented operation, does it make a difference to choose the TSC
(which implies a slight delta between the TSCs due to cache line delays
at synchronization and delay due to TSCs drifts caused by temperature)
over an atomic increment.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  2:02               ` Mathieu Desnoyers
@ 2008-09-23  2:26                 ` Darren Hart
  2008-09-23  2:31                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 122+ messages in thread
From: Darren Hart @ 2008-09-23  2:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Roland Dreier, Linus Torvalds, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	Frank Ch. Eigler, systemtap-ml

On Mon, Sep 22, 2008 at 7:02 PM, Mathieu Desnoyers <compudj@krystal.dyndns.org>
> So only in the specific case of instrumentation of things like locking,
> where it is possible to insure that instrumentation is synchronized with
> the instrumented operation, does it make a difference to choose the TSC
> (which implies a slight delta between the TSCs due to cache line delays
> at synchronization and delay due to TSCs drifts caused by temperature)
> over an atomic increment.
>

Hrm, i think that overlooks the other reason to use a time based counter over
an atomic increment: you might care about time.  Perhaps one might be less
concerned with actual order tightly grouped events and more concerned with the
actual time delta between more temporally distant events.  In that case, using
a clocksource would still be valuable. Although admitedtly the caller could
embed that in their payload, but since we seem to agree we need some kind of
counter, the time-based counter appears to be the most flexible.

Thanks,

-- 
Darren Hart

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  2:26                 ` Darren Hart
@ 2008-09-23  2:31                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  2:31 UTC (permalink / raw)
  To: Darren Hart
  Cc: Roland Dreier, Linus Torvalds, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	Frank Ch. Eigler, systemtap-ml

* Darren Hart (darren@dvhart.com) wrote:
> On Mon, Sep 22, 2008 at 7:02 PM, Mathieu Desnoyers <compudj@krystal.dyndns.org>
> > So only in the specific case of instrumentation of things like locking,
> > where it is possible to insure that instrumentation is synchronized with
> > the instrumented operation, does it make a difference to choose the TSC
> > (which implies a slight delta between the TSCs due to cache line delays
> > at synchronization and delay due to TSCs drifts caused by temperature)
> > over an atomic increment.
> >
> 
> Hrm, i think that overlooks the other reason to use a time based counter over
> an atomic increment: you might care about time.  Perhaps one might be less
> concerned with actual order tightly grouped events and more concerned with the
> actual time delta between more temporally distant events.  In that case, using
> a clocksource would still be valuable. Although admitedtly the caller could
> embed that in their payload, but since we seem to agree we need some kind of
> counter, the time-based counter appears to be the most flexible.
> 
> Thanks,
> 

See my answer to Linus for a proposal on how to do both :)

Mathieu

> -- 
> Darren Hart
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  1:26             ` Roland Dreier
  2008-09-23  1:39               ` Steven Rostedt
  2008-09-23  2:02               ` Mathieu Desnoyers
@ 2008-09-23  3:26               ` Linus Torvalds
  2008-09-23  3:36                 ` Mathieu Desnoyers
  2008-09-23  3:43                 ` Steven Rostedt
  2 siblings, 2 replies; 122+ messages in thread
From: Linus Torvalds @ 2008-09-23  3:26 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Roland Dreier wrote:
>
> Just like Einstein said, it really seems to me that the order of things
> depends on your frame of reference.

Heh. Yes. In general, there is no single ordering unless you actually use 
a serializing lock on all CPU's involved. 

And exactly as in the theory of relativity, two people on different CPU's 
can actually validly _disagree_ about the ordering of the same event. 
There are things that act as "light-cones" and are borders for what 
everybody can agree on, but basically, in the absence of explicit locks, 
it is very possible that no such thing as "ordering" may even exist.

Now, an atomic increment on a single counter obviously does imply *one* 
certain ordering, but it really only defines the ordering of that counter 
itself. It does not at all necessarily imply any ordering on the events 
that go on around the counter in other unrelated cachelines.

Which is exactly why even a global counter in no way orders "events" in 
general, unless those events have something else that does so.

		Linus

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:26               ` Linus Torvalds
@ 2008-09-23  3:36                 ` Mathieu Desnoyers
  2008-09-23  4:05                   ` Linus Torvalds
  2008-09-23  3:43                 ` Steven Rostedt
  1 sibling, 1 reply; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  3:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	darren, Frank Ch. Eigler, systemtap-ml

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 22 Sep 2008, Roland Dreier wrote:
> >
> > Just like Einstein said, it really seems to me that the order of things
> > depends on your frame of reference.
> 
> Heh. Yes. In general, there is no single ordering unless you actually use 
> a serializing lock on all CPU's involved. 
> 
> And exactly as in the theory of relativity, two people on different CPU's 
> can actually validly _disagree_ about the ordering of the same event. 
> There are things that act as "light-cones" and are borders for what 
> everybody can agree on, but basically, in the absence of explicit locks, 
> it is very possible that no such thing as "ordering" may even exist.
> 
> Now, an atomic increment on a single counter obviously does imply *one* 
> certain ordering, but it really only defines the ordering of that counter 
> itself. It does not at all necessarily imply any ordering on the events 
> that go on around the counter in other unrelated cachelines.
> 
> Which is exactly why even a global counter in no way orders "events" in 
> general, unless those events have something else that does so.
> 
> 		Linus
> 

Unless I am missing something, in the case we use an atomic operation
which implies memory barriers (cmpxchg and atomic_add_return does), one
can be sure that all memory operations done before the barrier are
completed at the barrier and that all memory ops following the barrier
will happen after.

Did you have something else in mind ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:36                 ` Mathieu Desnoyers
@ 2008-09-23  4:05                   ` Linus Torvalds
  0 siblings, 0 replies; 122+ messages in thread
From: Linus Torvalds @ 2008-09-23  4:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	darren, Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Mathieu Desnoyers wrote:
> 
> Unless I am missing something, in the case we use an atomic operation
> which implies memory barriers (cmpxchg and atomic_add_return does), one
> can be sure that all memory operations done before the barrier are
> completed at the barrier and that all memory ops following the barrier
> will happen after.

Sure (if you have a barrier - not all architectures will imply that for an 
incrment).

But that still doesn't mean a thing.

You have two events (a) and (b), and you put trace-points on each. In your 
trace, you see (a) before (b) by comparing the numbers. But what does that 
mean? 

The actual event that you traced is not the trace-point - the trace-point 
is more like a fancy "printk". And the fact that one showed up before 
another in the trace buffer, doesn't mean that the events _around_ the 
trace happened in the same order.

You can use the barriers to make a partial ordering, and if you have a 
separate tracepoint for entry into a region and exit, you can perhaps show 
that they were totally disjoint. Or maybe they were partially overlapping, 
and you'll never know exactly how they overlapped.

Example:

	trace(..);
	do_X();

being executed on two different CPU's. In the trace, CPU#1 was before 
CPU#2. Does that mean that "do_X()" happened first on CPU#1?

No.

The only way to show that would be to put a lock around the whole trace 
_and_ operation X, ie

	spin_lock(..);
	trace(..);
	do_X();
	spin_unlock(..);

and now, if CPU#1 shows up in the trace first, then you know that do_X() 
really did happen first on CPU#1. Otherwise you basically know *nothing*, 
and the ordering of the trace events was totally and utterly meaningless.

See? Trace events themselves may be ordered, but the point of the trace 
event is never to know the ordering of the trace itself - it's to know the 
ordering of the code we're interested in tracing. The ordering of the 
trace events themselves is irrelevant and not useful.

And I'd rather see people _understand_ that, than if they think the 
ordering is somehow something they can trust.

Btw, if you _do_ have locking, then you can also know that the "do_X()" 
operations will be essentially as far apart in some theoretical notion of 
"time" (let's imagine that we do have global time, even if we don't) as 
the cost of the trace operation and do_X() itself.

So if we _do_ have locking (and thus a valid ordering that actually can 
matter), then the TSC doesn't even have to be synchronized on a cycle 
basis across CPU's - it just needs to be close enough that you can tell 
which one happened first (and with ordering, that's a valid thing to do). 

So you don't even need "perfect" synchronization, you just need something 
reasonably close, and you'll be able to see ordering from TSC counts 
without having that horrible bouncing cross-CPU thing that will impact 
performance a lot.

Quite frankly, I suspect that anybody who wants to have a global counter 
might as well almost just have a global ring-buffer. The trace events 
aren't going to be CPU-local anyway if you need to always update a shared 
cacheline - and you might as well make the shared cacheline be the ring 
buffer head with a spinlock in it.

That may not be _quite_ true, but it's probably close enough.

		Linus

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:26               ` Linus Torvalds
  2008-09-23  3:36                 ` Mathieu Desnoyers
@ 2008-09-23  3:43                 ` Steven Rostedt
  2008-09-23  4:10                   ` Masami Hiramatsu
  2008-09-23  4:19                   ` Linus Torvalds
  1 sibling, 2 replies; 122+ messages in thread
From: Steven Rostedt @ 2008-09-23  3:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Linus Torvalds wrote:

> 
> 
> On Mon, 22 Sep 2008, Roland Dreier wrote:
> >
> > Just like Einstein said, it really seems to me that the order of things
> > depends on your frame of reference.
> 
> Heh. Yes. In general, there is no single ordering unless you actually use 
> a serializing lock on all CPU's involved. 
> 
> And exactly as in the theory of relativity, two people on different CPU's 
> can actually validly _disagree_ about the ordering of the same event. 
> There are things that act as "light-cones" and are borders for what 
> everybody can agree on, but basically, in the absence of explicit locks, 
> it is very possible that no such thing as "ordering" may even exist.
> 
> Now, an atomic increment on a single counter obviously does imply *one* 
> certain ordering, but it really only defines the ordering of that counter 
> itself. It does not at all necessarily imply any ordering on the events 
> that go on around the counter in other unrelated cachelines.
> 
> Which is exactly why even a global counter in no way orders "events" in 
> general, unless those events have something else that does so.

Hmm, I've been pretty spoiled by x86 mostly ording things correctly, and 
the not x86 boxes I've used has mostly been UP.

But, with that, with a global atomic counter, and the following trace:

cpu 0: trace_point_a
cpu 1: trace_point_c
cpu 0: trace_point_b
cpu 1: trace_point_d

Could the event a really come after event d, even though we already hit 
event b?

But I guess you are stating the fact that what the computer does 
internally, no one really knows. Without the help of real memory barriers, 
ording of memory accesses is mostly determined by tarot cards.

But basically, the perceived result of assembly commands is suppose to be 
accurate at the user level.  The traces that I've used not only shows the 
order (or perceived order) of events, but also the output of the corrupted 
data when the race happens. Usually with the two together, you can pretty 
much guarantee that the traced events actually did occur in the order 
presented.

But without some perceived accurate ording, even when you do see the 
corrupted data, the events can easily be misleading, even on an arch that 
usually orders the events as seen by the user.

-- Steve

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:43                 ` Steven Rostedt
@ 2008-09-23  4:10                   ` Masami Hiramatsu
  2008-09-23  4:17                     ` Martin Bligh
  2008-09-23 10:53                     ` Steven Rostedt
  2008-09-23  4:19                   ` Linus Torvalds
  1 sibling, 2 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23  4:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Roland Dreier, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

Steven Rostedt wrote:
> But, with that, with a global atomic counter, and the following trace:
> 
> cpu 0: trace_point_a
> cpu 1: trace_point_c
> cpu 0: trace_point_b
> cpu 1: trace_point_d
> 
> 
> Could the event a really come after event d, even though we already hit 
> event b?

yes, if event c is an interrupt event :-).

  cpu 0       cpu 1
             hit event d
hit event a
log event a
             irq event c
             log event c
hit event b
log event b
             log event d

so, I think if we really need to order events, we have to stop
irq right after hitting an event.

Anyway, in most case, I think it works, but as accurate as
synchronized-TSC if hardware supports it.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  4:10                   ` Masami Hiramatsu
@ 2008-09-23  4:17                     ` Martin Bligh
  2008-09-23 15:23                       ` Masami Hiramatsu
  2008-09-23 10:53                     ` Steven Rostedt
  1 sibling, 1 reply; 122+ messages in thread
From: Martin Bligh @ 2008-09-23  4:17 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Linus Torvalds, Roland Dreier,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

> yes, if event c is an interrupt event :-).
>
>  cpu 0       cpu 1
>            hit event d
> hit event a
> log event a
>            irq event c
>            log event c
> hit event b
> log event b
>            log event d
>
> so, I think if we really need to order events, we have to stop
> irq right after hitting an event.

How could you fix that in any practical way?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  4:17                     ` Martin Bligh
@ 2008-09-23 15:23                       ` Masami Hiramatsu
  0 siblings, 0 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 15:23 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Steven Rostedt, Linus Torvalds, Roland Dreier,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

Hi Martin,

Martin Bligh wrote:
>>            log event c
>> hit event b
>> log event b
>>            log event d
>>
>> so, I think if we really need to order events, we have to stop
>> irq right after hitting an event.
> 
> How could you fix that in any practical way?

In that case, I think it's easy to know that event c is irq
related event by checking source code. :-)
And practically, in most cases, I think we can presume what
happened by event arguments and subsequent events.

Anyway, since there must be a delay between hitting event(and
as linus said, happening event) and logging it, I think buffering
mechanism itself should focus on ensuring only writing and
reading order.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  4:10                   ` Masami Hiramatsu
  2008-09-23  4:17                     ` Martin Bligh
@ 2008-09-23 10:53                     ` Steven Rostedt
  1 sibling, 0 replies; 122+ messages in thread
From: Steven Rostedt @ 2008-09-23 10:53 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Roland Dreier, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml



On Tue, 23 Sep 2008, Masami Hiramatsu wrote:

> Steven Rostedt wrote:
> > But, with that, with a global atomic counter, and the following trace:
> > 
> > cpu 0: trace_point_a
> > cpu 1: trace_point_c
> > cpu 0: trace_point_b
> > cpu 1: trace_point_d
> > 
> > 
> > Could the event a really come after event d, even though we already hit
> > event b?
> 
> yes, if event c is an interrupt event :-).
> 
>  cpu 0       cpu 1
>             hit event d
> hit event a
> log event a
>             irq event c
>             log event c

heh, This is assuming that event c is in an IRQ handler.

Since I control where event c is, I can prevent that. I'm talking about
the CPU doing something funny that would have c come after d.

But I didn't specify exactly what the events were, so I'll accept that 
explanation ;-)

-- Steve

> hit event b
> log event b
>             log event d
> 
> so, I think if we really need to order events, we have to stop
> irq right after hitting an event.
> 
> Anyway, in most case, I think it works, but as accurate as
> synchronized-TSC if hardware supports it.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:43                 ` Steven Rostedt
  2008-09-23  4:10                   ` Masami Hiramatsu
@ 2008-09-23  4:19                   ` Linus Torvalds
  2008-09-23 14:12                     ` Mathieu Desnoyers
  1 sibling, 1 reply; 122+ messages in thread
From: Linus Torvalds @ 2008-09-23  4:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Steven Rostedt wrote:
> 
> But, with that, with a global atomic counter, and the following trace:
> 
> cpu 0: trace_point_a
> cpu 1: trace_point_c
> cpu 0: trace_point_b
> cpu 1: trace_point_d
> 
> Could the event a really come after event d, even though we already hit 
> event b?

Each tracepoint will basically give a partial ordering (if you make it so, 
of course - and on x86 it's hard to avoid it).

And with many trace-points, you can narrow down ordering if you're lucky.

But say that you have code like

	CPU#1		CPU#2

	trace_a		trace_c
	..		..
	trace_b		trace_d

and since each CPU itself is obviously strictly ordered, you a priori know 
that a < b, and c < d. But your trace buffer can look many different ways:

 - a -> b -> c -> d
   c -> d -> a -> b

   Now you do know that what happened between c and d must all have 
   happened entirely after/before the things that happened between
   a and b, and there is no overlap.

   This is only assuming the x86 full memory barrier from a "lock xadd" of 
   course, but those are the semantics you'd get on x86. On others, the 
   ordering might not be that strong.

 - a -> c -> b -> d
   a -> c -> d -> b

   With these trace point orderings, you really don't know anything at all 
   about the order of any access that happened in between. CPU#1 might 
   have gone first. Or not. Or partially. You simply do not know.

> But I guess you are stating the fact that what the computer does 
> internally, no one really knows. Without the help of real memory barriers, 
> ording of memory accesses is mostly determined by tarot cards.

Well, x86 defines a memory order. But what I'm trying to explain is that 
memory order still doesn't actually specify what happens to the code that 
actually does tracing! The trace is only going to show the order of the 
tracepoints, not the _other_ memory accesses. So you'll have *some* 
information, but it's very partial.

And the thing is, all those other memory accesses are the ones that do all 
the real work. You'll know they happened _somewhere_ between two 
tracepoints, but not much more than that.

This is why timestamps aren't really any worse than sequence numbers in 
all practical matters. They'll get you close enough that you can consider 
them equivalent to a cache-coherent counter, just one that you don't have 
to take a cache miss for, and that increments on its own!

Quite a lot of CPU's have nice, dependable, TSC's that run at constant 
frequency. 

And quite a lot of traces care a _lot_ about real time. When you do IO 
tracing, the problem is almost never about lock ordering or anything like 
that. You want to see how long a request took. You don't care AT ALL how 
many tracepoints were in between the beginning and end, you care about how 
many microseconds there were!

			Linus

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  4:19                   ` Linus Torvalds
@ 2008-09-23 14:12                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23 14:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, darren,
	Frank Ch. Eigler, systemtap-ml

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 22 Sep 2008, Steven Rostedt wrote:
> > 
> > But, with that, with a global atomic counter, and the following trace:
> > 
> > cpu 0: trace_point_a
> > cpu 1: trace_point_c
> > cpu 0: trace_point_b
> > cpu 1: trace_point_d
> > 
> > Could the event a really come after event d, even though we already hit 
> > event b?
> 
> Each tracepoint will basically give a partial ordering (if you make it so, 
> of course - and on x86 it's hard to avoid it).
> 
> And with many trace-points, you can narrow down ordering if you're lucky.
> 
> But say that you have code like
> 
> 	CPU#1		CPU#2
> 
> 	trace_a		trace_c
> 	..		..
> 	trace_b		trace_d
> 
> and since each CPU itself is obviously strictly ordered, you a priori know 
> that a < b, and c < d. But your trace buffer can look many different ways:
> 
>  - a -> b -> c -> d
>    c -> d -> a -> b
> 
>    Now you do know that what happened between c and d must all have 
>    happened entirely after/before the things that happened between
>    a and b, and there is no overlap.
> 
>    This is only assuming the x86 full memory barrier from a "lock xadd" of 
>    course, but those are the semantics you'd get on x86. On others, the 
>    ordering might not be that strong.
> 

Hrm, Documentation/atomic_ops.txt states that :

"Unlike the above routines, it is required that explicit memory
barriers are performed before and after the operation.  It must be
done such that all memory operations before and after the atomic
operation calls are strongly ordered with respect to the atomic
operation itself."

So on architectures with weaker ordering, the kernel atomic ops already
require that explicit smp_mb() are inserted before and after the atomic
increment. The same applies to cmpxchg.

Therefore I think it's ok, given the semantic provided by these two
atomic operations, to assume they imply a smp_mb() for any given
architecture. If not, then the architecture-specific implementation
would be broken wrt the semantic.

>  - a -> c -> b -> d
>    a -> c -> d -> b
> 
>    With these trace point orderings, you really don't know anything at all 
>    about the order of any access that happened in between. CPU#1 might 
>    have gone first. Or not. Or partially. You simply do not know.
> 

Yep. If two "real kernel" events happen to belong to the same
overlapping time window, there is not much we can know about their
order. Adding tracing statements before and after traced kernel
operations could help to make this window as small as possible, but I
doubt it's worth the performance penality and event duplication (and
incremented trace size).

Mathieu


> > But I guess you are stating the fact that what the computer does 
> > internally, no one really knows. Without the help of real memory barriers, 
> > ording of memory accesses is mostly determined by tarot cards.
> 
> Well, x86 defines a memory order. But what I'm trying to explain is that 
> memory order still doesn't actually specify what happens to the code that 
> actually does tracing! The trace is only going to show the order of the 
> tracepoints, not the _other_ memory accesses. So you'll have *some* 
> information, but it's very partial.
> 
> And the thing is, all those other memory accesses are the ones that do all 
> the real work. You'll know they happened _somewhere_ between two 
> tracepoints, but not much more than that.
> 
> This is why timestamps aren't really any worse than sequence numbers in 
> all practical matters. They'll get you close enough that you can consider 
> them equivalent to a cache-coherent counter, just one that you don't have 
> to take a cache miss for, and that increments on its own!
> 
> Quite a lot of CPU's have nice, dependable, TSC's that run at constant 
> frequency. 
> 
> And quite a lot of traces care a _lot_ about real time. When you do IO 
> tracing, the problem is almost never about lock ordering or anything like 
> that. You want to see how long a request took. You don't care AT ALL how 
> many tracepoints were in between the beginning and end, you care about how 
> many microseconds there were!
> 
> 			Linus
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:39           ` Linus Torvalds
  2008-09-23  1:26             ` Roland Dreier
@ 2008-09-23  2:30             ` Mathieu Desnoyers
  2008-09-23  3:06             ` Masami Hiramatsu
  2 siblings, 0 replies; 122+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  2:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 22 Sep 2008, Masami Hiramatsu wrote:
> > 
> > Sure, atomic counter might be more expensive but accurate for ordering.
> 
> Don't be silly.
> 
> An atomic counter is no more accurate for ordering than anything else.
> 
> Why?
> 
> Because all it tells you is the ordering of the atomic increment, not of 
> the caller. The atomic increment is not related to all the other ops that 
> the code that you trace actually does in any shape or form, and so the 
> ordering of the trace doesn't actually imply anything for the ordering of 
> the operations you are tracing!
> 
> Except for a single CPU, of course, but for that case you don't need a 
> sequence number either, since the ordering is entirely determined by the 
> ring buffer itself.
> 
> So the counter will be more expensive (cross-cpu cache bouncing for EVERY 
> SINGLE EVENT), less useful (no real meaning for people who DO want to have 
> a timestamp), and it's really no more "ordered" than anything that bases 
> itself on a TSC.
> 
> The fact is, you cannot order operations based on log messages unless you 
> have a lock around the whole caller - absolutely _no_ amount of locking or 
> atomic accesses in the log itself will guarantee ordering of the upper 
> layers.
> 
> And sure, if you have locking at a higher layer, then a sequence number is 
> sufficient, but on the other hand, so is a well-synchronized TSC.
> 
> So personally, I think that the optimal solution is:
> 
>  - let each ring buffer be associated with a "gettimestamp()" function, so 
>    that everybody _can_ set it to something of their own. But default to 
>    something sane, namely a raw TSC thing.
> 
>  - Add synchronization events to the ring buffer often enough that you can 
>    make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply 
>    noticing when the upper 32 bits change, although you could possibly do 
>    it with a heartbeat too.
> 

Hi Linus,

I totally agree with all of the above,

>  - Similarly, add a synchronization event when the TSC frequency changes.
> 
>  - Make the synchronization packet contain the full 64-bit TSC base, in 
>    addition to TSC frequency info _and_ the timebase.
> 

If possible, I would recommend to stay as far away as possible from
using frequency change events to support broken frequency scaling.
First, some architectures, like AMD, does not trigger an event for every
frequency changes on the CPU (if my memory is right, I think the
southbridge temperature throttling does not generate any event) and we
also have to consider the delay between the frequency change and and
moment the instrumentation would be called. Each frequency change event
would therefore increase the over TSC imprecision and I doubt we really
want to rely on this for our time base. I welcome people to prove me
wrong or paranoid by providing results proving the correct synchronicity
of such approach over time. :-) I however explain how LTTng deals with
such architectures I consider to fall into the "broken" category below.

>  - From those synchronization events, you should be able to get a very 
>    accurate timestamp *after* the fact from the raw TSC numbers (ie do all 
>    the scaling not when you gather the info, but when you present it), 
>    even if you only spent 32 bits of TSC info on 99% of all events (an 
>    just had a overflow log occasionally to get the rest of the info)
> 
>  - Most people will be _way_ happier with a timestamp that has enough 
>    precision to also show ordering (assuming that the caller holds a 
>    lock over the operation _including_ the tracing) than they would ever 
>    be with a sequence number.
> 

One gettimestamp() we could think of, to satisfy both people expecting
tracing to perform a memory barrier and for people expecting a
timestamp, would be to write timestamps taken from synchronized TSCs
into a variable with a cmpxchg, which would succeed only if the TSC
value is higher than the value present in this variable. That would
give both the memory barrier behavior and the timestamping. Given the
cache-line bouncing implied, I would only recommend to activate this
when really needed, e.g. when debugging race issues.

I currently use this scheme in LTTng to deal with broken x86
architecture with non-synchronized TSCs. Basically, all CPUs laging behind
others take the TSC value read from memory and adds 200 cycles (or how
many cycles it takes to actually read tsc) from the last cycle counter
value written to memory by the previous time base read. I also plan to
execute an heartbeat on every CPUs someday to give an upper bound to the
imprecision of such TSCs so it could be use for some sort of performance
measurement.

>  - people who really want to can consider the incrementing counter a TSC, 
>    but it will suck in so many ways that I bet it will not be very popular 
>    at all. But having the option to set a special timestamp function will
>    give people the option (on a per-buffer level) to make the "TSC" be a 
>    simple incrementing 32-bit counter using xaddl and the upper bits 
>    incrementing from a timer, but keep that as a "ok, the TSC is really 
>    broken, or this architecture doesn't support any fast cycle counters at 
>    all, or I really don't care about time, just sequence, and I guarantee 
>    I have a single lock in all callers that makes things unambiguous"
> 

I currently use the scheme you propose here on architectures
lacking TSC. I hook in the timer interrupt to increment the MSBs.
I doubt anyone will ever go the way of locking every caller, but yeah,
why not.

Also note that getting only 32-bits TSCs on MIPS makes things a bit less
simple, but I did an RCU-style adaptation layer which generates a full
64-bits TSC from a 32-bits time base. It's currently in the -lttng tree.

> Note the "single lock" part. It's not enough that you make any trace thing 
> under a lock. They must be under the _same_ lock for all relevant events 
> for you to be able to say anything about ordering. And that's actually 
> pretty rare for any complex behavior.
> 
> The timestamping, btw, is likely the most important part of the whole 
> logging thing. So we need to get it right. But by "right" I mean really 
> really low-latency so that it's acceptable to everybody, real-time enough 
> that you can tell how far apart events were, and precise enough that you 
> really _can_ see ordering.
> 
> The "raw TSC value with correction information" should be able to give you 
> all of that. At least on x86. On some platforms, the TSC may not give you 
> enough resolution to get reasonable guesses on event ordering.
> 

Mathieu

> 			Linus
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:39           ` Linus Torvalds
  2008-09-23  1:26             ` Roland Dreier
  2008-09-23  2:30             ` Mathieu Desnoyers
@ 2008-09-23  3:06             ` Masami Hiramatsu
  2 siblings, 0 replies; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23  3:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Hi Linus,

Linus Torvalds wrote:
> 
> On Mon, 22 Sep 2008, Masami Hiramatsu wrote:
>> Sure, atomic counter might be more expensive but accurate for ordering.
> 
> Don't be silly.
> 
> An atomic counter is no more accurate for ordering than anything else.
> 
> Why?
> 
> Because all it tells you is the ordering of the atomic increment, not of 
> the caller. The atomic increment is not related to all the other ops that 
> the code that you trace actually does in any shape or form, and so the 
> ordering of the trace doesn't actually imply anything for the ordering of 
> the operations you are tracing!
> 
> Except for a single CPU, of course, but for that case you don't need a 
> sequence number either, since the ordering is entirely determined by the 
> ring buffer itself.
> 
> So the counter will be more expensive (cross-cpu cache bouncing for EVERY 
> SINGLE EVENT), less useful (no real meaning for people who DO want to have 
> a timestamp), and it's really no more "ordered" than anything that bases 
> itself on a TSC.
> 
> The fact is, you cannot order operations based on log messages unless you 
> have a lock around the whole caller - absolutely _no_ amount of locking or 
> atomic accesses in the log itself will guarantee ordering of the upper 
> layers.

Indeed.
If TSC(or similar time counter) can provide synchronized-time, I don't
have any comment on that(AFAIK, latest x86 and ia64 can provide it).
# I might be a bit nervous about Broken TSC...

> And sure, if you have locking at a higher layer, then a sequence number is 
> sufficient, but on the other hand, so is a well-synchronized TSC.
> 
> So personally, I think that the optimal solution is:
> 
>  - let each ring buffer be associated with a "gettimestamp()" function, so 
>    that everybody _can_ set it to something of their own. But default to 
>    something sane, namely a raw TSC thing.

I agree, default to TSC is enough.

>  - Add synchronization events to the ring buffer often enough that you can 
>    make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply 
>    noticing when the upper 32 bits change, although you could possibly do 
>    it with a heartbeat too.
> 
>  - Similarly, add a synchronization event when the TSC frequency changes.
> 
>  - Make the synchronization packet contain the full 64-bit TSC base, in 
>    addition to TSC frequency info _and_ the timebase.
> 
>  - From those synchronization events, you should be able to get a very 
>    accurate timestamp *after* the fact from the raw TSC numbers (ie do all 
>    the scaling not when you gather the info, but when you present it), 
>    even if you only spent 32 bits of TSC info on 99% of all events (an 
>    just had a overflow log occasionally to get the rest of the info)
> 
>  - Most people will be _way_ happier with a timestamp that has enough 
>    precision to also show ordering (assuming that the caller holds a 
>    lock over the operation _including_ the tracing) than they would ever 
>    be with a sequence number.
> 
>  - people who really want to can consider the incrementing counter a TSC, 
>    but it will suck in so many ways that I bet it will not be very popular 
>    at all. But having the option to set a special timestamp function will
>    give people the option (on a per-buffer level) to make the "TSC" be a 
>    simple incrementing 32-bit counter using xaddl and the upper bits 
>    incrementing from a timer, but keep that as a "ok, the TSC is really 
>    broken, or this architecture doesn't support any fast cycle counters at 
>    all, or I really don't care about time, just sequence, and I guarantee 
>    I have a single lock in all callers that makes things unambiguous"

Thank you very much for giving me a good idea!
I agree with you.

> Note the "single lock" part. It's not enough that you make any trace thing 
> under a lock. They must be under the _same_ lock for all relevant events 
> for you to be able to say anything about ordering. And that's actually 
> pretty rare for any complex behavior.
> 
> The timestamping, btw, is likely the most important part of the whole 
> logging thing. So we need to get it right. But by "right" I mean really 
> really low-latency so that it's acceptable to everybody, real-time enough 
> that you can tell how far apart events were, and precise enough that you 
> really _can_ see ordering.
> 
> The "raw TSC value with correction information" should be able to give you 
> all of that. At least on x86. On some platforms, the TSC may not give you 
> enough resolution to get reasonable guesses on event ordering.
> 
> 			Linus

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 22:25     ` Masami Hiramatsu
  2008-09-22 23:11       ` Darren Hart
  2008-09-22 23:16       ` Martin Bligh
@ 2008-09-23 14:36       ` KOSAKI Motohiro
  2008-09-23 15:02         ` Frank Ch. Eigler
  2008-09-23 15:21         ` Masami Hiramatsu
  2 siblings, 2 replies; 122+ messages in thread
From: KOSAKI Motohiro @ 2008-09-23 14:36 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: kosaki.motohiro, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml


> By the way, systemtap uses two modes;
> 
> - single-channel mode
>  In this mode, all cpus share one buffer channel to write and read.
>  each writer locks spinlock and write a probe-local data to buffer.
> 
> - per-cpu buffer mode
>  In this mode, we use an atomic sequential number for ordering data. If
>  user doesn't need it(because they have their own timestamps), they can
>  choose not to use that seq-number.

I can't imazine a merit of the single-channel mode.
Could you please explain it?

Because some architecture don't have fine grained timestamp?
if so, could you explain which architecture don't have it?




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:36       ` KOSAKI Motohiro
@ 2008-09-23 15:02         ` Frank Ch. Eigler
  2008-09-23 15:21         ` Masami Hiramatsu
  1 sibling, 0 replies; 122+ messages in thread
From: Frank Ch. Eigler @ 2008-09-23 15:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, od, systemtap-ml

Hi -

On Tue, Sep 23, 2008 at 11:36:26PM +0900, KOSAKI Motohiro wrote:
> > By the way, systemtap uses two modes;
> > 
> > - single-channel mode
> >  In this mode, all cpus share one buffer channel to write and read.
> >  each writer locks spinlock and write a probe-local data to buffer.
> > - per-cpu buffer mode [...]
> 
> I can't imazine a merit of the single-channel mode.
> Could you please explain it?

It could be a way of saving some memory and merging hassle for
low-throughput data.  (Remember that systemtap enables in-situ
analysis of events so that often only brief final results need be sent
along need be sent out.)  If timestampwise cross-cpu merging can be
done on demand by the hypothetical future buffer widget, then little
reason remains not to use it.

- FChE

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:36       ` KOSAKI Motohiro
  2008-09-23 15:02         ` Frank Ch. Eigler
@ 2008-09-23 15:21         ` Masami Hiramatsu
  2008-09-23 17:59           ` KOSAKI Motohiro
  1 sibling, 1 reply; 122+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 15:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, systemtap-ml

Hi Motohiro,

KOSAKI Motohiro wrote:
>> By the way, systemtap uses two modes;
>>
>> - single-channel mode
>>  In this mode, all cpus share one buffer channel to write and read.
>>  each writer locks spinlock and write a probe-local data to buffer.
>>
>> - per-cpu buffer mode
>>  In this mode, we use an atomic sequential number for ordering data. If
>>  user doesn't need it(because they have their own timestamps), they can
>>  choose not to use that seq-number.
> 
> I can't imazine a merit of the single-channel mode.
> Could you please explain it?

Actually, single-channel mode is for not-frequently event tracing.
At least systemtap case, sometimes we just want to collect data
and watch it periodically(as like as 'top'). Or, just monitoring
errors as additional printk. in these cases, overhead is not so
important.

I think the main reason of using single-channel mode is simplicity of
userspace reader. We can use 'cat' or 'tail' to read the buffer on-line.
I'm not sure how much overhead ftrace-like buffer merging routine has,
but if kernel provides an interface which gives us single-merged buffer
image(like ftrace buffer), we are grad to use it. :-)

> Because some architecture don't have fine grained timestamp?
> if so, could you explain which architecture don't have it?

I heard that get_cycles always returns 0 on some cpu (ARM, MIPS, etc)...
(I think some of their platforms support variants of get_cycles)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 15:21         ` Masami Hiramatsu
@ 2008-09-23 17:59           ` KOSAKI Motohiro
  2008-09-23 18:28             ` Martin Bligh
  0 siblings, 1 reply; 122+ messages in thread
From: KOSAKI Motohiro @ 2008-09-23 17:59 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: kosaki.motohiro, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml

> > I can't imazine a merit of the single-channel mode.
> > Could you please explain it?
> 
> Actually, single-channel mode is for not-frequently event tracing.
> At least systemtap case, sometimes we just want to collect data
> and watch it periodically(as like as 'top'). Or, just monitoring
> errors as additional printk. in these cases, overhead is not so
> important.
> 
> I think the main reason of using single-channel mode is simplicity of
> userspace reader. We can use 'cat' or 'tail' to read the buffer on-line.
> I'm not sure how much overhead ftrace-like buffer merging routine has,
> but if kernel provides an interface which gives us single-merged buffer
> image(like ftrace buffer), we are grad to use it. :-)

Yup, I also think it is better.

Thanks.




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 17:59           ` KOSAKI Motohiro
@ 2008-09-23 18:28             ` Martin Bligh
  0 siblings, 0 replies; 122+ messages in thread
From: Martin Bligh @ 2008-09-23 18:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Masami Hiramatsu, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, systemtap-ml

>> I think the main reason of using single-channel mode is simplicity of
>> userspace reader. We can use 'cat' or 'tail' to read the buffer on-line.
>> I'm not sure how much overhead ftrace-like buffer merging routine has,
>> but if kernel provides an interface which gives us single-merged buffer
>> image(like ftrace buffer), we are grad to use it. :-)
>
> Yup, I also think it is better.

That was the plan, yes. Merge sort is cheap ;-)

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-19 21:33 Unified tracing buffer Martin Bligh
                   ` (7 preceding siblings ...)
  2008-09-22 19:45 ` Masami Hiramatsu
@ 2008-09-23  3:33 ` Andi Kleen
  2008-09-23  3:47   ` Martin Bligh
  8 siblings, 1 reply; 122+ messages in thread
From: Andi Kleen @ 2008-09-23  3:33 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler

"Martin Bligh" <mbligh@google.com> writes:

> During kernel summit and Plumbers conference, Linus and others
> expressed a desire for a unified
> tracing buffer system for multiple tracing applications (eg ftrace,
> lttng, systemtap, blktrace, etc) to use.

This is what relayfs always was promised to be, but apparently
never quite became. But before adding a new one I would recommend
to remove relayfs first.

-Andi

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:33 ` Andi Kleen
@ 2008-09-23  3:47   ` Martin Bligh
  2008-09-23  5:04     ` Andi Kleen
  0 siblings, 1 reply; 122+ messages in thread
From: Martin Bligh @ 2008-09-23  3:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler

>> During kernel summit and Plumbers conference, Linus and others
>> expressed a desire for a unified
>> tracing buffer system for multiple tracing applications (eg ftrace,
>> lttng, systemtap, blktrace, etc) to use.
>
> This is what relayfs always was promised to be, but apparently
> never quite became. But before adding a new one I would recommend
> to remove relayfs first.

It's a little different, though similar. relayfs is an unstructured buffer.
This would be a sequence of events with a common timestamp format,
and hopefully other commonalties too.

I agree that the underlying buffer structure could be shared, as has
been pointed out (buried in this long thread). However, in another buried
comment, it was pointed out that relayfs would have no users once this
was done, so ...

I don't think we can remove relayfs before adding this and switching over
the users though (possibly it could all be done at the same time, but messy)

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:47   ` Martin Bligh
@ 2008-09-23  5:04     ` Andi Kleen
  0 siblings, 0 replies; 122+ messages in thread
From: Andi Kleen @ 2008-09-23  5:04 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Andi Kleen, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler

> I don't think we can remove relayfs before adding this and switching over
> the users though (possibly it could all be done at the same time, but messy)

It's only two in tree users (+ systemtap) so shouldn't be a big hazzle.
And they're all tracers so presumably they all pass time stamps. But please
don't add a multitude of new methods to do this.

BTW I've been using something like that since early 2000. It was called
ktrace and was based on a old SGI patch.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2008-10-03 21:53 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-19 21:33 Unified tracing buffer Martin Bligh
2008-09-19 21:42 ` Randy Dunlap
2008-09-19 21:57   ` Martin Bligh
2008-09-19 22:41     ` Olaf Dabrunz
2008-09-19 22:19       ` Martin Bligh
2008-09-20  8:10         ` Olaf Dabrunz
2008-09-20  8:29         ` Steven Rostedt
2008-09-20 11:40           ` Mathieu Desnoyers
2008-09-20  8:26     ` Steven Rostedt
2008-09-20 11:44       ` Mathieu Desnoyers
2008-09-19 22:28 ` Olaf Dabrunz
2008-09-19 22:09   ` Martin Bligh
2008-09-19 23:18 ` Frank Ch. Eigler
2008-09-20  8:50   ` Steven Rostedt
2008-09-20 13:37     ` Mathieu Desnoyers
2008-09-20 13:51       ` Steven Rostedt
2008-09-20 14:54         ` Steven Rostedt
2008-09-22 18:45           ` Mathieu Desnoyers
2008-09-22 21:39             ` Steven Rostedt
2008-09-23  3:27               ` Mathieu Desnoyers
2008-09-20  0:07 ` Peter Zijlstra
2008-09-22 14:07   ` K.Prasad
2008-09-22 14:45     ` Peter Zijlstra
2008-09-22 16:29       ` Martin Bligh
2008-09-22 16:36         ` Peter Zijlstra
2008-09-22 20:50           ` Masami Hiramatsu
2008-09-23  3:05           ` Mathieu Desnoyers
2008-09-23  2:49       ` Mathieu Desnoyers
2008-09-23  5:25       ` Tom Zanussi
2008-09-23  9:31         ` Peter Zijlstra
2008-09-23 18:13           ` Mathieu Desnoyers
2008-09-23 18:33             ` Christoph Lameter
2008-09-23 18:56               ` Linus Torvalds
2008-09-23 13:50         ` Mathieu Desnoyers
2008-09-23 14:00         ` Martin Bligh
2008-09-23 17:55           ` K.Prasad
2008-09-23 18:27             ` Martin Bligh
2008-09-24  3:50           ` Tom Zanussi
2008-09-24  5:42             ` K.Prasad
2008-09-25  6:07             ` [RFC PATCH 0/8] current relay cleanup patchset Tom Zanussi
2008-09-25  6:07             ` [RFC PATCH 1/8] relay - Clean up relay_switch_subbuf() and make waking up consumers optional Tom Zanussi
2008-09-25  6:07             ` [RFC PATCH 2/8] relay - Make the relay sub-buffer switch code replaceable Tom Zanussi
2008-09-25  6:07             ` [RFC PATCH 3/8] relay - Add channel flags to relay, remove global callback param Tom Zanussi
2008-09-25  6:07             ` [RFC PATCH 4/8] relay - Add reserved param to switch-subbuf, in preparation for non-pad write/reserve Tom Zanussi
2008-09-25  6:07             ` [RFC PATCH 5/8] relay - Map the first sub-buffer at the end of the buffer, for temporary convenience Tom Zanussi
2008-09-25  6:07             ` [RFC PATCH 6/8] relay - Replace relay_reserve/relay_write with non-padded versions Tom Zanussi
2008-09-25  6:07             ` [RFC PATCH 7/8] relay - Remove padding-related code from relay_read()/relay_splice_read() et al Tom Zanussi
2008-09-25  6:08             ` [RFC PATCH 8/8] relay - Clean up remaining padding-related junk Tom Zanussi
2008-09-23  5:27       ` [PATCH 1/3] relay - clean up subbuf switch Tom Zanussi
2008-09-23 20:15         ` Andrew Morton
2008-09-23  5:27       ` [PATCH 2/3] relay - make subbuf switch replaceable Tom Zanussi
2008-09-23 20:17         ` Andrew Morton
2008-09-23  5:27       ` [PATCH 3/3] relay - add channel flags Tom Zanussi
2008-09-23 20:20         ` Andrew Morton
2008-09-24  3:57           ` Tom Zanussi
2008-09-20  0:26 ` Unified tracing buffer Marcel Holtmann
2008-09-20  9:03 ` Steven Rostedt
2008-09-20 13:55   ` Mathieu Desnoyers
2008-09-20 14:12     ` Arjan van de Ven
2008-09-22 18:52       ` Mathieu Desnoyers
2008-10-02 15:28         ` Jason Baron
2008-10-03 16:11           ` Mathieu Desnoyers
2008-10-03 18:37             ` Jason Baron
2008-10-03 19:10               ` Mathieu Desnoyers
2008-10-03 19:25                 ` Jason Baron
2008-10-03 19:56                   ` Mathieu Desnoyers
2008-10-03 20:25                     ` Jason Baron
2008-10-03 21:52                 ` Frank Ch. Eigler
2008-09-22  3:09     ` KOSAKI Motohiro
2008-09-22  9:57   ` Peter Zijlstra
2008-09-23  2:36     ` Mathieu Desnoyers
2008-09-22 13:57 ` K.Prasad
2008-09-22 19:45 ` Masami Hiramatsu
2008-09-22 20:13   ` Martin Bligh
2008-09-22 22:25     ` Masami Hiramatsu
2008-09-22 23:11       ` Darren Hart
2008-09-23  0:04         ` Masami Hiramatsu
2008-09-22 23:16       ` Martin Bligh
2008-09-23  0:05         ` Masami Hiramatsu
2008-09-23  0:12           ` Martin Bligh
2008-09-23 14:49             ` Masami Hiramatsu
2008-09-23 15:04               ` Mathieu Desnoyers
2008-09-23 15:30                 ` Masami Hiramatsu
2008-09-23 16:01                   ` Linus Torvalds
2008-09-23 17:04                     ` Masami Hiramatsu
2008-09-23 17:30                       ` Thomas Gleixner
2008-09-23 18:59                         ` Masami Hiramatsu
2008-09-23 19:36                           ` Thomas Gleixner
2008-09-23 19:38                             ` Martin Bligh
2008-09-23 19:41                               ` Thomas Gleixner
2008-09-23 19:50                                 ` Martin Bligh
2008-09-23 20:03                                   ` Thomas Gleixner
2008-09-23 21:02                                     ` Martin Bligh
2008-09-23 20:03                             ` Masami Hiramatsu
2008-09-23 20:08                               ` Thomas Gleixner
2008-09-23 15:46               ` Linus Torvalds
2008-09-23  0:39           ` Linus Torvalds
2008-09-23  1:26             ` Roland Dreier
2008-09-23  1:39               ` Steven Rostedt
2008-09-23  2:02               ` Mathieu Desnoyers
2008-09-23  2:26                 ` Darren Hart
2008-09-23  2:31                   ` Mathieu Desnoyers
2008-09-23  3:26               ` Linus Torvalds
2008-09-23  3:36                 ` Mathieu Desnoyers
2008-09-23  4:05                   ` Linus Torvalds
2008-09-23  3:43                 ` Steven Rostedt
2008-09-23  4:10                   ` Masami Hiramatsu
2008-09-23  4:17                     ` Martin Bligh
2008-09-23 15:23                       ` Masami Hiramatsu
2008-09-23 10:53                     ` Steven Rostedt
2008-09-23  4:19                   ` Linus Torvalds
2008-09-23 14:12                     ` Mathieu Desnoyers
2008-09-23  2:30             ` Mathieu Desnoyers
2008-09-23  3:06             ` Masami Hiramatsu
2008-09-23 14:36       ` KOSAKI Motohiro
2008-09-23 15:02         ` Frank Ch. Eigler
2008-09-23 15:21         ` Masami Hiramatsu
2008-09-23 17:59           ` KOSAKI Motohiro
2008-09-23 18:28             ` Martin Bligh
2008-09-23  3:33 ` Andi Kleen
2008-09-23  3:47   ` Martin Bligh
2008-09-23  5:04     ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox