Kernel marker has no performance impact on ia64.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Kernel marker has no performance impact on ia64.
@ 2008-06-02 22:12 Hideo AOKI
  2008-06-02 22:32 ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Hideo AOKI @ 2008-06-02 22:12 UTC (permalink / raw)
  To: mingo, mathieu.desnoyers, peterz
  Cc: Masami Hiramatsu, linux-kernel, Hideo AOKI

Hello,

I evaluated overhead of kernel marker using linux-2.6-sched-fixes
git tree, which includes several markers for LTTng, using an ia64
server.

While the immediate trace mark feature isn't implemented on ia64,
there is no major performance regression. So, I think that we 
don't have any issues to propose merging marker point patches 
into Linus's tree from the viewpoint of performance impact.

I prepared two kernels to evaluate. The first one was compiled
without CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.

I downloaded the original hackbench from the following URL:
http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c

I ran hackbench 5 times in each condition and calculated the
average and difference between the kernels.  

    The parameter of hackbench: every 50 from 50 to 800
    The number of CPUs of the server: 2, 4, and 8

Below is the results. As you can see, major performance
regression wasn't found in any case. Even if number of processes
increases, differences between marker-enabled kernel and marker-
disabled kernel doesn't increase. Moreover, if number of CPUs 
increases, the differences doesn't increase either.

Curiously, marker-enabled kernel is better than marker-disabled
kernel in more than half cases, although I guess it comes from
the difference of memory access pattern.


* 2 CPUs 

Number of | without      | with         | diff     | diff    |
processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
--------------------------------------------------------------
       50 |      4.811   |       4.872  |  +0.061  |  +1.27  |
      100 |      9.854   |      10.309  |  +0.454  |  +4.61  |
      150 |     15.602   |      15.040  |  -0.562  |  -3.6   |
      200 |     20.489   |      20.380  |  -0.109  |  -0.53  |
      250 |     25.798   |      25.652  |  -0.146  |  -0.56  |
      300 |     31.260   |      30.797  |  -0.463  |  -1.48  |
      350 |     36.121   |      35.770  |  -0.351  |  -0.97  |
      400 |     42.288   |      42.102  |  -0.186  |  -0.44  |
      450 |     47.778   |      47.253  |  -0.526  |  -1.1   |
      500 |     51.953   |      52.278  |  +0.325  |  +0.63  |
      550 |     58.401   |      57.700  |  -0.701  |  -1.2   | 
      600 |     63.334   |      63.222  |  -0.112  |  -0.18  |
      650 |     68.816   |      68.511  |  -0.306  |  -0.44  |
      700 |     74.667   |      74.088  |  -0.579  |  -0.78  |
      750 |     78.612   |      79.582  |  +0.970  |  +1.23  |
      800 |     85.431   |      85.263  |  -0.168  |  -0.2   |
--------------------------------------------------------------

* 4 CPUs 

Number of | without      | with         | diff     | diff    |
processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
--------------------------------------------------------------
       50 |      2.586   |       2.584  |  -0.003  |  -0.1   |
      100 |      5.254   |       5.283  |  +0.030  |  +0.56  |
      150 |      8.012   |       8.074  |  +0.061  |  +0.76  |
      200 |     11.172   |      11.000  |  -0.172  |  -1.54  |
      250 |     13.917   |      14.036  |  +0.119  |  +0.86  |
      300 |     16.905   |      16.543  |  -0.362  |  -2.14  |
      350 |     19.901   |      20.036  |  +0.135  |  +0.68  |
      400 |     22.908   |      23.094  |  +0.186  |  +0.81  |
      450 |     26.273   |      26.101  |  -0.172  |  -0.66  |
      500 |     29.554   |      29.092  |  -0.461  |  -1.56  |
      550 |     32.377   |      32.274  |  -0.103  |  -0.32  |
      600 |     35.855   |      35.322  |  -0.533  |  -1.49  |
      650 |     39.192   |      38.388  |  -0.804  |  -2.05  |
      700 |     41.744   |      41.719  |  -0.025  |  -0.06  |
      750 |     45.016   |      44.496  |  -0.520  |  -1.16  |
      800 |     48.212   |      47.603  |  -0.609  |  -1.26  |
--------------------------------------------------------------

* 8 CPUs 

Number of | without      | with         | diff     | diff    |
processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
--------------------------------------------------------------
       50 |      2.094   |       2.072  |  -0.022  |  -1.07  |
      100 |      4.162   |       4.273  |  +0.111  |  +2.66  |
      150 |      6.485   |       6.540  |  +0.055  |  +0.84  |
      200 |      8.556   |       8.478  |  -0.078  |  -0.91  |
      250 |     10.458   |      10.258  |  -0.200  |  -1.91  |
      300 |     12.425   |      12.750  |  +0.325  |  +2.62  |
      350 |     14.807   |      14.839  |  +0.032  |  +0.22  |
      400 |     16.801   |      16.959  |  +0.158  |  +0.94  |
      450 |     19.478   |      19.009  |  -0.470  |  -2.41  |
      500 |     21.296   |      21.504  |  +0.208  |  +0.98  |
      550 |     23.842   |      23.979  |  +0.137  |  +0.57  |
      600 |     26.309   |      26.111  |  -0.198  |  -0.75  |
      650 |     28.705   |      28.446  |  -0.259  |  -0.9   |
      700 |     31.233   |      31.394  |  +0.161  |  +0.52  |
      750 |     34.064   |      33.720  |  -0.344  |  -1.01  |
      800 |     36.320   |      36.114  |  -0.206  |  -0.57  |
--------------------------------------------------------------

Best regards,
Hideo


P.S. When I compiled the linux-2.6-sched-fixes tree on ia64, I
had to revert the following git commit since pteval_t is defined
on x86 only.

commit 8686f2b37e7394b51dd6593678cbfd85ecd28c65
Date:   Tue May 6 15:42:40 2008 -0700

    generic, x86, PAT: fix mprotect

-- 
Hitachi Computer Products (America) Inc.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-02 22:12 Kernel marker has no performance impact on ia64 Hideo AOKI
@ 2008-06-02 22:32 ` Peter Zijlstra
  2008-06-02 23:21   ` Mathieu Desnoyers
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-06-02 22:32 UTC (permalink / raw)
  To: Hideo AOKI; +Cc: mingo, mathieu.desnoyers, Masami Hiramatsu, linux-kernel

On Mon, 2008-06-02 at 18:12 -0400, Hideo AOKI wrote:
> Hello,
> 
> I evaluated overhead of kernel marker using linux-2.6-sched-fixes
> git tree, which includes several markers for LTTng, using an ia64
> server.
> 
> While the immediate trace mark feature isn't implemented on ia64,
> there is no major performance regression. So, I think that we 
> don't have any issues to propose merging marker point patches 
> into Linus's tree from the viewpoint of performance impact.

Performance is atm the least of the concerns regarding this work.

I'm still convinced markers are too ugly to live.

I also worry greatly about the fact that its too easy to expose too much
to user-space. There are no clear rules and the free form marker format
just begs for an inconsistent mess to arise.

IMHO the current free-form trace_mark() should be removed from the tree
- its great for ad-hoc debugging but its a disaster waiting to happen
for anything else. Anybody doing ad-hoc debugging can patch it in
themselves if needed.

Regular trace points can be custom made; this has the advantages that it
raises the implementation barrier and hopefully that encourages some
thought in the process. It also avoid the code from growing into
something that looks like someone had a long night of debugging.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-02 22:32 ` Peter Zijlstra
@ 2008-06-02 23:21   ` Mathieu Desnoyers
  2008-06-03  6:07     ` Takashi Nishiie
                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Mathieu Desnoyers @ 2008-06-02 23:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Hideo AOKI, mingo, Masami Hiramatsu, linux-kernel

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Mon, 2008-06-02 at 18:12 -0400, Hideo AOKI wrote:
> > Hello,
> > 
> > I evaluated overhead of kernel marker using linux-2.6-sched-fixes
> > git tree, which includes several markers for LTTng, using an ia64
> > server.
> > 
> > While the immediate trace mark feature isn't implemented on ia64,
> > there is no major performance regression. So, I think that we 
> > don't have any issues to propose merging marker point patches 
> > into Linus's tree from the viewpoint of performance impact.
> 
> Performance is atm the least of the concerns regarding this work.
> 
> I'm still convinced markers are too ugly to live.
> 
> I also worry greatly about the fact that its too easy to expose too much
> to user-space. There are no clear rules and the free form marker format
> just begs for an inconsistent mess to arise.
> 
> IMHO the current free-form trace_mark() should be removed from the tree
> - its great for ad-hoc debugging but its a disaster waiting to happen
> for anything else. Anybody doing ad-hoc debugging can patch it in
> themselves if needed.
> 
> Regular trace points can be custom made; this has the advantages that it
> raises the implementation barrier and hopefully that encourages some
> thought in the process. It also avoid the code from growing into
> something that looks like someone had a long night of debugging.
> 

Maybe we could settle for an intermediate solution : I agree with you
that defining the trace points in headers, like you did for the
scheduler, makes the code much cleaner and makes things much easier to
maintain afterward. However, having the trace_mark mechanism underneath
helps a lot in plugging a generic tracer (actually, if we can settle the
marker issue, I've got a kernel tracer, LTTng, that I've been waiting
for quite a while to push to mainline that I would like to post someday).

So I would be in favor of requiring tracing statements to be described
in static inline functions, in header files, that could preferably call
trace_mark() and optionally also call other in-kernel tracers directly.

Ideally, we could re-use the immediate values infrastructure to control
activation of these trace points with minimal impact on the system.

One of my goal is to provide a mechanism that can feed both non-debug
and debug information to a generic tracing mechanism to allow
system-wide analysis of the kernel, both for production system and
kernel debugging.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Kernel marker has no performance impact on ia64.
  2008-06-02 23:21   ` Mathieu Desnoyers
@ 2008-06-03  6:07     ` Takashi Nishiie
  2008-06-04  4:58     ` Masami Hiramatsu
  2008-06-04 22:27     ` Peter Zijlstra
  2 siblings, 0 replies; 30+ messages in thread
From: Takashi Nishiie @ 2008-06-03  6:07 UTC (permalink / raw)
  To: 'Mathieu Desnoyers', 'Peter Zijlstra'
  Cc: 'Hideo AOKI', mingo, 'Masami Hiramatsu',
	linux-kernel

Hello.

> One of my goal is to provide a mechanism that can feed both non-debug
> and debug information to a generic tracing mechanism to allow
> system-wide analysis of the kernel, both for production system and
> kernel debugging.

 I agree with the target.
 I understand there is a person who is the neat paranoiac that it 
doesn't want to leave the code for debugging for a final code. If 
it is a situation in which the code for debugging without united 
interfaces lies scattered, the opinion might be correct. 
 It is an effective method only to set up the marker if the code 
for debugging without united interfaces can be appropriately set 
up with the marker. 
 I think that the mechanism that can offer both non-debugging and
debugging information to the generic pursuit mechanism to permit 
the analysis of the entire kernel system in the meaning of 
obtaining dynamic information for a long time not obtained by a 
static necropsy by kdump is indispensable in a mission-critical 
system.

Regards.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-02 23:21   ` Mathieu Desnoyers
  2008-06-03  6:07     ` Takashi Nishiie
@ 2008-06-04  4:58     ` Masami Hiramatsu
  2008-06-04 23:26       ` Mathieu Desnoyers
  2008-06-04 22:27     ` Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Masami Hiramatsu @ 2008-06-04  4:58 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra; +Cc: Hideo AOKI, mingo, linux-kernel

Hi Peter and Mathieu,

Thank you for your comments.

Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
>> On Mon, 2008-06-02 at 18:12 -0400, Hideo AOKI wrote:
>>> Hello,
>>>
>>> I evaluated overhead of kernel marker using linux-2.6-sched-fixes
>>> git tree, which includes several markers for LTTng, using an ia64
>>> server.
>>>
>>> While the immediate trace mark feature isn't implemented on ia64,
>>> there is no major performance regression. So, I think that we 
>>> don't have any issues to propose merging marker point patches 
>>> into Linus's tree from the viewpoint of performance impact.
>> Performance is atm the least of the concerns regarding this work.
>>
>> I'm still convinced markers are too ugly to live.
>>
>> I also worry greatly about the fact that its too easy to expose too much
>> to user-space. There are no clear rules and the free form marker format
>> just begs for an inconsistent mess to arise.

Sure, I think we should review each point carefully
and should make clear rules what is acceptable or not and why.

>>
>> IMHO the current free-form trace_mark() should be removed from the tree
>> - its great for ad-hoc debugging but its a disaster waiting to happen
>> for anything else. Anybody doing ad-hoc debugging can patch it in
>> themselves if needed.
>>
>> Regular trace points can be custom made; this has the advantages that it
>> raises the implementation barrier and hopefully that encourages some
>> thought in the process. It also avoid the code from growing into
>> something that looks like someone had a long night of debugging.
>>
> 
> Maybe we could settle for an intermediate solution : I agree with you
> that defining the trace points in headers, like you did for the
> scheduler, makes the code much cleaner and makes things much easier to
> maintain afterward. However, having the trace_mark mechanism underneath
> helps a lot in plugging a generic tracer (actually, if we can settle the
> marker issue, I've got a kernel tracer, LTTng, that I've been waiting
> for quite a while to push to mainline that I would like to post someday).

That's good to me.
BTW, I'd like to know your plan, would those static inline functions be
defined in new headers or marker.h(or other existing headers)?

Regards,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-04  4:58     ` Masami Hiramatsu
@ 2008-06-04 23:26       ` Mathieu Desnoyers
  2008-06-04 23:40         ` Masami Hiramatsu
  0 siblings, 1 reply; 30+ messages in thread
From: Mathieu Desnoyers @ 2008-06-04 23:26 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: Peter Zijlstra, Hideo AOKI, mingo, linux-kernel

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Hi Peter and Mathieu,
> 
> Thank you for your comments.
> 
> Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> >> On Mon, 2008-06-02 at 18:12 -0400, Hideo AOKI wrote:
> >>> Hello,
> >>>
> >>> I evaluated overhead of kernel marker using linux-2.6-sched-fixes
> >>> git tree, which includes several markers for LTTng, using an ia64
> >>> server.
> >>>
> >>> While the immediate trace mark feature isn't implemented on ia64,
> >>> there is no major performance regression. So, I think that we 
> >>> don't have any issues to propose merging marker point patches 
> >>> into Linus's tree from the viewpoint of performance impact.
> >> Performance is atm the least of the concerns regarding this work.
> >>
> >> I'm still convinced markers are too ugly to live.
> >>
> >> I also worry greatly about the fact that its too easy to expose too much
> >> to user-space. There are no clear rules and the free form marker format
> >> just begs for an inconsistent mess to arise.
> 
> Sure, I think we should review each point carefully
> and should make clear rules what is acceptable or not and why.
> 
> >>
> >> IMHO the current free-form trace_mark() should be removed from the tree
> >> - its great for ad-hoc debugging but its a disaster waiting to happen
> >> for anything else. Anybody doing ad-hoc debugging can patch it in
> >> themselves if needed.
> >>
> >> Regular trace points can be custom made; this has the advantages that it
> >> raises the implementation barrier and hopefully that encourages some
> >> thought in the process. It also avoid the code from growing into
> >> something that looks like someone had a long night of debugging.
> >>
> > 
> > Maybe we could settle for an intermediate solution : I agree with you
> > that defining the trace points in headers, like you did for the
> > scheduler, makes the code much cleaner and makes things much easier to
> > maintain afterward. However, having the trace_mark mechanism underneath
> > helps a lot in plugging a generic tracer (actually, if we can settle the
> > marker issue, I've got a kernel tracer, LTTng, that I've been waiting
> > for quite a while to push to mainline that I would like to post someday).
> 
> That's good to me.
> BTW, I'd like to know your plan, would those static inline functions be
> defined in new headers or marker.h(or other existing headers)?
> 

Hi Masami,

What do you think of kernel/sched-trace.h for the scheduler as proposed
by Peter ? Having these headers close to the c file instrumentation they
deal with seems to scale maintenance better. Placing all these in one
big kernel header included everywhere would require to recompile the
whole kernel when the header is touched, which is, I guess, not what we
want.

Mathieu

> Regards,
> 
> -- 
> Masami Hiramatsu
> 
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
> 
> e-mail: mhiramat@redhat.com
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-04 23:26       ` Mathieu Desnoyers
@ 2008-06-04 23:40         ` Masami Hiramatsu
  0 siblings, 0 replies; 30+ messages in thread
From: Masami Hiramatsu @ 2008-06-04 23:40 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Peter Zijlstra, Hideo AOKI, mingo, linux-kernel

Hi Mathieu,

Mathieu Desnoyers wrote:
>>> Maybe we could settle for an intermediate solution : I agree with you
>>> that defining the trace points in headers, like you did for the
>>> scheduler, makes the code much cleaner and makes things much easier to
>>> maintain afterward. However, having the trace_mark mechanism underneath
>>> helps a lot in plugging a generic tracer (actually, if we can settle the
>>> marker issue, I've got a kernel tracer, LTTng, that I've been waiting
>>> for quite a while to push to mainline that I would like to post someday).
>> That's good to me.
>> BTW, I'd like to know your plan, would those static inline functions be
>> defined in new headers or marker.h(or other existing headers)?
>>
> 
> Hi Masami,
> 
> What do you think of kernel/sched-trace.h for the scheduler as proposed
> by Peter ? Having these headers close to the c file instrumentation they
> deal with seems to scale maintenance better. Placing all these in one
> big kernel header included everywhere would require to recompile the
> whole kernel when the header is touched, which is, I guess, not what we
> want.

I agree with you, one big kernel header is hard to maintain, especially
by patches :-)

Thanks,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-02 23:21   ` Mathieu Desnoyers
  2008-06-03  6:07     ` Takashi Nishiie
  2008-06-04  4:58     ` Masami Hiramatsu
@ 2008-06-04 22:27     ` Peter Zijlstra
  2008-06-04 23:22       ` Mathieu Desnoyers
  2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-06-04 22:27 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Hideo AOKI, mingo, Masami Hiramatsu, linux-kernel

On Mon, 2008-06-02 at 19:21 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Mon, 2008-06-02 at 18:12 -0400, Hideo AOKI wrote:
> > > Hello,
> > > 
> > > I evaluated overhead of kernel marker using linux-2.6-sched-fixes
> > > git tree, which includes several markers for LTTng, using an ia64
> > > server.
> > > 
> > > While the immediate trace mark feature isn't implemented on ia64,
> > > there is no major performance regression. So, I think that we 
> > > don't have any issues to propose merging marker point patches 
> > > into Linus's tree from the viewpoint of performance impact.
> > 
> > Performance is atm the least of the concerns regarding this work.
> > 
> > I'm still convinced markers are too ugly to live.
> > 
> > I also worry greatly about the fact that its too easy to expose too much
> > to user-space. There are no clear rules and the free form marker format
> > just begs for an inconsistent mess to arise.
> > 
> > IMHO the current free-form trace_mark() should be removed from the tree
> > - its great for ad-hoc debugging but its a disaster waiting to happen
> > for anything else. Anybody doing ad-hoc debugging can patch it in
> > themselves if needed.
> > 
> > Regular trace points can be custom made; this has the advantages that it
> > raises the implementation barrier and hopefully that encourages some
> > thought in the process. It also avoid the code from growing into
> > something that looks like someone had a long night of debugging.
> > 
> 
> Maybe we could settle for an intermediate solution : I agree with you
> that defining the trace points in headers, like you did for the
> scheduler, makes the code much cleaner and makes things much easier to
> maintain afterward. However, having the trace_mark mechanism underneath
> helps a lot in plugging a generic tracer (actually, if we can settle the
> marker issue, I've got a kernel tracer, LTTng, that I've been waiting
> for quite a while to push to mainline that I would like to post someday).
> 
> So I would be in favor of requiring tracing statements to be described
> in static inline functions, in header files, that could preferably call
> trace_mark() and optionally also call other in-kernel tracers directly.
> 
> Ideally, we could re-use the immediate values infrastructure to control
> activation of these trace points with minimal impact on the system.
> 
> One of my goal is to provide a mechanism that can feed both non-debug
> and debug information to a generic tracing mechanism to allow
> system-wide analysis of the kernel, both for production system and
> kernel debugging.

So are you proposing something like:

static inline void 
trace_sched_switch(struct task_struct *prev, struct task_struct *next)
{
	trace_mark(sched_switch, prev, next);
}

dropping the silly fmt string but using the multiplex of trace_mark, and
then doing the stringify bit:

       "prev_pid %d next_pid %d prev_state %ld\n"

in the actual tracer?


IMHO the 'type safety' of the fmt string is over-rated, since it cannot
distinguish between a task_struct * or a bio *, both are a pointers -
and half arsed type safely is worse than no type safety.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-04 22:27     ` Peter Zijlstra
@ 2008-06-04 23:22       ` Mathieu Desnoyers
  2008-06-05  8:12         ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Mathieu Desnoyers @ 2008-06-04 23:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Hideo AOKI, mingo, Masami Hiramatsu, linux-kernel

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Mon, 2008-06-02 at 19:21 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Mon, 2008-06-02 at 18:12 -0400, Hideo AOKI wrote:
> > > > Hello,
> > > > 
> > > > I evaluated overhead of kernel marker using linux-2.6-sched-fixes
> > > > git tree, which includes several markers for LTTng, using an ia64
> > > > server.
> > > > 
> > > > While the immediate trace mark feature isn't implemented on ia64,
> > > > there is no major performance regression. So, I think that we 
> > > > don't have any issues to propose merging marker point patches 
> > > > into Linus's tree from the viewpoint of performance impact.
> > > 
> > > Performance is atm the least of the concerns regarding this work.
> > > 
> > > I'm still convinced markers are too ugly to live.
> > > 
> > > I also worry greatly about the fact that its too easy to expose too much
> > > to user-space. There are no clear rules and the free form marker format
> > > just begs for an inconsistent mess to arise.
> > > 
> > > IMHO the current free-form trace_mark() should be removed from the tree
> > > - its great for ad-hoc debugging but its a disaster waiting to happen
> > > for anything else. Anybody doing ad-hoc debugging can patch it in
> > > themselves if needed.
> > > 
> > > Regular trace points can be custom made; this has the advantages that it
> > > raises the implementation barrier and hopefully that encourages some
> > > thought in the process. It also avoid the code from growing into
> > > something that looks like someone had a long night of debugging.
> > > 
> > 
> > Maybe we could settle for an intermediate solution : I agree with you
> > that defining the trace points in headers, like you did for the
> > scheduler, makes the code much cleaner and makes things much easier to
> > maintain afterward. However, having the trace_mark mechanism underneath
> > helps a lot in plugging a generic tracer (actually, if we can settle the
> > marker issue, I've got a kernel tracer, LTTng, that I've been waiting
> > for quite a while to push to mainline that I would like to post someday).
> > 
> > So I would be in favor of requiring tracing statements to be described
> > in static inline functions, in header files, that could preferably call
> > trace_mark() and optionally also call other in-kernel tracers directly.
> > 
> > Ideally, we could re-use the immediate values infrastructure to control
> > activation of these trace points with minimal impact on the system.
> > 
> > One of my goal is to provide a mechanism that can feed both non-debug
> > and debug information to a generic tracing mechanism to allow
> > system-wide analysis of the kernel, both for production system and
> > kernel debugging.
> 
> So are you proposing something like:
> 
> static inline void 
> trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> {
> 	trace_mark(sched_switch, prev, next);
> }
> 

Not exactly. Something more along the lines of

static inline void 
trace_sched_switch(struct task_struct *prev, struct task_struct *next)
{
  /* Internal tracers. */
  ftrace_sched_switch(prev, next);
  othertracer_sched_switch(prev, next);
  /*
   * System-wide tracing. Useful information is exported here.
   * Probes connecting to these markers are expected to only use the
   * information provided to them for data collection purpose. Type
   * casting pointers is discouraged.
   */
	trace_mark(kernel_sched_switch, "prev_pid %d next_pid %d prev_state %ld",
    prev->pid, next->pid, prev->state);
}

> dropping the silly fmt string but using the multiplex of trace_mark, and
> then doing the stringify bit:
> 
>        "prev_pid %d next_pid %d prev_state %ld\n"
> 
> in the actual tracer?
> 

It would make much more sense to put this formatting information along
with the trace point (e.g. in a a kernel/sched-trace.h header) rather
that to hide it in a tracer (loadable module) because this information
is an interface to the trace point.

> 
> IMHO the 'type safety' of the fmt string is over-rated, since it cannot
> distinguish between a task_struct * or a bio *, both are a pointers -
> and half arsed type safely is worse than no type safety.
> 

I totally agree with you that not having the capacity to inspect pointer
types is a problem for tracers which wants to receive the "raw" pointer
and deal with the data they need like big boys. On the other hand, it
requires them to be closely tied to the kernel internals and therefore
it makes sense to call them directly from the tracing site, thus
bypassing the marker format string.

However, letting the marker specify the data format so a tracer could
format it into a memory buffer (in a binary or text format, depending on
the implementation) or so that a tool like systemtap can use this
identified information without having to be closely tied to the kernel
makes sense to me.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-04 23:22       ` Mathieu Desnoyers
@ 2008-06-05  8:12         ` Peter Zijlstra
  2008-06-05 14:28           ` Masami Hiramatsu
  2008-06-12 13:53           ` Mathieu Desnoyers
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2008-06-05  8:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Hideo AOKI, mingo, Masami Hiramatsu, linux-kernel, Steven Rostedt

On Wed, 2008-06-04 at 19:22 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:

> > So are you proposing something like:
> > 
> > static inline void 
> > trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> > {
> > 	trace_mark(sched_switch, prev, next);
> > }
> > 
> 
> Not exactly. Something more along the lines of
> 
> static inline void 
> trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> {
>   /* Internal tracers. */
>   ftrace_sched_switch(prev, next);
>   othertracer_sched_switch(prev, next);
>   /*
>    * System-wide tracing. Useful information is exported here.
>    * Probes connecting to these markers are expected to only use the
>    * information provided to them for data collection purpose. Type
>    * casting pointers is discouraged.
>    */
> 	trace_mark(kernel_sched_switch, "prev_pid %d next_pid %d prev_state %ld",
>     prev->pid, next->pid, prev->state);
> }

Advantage of my method would be that ftrace (and othertracer) can use
the same marker and doesn't need yet another hoook.

> > dropping the silly fmt string but using the multiplex of trace_mark, and
> > then doing the stringify bit:
> > 
> >        "prev_pid %d next_pid %d prev_state %ld\n"
> > 
> > in the actual tracer?
> > 
> 
> It would make much more sense to put this formatting information along
> with the trace point (e.g. in a a kernel/sched-trace.h header) rather
> that to hide it in a tracer (loadable module) because this information
> is an interface to the trace point.

I'm not sure - it seems to me it should be part of the tracer because
its a detail/subset of the actual data - rendering it useless for others
who'd like a different set.

> > IMHO the 'type safety' of the fmt string is over-rated, since it cannot
> > distinguish between a task_struct * or a bio *, both are a pointers -
> > and half arsed type safely is worse than no type safety.
> > 
> 
> I totally agree with you that not having the capacity to inspect pointer
> types is a problem for tracers which wants to receive the "raw" pointer
> and deal with the data they need like big boys. On the other hand, it
> requires them to be closely tied to the kernel internals and therefore
> it makes sense to call them directly from the tracing site, thus
> bypassing the marker format string.
> 
> However, letting the marker specify the data format so a tracer could
> format it into a memory buffer (in a binary or text format, depending on
> the implementation) or so that a tool like systemtap can use this
> identified information without having to be closely tied to the kernel
> makes sense to me.

So s-tap is meant to parse this sting and interpret the varargs without
being closely tied to the kernel? - Somehow that doesn't make me feel
warm and fuzzy. That not only ties userspace to the information present
in the marker, but to the actual string as well.

The stronger you make this bind the less I like it.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-05  8:12         ` Peter Zijlstra
@ 2008-06-05 14:28           ` Masami Hiramatsu
  2008-06-12 14:04             ` Mathieu Desnoyers
  2008-06-12 13:53           ` Mathieu Desnoyers
  1 sibling, 1 reply; 30+ messages in thread
From: Masami Hiramatsu @ 2008-06-05 14:28 UTC (permalink / raw)
  To: Peter Zijlstra, Mathieu Desnoyers
  Cc: Hideo AOKI, mingo, linux-kernel, Steven Rostedt

Hi Peter and Mathieu,

Peter Zijlstra wrote:
> On Wed, 2008-06-04 at 19:22 -0400, Mathieu Desnoyers wrote:
>> * Peter Zijlstra (peterz@infradead.org) wrote:
> 
>>> So are you proposing something like:
>>>
>>> static inline void 
>>> trace_sched_switch(struct task_struct *prev, struct task_struct *next)
>>> {
>>> 	trace_mark(sched_switch, prev, next);
>>> }
>>>
>> Not exactly. Something more along the lines of
>>
>> static inline void 
>> trace_sched_switch(struct task_struct *prev, struct task_struct *next)
>> {
>>   /* Internal tracers. */
>>   ftrace_sched_switch(prev, next);
>>   othertracer_sched_switch(prev, next);
>>   /*
>>    * System-wide tracing. Useful information is exported here.
>>    * Probes connecting to these markers are expected to only use the
>>    * information provided to them for data collection purpose. Type
>>    * casting pointers is discouraged.
>>    */
>> 	trace_mark(kernel_sched_switch, "prev_pid %d next_pid %d prev_state %ld",
>>     prev->pid, next->pid, prev->state);
>> }
> 
> Advantage of my method would be that ftrace (and othertracer) can use
> the same marker and doesn't need yet another hoook.

If so, I'd like to suggest below changes,

- introduce below macro in marker.h

#define DEFINE_TRACE(name, vargs, args...)	\
static inline void trace_##name vargs		\
{						\
	trace_mark(name, #vargs, ##args);	\
}

- remove __marker_check_format from __trace_mark
- and write a definition in sched_trace.h

DEFINE_TRACE(sched_switch, (struct task_struct *prev, struct task_struct *next),
	     prev, next);

Thus, we can remove fmt string and also ensure the type checking, because;
- Type checking at the trace point is done by the compiler.
- Type checking of probe handler can be done by comparing #vargs strings.

Thanks,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-05 14:28           ` Masami Hiramatsu
@ 2008-06-12 14:04             ` Mathieu Desnoyers
  2008-06-12 15:31               ` Masami Hiramatsu
  0 siblings, 1 reply; 30+ messages in thread
From: Mathieu Desnoyers @ 2008-06-12 14:04 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Hideo AOKI, mingo, linux-kernel, Steven Rostedt

Hi Masami,

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Hi Peter and Mathieu,
> 
> Peter Zijlstra wrote:
> > On Wed, 2008-06-04 at 19:22 -0400, Mathieu Desnoyers wrote:
> >> * Peter Zijlstra (peterz@infradead.org) wrote:
> > 
> >>> So are you proposing something like:
> >>>
> >>> static inline void 
> >>> trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> >>> {
> >>> 	trace_mark(sched_switch, prev, next);
> >>> }
> >>>
> >> Not exactly. Something more along the lines of
> >>
> >> static inline void 
> >> trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> >> {
> >>   /* Internal tracers. */
> >>   ftrace_sched_switch(prev, next);
> >>   othertracer_sched_switch(prev, next);
> >>   /*
> >>    * System-wide tracing. Useful information is exported here.
> >>    * Probes connecting to these markers are expected to only use the
> >>    * information provided to them for data collection purpose. Type
> >>    * casting pointers is discouraged.
> >>    */
> >> 	trace_mark(kernel_sched_switch, "prev_pid %d next_pid %d prev_state %ld",
> >>     prev->pid, next->pid, prev->state);
> >> }
> > 
> > Advantage of my method would be that ftrace (and othertracer) can use
> > the same marker and doesn't need yet another hoook.
> 
> If so, I'd like to suggest below changes,
> 
> - introduce below macro in marker.h
> 
> #define DEFINE_TRACE(name, vargs, args...)	\
> static inline void trace_##name vargs		\
> {						\
> 	trace_mark(name, #vargs, ##args);	\
> }
> 
> - remove __marker_check_format from __trace_mark
> - and write a definition in sched_trace.h
> 
> DEFINE_TRACE(sched_switch, (struct task_struct *prev, struct task_struct *next),
> 	     prev, next);
> 
> Thus, we can remove fmt string and also ensure the type checking, because;
> - Type checking at the trace point is done by the compiler.
> - Type checking of probe handler can be done by comparing #vargs strings.
> 

Hrm, interesting! The only problem I see with this is that it won't
allow a tracer to efficiently parse the "format information". Parsing C
code is not as straightforward and compact as parsing a format string.

However, Peter and you are about to convince me that an hybrid between
the solution you propose here and the marker scheme could be used.

Your scheme could be used to declare the markers and probes
(DEFINE_TRACE()) in header files. It would declare a static inline
function called at the instrumentation site and a probe prototype
that could then be used in the probe module to declare the probe that
will have to be connected to the marker. This part would allow
custom-made probes.

Within the tracer, we would declare custom-made probes for each of these
DEFINE_TRACE statements and associate them with format strings. Because
the probe has to match the prototype, type correctness is ensured. The
format strings would at that point be the exact same as the current
trace_mark() statements. The information passed to trace_mark() would be
send for direct interpretation or serialization with only basic types
available, similar to printk().

We sould leave the trace_mark() statements available for people who want
to add their own debug-style instrumentation to their kernel without
having to add DEFINE_TRACE() statements and modify the tracer
accordingly.

I guess a bit of polishing will come with the implementation, which I
plan to start soon.

Thanks!

Mathieu

> Thanks,
> 
> -- 
> Masami Hiramatsu
> 
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
> 
> e-mail: mhiramat@redhat.com
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 14:04             ` Mathieu Desnoyers
@ 2008-06-12 15:31               ` Masami Hiramatsu
  0 siblings, 0 replies; 30+ messages in thread
From: Masami Hiramatsu @ 2008-06-12 15:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Hideo AOKI, mingo, linux-kernel, Steven Rostedt

Hi Mathieu,

Mathieu Desnoyers wrote:
>> If so, I'd like to suggest below changes,
>>
>> - introduce below macro in marker.h
>>
>> #define DEFINE_TRACE(name, vargs, args...)	\
>> static inline void trace_##name vargs		\
>> {						\
>> 	trace_mark(name, #vargs, ##args);	\
>> }
>>
>> - remove __marker_check_format from __trace_mark
>> - and write a definition in sched_trace.h
>>
>> DEFINE_TRACE(sched_switch, (struct task_struct *prev, struct task_struct *next),
>> 	     prev, next);
>>
>> Thus, we can remove fmt string and also ensure the type checking, because;
>> - Type checking at the trace point is done by the compiler.
>> - Type checking of probe handler can be done by comparing #vargs strings.
>>
> 
> Hrm, interesting! The only problem I see with this is that it won't
> allow a tracer to efficiently parse the "format information". Parsing C
> code is not as straightforward and compact as parsing a format string.

Sure, Parsing C code is not a good idea. I think each tracer can have
a lookup table to get a printf-style format corresponding to each
"regular" marking point. Maintaining this lookup table is not so hard,
because these "regular" marking points should be enough stable.

> However, Peter and you are about to convince me that an hybrid between
> the solution you propose here and the marker scheme could be used.
> 
> Your scheme could be used to declare the markers and probes
> (DEFINE_TRACE()) in header files. It would declare a static inline
> function called at the instrumentation site and a probe prototype
> that could then be used in the probe module to declare the probe that
> will have to be connected to the marker. This part would allow
> custom-made probes.
> 
> Within the tracer, we would declare custom-made probes for each of these
> DEFINE_TRACE statements and associate them with format strings. Because
> the probe has to match the prototype, type correctness is ensured. The
> format strings would at that point be the exact same as the current
> trace_mark() statements. The information passed to trace_mark() would be
> send for direct interpretation or serialization with only basic types
> available, similar to printk().

If the tracer including systemtap introduces above the lookup table,
that can translate "name(arguments)" to "format" easily, and can continue
to use its format string parser.

> We sould leave the trace_mark() statements available for people who want
> to add their own debug-style instrumentation to their kernel without
> having to add DEFINE_TRACE() statements and modify the tracer
> accordingly.

I agree, trace_mark() still useful for "non-regular" markers temporarily
inserted to the developing code by individual developers.

> I guess a bit of polishing will come with the implementation, which I
> plan to start soon.
> 
> Thanks!
> 
> Mathieu
> 

Thank you!

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-05  8:12         ` Peter Zijlstra
  2008-06-05 14:28           ` Masami Hiramatsu
@ 2008-06-12 13:53           ` Mathieu Desnoyers
  2008-06-12 14:27             ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Mathieu Desnoyers @ 2008-06-12 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hideo AOKI, mingo, Masami Hiramatsu, linux-kernel, Steven Rostedt,
	Frank Ch. Eigler

Hi Peter,

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2008-06-04 at 19:22 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> 
> > > So are you proposing something like:
> > > 
> > > static inline void 
> > > trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> > > {
> > > 	trace_mark(sched_switch, prev, next);
> > > }
> > > 
> > 
> > Not exactly. Something more along the lines of
> > 
> > static inline void 
> > trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> > {
> >   /* Internal tracers. */
> >   ftrace_sched_switch(prev, next);
> >   othertracer_sched_switch(prev, next);
> >   /*
> >    * System-wide tracing. Useful information is exported here.
> >    * Probes connecting to these markers are expected to only use the
> >    * information provided to them for data collection purpose. Type
> >    * casting pointers is discouraged.
> >    */
> > 	trace_mark(kernel_sched_switch, "prev_pid %d next_pid %d prev_state %ld",
> >     prev->pid, next->pid, prev->state);
> > }
> 
> Advantage of my method would be that ftrace (and othertracer) can use
> the same marker and doesn't need yet another hoook.
> 

Am I correct by saying that the method you propose completely removes
type checking between the instrumentation site and what the probes
expect ? If yes, this seems to be too fragile. Every time a marker would
change, one would have to audit _every_ probes, both in-kernel and in
modules. Adding type checking to the marker infrastructure makes
automatic detection of these changes possible.

> > > dropping the silly fmt string but using the multiplex of trace_mark, and
> > > then doing the stringify bit:
> > > 
> > >        "prev_pid %d next_pid %d prev_state %ld\n"
> > > 
> > > in the actual tracer?
> > > 
> > 
> > It would make much more sense to put this formatting information along
> > with the trace point (e.g. in a a kernel/sched-trace.h header) rather
> > that to hide it in a tracer (loadable module) because this information
> > is an interface to the trace point.
> 
> I'm not sure - it seems to me it should be part of the tracer because
> its a detail/subset of the actual data - rendering it useless for others
> who'd like a different set.
> 

If it ends up elsewhere, then we have to ensure type correctness in some
way.

> > > IMHO the 'type safety' of the fmt string is over-rated, since it cannot
> > > distinguish between a task_struct * or a bio *, both are a pointers -
> > > and half arsed type safely is worse than no type safety.
> > > 
> > 
> > I totally agree with you that not having the capacity to inspect pointer
> > types is a problem for tracers which wants to receive the "raw" pointer
> > and deal with the data they need like big boys. On the other hand, it
> > requires them to be closely tied to the kernel internals and therefore
> > it makes sense to call them directly from the tracing site, thus
> > bypassing the marker format string.
> > 
> > However, letting the marker specify the data format so a tracer could
> > format it into a memory buffer (in a binary or text format, depending on
> > the implementation) or so that a tool like systemtap can use this
> > identified information without having to be closely tied to the kernel
> > makes sense to me.
> 
> So s-tap is meant to parse this sting and interpret the varargs without
> being closely tied to the kernel? - Somehow that doesn't make me feel
> warm and fuzzy. That not only ties userspace to the information present
> in the marker, but to the actual string as well.
> 
> The stronger you make this bind the less I like it.
> 

Well, the string contains each field name and type. Therefore, SystemTAP
can hook on a marker and parse the string looking for some elements by
passing a NULL format string upon probe registration. Alternatively, it
can provide the exact format string expected when it registers its probe
to the marker and a check will be done to verify that the format string
passed along with the registered probe matches the marker format string.

Also, about what you said earlier in this thread :
"Regular trace points can be custom made; this has the advantages that
it raises the implementation barrier and hopefully that encourages some
thought in the process. It also avoid the code from growing into
something that looks like someone had a long night of debugging."

Before it has been moved to the markers, LTTng was once designed with
custom-made code to save the trace information through custom hooks. To
help maintainers instrument their own subsystem and do the right choice
without being a tracing expert, we created a code generator which
generated this custom code for each trace point given a description of
the trace points. It turned out that keeping this duplicate list of
trace points was cumbersome and that the generated code did eat a lot of
instruction cache. This is why to turned to markers, so we could re-use
a common infrastructure to serialize the data into trace buffers. We
turned to the marker format string to allow the types to serialize to be
parsed efficiently by the tracer. I strongly recommend not to declare
the types associated with a kernel trace point in two unrelated
locations without type checking in-between them (e.g. trace_mark in
kernel code, string in the tracer module), because it would then become
harder to track consistency when the code changes.

However, I would not be against an hybrid of Masami's proposal and
current markers, which I will propose in reply to his email.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 13:53           ` Mathieu Desnoyers
@ 2008-06-12 14:27             ` Peter Zijlstra
  2008-06-12 15:53               ` Frank Ch. Eigler
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-06-12 14:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Hideo AOKI, mingo, Masami Hiramatsu, linux-kernel, Steven Rostedt,
	Frank Ch. Eigler

On Thu, 2008-06-12 at 09:53 -0400, Mathieu Desnoyers wrote:
> Hi Peter,
> 
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Wed, 2008-06-04 at 19:22 -0400, Mathieu Desnoyers wrote:
> > > * Peter Zijlstra (peterz@infradead.org) wrote:
> > 
> > > > So are you proposing something like:
> > > > 
> > > > static inline void 
> > > > trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> > > > {
> > > > 	trace_mark(sched_switch, prev, next);
> > > > }
> > > > 
> > > 
> > > Not exactly. Something more along the lines of
> > > 
> > > static inline void 
> > > trace_sched_switch(struct task_struct *prev, struct task_struct *next)
> > > {
> > >   /* Internal tracers. */
> > >   ftrace_sched_switch(prev, next);
> > >   othertracer_sched_switch(prev, next);
> > >   /*
> > >    * System-wide tracing. Useful information is exported here.
> > >    * Probes connecting to these markers are expected to only use the
> > >    * information provided to them for data collection purpose. Type
> > >    * casting pointers is discouraged.
> > >    */
> > > 	trace_mark(kernel_sched_switch, "prev_pid %d next_pid %d prev_state %ld",
> > >     prev->pid, next->pid, prev->state);
> > > }
> > 
> > Advantage of my method would be that ftrace (and othertracer) can use
> > the same marker and doesn't need yet another hoook.
> > 
> 
> Am I correct by saying that the method you propose completely removes
> type checking between the instrumentation site and what the probes
> expect ? If yes, this seems to be too fragile. Every time a marker would
> change, one would have to audit _every_ probes, both in-kernel and in
> modules. Adding type checking to the marker infrastructure makes
> automatic detection of these changes possible.

would be as simple as:

 git grep sched_switch

every time someone changes trace_sched_switch() arguments. Doesn't seem
too hard, you could even make checkpatch remind you to do that if it
sees a change to a trace_* function.

The down-side of runtime type checking (of which Masami's proposal is
the best so far), is that you'll still not find the breakage until
someone actually tries to use a tracer - so you'll still need the above.

> > > > dropping the silly fmt string but using the multiplex of trace_mark, and
> > > > then doing the stringify bit:
> > > > 
> > > >        "prev_pid %d next_pid %d prev_state %ld\n"
> > > > 
> > > > in the actual tracer?
> > > > 
> > > 
> > > It would make much more sense to put this formatting information along
> > > with the trace point (e.g. in a a kernel/sched-trace.h header) rather
> > > that to hide it in a tracer (loadable module) because this information
> > > is an interface to the trace point.
> > 
> > I'm not sure - it seems to me it should be part of the tracer because
> > its a detail/subset of the actual data - rendering it useless for others
> > who'd like a different set.
> > 
> 
> If it ends up elsewhere, then we have to ensure type correctness in some
> way.

Sure, idealy we'd want compile time type safety. We'd want
trace_sched_switch()'s arguments to match trace_sched_switch_handler()'s
arguments, and a compile time error if this is not the case.

However - we cannot seem to get that. Runtime type safety just doesn't
help this case.

But the point I was making here is that:

  trace_sched_switch(prev->pid, next->pid, next->state)

could be useless for some other tracer who'd want:

  trace_sched_switch(prev->vruntime, next->vruntime)

Also, the ->pid stuff isn't even alive on the normal code path, so
adding that to the marker also bloats the code generated there.

So by using the marker:

  trace_sched_switch(prev, next)

We can have various tracers that display different information and avoid
livelyness issues.

> > > > IMHO the 'type safety' of the fmt string is over-rated, since it cannot
> > > > distinguish between a task_struct * or a bio *, both are a pointers -
> > > > and half arsed type safely is worse than no type safety.
> > > > 
> > > 
> > > I totally agree with you that not having the capacity to inspect pointer
> > > types is a problem for tracers which wants to receive the "raw" pointer
> > > and deal with the data they need like big boys. On the other hand, it
> > > requires them to be closely tied to the kernel internals and therefore
> > > it makes sense to call them directly from the tracing site, thus
> > > bypassing the marker format string.
> > > 
> > > However, letting the marker specify the data format so a tracer could
> > > format it into a memory buffer (in a binary or text format, depending on
> > > the implementation) or so that a tool like systemtap can use this
> > > identified information without having to be closely tied to the kernel
> > > makes sense to me.
> > 
> > So s-tap is meant to parse this sting and interpret the varargs without
> > being closely tied to the kernel? - Somehow that doesn't make me feel
> > warm and fuzzy. That not only ties userspace to the information present
> > in the marker, but to the actual string as well.
> > 
> > The stronger you make this bind the less I like it.
> > 
> 
> Well, the string contains each field name and type. Therefore, SystemTAP
> can hook on a marker and parse the string looking for some elements by
> passing a NULL format string upon probe registration. Alternatively, it
> can provide the exact format string expected when it registers its probe
> to the marker and a check will be done to verify that the format string
> passed along with the registered probe matches the marker format string.

Yes, I get that, its one of the ugliest things I've met in this whole
marker story. Why can't stap not insert a normal trace handler that
extracts the information from prev/next it wants?

> Also, about what you said earlier in this thread :
> "Regular trace points can be custom made; this has the advantages that
> it raises the implementation barrier and hopefully that encourages some
> thought in the process. It also avoid the code from growing into
> something that looks like someone had a long night of debugging."
> 
> Before it has been moved to the markers, LTTng was once designed with
> custom-made code to save the trace information through custom hooks. To
> help maintainers instrument their own subsystem and do the right choice
> without being a tracing expert,

>  we created a code generator which
> generated this custom code for each trace point given a description of
> the trace points.

>  It turned out that keeping this duplicate list of
> trace points was cumbersome and that the generated code did eat a lot of
> instruction cache. 

Well, your last proposal of static inline functions basically returns
thereto. So what was cumbersome about it?

The I$ issue is unfortunate indeed - but it seems to be the price to pay
for compile time type safety.

As for that code-generator, that seems a sane idea, esp if the input
file is simply a regular C header file with trace point definitions.

> This is why to turned to markers, so we could re-use
> a common infrastructure to serialize the data into trace buffers. We
> turned to the marker format string to allow the types to serialize to be
> parsed efficiently by the tracer. I strongly recommend not to declare
> the types associated with a kernel trace point in two unrelated
> locations without type checking in-between them (e.g. trace_mark in
> kernel code, string in the tracer module), because it would then become
> harder to track consistency when the code changes.

I see the value of trace_mark() in debugging sessions, but merging these
things is like merging the resulting code file after a printk debugging
session.

> However, I would not be against an hybrid of Masami's proposal and
> current markers, which I will propose in reply to his email.

Ah - I'm looking forward..


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 14:27             ` Peter Zijlstra
@ 2008-06-12 15:53               ` Frank Ch. Eigler
  2008-06-12 16:16                 ` Masami Hiramatsu
  2008-06-12 16:53                 ` Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: Frank Ch. Eigler @ 2008-06-12 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Hideo AOKI, mingo, Masami Hiramatsu,
	linux-kernel, Steven Rostedt

Hi -

On Thu, Jun 12, 2008 at 04:27:03PM +0200, Peter Zijlstra wrote:
> [...]
> > Well, the string contains each field name and type. Therefore, SystemTAP
> > can hook on a marker and parse the string looking for some elements by
> > passing a NULL format string upon probe registration. Alternatively, it
> > can provide the exact format string expected when it registers its probe
> > to the marker and a check will be done to verify that the format string
> > passed along with the registered probe matches the marker format string.
> 
> Yes, I get that, its one of the ugliest things I've met in this whole
> marker story. Why can't stap not insert a normal trace handler that
> extracts the information from prev/next it wants? [...]

Think this through.  How should systemtap (or another user-space
separate-compiled tool like lttng) do this exactly?

(a) rely on debugging information?  Even assuming we could get proper
    anchors (PC addresses or unambiguous type names) to start
    searching dwarf data, we lose a key attractions of markers: that
    it can robustly transfer data *without* dwarf data kept around.

(b) rely on hand-written C code (prototypes, pointer derefrencing
    wrappers) distributed with systemtap?  Not only would this be a
    brittle maintenance pain in the form of cude duplication, but then
    errors in it couldn't even be detected until the final C
    compilation stage.  That would make a lousy user experience.

(c) have systemtap try to parse the mhiramat-proposed "(struct
    task_struct * next, struct task_struct * prev)" format strings?
    Then we're asking systemtap to parse potentially general C type
    expressions, find the kernel headers that declare the types?
    Parse available subfields?  That seems too much to ask for.

(d) or another way?


- FChE

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 15:53               ` Frank Ch. Eigler
@ 2008-06-12 16:16                 ` Masami Hiramatsu
  2008-06-12 16:43                   ` Frank Ch. Eigler
  2008-06-12 16:53                 ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Masami Hiramatsu @ 2008-06-12 16:16 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Peter Zijlstra, Mathieu Desnoyers, Hideo AOKI, mingo,
	linux-kernel, Steven Rostedt

Hi Frank,

Frank Ch. Eigler wrote:
> Hi -
> 
> On Thu, Jun 12, 2008 at 04:27:03PM +0200, Peter Zijlstra wrote:
>> [...]
>>> Well, the string contains each field name and type. Therefore, SystemTAP
>>> can hook on a marker and parse the string looking for some elements by
>>> passing a NULL format string upon probe registration. Alternatively, it
>>> can provide the exact format string expected when it registers its probe
>>> to the marker and a check will be done to verify that the format string
>>> passed along with the registered probe matches the marker format string.
>> Yes, I get that, its one of the ugliest things I've met in this whole
>> marker story. Why can't stap not insert a normal trace handler that
>> extracts the information from prev/next it wants? [...]
> 
> Think this through.  How should systemtap (or another user-space
> separate-compiled tool like lttng) do this exactly?
> 
> (a) rely on debugging information?  Even assuming we could get proper
>     anchors (PC addresses or unambiguous type names) to start
>     searching dwarf data, we lose a key attractions of markers: that
>     it can robustly transfer data *without* dwarf data kept around.
> 
> (b) rely on hand-written C code (prototypes, pointer derefrencing
>     wrappers) distributed with systemtap?  Not only would this be a
>     brittle maintenance pain in the form of cude duplication, but then
>     errors in it couldn't even be detected until the final C
>     compilation stage.  That would make a lousy user experience.
> 
> (c) have systemtap try to parse the mhiramat-proposed "(struct
>     task_struct * next, struct task_struct * prev)" format strings?
>     Then we're asking systemtap to parse potentially general C type
>     expressions, find the kernel headers that declare the types?
>     Parse available subfields?  That seems too much to ask for.
> 
> (d) or another way?

use a lookup table. we can expect that the marking points which
regularly inserted in the upstream kernel are stable(not so
frequently change). In that case, we can easily maintain
a lookup table which has pairs of format strings like as
"sched_switch(struct task_struct * next, struct task_struct * prev)":"next %p prev %p"
out of tree. Thus, you can use the printf-style format parser.

This actually is a kind of duplication, but in this way,
I think we can detect errors before generating C code, and
easily add lookup pairs of format strings.
(additionally, we can choose %s or %p for "char *" ;-))

> 
> 
> - FChE

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 16:16                 ` Masami Hiramatsu
@ 2008-06-12 16:43                   ` Frank Ch. Eigler
  2008-06-12 16:56                     ` Peter Zijlstra
  2008-06-12 17:05                     ` Masami Hiramatsu
  0 siblings, 2 replies; 30+ messages in thread
From: Frank Ch. Eigler @ 2008-06-12 16:43 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Mathieu Desnoyers, Hideo AOKI, mingo,
	linux-kernel, Steven Rostedt

Hi -

On Thu, Jun 12, 2008 at 12:16:35PM -0400, Masami Hiramatsu wrote:
> [...]
> > Think this through.  How should systemtap (or another user-space
> > separate-compiled tool like lttng) do this exactly?
> > [...]
> > (d) or another way?
> 
> use a lookup table. we can expect that the marking points which
> regularly inserted in the upstream kernel are stable(not so
> frequently change). In that case, we can easily maintain
> a lookup table which has pairs of format strings like as
> "sched_switch(struct task_struct * next, struct task_struct * prev)":"next %p prev %p"
> out of tree. Thus, you can use the printf-style format parser.

That's an interesting idea, but errors in this table would themselves
only be caught at C compilation time.  Worse, it does nothing helpful
for actually pulling out the next/prev fields of interest.  Remember,
real tracing users don't care so much about the task_struct pointers,
but about observable things like PIDs.  Systemtap would be back to the
debuginfo or C-header-guessing/parsing job (or embedded-C, yuck).

This is another reason why markers are a nice solution.  They allow
passing of actual useful values: not just the %p pointers but the most
interesting derived values (e.g., prev->pid).  And they do this
*efficiently* - in out-of-line code that imposes no measurable
overhead in the normal case..

- FChE

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 16:43                   ` Frank Ch. Eigler
@ 2008-06-12 16:56                     ` Peter Zijlstra
  2008-06-12 22:10                       ` Mathieu Desnoyers
  2008-06-12 17:05                     ` Masami Hiramatsu
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-06-12 16:56 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Hideo AOKI, mingo,
	linux-kernel, Steven Rostedt

On Thu, 2008-06-12 at 12:43 -0400, Frank Ch. Eigler wrote:

> This is another reason why markers are a nice solution.  They allow
> passing of actual useful values: not just the %p pointers but the most
> interesting derived values (e.g., prev->pid). 

Useful to whoem? stap isn't the holy grail of tracing and certainly not
the only consumer of trace points, so restricting the information to
just what stap needs is actually making trace points less useful.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 16:56                     ` Peter Zijlstra
@ 2008-06-12 22:10                       ` Mathieu Desnoyers
  0 siblings, 0 replies; 30+ messages in thread
From: Mathieu Desnoyers @ 2008-06-12 22:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frank Ch. Eigler, Masami Hiramatsu, Hideo AOKI, mingo,
	linux-kernel, Steven Rostedt

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2008-06-12 at 12:43 -0400, Frank Ch. Eigler wrote:
> 
> > This is another reason why markers are a nice solution.  They allow
> > passing of actual useful values: not just the %p pointers but the most
> > interesting derived values (e.g., prev->pid). 
> 
> Useful to whoem? stap isn't the holy grail of tracing and certainly not
> the only consumer of trace points, so restricting the information to
> just what stap needs is actually making trace points less useful.
> 

LTTng also currently relies on the markers identifying useful data and
ideally won't contain any tracepoint-specific code to extract subfields
from structures.

By the way, a point that should be clarified is that putting a 
"prev->pid" field in a marker does not make this variable live unless
the marker is activated. When disabled, the marker site jumps over the
stack setup, including any pointer dereference.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 16:43                   ` Frank Ch. Eigler
  2008-06-12 16:56                     ` Peter Zijlstra
@ 2008-06-12 17:05                     ` Masami Hiramatsu
  2008-06-12 17:48                       ` Frank Ch. Eigler
  1 sibling, 1 reply; 30+ messages in thread
From: Masami Hiramatsu @ 2008-06-12 17:05 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Peter Zijlstra, Mathieu Desnoyers, Hideo AOKI, mingo,
	linux-kernel, Steven Rostedt

Hi,

Frank Ch. Eigler wrote:
> Hi -
> 
> On Thu, Jun 12, 2008 at 12:16:35PM -0400, Masami Hiramatsu wrote:
>> [...]
>>> Think this through.  How should systemtap (or another user-space
>>> separate-compiled tool like lttng) do this exactly?
>>> [...]
>>> (d) or another way?
>> use a lookup table. we can expect that the marking points which
>> regularly inserted in the upstream kernel are stable(not so
>> frequently change). In that case, we can easily maintain
>> a lookup table which has pairs of format strings like as
>> "sched_switch(struct task_struct * next, struct task_struct * prev)":"next %p prev %p"
>> out of tree. Thus, you can use the printf-style format parser.
> 
> That's an interesting idea, but errors in this table would themselves
> only be caught at C compilation time.

Hmm, why would you think so?
I think if we can't find corresponding entry from the lookup table,
it becomes an error.

>  Worse, it does nothing helpful
> for actually pulling out the next/prev fields of interest.  Remember,
> real tracing users don't care so much about the task_struct pointers,
> but about observable things like PIDs.  Systemtap would be back to the
> debuginfo or C-header-guessing/parsing job (or embedded-C, yuck).

Yeah, but that is same as previous marker. It depends on what parameter
the kernel pass to the marker. I mean, the parameter issue is not an
issue of the marker framework, but a discussion point of marking points.

> This is another reason why markers are a nice solution.  They allow
> passing of actual useful values: not just the %p pointers but the most
> interesting derived values (e.g., prev->pid).  And they do this
> *efficiently* - in out-of-line code that imposes no measurable
> overhead in the normal case..

Even if you use trace_mark() markers, you have to post a kernel patch
which passes the prev->pid to the marking point and to discuss it.
for example,
DEFINE_TRACE(sched_switch, (int prev_pid, int next_pid), prev_pid, next_pid)

But it might not so general, we have to discuss what parameters are enough
good for each marking point.

> 
> 
> - FChE

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 17:05                     ` Masami Hiramatsu
@ 2008-06-12 17:48                       ` Frank Ch. Eigler
  2008-06-12 19:34                         ` Masami Hiramatsu
  0 siblings, 1 reply; 30+ messages in thread
From: Frank Ch. Eigler @ 2008-06-12 17:48 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Mathieu Desnoyers, Hideo AOKI, mingo,
	linux-kernel, Steven Rostedt

Hi -

On Thu, Jun 12, 2008 at 01:05:52PM -0400, Masami Hiramatsu wrote:
> [...]
> >> "sched_switch(struct task_struct * next, struct task_struct * prev)":"next %p prev %p"
> >> out of tree. Thus, you can use the printf-style format parser.
> > 
> > That's an interesting idea, but errors in this table would themselves
> > only be caught at C compilation time.

> Hmm, why would you think so?  I think if we can't find corresponding
> entry from the lookup table, it becomes an error.

Sure, but if the entry exists but is wrong, we'd emit C code that
won't compile.


> [...] Even if you use trace_mark() markers, you have to post a
> kernel patch which passes the prev->pid to the marking point and to
> discuss it.  for example,
> DEFINE_TRACE(sched_switch, (int prev_pid, int next_pid), prev_pid, next_pid)

(If it were up to me, I would add the task pointers too, which
debuginfo-less systemtap could ignore but other tracers may use.)


> But it might not so general, we have to discuss what parameters are
> enough good for each marking point.

That's exactly what the "lttng instrumentation markers" threads from
the recent past had started.


- FChE

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 17:48                       ` Frank Ch. Eigler
@ 2008-06-12 19:34                         ` Masami Hiramatsu
  2008-06-13  4:19                           ` Takashi Nishiie
  0 siblings, 1 reply; 30+ messages in thread
From: Masami Hiramatsu @ 2008-06-12 19:34 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Peter Zijlstra, Mathieu Desnoyers, Hideo AOKI, mingo,
	linux-kernel, Steven Rostedt

Hi Frank,

Frank Ch. Eigler wrote:
> Hi -
> 
> On Thu, Jun 12, 2008 at 01:05:52PM -0400, Masami Hiramatsu wrote:
>> [...]
>>>> "sched_switch(struct task_struct * next, struct task_struct * prev)":"next %p prev %p"
>>>> out of tree. Thus, you can use the printf-style format parser.
>>> That's an interesting idea, but errors in this table would themselves
>>> only be caught at C compilation time.
> 
>> Hmm, why would you think so?  I think if we can't find corresponding
>> entry from the lookup table, it becomes an error.
> 
> Sure, but if the entry exists but is wrong, we'd emit C code that
> won't compile.

I think if someone changes the trace point in the kernel,
Module.markers is also changed.

ex.)
 DEFINE_TRACE(sched_switch, (struct task_struct * next, struct task_struct * prev),
	     next, prev);

if someone changes above to below,

 DEFINE_TRACE(sched_switch, (int prev_pid, int next_pid), prev_pid, next_pid);

Module.markers also change like

 sched_switch	vmlinux	(struct task_struct * next, struct task_struct * prev)

to

 sched_switch	vmlinux	(int prev_pid, int next_pid)

In this case, the below entry never matches to new Module.markers.

"sched_switch(struct task_struct * next, struct task_struct * prev)":"next %p prev %p"

Thus, we can find an error.
However, of cause, we should take care of the task_struct changes.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Kernel marker has no performance impact on ia64.
  2008-06-12 19:34                         ` Masami Hiramatsu
@ 2008-06-13  4:19                           ` Takashi Nishiie
  2008-06-13 18:02                             ` Masami Hiramatsu
  0 siblings, 1 reply; 30+ messages in thread
From: Takashi Nishiie @ 2008-06-13  4:19 UTC (permalink / raw)
  To: 'Masami Hiramatsu', 'Frank Ch. Eigler'
  Cc: 'Peter Zijlstra', 'Mathieu Desnoyers',
	'Hideo AOKI', mingo, linux-kernel,
	'Steven Rostedt'

Hi,Hiramatsu

 Hiramatsu wrote:

>I think if someone changes the trace point in the kernel,
>Module.markers is also changed.
>
>ex.)
> DEFINE_TRACE(sched_switch, (struct task_struct * next, struct task_struct
* prev),
>	     next, prev);
>
>if someone changes above to below,
>
> DEFINE_TRACE(sched_switch, (int prev_pid, int next_pid), prev_pid,
next_pid);
>
>Module.markers also change like
>
> sched_switch	vmlinux	(struct task_struct * next, struct task_struct *
prev)
>
>to
>
> sched_switch	vmlinux	(int prev_pid, int next_pid)
>
>In this case, the below entry never matches to new Module.markers.
>
>"sched_switch(struct task_struct * next, struct task_struct * prev)":"next
%p prev %p"
>
>Thus, we can find an error.
>However, of cause, we should take care of the task_struct changes.

  The one expected of markers with SystemTap and LTTng is only 
different the foundation of this discussion. 

SystemTap:
  Because the output format is basically defined with user script in 
systemtap script, "format information" is unnecessary. It can do 
nothing but guess from "format information" in present markers though 
the type of the argument is indispensable for SystemTap on the other 
hand. If the type of an accurate argument is understood, correct 
processing can be executed. Moreover, being able to refer to the 
member of the structure if the argument is a pointer of the structure 
becomes possible, and it is convenient. 

LTTng:
  Because the output format is not originally defined in LTTng, "format 
information" is indispensable. 

  I think that I only have to add the item that shows the type of an 
accurate argument if the format of "Module.markers" is changed. 

Thank you,

--
Takashi Nishiie

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-13  4:19                           ` Takashi Nishiie
@ 2008-06-13 18:02                             ` Masami Hiramatsu
  2008-06-16  2:58                               ` Takashi Nishiie
  0 siblings, 1 reply; 30+ messages in thread
From: Masami Hiramatsu @ 2008-06-13 18:02 UTC (permalink / raw)
  To: Takashi Nishiie
  Cc: 'Frank Ch. Eigler', 'Peter Zijlstra',
	'Mathieu Desnoyers', 'Hideo AOKI', mingo,
	linux-kernel, 'Steven Rostedt'

Hi Takashi,

Thank you for comment.
Takashi Nishiie wrote:
> Hi,Hiramatsu
> 
>  Hiramatsu wrote:
> 
>> I think if someone changes the trace point in the kernel,
>> Module.markers is also changed.
>>
>> ex.)
>> DEFINE_TRACE(sched_switch, (struct task_struct * next, struct task_struct
> * prev),
>> 	     next, prev);
>>
>> if someone changes above to below,
>>
>> DEFINE_TRACE(sched_switch, (int prev_pid, int next_pid), prev_pid,
> next_pid);
>> Module.markers also change like
>>
>> sched_switch	vmlinux	(struct task_struct * next, struct task_struct *
> prev)
>> to
>>
>> sched_switch	vmlinux	(int prev_pid, int next_pid)
>>
>> In this case, the below entry never matches to new Module.markers.
>>
>> "sched_switch(struct task_struct * next, struct task_struct * prev)":"next
> %p prev %p"
>> Thus, we can find an error.
>> However, of cause, we should take care of the task_struct changes.
> 
>   The one expected of markers with SystemTap and LTTng is only 
> different the foundation of this discussion. 
> 
> SystemTap:
>   Because the output format is basically defined with user script in 
> systemtap script, "format information" is unnecessary. It can do 
> nothing but guess from "format information" in present markers though 
> the type of the argument is indispensable for SystemTap on the other 
> hand. If the type of an accurate argument is understood, correct 
> processing can be executed. Moreover, being able to refer to the 
> member of the structure if the argument is a pointer of the structure 
> becomes possible, and it is convenient. 

It's not so easy for systemtap... for example, if a marker pass 'char *',
how can we do it? it might have to record as a string, or might have to
record as a pointer. Obviously, printf-style information includes a kind of
'semantic' information. So, I think the 'lookup table' is good.

> LTTng:
>   Because the output format is not originally defined in LTTng, "format 
> information" is indispensable. 

Sure, however, LTTng can have original "format information" itself.

>   I think that I only have to add the item that shows the type of an 
> accurate argument if the format of "Module.markers" is changed. 

I think those 'semantic' type information (which informs you what
is this value and how this value should be recorded) should be
defined by each tracer, because what information they want to
extract from arguments are strongly depends on their use cases.

Actually, that requires tracers to pay maintenance costs, but
it's not so big if the "regular" marking point is enough stable.

Regards,

> 
> Thank you,
> 
> --
> Takashi Nishiie
> 
> 
> 

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Kernel marker has no performance impact on ia64.
  2008-06-13 18:02                             ` Masami Hiramatsu
@ 2008-06-16  2:58                               ` Takashi Nishiie
  0 siblings, 0 replies; 30+ messages in thread
From: Takashi Nishiie @ 2008-06-16  2:58 UTC (permalink / raw)
  To: 'Masami Hiramatsu'
  Cc: 'Frank Ch. Eigler', 'Peter Zijlstra',
	'Mathieu Desnoyers', 'Hideo AOKI', mingo,
	linux-kernel, 'Steven Rostedt'

Hi.

Takashi Nishiie wrote:
>>   I think that I only have to add the item that shows the type of an 
>> accurate argument if the format of "Module.markers" is changed. 
>
Masami Hiramatsu wrote:
>I think those 'semantic' type information (which informs you what
>is this value and how this value should be recorded) should be
>defined by each tracer, because what information they want to
>extract from arguments are strongly depends on their use cases.
>
>Actually, that requires tracers to pay maintenance costs, but
>it's not so big if the "regular" marking point is enough stable.

  Certainly, because it only has to be able to acquire name of the 
point that wants to be traced, value that can be referred, the type, 
and those three items if interchangeability with an existing tool is 
not thought, it might not have to stick to a detailed thing. 

  By the way, 'pr_debug' and 'dev_debug' and 'DEBUGP', etc. might 
unite handling by replacing it with kernel markers. 

Thank you,

--
Takashi Nishiie




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 15:53               ` Frank Ch. Eigler
  2008-06-12 16:16                 ` Masami Hiramatsu
@ 2008-06-12 16:53                 ` Peter Zijlstra
  2008-06-12 17:38                   ` Frank Ch. Eigler
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-06-12 16:53 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Mathieu Desnoyers, Hideo AOKI, mingo, Masami Hiramatsu,
	linux-kernel, Steven Rostedt

On Thu, 2008-06-12 at 11:53 -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> On Thu, Jun 12, 2008 at 04:27:03PM +0200, Peter Zijlstra wrote:
> > [...]
> > > Well, the string contains each field name and type. Therefore, SystemTAP
> > > can hook on a marker and parse the string looking for some elements by
> > > passing a NULL format string upon probe registration. Alternatively, it
> > > can provide the exact format string expected when it registers its probe
> > > to the marker and a check will be done to verify that the format string
> > > passed along with the registered probe matches the marker format string.
> > 
> > Yes, I get that, its one of the ugliest things I've met in this whole
> > marker story. Why can't stap not insert a normal trace handler that
> > extracts the information from prev/next it wants? [...]
> 
> Think this through.  How should systemtap (or another user-space
> separate-compiled tool like lttng) do this exactly?

lttng has trace handlers to write data into some buffer, right?
The trace point need not be concerned with which data, nor into what
buffer.

> (a) rely on debugging information?  Even assuming we could get proper
>     anchors (PC addresses or unambiguous type names) to start
>     searching dwarf data, we lose a key attractions of markers: that
>     it can robustly transfer data *without* dwarf data kept around.

Perhaps you can ship a reduced set of dwarf info tailored to contain the
bits you need, like where to find ->pid in struct task_struct. Or you
can hook into the lttng tracer.

> (b) rely on hand-written C code (prototypes, pointer derefrencing
>     wrappers) distributed with systemtap?  Not only would this be a
>     brittle maintenance pain in the form of cude duplication, but then
>     errors in it couldn't even be detected until the final C
>     compilation stage.  That would make a lousy user experience.

Not really sure what you mean here - I throught compile time errors were
the goal?!

> (c) have systemtap try to parse the mhiramat-proposed "(struct
>     task_struct * next, struct task_struct * prev)" format strings?
>     Then we're asking systemtap to parse potentially general C type
>     expressions, find the kernel headers that declare the types?
>     Parse available subfields?  That seems too much to ask for.

tcc and sourcefs sound like way fun ;-)

> (d) or another way?

Get your own tracer in kernel - that provides exactly what stap needs?



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 16:53                 ` Peter Zijlstra
@ 2008-06-12 17:38                   ` Frank Ch. Eigler
  2008-06-13 11:01                     ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Frank Ch. Eigler @ 2008-06-12 17:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Hideo AOKI, mingo, Masami Hiramatsu,
	linux-kernel, Steven Rostedt

Hi -

On Thu, Jun 12, 2008 at 06:53:41PM +0200, Peter Zijlstra wrote:
> [...]
> > Think this through.  How should systemtap (or another user-space
> > separate-compiled tool like lttng) do this exactly?
> 
> lttng has trace handlers to write data into some buffer, right?
> The trace point need not be concerned with which data, nor into what
> buffer.

The "which data" is definitely a trace point concern.  It communicates
from the developer to users as to what values are likely to be of
interest.

> > (a) rely on debugging information?  Even assuming we could get proper
> >     anchors (PC addresses or unambiguous type names) to start
> >     searching dwarf data, we lose a key attractions of markers: that
> >     it can robustly transfer data *without* dwarf data kept around.
> 
> Perhaps you can ship a reduced set of dwarf info [...]

"I" don't ship or generate dwarf data.  Distributors do.

> > (b) rely on hand-written C code (prototypes, pointer derefrencing
> >     wrappers) distributed with systemtap?  Not only would this be a
> >     brittle maintenance pain in the form of cude duplication, but then
> >     errors in it couldn't even be detected until the final C
> >     compilation stage.  That would make a lousy user experience.
>
> Not really sure what you mean here - I throught compile time errors
> were the goal?!

For us, it's too late.  In systemtap, we try to give people useful
information when they make mistakes.  For probes of whatever sort, we
want to know the available data types and names, while just starting
to process their script, so that we can check types and suggest
alternatives.  C code compilation is quite some way removed and is
supposed to be a systemtap internal implementation detail.

> > (c) have systemtap try to parse the mhiramat-proposed "(struct
> >     task_struct * next, struct task_struct * prev)" format strings?
> >     Then we're asking systemtap to parse potentially general C type
> >     expressions, find the kernel headers that declare the types?
> >     Parse available subfields?  That seems too much to ask for.
> 
> tcc and sourcefs sound like way fun ;-)

Really...

> > (d) or another way?
> 
> Get your own tracer in kernel - that provides exactly what stap needs?

You are missing that (a) this is the point of markers - to allow the
the gajillion tracers a single place per event to hook through, and
(b) we would like to leave to subsystem developers and/or end-users as
to what data should be available.  We don't want to get into the
middle of it.

> > This is another reason why markers are a nice solution.  They allow
> > passing of actual useful values: not just the %p pointers but the most
> > interesting derived values (e.g., prev->pid). 
> 
> Useful to whoem? stap isn't the holy grail of tracing and certainly not
> the only consumer of trace points, so restricting the information to
> just what stap needs is actually making trace points less useful.

Only you are talking about "restricting".  I am talking about
thoughtfully *adding* simple scalar values that most tracing type
tools will generally want, so that they don't have to perform heroic
measures like ship their own dwarf subsets or parse kernel source
code.  

Which scalars?  Well, it depends on the subsystem and event, and you
can review here several discussions about specific lttng
instrumentation markers to see how they generally go.

- FChE

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-12 17:38                   ` Frank Ch. Eigler
@ 2008-06-13 11:01                     ` Peter Zijlstra
  2008-06-13 14:17                       ` Frank Ch. Eigler
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2008-06-13 11:01 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Mathieu Desnoyers, Hideo AOKI, mingo, Masami Hiramatsu,
	linux-kernel, Steven Rostedt

On Thu, 2008-06-12 at 13:38 -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> On Thu, Jun 12, 2008 at 06:53:41PM +0200, Peter Zijlstra wrote:
> > [...]
> > > Think this through.  How should systemtap (or another user-space
> > > separate-compiled tool like lttng) do this exactly?
> > 
> > lttng has trace handlers to write data into some buffer, right?
> > The trace point need not be concerned with which data, nor into what
> > buffer.
> 
> The "which data" is definitely a trace point concern.  It communicates
> from the developer to users as to what values are likely to be of
> interest.

But that is tracer specific - I might write a scheduler tracer that
looks at the quality of scheduling decisions and thus wants to look at
the virtual timeline variables and the scheduling class of the tasks
involved.

That's a whole different context, but the trace points are the same. Are
you saying trace points are not to allow me to do that?

> > > (a) rely on debugging information?  Even assuming we could get proper
> > >     anchors (PC addresses or unambiguous type names) to start
> > >     searching dwarf data, we lose a key attractions of markers: that
> > >     it can robustly transfer data *without* dwarf data kept around.
> > 
> > Perhaps you can ship a reduced set of dwarf info [...]
> 
> "I" don't ship or generate dwarf data.  Distributors do.

That's ignoring the point - 'someone' could generate reduced debug info
to allow you to easily get what you want.

> > > (b) rely on hand-written C code (prototypes, pointer derefrencing
> > >     wrappers) distributed with systemtap?  Not only would this be a
> > >     brittle maintenance pain in the form of cude duplication, but then
> > >     errors in it couldn't even be detected until the final C
> > >     compilation stage.  That would make a lousy user experience.
> >
> > Not really sure what you mean here - I throught compile time errors
> > were the goal?!
> 
> For us, it's too late.  In systemtap, we try to give people useful
> information when they make mistakes.  For probes of whatever sort, we
> want to know the available data types and names, while just starting
> to process their script, so that we can check types and suggest
> alternatives.  C code compilation is quite some way removed and is
> supposed to be a systemtap internal implementation detail.

Sounds like you have an incomplete Native-Interface system then. You
should be able to match a native function description back to your
script language.

> > > (c) have systemtap try to parse the mhiramat-proposed "(struct
> > >     task_struct * next, struct task_struct * prev)" format strings?
> > >     Then we're asking systemtap to parse potentially general C type
> > >     expressions, find the kernel headers that declare the types?
> > >     Parse available subfields?  That seems too much to ask for.
> > 
> > tcc and sourcefs sound like way fun ;-)
> 
> Really...

Yeah, wouldn't it be cool if the kernel came with an embedded compiler
and a filesystem that included its exact source code? Together with the
entry instrumentation site and dynamic jump patches you can do really
weird stuff... /me dreams on :-)

> > > (d) or another way?
> > 
> > Get your own tracer in kernel - that provides exactly what stap needs?
> 
> You are missing that (a) this is the point of markers - to allow the
> the gajillion tracers a single place per event to hook through, and
> (b) we would like to leave to subsystem developers and/or end-users as
> to what data should be available.  We don't want to get into the
> middle of it.

I think a) and b) contradict each other, you cannot cater for all
tracers and limit the data they can use.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel marker has no performance impact on ia64.
  2008-06-13 11:01                     ` Peter Zijlstra
@ 2008-06-13 14:17                       ` Frank Ch. Eigler
  0 siblings, 0 replies; 30+ messages in thread
From: Frank Ch. Eigler @ 2008-06-13 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Hideo AOKI, mingo, Masami Hiramatsu,
	linux-kernel, Steven Rostedt

Hi -

On Fri, Jun 13, 2008 at 01:01:12PM +0200, Peter Zijlstra wrote:
> [...]
> > > The trace point need not be concerned with which data, nor into what
> > > buffer.
> > 
> > The "which data" is definitely a trace point concern.  It communicates
> > from the developer to users as to what values are likely to be of
> > interest.
> 
> But that is tracer specific - I might write a scheduler tracer that
> looks at the quality of scheduling decisions and thus wants to look at
> the virtual timeline variables and the scheduling class of the tasks
> involved.

> That's a whole different context, but the trace points are the same. Are
> you saying trace points are not to allow me to do that?

Of course markers can allow you to do that.  You can add two markers
in one spot if you really want to, one with far more details than the
other.  You can still write your own "tracer" kernel-side code to
attach to that if you wish.  But you might find that you don't have
to, if a more general tool can get you the same data (and more).

(A separate issue, brought up some number of times here, has been
whether even detailed debugging-oriented markers would be worth
keeping in the code.  I've argued that, yes, it's worthwhile, since
they cost nearly nothing, and in exchange would allow your
users/customers to gather the same debugging data for bug reports, in
situ on a live system.)

> > > > (a) rely on debugging information?  Even assuming we could get proper
> > > >     anchors (PC addresses or unambiguous type names) to start
> > > >     searching dwarf data, we lose a key attractions of markers: that
> > > >     it can robustly transfer data *without* dwarf data kept around.
> > > 
> > > Perhaps you can ship a reduced set of dwarf info [...]
> > 
> > "I" don't ship or generate dwarf data.  Distributors do.
> 
> That's ignoring the point - 'someone' could generate reduced debug info
> to allow you to easily get what you want.

But is there clear benefit to blocking potential incremental progress,
waiting for newer technology?  What if one can transition smoothly
once godot does show up?

> [...]
> > > Get your own tracer in kernel - that provides exactly what stap needs?
> > 
> > You are missing that (a) this is the point of markers - to allow the
> > the gajillion tracers a single place per event to hook through, and
> > (b) we would like to leave to subsystem developers and/or end-users as
> > to what data should be available.  We don't want to get into the
> > middle of it.
> 
> I think a) and b) contradict each other, you cannot cater for all
> tracers and limit the data they can use.

We are not talking about "limiting".  Markers are about identifying
the large intersection of interest (events + data) amongst multiple
tracing-type tools, and letting them share an single efficient hooking
mechanism for getting at that.  No one said it must to be the solitary
tracing-hook mechanism.

My (b) point was that systemtap per se does not want to specify what
subset of data should be available to an end user -- so we don't want
systemtap-specific markers or hooks or whatever.  We want to expose
whatever the developer deemed appropriate.  I merely lobby that this
set should include some values that tools can use without heroic
measures.

- FChE

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-06-16  2:58 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-02 22:12 Kernel marker has no performance impact on ia64 Hideo AOKI
2008-06-02 22:32 ` Peter Zijlstra
2008-06-02 23:21   ` Mathieu Desnoyers
2008-06-03  6:07     ` Takashi Nishiie
2008-06-04  4:58     ` Masami Hiramatsu
2008-06-04 23:26       ` Mathieu Desnoyers
2008-06-04 23:40         ` Masami Hiramatsu
2008-06-04 22:27     ` Peter Zijlstra
2008-06-04 23:22       ` Mathieu Desnoyers
2008-06-05  8:12         ` Peter Zijlstra
2008-06-05 14:28           ` Masami Hiramatsu
2008-06-12 14:04             ` Mathieu Desnoyers
2008-06-12 15:31               ` Masami Hiramatsu
2008-06-12 13:53           ` Mathieu Desnoyers
2008-06-12 14:27             ` Peter Zijlstra
2008-06-12 15:53               ` Frank Ch. Eigler
2008-06-12 16:16                 ` Masami Hiramatsu
2008-06-12 16:43                   ` Frank Ch. Eigler
2008-06-12 16:56                     ` Peter Zijlstra
2008-06-12 22:10                       ` Mathieu Desnoyers
2008-06-12 17:05                     ` Masami Hiramatsu
2008-06-12 17:48                       ` Frank Ch. Eigler
2008-06-12 19:34                         ` Masami Hiramatsu
2008-06-13  4:19                           ` Takashi Nishiie
2008-06-13 18:02                             ` Masami Hiramatsu
2008-06-16  2:58                               ` Takashi Nishiie
2008-06-12 16:53                 ` Peter Zijlstra
2008-06-12 17:38                   ` Frank Ch. Eigler
2008-06-13 11:01                     ` Peter Zijlstra
2008-06-13 14:17                       ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox