[PATCH] perf_counter: extensible perf_counter

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] perf_counter: extensible perf_counter_attr
@ 2009-06-08 17:25 Peter Zijlstra
  2009-06-08 19:02 ` Corey Ashford
  2009-06-09  4:17 ` Paul Mackerras
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2009-06-08 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Mackerras, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Corey Ashford, linux-kernel

Allow extending the perf_counter_attr structure by linking extended
structures to it.

Also, should we grow the directly reserved space in the structure a
little more?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_counter.h |    7 +++++++
 kernel/perf_counter.c        |    1 +
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 3586df8..781d8ce 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -175,6 +175,13 @@ struct perf_counter_attr {
 	__u32			__reserved_3;
 
 	__u64			__reserved_4;
+
+	struct perf_counter_attr_ext *ext_attrs;
+};
+
+struct perf_counter_attr_ext {
+	struct perf_counter_attr_ext 	*next;
+	__u64				perf_attr_ext_type;
 };
 
 /*
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 5eacaaf..606f662 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -3467,6 +3467,7 @@ perf_counter_alloc(struct perf_counter_attr *attr,
 
 	counter->cpu		= cpu;
 	counter->attr		= *attr;
+	counter->attr.ext_attrs	= NULL;
 	counter->group_leader	= group_leader;
 	counter->pmu		= NULL;
 	counter->ctx		= ctx;


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-08 17:25 [PATCH] perf_counter: extensible perf_counter_attr Peter Zijlstra
@ 2009-06-08 19:02 ` Corey Ashford
  2009-06-08 19:51   ` Peter Zijlstra
  2009-06-09  4:17 ` Paul Mackerras
  1 sibling, 1 reply; 19+ messages in thread
From: Corey Ashford @ 2009-06-08 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

Hi Peter,

Peter Zijlstra wrote:
> Allow extending the perf_counter_attr structure by linking extended
> structures to it.
> 
> Also, should we grow the directly reserved space in the structure a
> little more?
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/perf_counter.h |    7 +++++++
>  kernel/perf_counter.c        |    1 +
>  2 files changed, 8 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
> index 3586df8..781d8ce 100644
> --- a/include/linux/perf_counter.h
> +++ b/include/linux/perf_counter.h
> @@ -175,6 +175,13 @@ struct perf_counter_attr {
>  	__u32			__reserved_3;
> 
>  	__u64			__reserved_4;
> +
> +	struct perf_counter_attr_ext *ext_attrs;
> +};
> +
> +struct perf_counter_attr_ext {
> +	struct perf_counter_attr_ext 	*next;
> +	__u64				perf_attr_ext_type;
>  };

Let's say I want to extend the attributes by four 64-bit quantities... from the 
above definition, I'd need four additional records chained together, right?  How 
about something like this instead:

struct perf_counter_attr_ext {
     __u32	num_perf_attrs;
     __u64 	*perf_attr_ext;
};

Another alternative would be to place these two fields into perf_counter_attr 
and so eliminate the extra level of indirection.


Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-08 19:02 ` Corey Ashford
@ 2009-06-08 19:51   ` Peter Zijlstra
  2009-06-08 21:18     ` Corey Ashford
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2009-06-08 19:51 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

On Mon, 2009-06-08 at 12:02 -0700, Corey Ashford wrote:
> Hi Peter,
> 
> Peter Zijlstra wrote:
> > Allow extending the perf_counter_attr structure by linking extended
> > structures to it.
> > 
> > Also, should we grow the directly reserved space in the structure a
> > little more?
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  include/linux/perf_counter.h |    7 +++++++
> >  kernel/perf_counter.c        |    1 +
> >  2 files changed, 8 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
> > index 3586df8..781d8ce 100644
> > --- a/include/linux/perf_counter.h
> > +++ b/include/linux/perf_counter.h
> > @@ -175,6 +175,13 @@ struct perf_counter_attr {
> >  	__u32			__reserved_3;
> > 
> >  	__u64			__reserved_4;
> > +
> > +	struct perf_counter_attr_ext *ext_attrs;
> > +};
> > +
> > +struct perf_counter_attr_ext {
> > +	struct perf_counter_attr_ext 	*next;
> > +	__u64				perf_attr_ext_type;
> >  };
> 
> Let's say I want to extend the attributes by four 64-bit quantities... from the 
> above definition, I'd need four additional records chained together, right?  How 
> about something like this instead:

Ah, the idea was to do something like:

struct perf_counter_attr_feature {
	struct perf_counter_attr_ext	head;
	... more stuff ...
};

and set head.perf_attr_ext_type = PERF_ATTR_EXT_FEATURE




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-08 19:51   ` Peter Zijlstra
@ 2009-06-08 21:18     ` Corey Ashford
  2009-06-08 21:23       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Corey Ashford @ 2009-06-08 21:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

Peter Zijlstra wrote:
> On Mon, 2009-06-08 at 12:02 -0700, Corey Ashford wrote:
>> Hi Peter,
>>
>> Peter Zijlstra wrote:
>>> Allow extending the perf_counter_attr structure by linking extended
>>> structures to it.
>>>
>>> Also, should we grow the directly reserved space in the structure a
>>> little more?
>>>
>>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>> ---
>>>  include/linux/perf_counter.h |    7 +++++++
>>>  kernel/perf_counter.c        |    1 +
>>>  2 files changed, 8 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
>>> index 3586df8..781d8ce 100644
>>> --- a/include/linux/perf_counter.h
>>> +++ b/include/linux/perf_counter.h
>>> @@ -175,6 +175,13 @@ struct perf_counter_attr {
>>>  	__u32			__reserved_3;
>>>
>>>  	__u64			__reserved_4;
>>> +
>>> +	struct perf_counter_attr_ext *ext_attrs;
>>> +};
>>> +
>>> +struct perf_counter_attr_ext {
>>> +	struct perf_counter_attr_ext 	*next;
>>> +	__u64				perf_attr_ext_type;
>>>  };
>> Let's say I want to extend the attributes by four 64-bit quantities... from the 
>> above definition, I'd need four additional records chained together, right?  How 
>> about something like this instead:
> 
> Ah, the idea was to do something like:
> 
> struct perf_counter_attr_feature {
> 	struct perf_counter_attr_ext	head;
> 	... more stuff ...
> };
> 
> and set head.perf_attr_ext_type = PERF_ATTR_EXT_FEATURE


So you'd do something like this then when assigning to the perf_counter_attr struct:

struct perf_counter_attr pca;
...
struct perf_counter_attr_feature feature;

...
pca.ext_attrs = (struct perf_counter_attr_ext *)&feature;
-or-
pca.ext_attrs = &feature.head; /* secretly know that there's data that lies past 
the attr struct header */

Am I understanding this correctly?

- Corey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-08 21:18     ` Corey Ashford
@ 2009-06-08 21:23       ` Peter Zijlstra
  2009-06-08 21:29         ` Corey Ashford
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2009-06-08 21:23 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

On Mon, 2009-06-08 at 14:18 -0700, Corey Ashford wrote:
> pca.ext_attrs = &feature.head; /* secretly know that there's data that lies past 
> the attr struct header */

Right, except its not so very secret since we have the type field
telling us.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-08 21:23       ` Peter Zijlstra
@ 2009-06-08 21:29         ` Corey Ashford
  2009-06-08 21:50           ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Corey Ashford @ 2009-06-08 21:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel



Peter Zijlstra wrote:
> On Mon, 2009-06-08 at 14:18 -0700, Corey Ashford wrote:
>> pca.ext_attrs = &feature.head; /* secretly know that there's data that lies past 
>> the attr struct header */
> 
> Right, except its not so very secret since we have the type field
> telling us.

I see.  Thanks for the explanation.  Looks like a good plan to me!

- Corey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-08 21:29         ` Corey Ashford
@ 2009-06-08 21:50           ` Ingo Molnar
  2009-06-09  0:50             ` Corey Ashford
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2009-06-08 21:50 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

* Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:

> Peter Zijlstra wrote:
>> On Mon, 2009-06-08 at 14:18 -0700, Corey Ashford wrote:
>>> pca.ext_attrs = &feature.head; /* secretly know that there's data 
>>> that lies past the attr struct header */
>>
>> Right, except its not so very secret since we have the type field
>> telling us.
>
> I see.  Thanks for the explanation.  Looks like a good plan to me!

i think we'd have a much simpler implementation by changing 
__reserved_1 to attr_size. When the kernel adds new attributes, the 
size will increase - old user-space will use the old size which the 
kernel detects and adopts to (by zeroing out that attribute space).

That way we'll always have a nice flat attributes structure, with no 
quirky chaining that has field-dependent data types ...

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-08 21:50           ` Ingo Molnar
@ 2009-06-09  0:50             ` Corey Ashford
  2009-06-09  6:51               ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Corey Ashford @ 2009-06-09  0:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

Ingo Molnar wrote:
> * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
> 
>> Peter Zijlstra wrote:
>>> On Mon, 2009-06-08 at 14:18 -0700, Corey Ashford wrote:
>>>> pca.ext_attrs = &feature.head; /* secretly know that there's data 
>>>> that lies past the attr struct header */
>>> Right, except its not so very secret since we have the type field
>>> telling us.
>> I see.  Thanks for the explanation.  Looks like a good plan to me!
> 
> i think we'd have a much simpler implementation by changing 
> __reserved_1 to attr_size. When the kernel adds new attributes, the 
> size will increase - old user-space will use the old size which the 
> kernel detects and adopts to (by zeroing out that attribute space).
> 
> That way we'll always have a nice flat attributes structure, with no 
> quirky chaining that has field-dependent data types ...
> 
> 	Ingo

If I understand you correctly, you would simply make perf_counter_attr larger 
every time you want to add a new attribute.  Users using the new attributes 
would call sys_perf_counter_open with a larger attr_size value.

What about arch-dependent attributes?  Would you want to place them all in the 
perf_counter_attr struct?  I suppose this could be done by #include'ing an 
arch-specific .h file.

- Corey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-08 17:25 [PATCH] perf_counter: extensible perf_counter_attr Peter Zijlstra
  2009-06-08 19:02 ` Corey Ashford
@ 2009-06-09  4:17 ` Paul Mackerras
  2009-06-09  6:53   ` Ingo Molnar
  1 sibling, 1 reply; 19+ messages in thread
From: Paul Mackerras @ 2009-06-09  4:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Corey Ashford, linux-kernel

Peter Zijlstra writes:

> Allow extending the perf_counter_attr structure by linking extended
> structures to it.

...

> @@ -175,6 +175,13 @@ struct perf_counter_attr {
>  	__u32			__reserved_3;
>  
>  	__u64			__reserved_4;
> +
> +	struct perf_counter_attr_ext *ext_attrs;
> +};

Since this is a user-visible structure, you've just introduced an ABI
difference between 32-bit and 64-bit processes, which we've managed to
avoid so far.  Assuming that is that you intend eventually to use the
value that userspace puts there, which you don't at the moment.  Was
that your intention?

Paul.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-09  0:50             ` Corey Ashford
@ 2009-06-09  6:51               ` Ingo Molnar
  2009-06-09  8:13                 ` Corey Ashford
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2009-06-09  6:51 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

* Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:

> If I understand you correctly, you would simply make 
> perf_counter_attr larger every time you want to add a new 
> attribute.  Users using the new attributes would call 
> sys_perf_counter_open with a larger attr_size value.

Yes, exactly. Basically ABIs in Linux only get extended (never 
shrunk and never changed) so it's not like we ever want to (or can) 
shrink the size of the structure or change its semantics.

Each future extension gives the structure a new, unique size - which 
also acts as an 'ABI version' identifier, in a pretty robust way. We 
check this 'ABI version' (the structure size) in the kernel code so 
it's not just a passive 'version field' thing.

Here are the various compatibility variations:

 - same-version kernel and user-space: they both use the same 
   attr_size value and support the full set of features.

 - old user-space running on new kernel: works fine, as the kernel 
   will do a short copy and zero out the remaining attributes.

 - new user-space running on old kernel: the kernel returns -ENOTSUP 
   and user-space has a choice to refuse to run cleanly - or, if an 
   old ABI version is widespread, might chose to utilize the old,
   smaller attribute structure size field (at the cost of not using
   new attribute features, obviously).

( Additional detail: in the size mismatch failure case the kernel 
  should write back the supported size into attr_size, so that 
  user-space knows which precise ABI variant it deals with on the 
  kernel side. )

This kind of ABI maintenance method has a number of substantial 
advantages:

 - It is very compatible (see above)

 - It is extensible easily and in an unlimited way - we just extend 
   the structure size.

 - It is very clean on the kernel side and the user side as well,
   because we just have a single attribute structure.

 - It makes the ABI 'version' field an _active_ component of 
   functionality - so there is no way for subtle breakages to slip 
   in.

 - New attributes are prime-time members of the attribute structure, 
   not second-class citizens that first have to be read in via an
   elaborate chaining mechanism at extra cost.

[ If only all our syscall ABIs used this technique :-) It would be 
  so much easier to extend syscalls cleanly - without having to go 
  through the expensive and time-consuming process to add new
  syscalls. ]

> What about arch-dependent attributes?  Would you want to place 
> them all in the perf_counter_attr struct?  I suppose this could be 
> done by #include'ing an arch-specific .h file.

What arch-dependent attributes are you thinking about? In the 
perfcounters subsystem we want to support PMU and other performance 
analysis features in a way that makes it possible for all 
architectures to make use of them.

So 'arch dependent attributes' per se are bad and against the 
perfcounters design. "Generic perfcounter feature only supported by 
a single architecture initially" is better.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-09  4:17 ` Paul Mackerras
@ 2009-06-09  6:53   ` Ingo Molnar
  2009-06-09  9:58     ` Paul Mackerras
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2009-06-09  6:53 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Corey Ashford, linux-kernel


* Paul Mackerras <paulus@samba.org> wrote:

> Peter Zijlstra writes:
> 
> > Allow extending the perf_counter_attr structure by linking extended
> > structures to it.
> 
> ...
> 
> > @@ -175,6 +175,13 @@ struct perf_counter_attr {
> >  	__u32			__reserved_3;
> >  
> >  	__u64			__reserved_4;
> > +
> > +	struct perf_counter_attr_ext *ext_attrs;
> > +};
> 
> Since this is a user-visible structure, you've just introduced an 
> ABI difference between 32-bit and 64-bit processes, which we've 
> managed to avoid so far. [...]

This is a good point too - handling pointers in ABIs is possible but 
should be avoided as much as possible: it creates the need to 
introduce a compat_sys_perf_counter_open() and doubles the syscall 
table complexity.

Lets do the s/__reserved_1/attr_size ABI i suggested (i outlined 
various properties of it in the previous mail). Agreed?

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-09  6:51               ` Ingo Molnar
@ 2009-06-09  8:13                 ` Corey Ashford
  2009-06-09 11:53                   ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Corey Ashford @ 2009-06-09  8:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

Ingo Molnar wrote:
> * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
>
>   
>> If I understand you correctly, you would simply make 
>> perf_counter_attr larger every time you want to add a new 
>> attribute.  Users using the new attributes would call 
>> sys_perf_counter_open with a larger attr_size value.
>>     
>
> Yes, exactly. Basically ABIs in Linux only get extended (never 
> shrunk and never changed) so it's not like we ever want to (or can) 
> shrink the size of the structure or change its semantics.
>
> Each future extension gives the structure a new, unique size - which 
> also acts as an 'ABI version' identifier, in a pretty robust way. We 
> check this 'ABI version' (the structure size) in the kernel code so 
> it's not just a passive 'version field' thing.
>
> Here are the various compatibility variations:
>
>  - same-version kernel and user-space: they both use the same 
>    attr_size value and support the full set of features.
>
>  - old user-space running on new kernel: works fine, as the kernel 
>    will do a short copy and zero out the remaining attributes.
>
>  - new user-space running on old kernel: the kernel returns -ENOTSUP 
>    and user-space has a choice to refuse to run cleanly - or, if an 
>    old ABI version is widespread, might chose to utilize the old,
>    smaller attribute structure size field (at the cost of not using
>    new attribute features, obviously).
>
> ( Additional detail: in the size mismatch failure case the kernel 
>   should write back the supported size into attr_size, so that 
>   user-space knows which precise ABI variant it deals with on the 
>   kernel side. )
>
> This kind of ABI maintenance method has a number of substantial 
> advantages:
>
>  - It is very compatible (see above)
>
>  - It is extensible easily and in an unlimited way - we just extend 
>    the structure size.
>
>  - It is very clean on the kernel side and the user side as well,
>    because we just have a single attribute structure.
>
>  - It makes the ABI 'version' field an _active_ component of 
>    functionality - so there is no way for subtle breakages to slip 
>    in.
>
>  - New attributes are prime-time members of the attribute structure, 
>    not second-class citizens that first have to be read in via an
>    elaborate chaining mechanism at extra cost.
>
> [ If only all our syscall ABIs used this technique :-) It would be 
>   so much easier to extend syscalls cleanly - without having to go 
>   through the expensive and time-consuming process to add new
>   syscalls. ]
>   
Thanks for the detailed explanation :)

>> What about arch-dependent attributes?  Would you want to place 
>> them all in the perf_counter_attr struct?  I suppose this could be 
>> done by #include'ing an arch-specific .h file.
>>     
>
> What arch-dependent attributes are you thinking about? In the 
> perfcounters subsystem we want to support PMU and other performance 
> analysis features in a way that makes it possible for all 
> architectures to make use of them.
>
> So 'arch dependent attributes' per se are bad and against the 
> perfcounters design. "Generic perfcounter feature only supported by 
> a single architecture initially" is better.
>   
Well, I think Intel has PEBS and AMD has some similar mechanism.  I 
would guess that at some point you would want to provide access to those 
PMU features via these attributes.  Since these mechanisms are very 
chip-specific, I don't think you would want to try to create an 
arch-independent interface to them.  There may be future mechanisms that 
only make sense on one particular chip design, and would therefore not 
be a candidate for wider use, but would still make sense to provide some 
support for that mechanism via the attributes.

Did you have some different plan for PEBS  (etc.) ?

- Corey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-09  6:53   ` Ingo Molnar
@ 2009-06-09  9:58     ` Paul Mackerras
  0 siblings, 0 replies; 19+ messages in thread
From: Paul Mackerras @ 2009-06-09  9:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Corey Ashford, linux-kernel

Ingo Molnar writes:

> * Paul Mackerras <paulus@samba.org> wrote:

> > Since this is a user-visible structure, you've just introduced an 
> > ABI difference between 32-bit and 64-bit processes, which we've 
> > managed to avoid so far. [...]
> 
> This is a good point too - handling pointers in ABIs is possible but 
> should be avoided as much as possible: it creates the need to 
> introduce a compat_sys_perf_counter_open() and doubles the syscall 
> table complexity.
> 
> Lets do the s/__reserved_1/attr_size ABI i suggested (i outlined 
> various properties of it in the previous mail). Agreed?

Yep, sounds good.

We also need a way to allow access to random machine-specific features
that might not be supported in a generic way, such as the instruction
matching CAM on POWER4/PPC970, or (apparently) PEBS.  That's what I
had intended exclusive groups + the extra_config_len field to be used
for, but Peter removed extra_config_len...

Paul.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-09  8:13                 ` Corey Ashford
@ 2009-06-09 11:53                   ` Ingo Molnar
  2009-06-09 16:44                     ` Corey Ashford
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2009-06-09 11:53 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel


* Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:

>> So 'arch dependent attributes' per se are bad and against the 
>> perfcounters design. "Generic perfcounter feature only supported 
>> by a single architecture initially" is better.
>   
> Well, I think Intel has PEBS and AMD has some similar mechanism.  
> I would guess that at some point you would want to provide access 
> to those PMU features via these attributes.  Since these 
> mechanisms are very chip-specific, I don't think you would want to 
> try to create an arch-independent interface to them.  There may be 
> future mechanisms that only make sense on one particular chip 
> design, and would therefore not be a candidate for wider use, but 
> would still make sense to provide some support for that mechanism 
> via the attributes.
>
> Did you have some different plan for PEBS (etc.) ?

I think PEBS is best supported by a generic abstraction. Something 
like this: it's basically a special sampling format, that generates 
a record of:

	struct pt_regs regs;
	__u64 insn_latency; /* optional */
	__u64 data_address; /* optional */

this is pretty generic.

The raw CPU records have a CPU specific format, and they have to be 
demultiplexed anyway (on Nehalem, which can have up to four separate 
PEBS counters - but each output into the same DS area), so the 
lowlevel arch code converts the CPU record into the above generic 
sample record when it copies it into the mmap pages. It's a quick 
copy so no big deal performance-wise.

( Details:

   - there might be some additional complications from sampling 
     32-bit contexts, but that too is a mostly low level detail that 
     gets hidden.

   - we might use a tiny bit more compact registers structure than
     struct pt_regs. OTOH it's a well-known structure so it makes 
     sense to standardize on it, even if the CPU doesnt sample all 
     registers.
)

Can you see desirable PEBS-alike PMU features that cannot be 
expressed via such means?

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-09 11:53                   ` Ingo Molnar
@ 2009-06-09 16:44                     ` Corey Ashford
  2009-06-09 22:00                       ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Corey Ashford @ 2009-06-09 16:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Corey Ashford, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel

Thanks for your reply, Ingo.

Ingo Molnar wrote:
> I think PEBS is best supported by a generic abstraction. Something 
> like this: it's basically a special sampling format, that generates 
> a record of:
>
> 	struct pt_regs regs;
> 	__u64 insn_latency; /* optional */
> 	__u64 data_address; /* optional */
>
> this is pretty generic.
>
> The raw CPU records have a CPU specific format, and they have to be 
> demultiplexed anyway (on Nehalem, which can have up to four separate 
> PEBS counters - but each output into the same DS area), so the 
> lowlevel arch code converts the CPU record into the above generic 
> sample record when it copies it into the mmap pages. It's a quick 
> copy so no big deal performance-wise.
>
> ( Details:
>
>    - there might be some additional complications from sampling 
>      32-bit contexts, but that too is a mostly low level detail that 
>      gets hidden.
>
>    - we might use a tiny bit more compact registers structure than
>      struct pt_regs. OTOH it's a well-known structure so it makes 
>      sense to standardize on it, even if the CPU doesnt sample all 
>      registers.
> )
>
>   
I see, so that's how you'd return the data.  How would a user specify 
that they want to use PEBS?

Another observation is that you'd need some sort of bit vector, or at 
the least a document, that describes which registers are valid in the 
pt_regs struct.

> Can you see desirable PEBS-alike PMU features that cannot be 
> expressed via such means?
>
> 	Ingo
Power PMU's provide some fairly complex features, such as a thresholding 
mechanism which is used for marking instructions, and also there's an 
Instruction Matching CAM which can be used to mark only on certain 
instruction types.  Since these features are present only on Power, I'm 
not sure it makes sense to go to the trouble of abstracting them for use 
on other arch/chip designs.

- Corey

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-09 16:44                     ` Corey Ashford
@ 2009-06-09 22:00                       ` Ingo Molnar
  2009-06-09 23:16                         ` Corey Ashford
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2009-06-09 22:00 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel


* Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:

> Thanks for your reply, Ingo.
>
> Ingo Molnar wrote:
>> I think PEBS is best supported by a generic abstraction. Something  
>> like this: it's basically a special sampling format, that generates a 
>> record of:
>>
>> 	struct pt_regs regs;
>> 	__u64 insn_latency; /* optional */
>> 	__u64 data_address; /* optional */
>>
>> this is pretty generic.
>>
>> The raw CPU records have a CPU specific format, and they have to be  
>> demultiplexed anyway (on Nehalem, which can have up to four separate  
>> PEBS counters - but each output into the same DS area), so the  
>> lowlevel arch code converts the CPU record into the above generic  
>> sample record when it copies it into the mmap pages. It's a quick copy 
>> so no big deal performance-wise.
>>
>> ( Details:
>>
>>    - there might be some additional complications from sampling      
>> 32-bit contexts, but that too is a mostly low level detail that      
>> gets hidden.
>>
>>    - we might use a tiny bit more compact registers structure than
>>      struct pt_regs. OTOH it's a well-known structure so it makes      
>> sense to standardize on it, even if the CPU doesnt sample all      
>> registers.
>> )
>>
>>   
> I see, so that's how you'd return the data.  How would a user 
> specify that they want to use PEBS?

They wouldnt need to. PEBS would result in more precise samples when 
certain counters are used - and transparently so.

The non-PEBS NMI samples are pretty accurate to begin with, so it's 
not a _major_ difference in quality.

PEBS would also auto-activate when a user wants to sample say the 
instruction latency as well - and on non-PEBS hardware the platform 
code would refuse to open the counter.

PEBS could also be used to reduce the number of NMIs needed - so 
it's a transparent speedup as well.

> Another observation is that you'd need some sort of bit vector, or 
> at the least a document, that describes which registers are valid 
> in the pt_regs struct.

Initially i think we should use PEBS to transparently enhance 
regular IP samples.

What would we use the other registers for? Many registers will have 
stack context dependencies so the PEBS data cannot really be 
analyzed in any meaningful way after the fact.

(except perhaps for the narrow purpose of debugging)

>> Can you see desirable PEBS-alike PMU features that cannot be expressed 
>> via such means?
>
> Power PMU's provide some fairly complex features, such as a 
> thresholding mechanism which is used for marking instructions, and 
> also there's an Instruction Matching CAM which can be used to mark 
> only on certain instruction types.  Since these features are 
> present only on Power, I'm not sure it makes sense to go to the 
> trouble of abstracting them for use on other arch/chip designs.

Could you please describe the low level semantics more accurately - 
at least of the ones you find to be the most useful in practice?

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-09 22:00                       ` Ingo Molnar
@ 2009-06-09 23:16                         ` Corey Ashford
  2009-06-10  0:14                           ` Paul Mackerras
  0 siblings, 1 reply; 19+ messages in thread
From: Corey Ashford @ 2009-06-09 23:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel



Ingo Molnar wrote:
> * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
> 
>> Thanks for your reply, Ingo.
>>
>> Ingo Molnar wrote:
>>> I think PEBS is best supported by a generic abstraction. Something  
>>> like this: it's basically a special sampling format, that generates a 
>>> record of:
>>>
>>> 	struct pt_regs regs;
>>> 	__u64 insn_latency; /* optional */
>>> 	__u64 data_address; /* optional */
>>>
>>> this is pretty generic.
>>>
>>> The raw CPU records have a CPU specific format, and they have to be  
>>> demultiplexed anyway (on Nehalem, which can have up to four separate  
>>> PEBS counters - but each output into the same DS area), so the  
>>> lowlevel arch code converts the CPU record into the above generic  
>>> sample record when it copies it into the mmap pages. It's a quick copy 
>>> so no big deal performance-wise.
>>>
>>> ( Details:
>>>
>>>    - there might be some additional complications from sampling      
>>> 32-bit contexts, but that too is a mostly low level detail that      
>>> gets hidden.
>>>
>>>    - we might use a tiny bit more compact registers structure than
>>>      struct pt_regs. OTOH it's a well-known structure so it makes      
>>> sense to standardize on it, even if the CPU doesnt sample all      
>>> registers.
>>> )
>>>
>>>   
>> I see, so that's how you'd return the data.  How would a user 
>> specify that they want to use PEBS?
> 
> They wouldnt need to. PEBS would result in more precise samples when 
> certain counters are used - and transparently so.
> 
> The non-PEBS NMI samples are pretty accurate to begin with, so it's 
> not a _major_ difference in quality.
> 
> PEBS would also auto-activate when a user wants to sample say the 
> instruction latency as well - and on non-PEBS hardware the platform 
> code would refuse to open the counter.
> 
> PEBS could also be used to reduce the number of NMIs needed - so 
> it's a transparent speedup as well.
> 
>> Another observation is that you'd need some sort of bit vector, or 
>> at the least a document, that describes which registers are valid 
>> in the pt_regs struct.
> 
> Initially i think we should use PEBS to transparently enhance 
> regular IP samples.
> 
> What would we use the other registers for? Many registers will have 
> stack context dependencies so the PEBS data cannot really be 
> analyzed in any meaningful way after the fact.

Well, you were the one that offered up the idea of using pt_regs as an 
abstraction.  For the case of PEBS, I think 98% of it would be emptry since the 
sampling mechanism can get a PC and maybe a data address (?).

> 
> (except perhaps for the narrow purpose of debugging)
> 
>>> Can you see desirable PEBS-alike PMU features that cannot be expressed 
>>> via such means?
>> Power PMU's provide some fairly complex features, such as a 
>> thresholding mechanism which is used for marking instructions, and 
>> also there's an Instruction Matching CAM which can be used to mark 
>> only on certain instruction types.  Since these features are 
>> present only on Power, I'm not sure it makes sense to go to the 
>> trouble of abstracting them for use on other arch/chip designs.
> 
> Could you please describe the low level semantics more accurately - 
> at least of the ones you find to be the most useful in practice?

Ok, some disclosure here: we have not yet supported either of these features in 
the Power PMU's in any open source code (and perhaps the proprietary code too, 
but I don't know about that).  Since these features are described in an IBM 
proprietary document, I can't describe how they work here, except to say that 
they are present in the chip.

I realize this is not very useful to come here and complain there's no support 
for something I can't describe!  (maybe later or maybe never... not sure)

But in general I do think there should be some sort of provision in the 
interface for some chip-specific features for which fully abstracting their 
function doesn't make a lot of sense.  At some point, if we find that there's a 
good reason to support these features in open source code (and thereby divulging 
how they work), it would be nice if there was an avenue for adding such support 
to PCL.

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-09 23:16                         ` Corey Ashford
@ 2009-06-10  0:14                           ` Paul Mackerras
  2009-06-10 22:06                             ` Corey Ashford
  0 siblings, 1 reply; 19+ messages in thread
From: Paul Mackerras @ 2009-06-10  0:14 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

Corey Ashford writes:

> Ok, some disclosure here: we have not yet supported either of these features in 
> the Power PMU's in any open source code (and perhaps the proprietary code too, 
> but I don't know about that).  Since these features are described in an IBM 
> proprietary document, I can't describe how they work here, except to say that 
> they are present in the chip.

Actually, the public PPC970FX user manual has a chapter on the
performance monitor unit which describes both thresholding and the
instruction matching CAM, among other things.

I think we can support thresholding using higher-order bits of the
event code when the low bits == PM_THRESH_TIMEO.  We can only have one
PM_THRESH_TIMEO event on the PMU at any given time, so there can't be
any conflict over the thresholder settings.

The IMC is more problematic, but we can't do much with it on POWER5
and later processors anyway, due to various things being accessible
only in hypervisor mode, so I have been ignoring it. :)

Paul.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf_counter: extensible perf_counter_attr
  2009-06-10  0:14                           ` Paul Mackerras
@ 2009-06-10 22:06                             ` Corey Ashford
  0 siblings, 0 replies; 19+ messages in thread
From: Corey Ashford @ 2009-06-10 22:06 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel

Paul Mackerras wrote:
> Corey Ashford writes:
> 
>> Ok, some disclosure here: we have not yet supported either of these features in 
>> the Power PMU's in any open source code (and perhaps the proprietary code too, 
>> but I don't know about that).  Since these features are described in an IBM 
>> proprietary document, I can't describe how they work here, except to say that 
>> they are present in the chip.
> 
> Actually, the public PPC970FX user manual has a chapter on the
> performance monitor unit which describes both thresholding and the
> instruction matching CAM, among other things.

Ah, good point!

Anyone interested can find the 970FX (G5) manual here: www.cs.lth.se/EDA116/G5.pdf
It's supposed to be on the IBM web site too, but for some reason, the link is 
broken (reported!).

> 
> I think we can support thresholding using higher-order bits of the
> event code when the low bits == PM_THRESH_TIMEO.  We can only have one
> PM_THRESH_TIMEO event on the PMU at any given time, so there can't be
> any conflict over the thresholder settings.
> 
> The IMC is more problematic, but we can't do much with it on POWER5
> and later processors anyway, due to various things being accessible
> only in hypervisor mode, so I have been ignoring it. :)
> 
> Paul.

Ug.  Well, there's always bare-metal Linux :)

- Corey


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-06-10 22:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-08 17:25 [PATCH] perf_counter: extensible perf_counter_attr Peter Zijlstra
2009-06-08 19:02 ` Corey Ashford
2009-06-08 19:51   ` Peter Zijlstra
2009-06-08 21:18     ` Corey Ashford
2009-06-08 21:23       ` Peter Zijlstra
2009-06-08 21:29         ` Corey Ashford
2009-06-08 21:50           ` Ingo Molnar
2009-06-09  0:50             ` Corey Ashford
2009-06-09  6:51               ` Ingo Molnar
2009-06-09  8:13                 ` Corey Ashford
2009-06-09 11:53                   ` Ingo Molnar
2009-06-09 16:44                     ` Corey Ashford
2009-06-09 22:00                       ` Ingo Molnar
2009-06-09 23:16                         ` Corey Ashford
2009-06-10  0:14                           ` Paul Mackerras
2009-06-10 22:06                             ` Corey Ashford
2009-06-09  4:17 ` Paul Mackerras
2009-06-09  6:53   ` Ingo Molnar
2009-06-09  9:58     ` Paul Mackerras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox