perf_counters issue with PERF_SAMPLE

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* perf_counters issue with PERF_SAMPLE_GROUP
@ 2009-08-11 15:41 stephane eranian
  2009-08-11 16:05 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: stephane eranian @ 2009-08-11 15:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

Hi,

It seems to me there is a problem with the group counter values
when you use PERF_SAMPLE_GROUP. The counts are bogus
for all events.

Test case is pretty simple:
    - single group, 2 events
    - sampling on PERF_COUNT_HW_CYCLES
    - other event is PERF_COUNT_HW_CYCLES
    - leader has SAMPLE_IP|SAMPLE_GROUP
    - no inheritance
    - single thread
    - using sampling in one shot mode with PERF_COUNTER_IOC_REFRESH
    - all events but leader start with disabled = 0 (i.e., enabled)
    - sampling period is 240000000 (cycles)

Notification 1: ip=0x401300  39100608 PERF_COUNT_HW_CPU_CYCLES (12)
Notification 2: ip=0x401300  17991616 PERF_COUNT_HW_CPU_CYCLES (12)
Notification 3: ip=0x401300  17981248 PERF_COUNT_HW_CPU_CYCLES (12)
Notification 4: ip=0x401300  9409478912 PERF_COUNT_HW_CPU_CYCLES (12)

I would expect the value for the 2nd event to be close to 240000000.
But instead,
it is going up and down. The IP, nr and id (12) fields are correct, so
the parsing of
the buffer is correct. This is with the latest from Linus's 2.6.31-rc5.

Related to PERF_SAMPLE_GROUP, I believe there is some information missing.
You need to provide the TIMING information because in the case of SAMPLE_GROUP
you'd like to be able to scale the values of the counters you are
collecting. And you
need the timing at the moment, the sample was recorded not later.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-11 15:41 perf_counters issue with PERF_SAMPLE_GROUP stephane eranian
@ 2009-08-11 16:05 ` Peter Zijlstra
  2009-08-11 19:40   ` stephane eranian
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2009-08-11 16:05 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Tue, 2009-08-11 at 17:41 +0200, stephane eranian wrote:
> Hi,
> 
> It seems to me there is a problem with the group counter values
> when you use PERF_SAMPLE_GROUP. The counts are bogus
> for all events.
> 
> Test case is pretty simple:
>     - single group, 2 events
>     - sampling on PERF_COUNT_HW_CYCLES
>     - other event is PERF_COUNT_HW_CYCLES
>     - leader has SAMPLE_IP|SAMPLE_GROUP
>     - no inheritance
>     - single thread
>     - using sampling in one shot mode with PERF_COUNTER_IOC_REFRESH
>     - all events but leader start with disabled = 0 (i.e., enabled)
>     - sampling period is 240000000 (cycles)
> 
> Notification 1: ip=0x401300  39100608 PERF_COUNT_HW_CPU_CYCLES (12)
> Notification 2: ip=0x401300  17991616 PERF_COUNT_HW_CPU_CYCLES (12)
> Notification 3: ip=0x401300  17981248 PERF_COUNT_HW_CPU_CYCLES (12)
> Notification 4: ip=0x401300  9409478912 PERF_COUNT_HW_CPU_CYCLES (12)
> 
> I would expect the value for the 2nd event to be close to 240000000.
> But instead,
> it is going up and down. The IP, nr and id (12) fields are correct, so
> the parsing of
> the buffer is correct. This is with the latest from Linus's 2.6.31-rc5.

Could have broken somewhere along the line, the group stuff doesn't get
tested a lot, if at all.

perf used to have some support for it, not sure what the current state
is.

You seem to have forgotten to append your test.c though :-)

> Related to PERF_SAMPLE_GROUP, I believe there is some information missing.
> You need to provide the TIMING information because in the case of SAMPLE_GROUP
> you'd like to be able to scale the values of the counters you are
> collecting. And you
> need the timing at the moment, the sample was recorded not later.

Right, so something like the below, possibly complemented with having
PERF_COUNTER_IOC_RESET also reset the run-times?

---
 include/linux/perf_counter.h |    3 +++
 kernel/perf_counter.c        |   20 ++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 2b36afe..44a056b 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -365,10 +365,13 @@ enum perf_event_type {
 	 * 	{ u64			period;   } && PERF_SAMPLE_PERIOD
 	 *
 	 *	{ u64			nr;
+	 *	  { u64		time_enabled; } && PERF_FORMAT_ENABLED
+	 *	  { u64		time_running; } && PERF_FORMAT_RUNNING
 	 *	  { u64 id, val; }	cnt[nr];  } && PERF_SAMPLE_GROUP
 	 *
 	 *	{ u64			nr,
 	 *	  u64			ips[nr];  } && PERF_SAMPLE_CALLCHAIN
+	 *
 	 *	{ u32			size;
 	 *	  char                  data[size];}&& PERF_SAMPLE_RAW
 	 * };
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index e26d2fc..e61e701 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -2636,6 +2636,7 @@ void perf_counter_output(struct perf_counter *counter, int nmi,
 {
 	int ret;
 	u64 sample_type = counter->attr.sample_type;
+	u64 read_format = counter->attr.read_format;
 	struct perf_output_handle handle;
 	struct perf_event_header header;
 	u64 ip;
@@ -2703,6 +2704,10 @@ void perf_counter_output(struct perf_counter *counter, int nmi,
 	if (sample_type & PERF_SAMPLE_GROUP) {
 		header.size += sizeof(u64) +
 			counter->nr_siblings * sizeof(group_entry);
+		if (read_format & PERF_FORMAT_ENABLED)
+			header.size += sizeof(u64);
+		if (read_format & PERF_FORMAT_RUNNING)
+			header.size += sizeof(u64);
 	}
 
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
@@ -2765,9 +2770,20 @@ void perf_counter_output(struct perf_counter *counter, int nmi,
 	 */
 	if (sample_type & PERF_SAMPLE_GROUP) {
 		struct perf_counter *leader, *sub;
-		u64 nr = counter->nr_siblings;
+		u64 val;
+
+		val = counter->nr_siblings;
+		perf_output_put(&handle, val);
 
-		perf_output_put(&handle, nr);
+		if (read_format & PERF_FORMAT_ENABLED) {
+			val = counter->total_time_enabled;
+			perf_output_put(&handle, val);
+		}
+
+		if (read_format & PERF_FORMAT_RUNNING) {
+			val = counter->total_time_running;
+			perf_output_put(&handle, val);
+		}
 
 		leader = counter->group_leader;
 		list_for_each_entry(sub, &leader->sibling_list, list_entry) {



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-11 16:05 ` Peter Zijlstra
@ 2009-08-11 19:40   ` stephane eranian
  2009-08-11 20:55     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: stephane eranian @ 2009-08-11 19:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Tue, Aug 11, 2009 at 6:05 PM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> On Tue, 2009-08-11 at 17:41 +0200, stephane eranian wrote:
>> Hi,
>>
>> It seems to me there is a problem with the group counter values
>> when you use PERF_SAMPLE_GROUP. The counts are bogus
>> for all events.
>>
>> Test case is pretty simple:
>>     - single group, 2 events
>>     - sampling on PERF_COUNT_HW_CYCLES
>>     - other event is PERF_COUNT_HW_CYCLES
>>     - leader has SAMPLE_IP|SAMPLE_GROUP
>>     - no inheritance
>>     - single thread
>>     - using sampling in one shot mode with PERF_COUNTER_IOC_REFRESH
>>     - all events but leader start with disabled = 0 (i.e., enabled)
>>     - sampling period is 240000000 (cycles)
>>
>> Notification 1: ip=0x401300  39100608 PERF_COUNT_HW_CPU_CYCLES (12)
>> Notification 2: ip=0x401300  17991616 PERF_COUNT_HW_CPU_CYCLES (12)
>> Notification 3: ip=0x401300  17981248 PERF_COUNT_HW_CPU_CYCLES (12)
>> Notification 4: ip=0x401300  9409478912 PERF_COUNT_HW_CPU_CYCLES (12)
>>
>> I would expect the value for the 2nd event to be close to 240000000.
>> But instead,
>> it is going up and down. The IP, nr and id (12) fields are correct, so
>> the parsing of
>> the buffer is correct. This is with the latest from Linus's 2.6.31-rc5.
>
> Could have broken somewhere along the line, the group stuff doesn't get
> tested a lot, if at all.
>
> perf used to have some support for it, not sure what the current state
> is.
>
> You seem to have forgotten to append your test.c though :-)
>
Can't send you the program because it uses extra bits and pieces
which are hard to remove. Otherwise I would have send it already.
But I think it boils down to the following piece of code in
perf_counter_output():
                leader = counter->group_leader;
                list_for_each_entry(sub, &leader->sibling_list, list_entry) {
                        if (sub != counter)
                                sub->pmu->read(sub);

                        group_entry.id = primary_counter_id(sub);
                        group_entry.counter = atomic64_read(&sub->count);

                        perf_output_put(&handle, group_entry);
                }


>> Related to PERF_SAMPLE_GROUP, I believe there is some information missing.
>> You need to provide the TIMING information because in the case of SAMPLE_GROUP
>> you'd like to be able to scale the values of the counters you are
>> collecting. And you
>> need the timing at the moment, the sample was recorded not later.
>
> Right, so something like the below, possibly complemented with having
> PERF_COUNTER_IOC_RESET also reset the run-times?
>
Yes, but don't you have a namespace issue between PERF_FORMAT_* and
PERF_SAMPLE_* in the patch below? I would think you want to keep them separate.

I am also wondering about why one would want one timing value and not the other.
In other words, why not group them under a single name. But maybe it is harder
to return more than one u64 per PERF_FORMAT?

> ---
>  include/linux/perf_counter.h |    3 +++
>  kernel/perf_counter.c        |   20 ++++++++++++++++++--
>  2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
> index 2b36afe..44a056b 100644
> --- a/include/linux/perf_counter.h
> +++ b/include/linux/perf_counter.h
> @@ -365,10 +365,13 @@ enum perf_event_type {
>         *      { u64                   period;   } && PERF_SAMPLE_PERIOD
>         *
>         *      { u64                   nr;
> +        *        { u64         time_enabled; } && PERF_FORMAT_ENABLED
> +        *        { u64         time_running; } && PERF_FORMAT_RUNNING
>         *        { u64 id, val; }      cnt[nr];  } && PERF_SAMPLE_GROUP
>         *
>         *      { u64                   nr,
>         *        u64                   ips[nr];  } && PERF_SAMPLE_CALLCHAIN
> +        *
>         *      { u32                   size;
>         *        char                  data[size];}&& PERF_SAMPLE_RAW
>         * };
> diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
> index e26d2fc..e61e701 100644
> --- a/kernel/perf_counter.c
> +++ b/kernel/perf_counter.c
> @@ -2636,6 +2636,7 @@ void perf_counter_output(struct perf_counter *counter, int nmi,
>  {
>        int ret;
>        u64 sample_type = counter->attr.sample_type;
> +       u64 read_format = counter->attr.read_format;
>        struct perf_output_handle handle;
>        struct perf_event_header header;
>        u64 ip;
> @@ -2703,6 +2704,10 @@ void perf_counter_output(struct perf_counter *counter, int nmi,
>        if (sample_type & PERF_SAMPLE_GROUP) {
>                header.size += sizeof(u64) +
>                        counter->nr_siblings * sizeof(group_entry);
> +               if (read_format & PERF_FORMAT_ENABLED)
> +                       header.size += sizeof(u64);
> +               if (read_format & PERF_FORMAT_RUNNING)
> +                       header.size += sizeof(u64);
>        }
>
>        if (sample_type & PERF_SAMPLE_CALLCHAIN) {
> @@ -2765,9 +2770,20 @@ void perf_counter_output(struct perf_counter *counter, int nmi,
>         */
>        if (sample_type & PERF_SAMPLE_GROUP) {
>                struct perf_counter *leader, *sub;
> -               u64 nr = counter->nr_siblings;
> +               u64 val;
> +
> +               val = counter->nr_siblings;
> +               perf_output_put(&handle, val);
>
> -               perf_output_put(&handle, nr);
> +               if (read_format & PERF_FORMAT_ENABLED) {
> +                       val = counter->total_time_enabled;
> +                       perf_output_put(&handle, val);
> +               }
> +
> +               if (read_format & PERF_FORMAT_RUNNING) {
> +                       val = counter->total_time_running;
> +                       perf_output_put(&handle, val);
> +               }
>
>                leader = counter->group_leader;
>                list_for_each_entry(sub, &leader->sibling_list, list_entry) {
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-11 19:40   ` stephane eranian
@ 2009-08-11 20:55     ` Peter Zijlstra
  2009-08-11 21:08       ` stephane eranian
  2009-08-12 20:54       ` stephane eranian
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-08-11 20:55 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Tue, 2009-08-11 at 21:40 +0200, stephane eranian wrote:

> > You seem to have forgotten to append your test.c though :-)
> >
> Can't send you the program because it uses extra bits and pieces
> which are hard to remove. Otherwise I would have send it already.

Those other bits aren't open source? tskk :-)

> But I think it boils down to the following piece of code in
> perf_counter_output():
>                 leader = counter->group_leader;
>                 list_for_each_entry(sub, &leader->sibling_list, list_entry) {
>                         if (sub != counter)
>                                 sub->pmu->read(sub);
> 
>                         group_entry.id = primary_counter_id(sub);
>                         group_entry.counter = atomic64_read(&sub->count);
> 
>                         perf_output_put(&handle, group_entry);
>                 }

Well, likely, but nothing obviously wrong stands out there, so now I get
to write a reproduces to see what's going wrong.

> >> Related to PERF_SAMPLE_GROUP, I believe there is some information missing.
> >> You need to provide the TIMING information because in the case of SAMPLE_GROUP
> >> you'd like to be able to scale the values of the counters you are
> >> collecting. And you
> >> need the timing at the moment, the sample was recorded not later.
> >
> > Right, so something like the below, possibly complemented with having
> > PERF_COUNTER_IOC_RESET also reset the run-times?
> >
> Yes, but don't you have a namespace issue between PERF_FORMAT_* and
> PERF_SAMPLE_* in the patch below? I would think you want to keep them separate.

Maybe, otoh we've consistently used it whenever exposing the timing
data.

> I am also wondering about why one would want one timing value and not the other.
> In other words, why not group them under a single name. But maybe it is harder
> to return more than one u64 per PERF_FORMAT?

Not really, Paul did it like that initially and we've been consistently
doing it like that -- changing it now is a bit late.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-11 20:55     ` Peter Zijlstra
@ 2009-08-11 21:08       ` stephane eranian
  2009-08-12  8:32         ` Peter Zijlstra
  2009-08-12 20:54       ` stephane eranian
  1 sibling, 1 reply; 13+ messages in thread
From: stephane eranian @ 2009-08-11 21:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Tue, Aug 11, 2009 at 10:55 PM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> On Tue, 2009-08-11 at 21:40 +0200, stephane eranian wrote:
>
>> > You seem to have forgotten to append your test.c though :-)
>> >
>> Can't send you the program because it uses extra bits and pieces
>> which are hard to remove. Otherwise I would have send it already.
>
> Those other bits aren't open source? tskk :-)
>

You don't know me well!
They will be but they are not ready yet.

>> But I think it boils down to the following piece of code in
>> perf_counter_output():
>>                 leader = counter->group_leader;
>>                 list_for_each_entry(sub, &leader->sibling_list, list_entry) {
>>                         if (sub != counter)
>>                                 sub->pmu->read(sub);
>>
>>                         group_entry.id = primary_counter_id(sub);
>>                         group_entry.counter = atomic64_read(&sub->count);
>>
>>                         perf_output_put(&handle, group_entry);
>>                 }
>
> Well, likely, but nothing obviously wrong stands out there, so now I get
> to write a reproduces to see what's going wrong.
>

>> >> Related to PERF_SAMPLE_GROUP, I believe there is some information missing.
>> >> You need to provide the TIMING information because in the case of SAMPLE_GROUP
>> >> you'd like to be able to scale the values of the counters you are
>> >> collecting. And you
>> >> need the timing at the moment, the sample was recorded not later.
>> >
>> > Right, so something like the below, possibly complemented with having
>> > PERF_COUNTER_IOC_RESET also reset the run-times?
>> >
>> Yes, but don't you have a namespace issue between PERF_FORMAT_* and
>> PERF_SAMPLE_* in the patch below? I would think you want to keep them separate.
>
> Maybe, otoh we've consistently used it whenever exposing the timing
> data.
>
How do you make sure the bits used by the PERF_FORMAT_* stuff does not collide
with PERF_SAMPLE_*?

>> I am also wondering about why one would want one timing value and not the other.
>> In other words, why not group them under a single name. But maybe it is harder
>> to return more than one u64 per PERF_FORMAT?
>
> Not really, Paul did it like that initially and we've been consistently
> doing it like that -- changing it now is a bit late.
>

But aren't you going to change the cpu, pid target stuff we discussed a couple
of weeks ago anyway?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-11 21:08       ` stephane eranian
@ 2009-08-12  8:32         ` Peter Zijlstra
  2009-08-12  9:02           ` Ingo Molnar
  2009-08-13 23:49           ` Paul Mackerras
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2009-08-12  8:32 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Tue, 2009-08-11 at 23:08 +0200, stephane eranian wrote:

> > Not really, Paul did it like that initially and we've been consistently
> > doing it like that -- changing it now is a bit late.
> >
> 
> But aren't you going to change the cpu, pid target stuff we discussed a couple
> of weeks ago anyway?

Right, I'd like to, but Ingo doesn't. I haven't heard back from Paul on
this.

If we're going to do that we might as well fold these two FORMAT bits
into one or something.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-12  8:32         ` Peter Zijlstra
@ 2009-08-12  9:02           ` Ingo Molnar
  2009-08-12 12:22             ` stephane eranian
  2009-08-13 23:49           ` Paul Mackerras
  1 sibling, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2009-08-12  9:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2009-08-11 at 23:08 +0200, stephane eranian wrote:
> 
> > But aren't you going to change the cpu, pid target stuff we 
> > discussed a couple of weeks ago anyway?
> 
> Right, I'd like to, but Ingo doesn't. I haven't heard back from 
> Paul on this.

Not sure we want to change it. Mixing PID and CPU into the same 
space disallows the simultaneous application of both. I.e. right now 
we allow 3 models:

 - PID-ish
 - CPU-ish
 - PID and CPU [say measure CPU#2 component of an inherited workload.]

Also, i dont really see the use-cases for new targets. (i've seen a 
few mentioned but none seemed valid) What new targets do people have 
in mind?

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-12  9:02           ` Ingo Molnar
@ 2009-08-12 12:22             ` stephane eranian
  2009-08-13  9:46               ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: stephane eranian @ 2009-08-12 12:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Wed, Aug 12, 2009 at 11:02 AM, Ingo Molnar<mingo@elte.hu> wrote:
>
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
>> On Tue, 2009-08-11 at 23:08 +0200, stephane eranian wrote:
>>
>> > But aren't you going to change the cpu, pid target stuff we
>> > discussed a couple of weeks ago anyway?
>>
>> Right, I'd like to, but Ingo doesn't. I haven't heard back from
>> Paul on this.
>
> Not sure we want to change it. Mixing PID and CPU into the same
> space disallows the simultaneous application of both. I.e. right now
> we allow 3 models:
>
>  - PID-ish
>  - CPU-ish
>  - PID and CPU [say measure CPU#2 component of an inherited workload.]
>
How useful is that last model, especially why only one CPU?

> Also, i dont really see the use-cases for new targets. (i've seen a
> few mentioned but none seemed valid) What new targets do people have
> in mind?

I seem to recall people mentioned:
   1- CPU socket, e.g., uncore PMU
   2- chipset
   3- GPU

I can see 1/ being indirectly achievable by specifying a CPU.
But the others are uncorrelated to either a CPU or thread.
I have already seen requests for accessing chipsets, and
seems GPU are around the corner now.

Why do you think those would be invalid targets given
the goal of this API?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-12 12:22             ` stephane eranian
@ 2009-08-13  9:46               ` Ingo Molnar
  2009-08-13 19:15                 ` stephane eranian
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2009-08-13  9:46 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

* stephane eranian <eranian@googlemail.com> wrote:

> On Wed, Aug 12, 2009 at 11:02 AM, Ingo Molnar<mingo@elte.hu> wrote:
>
> > Not sure we want to change it. Mixing PID and CPU into the same 
> > space disallows the simultaneous application of both. I.e. right 
> > now we allow 3 models:
> >
> >  - PID-ish
> >  - CPU-ish
> >  - PID and CPU [say measure CPU#2 component of an inherited workload.]
>
> How useful is that last model, especially why only one CPU?

It's somewhat useful: say on an inherited workload one could 'cut 
out' just a single CPU worth of samples.

Or a tool could implement a more scalable sampling model: say on a 
quad core CPU one could have four counters in an inherited workload:

  cycles:cpu0
  cycles:cpu1
  cycles:cpu2
  cycles:cpu3

... and depending on which CPU a sub-process or sub-thread is 
running on, would the according (nicely per cpu local) sampling 
buffer be used.

> > Also, i dont really see the use-cases for new targets. (i've 
> > seen a few mentioned but none seemed valid) What new targets do 
> > people have in mind?
> 
> I seem to recall people mentioned:
>    1- CPU socket, e.g., uncore PMU
>    2- chipset
>    3- GPU
> 
> I can see 1/ being indirectly achievable by specifying a CPU.

Correct.

( Note, it's not just indirectly achievable as a side-effect - for 
  example the Intel uncore PMU has a target CPU irq-mask, so it 
  makes sense to allow the specification of the specific CPU we are
  measuring on as well. The physical socket maps from the CPU. )

> [...] But the others are uncorrelated to either a CPU or thread. I 
> have already seen requests for accessing chipsets, and seems GPU 
> are around the corner now.
> 
> Why do you think those would be invalid targets given the goal of 
> this API?

No.

Chipset and GPU measurements are very much possible via perfcounters 
as well - but that does not require the touching of the pid,cpu 
target parameters to sys_perf_counter_open().

I think the confusion in this discussion comes from the fact that 
there are two different types of 'targets':

The first type of target, the <pid,cpu> target is a _scheduling_, 
task management abstraction. Adding a chipset ID or GPU ID to that 
makes little sense! Tasks dont get scheduled on a 'chipset' - to 
each task the chipset looks like an external entity.

The second type of target is the 'event source itself'. (and it's 
not really a target but a source.)

A chipset or GPU should be abstracted via an _event source_ 
abstraction. We've got wide possibilities to do that, and we already 
abstract a fair amount of non-CPU-sourced events that way: say we 
have irq tracepoint counters:

  aldebaran:~> perf list 2>&1 | grep irq
    irq:irq_handler_entry                      [Tracepoint event]

irqs come from the chipset, so in an (unintended) way perfcounters 
already instruments the chipset today.

So yes, both chipset and GPU sampling is very much possible, and it 
does not require the tweaking of the syscall target parameters - 
each CPU has a typically symmetric view on it.

Note that there's overlap: a CPU can be an event source and a 
scheduling target as well. I think some of the confusion in 
terminology comes from that.

To support chipset or GPU sampling, the perf_type_id and/or the 
struct perf_counter_attr space can be extended.

We'd have to see the patches to decide the best way forward - it's 
difficult to argue this hypothetically, as there are so many 
solutions (with different levels of complexity and utility) to 
expose chipsets and GPUs to perfcounters. In any case, the counter 
scheduling target parameters dont need to be touched for them.

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-13  9:46               ` Ingo Molnar
@ 2009-08-13 19:15                 ` stephane eranian
  2009-08-13 23:13                   ` Corey Ashford
  0 siblings, 1 reply; 13+ messages in thread
From: stephane eranian @ 2009-08-13 19:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

On Thu, Aug 13, 2009 at 11:46 AM, Ingo Molnar<mingo@elte.hu> wrote:
>
> * stephane eranian <eranian@googlemail.com> wrote:
>
>> On Wed, Aug 12, 2009 at 11:02 AM, Ingo Molnar<mingo@elte.hu> wrote:
>>
>> > Not sure we want to change it. Mixing PID and CPU into the same
>> > space disallows the simultaneous application of both. I.e. right
>> > now we allow 3 models:
>> >
>> >  - PID-ish
>> >  - CPU-ish
>> >  - PID and CPU [say measure CPU#2 component of an inherited workload.]
>>
>> How useful is that last model, especially why only one CPU?
>
> It's somewhat useful: say on an inherited workload one could 'cut
> out' just a single CPU worth of samples.
>
> Or a tool could implement a more scalable sampling model: say on a
> quad core CPU one could have four counters in an inherited workload:
>
>  cycles:cpu0
>  cycles:cpu1
>  cycles:cpu2
>  cycles:cpu3
>

What would you be able to do with that kind of information?
How would you modify the program to improve performance?

> ... and depending on which CPU a sub-process or sub-thread is
> running on, would the according (nicely per cpu local) sampling
> buffer be used.
>
>> > Also, i dont really see the use-cases for new targets. (i've
>> > seen a few mentioned but none seemed valid) What new targets do
>> > people have in mind?
>>
>> I seem to recall people mentioned:
>>    1- CPU socket, e.g., uncore PMU
>>    2- chipset
>>    3- GPU
>>
>> I can see 1/ being indirectly achievable by specifying a CPU.
>
> Correct.
>
> ( Note, it's not just indirectly achievable as a side-effect - for
>  example the Intel uncore PMU has a target CPU irq-mask, so it
>  makes sense to allow the specification of the specific CPU we are
>  measuring on as well. The physical socket maps from the CPU. )
>
I know that because I used it to support uncore on perfmon.
Some people have argued, though, that it could be interesting to
interrupt all cores at once on uncore overflow. That's one way to know
where all cores are at the same point in time. I wonder how you could
support that.


>> [...] But the others are uncorrelated to either a CPU or thread. I
>> have already seen requests for accessing chipsets, and seems GPU
>> are around the corner now.
>>
>> Why do you think those would be invalid targets given the goal of
>> this API?
>
> No.
>
> Chipset and GPU measurements are very much possible via perfcounters
> as well - but that does not require the touching of the pid,cpu
> target parameters to sys_perf_counter_open().
>
> I think the confusion in this discussion comes from the fact that
> there are two different types of 'targets':
>
> The first type of target, the <pid,cpu> target is a _scheduling_,
> task management abstraction. Adding a chipset ID or GPU ID to that
> makes little sense! Tasks dont get scheduled on a 'chipset' - to
> each task the chipset looks like an external entity.
>
> The second type of target is the 'event source itself'. (and it's
> not really a target but a source.)
>
> A chipset or GPU should be abstracted via an _event source_
> abstraction. We've got wide possibilities to do that, and we already
> abstract a fair amount of non-CPU-sourced events that way: say we
> have irq tracepoint counters:
>
>  aldebaran:~> perf list 2>&1 | grep irq
>    irq:irq_handler_entry                      [Tracepoint event]
>
> irqs come from the chipset, so in an (unintended) way perfcounters
> already instruments the chipset today.
>
> So yes, both chipset and GPU sampling is very much possible, and it
> does not require the tweaking of the syscall target parameters -
> each CPU has a typically symmetric view on it.
>

Except there can be many GPUs, I/O devices and other pieces of
hardware with PMU-like capabilities in a single system. In that case,
you need to be able to name them: I want to measure GPUcycles on
GPU0. When you are down at that level, you don't really care about
the CPU or thread. So what would you pass for those in that case?

> Note that there's overlap: a CPU can be an event source and a
> scheduling target as well. I think some of the confusion in
> terminology comes from that.
>
> To support chipset or GPU sampling, the perf_type_id and/or the
> struct perf_counter_attr space can be extended.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-13 19:15                 ` stephane eranian
@ 2009-08-13 23:13                   ` Corey Ashford
  0 siblings, 0 replies; 13+ messages in thread
From: Corey Ashford @ 2009-08-13 23:13 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel



stephane eranian wrote:
> On Thu, Aug 13, 2009 at 11:46 AM, Ingo Molnar<mingo@elte.hu> wrote:
>> * stephane eranian <eranian@googlemail.com> wrote:
>>
>>> On Wed, Aug 12, 2009 at 11:02 AM, Ingo Molnar<mingo@elte.hu> wrote:
[snip]
>>
>> So yes, both chipset and GPU sampling is very much possible, and it
>> does not require the tweaking of the syscall target parameters -
>> each CPU has a typically symmetric view on it.
>>
> 
> Except there can be many GPUs, I/O devices and other pieces of
> hardware with PMU-like capabilities in a single system. In that case,
> you need to be able to name them: I want to measure GPUcycles on
> GPU0. When you are down at that level, you don't really care about
> the CPU or thread. So what would you pass for those in that case?
> 
>> Note that there's overlap: a CPU can be an event source and a
>> scheduling target as well. I think some of the confusion in
>> terminology comes from that.
>>
>> To support chipset or GPU sampling, the perf_type_id and/or the
>> struct perf_counter_attr space can be extended.
>>

One solution might be to split up the cpu argument to sys_perf_counter_open into 
two fields, for example:

union {
     int   cpu;
     __u32 pmu_type   : 12,   /* 0 = CPU, 1 = Socket, 2 = GPU,
                                 3-4095 = arch-defined ... */
           pmu_number : 20;   /* when pmu_type is 0, 0 = CPU0, 1 = CPU1, etc. */
}

If cpu is not equal to -1, then it would be interpreted as the union above.  For 
the most common case where you are monitoring CPU's, you just pass the cpu 
number in without worrying about the upper bits (as long as the cpu number is 
less than 2^20!).

As time goes on, I think we will see more chips with fancy integrated GPU's and 
other types of accelerators, as well as other off-chip processors/accelerators, 
that developers will want to monitor.

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-12  8:32         ` Peter Zijlstra
  2009-08-12  9:02           ` Ingo Molnar
@ 2009-08-13 23:49           ` Paul Mackerras
  1 sibling, 0 replies; 13+ messages in thread
From: Paul Mackerras @ 2009-08-13 23:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Robert Richter, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

Peter Zijlstra writes:

> On Tue, 2009-08-11 at 23:08 +0200, stephane eranian wrote:
> 
> > > Not really, Paul did it like that initially and we've been consistently
> > > doing it like that -- changing it now is a bit late.
> > >
> > 
> > But aren't you going to change the cpu, pid target stuff we discussed a couple
> > of weeks ago anyway?
> 
> Right, I'd like to, but Ingo doesn't. I haven't heard back from Paul on
> this.

Sorry, got distracted.  What are we proposing to change exactly?

Paul.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: perf_counters issue with PERF_SAMPLE_GROUP
  2009-08-11 20:55     ` Peter Zijlstra
  2009-08-11 21:08       ` stephane eranian
@ 2009-08-12 20:54       ` stephane eranian
  1 sibling, 0 replies; 13+ messages in thread
From: stephane eranian @ 2009-08-12 20:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

Hi,

Another thing that occurred to me  while using PERF_SAMPLE_GROUP
is about that unique ID which you get out of read() with PERF_FORMAT_ID.

That number if generated by the kernel and it is a globally incrementing
64-bit counter.

An application may not get subsequent numbers even though it is issuing
perf_open_counter() in sequence. Another applications may cause the
number to change.

I believe this is not very convenient because imagine in your signal handler
you parse the GROUP and you want to relate id to the event. You need
to have a lookup table.

I believe it would be more convenient if the tool could pass a 64-bit number
itself when the event is created. It would get it back as part of
PERF_SAMPLE_GROUP.
The id could then be more related to the tool's data structure, no need for a
lookup table, overall more efficient. If I recall, this is what
epoll() can do and
this is very convenient.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-08-13 23:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-11 15:41 perf_counters issue with PERF_SAMPLE_GROUP stephane eranian
2009-08-11 16:05 ` Peter Zijlstra
2009-08-11 19:40   ` stephane eranian
2009-08-11 20:55     ` Peter Zijlstra
2009-08-11 21:08       ` stephane eranian
2009-08-12  8:32         ` Peter Zijlstra
2009-08-12  9:02           ` Ingo Molnar
2009-08-12 12:22             ` stephane eranian
2009-08-13  9:46               ` Ingo Molnar
2009-08-13 19:15                 ` stephane eranian
2009-08-13 23:13                   ` Corey Ashford
2009-08-13 23:49           ` Paul Mackerras
2009-08-12 20:54       ` stephane eranian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox