performance counter 20% error finding retired instruction count

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* performance counter 20% error finding retired instruction count
@ 2009-06-24 13:59 Vince Weaver
  2009-06-24 15:10 ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Vince Weaver @ 2009-06-24 13:59 UTC (permalink / raw)
  To: linux-kernel

Hello

As an aside, is it time to set up a dedicated Performance Counters
for Linux mailing list?   (Hereafter referred to as p10c7l to avoid
confusion with the other implementations that have already taken
all the good abbreviated forms of the concept).  If/when the 
infrastructure appears in a released kernel, there's going to be a lot of 
chatter by people who use performance counters and suddenly find they are 
stuck with a huge step backwards in functionality.  And asking Fortran 
programmers to provide kernel patches probably won't be a productive 
response.  But I digress.

I was trying to get an exact retired instruction count from p10c7l.
I am using the test million.s, available here
  ( http://www.csl.cornell.edu/~vince/projects/perf_counter/million.s )
It should count exactly one million instructions.

Tests with valgrind and qemu show that it does.

Using perfmon2 on Pentium Pro, PII, PIII, P4, Athlon32, and Phenom
all give the proper result:

tobler:~% pfmon -e retired_instructions ./million
1000002 RETIRED_INSTRUCTIONS

    ( it is 1,000,002 +/-2 because on most x86 architectures retired
      instruction count includes any hardware interrupts that might
      happen at the time.  It woud be a great feature if p10c7l
      could add some way of gathering the per-process hardware
      instruction count statistic to help quantify that).

Yet with perf on the same Athlon32 machine (using
kernel 2.6.30-03984-g45e3e19) gives:

tobler:~%perf stat ./million

  Performance counter stats for './million':

        1.519366  task-clock-ticks     #       0.835 CPU utilization factor
               3  context-switches     #       0.002 M/sec
               0  CPU-migrations       #       0.000 M/sec
              53  page-faults          #       0.035 M/sec
         2483822  cycles               #    1634.775 M/sec
         1240849  instructions         #     816.689 M/sec # 0.500 per cycle
          612685  cache-references     #     403.250 M/sec
            3564  cache-misses         #       2.346 M/sec

  Wall-clock time elapsed:     1.819226 msecs

Running multiple times gives:
    1240849
    1257312
    1242313

Which is a varying error of at least 20% which isn't even 
consistent.  Is this because of sampling?  The documentation doesn't 
really warn about this as far as I can tell.

Thanks for any help resolving this problem

Vince

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter 20% error finding retired instruction count
  2009-06-24 13:59 performance counter 20% error finding retired instruction count Vince Weaver
@ 2009-06-24 15:10 ` Ingo Molnar
  2009-06-25  2:12   ` Vince Weaver
  2009-06-26 18:22   ` Vince Weaver
  0 siblings, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-06-24 15:10 UTC (permalink / raw)
  To: Vince Weaver, Peter Zijlstra, Paul Mackerras; +Cc: linux-kernel


* Vince Weaver <vince@deater.net> wrote:

> Hello
>
> As an aside, is it time to set up a dedicated Performance Counters
> for Linux mailing list?   (Hereafter referred to as p10c7l to avoid
> confusion with the other implementations that have already taken
> all the good abbreviated forms of the concept).

('perfcounters' is the name of the subsystem/feature and it's 
unique.)

> [...]  If/when the infrastructure appears in a released kernel, 
> there's going to be a lot of chatter by people who use performance 
> counters and suddenly find they are stuck with a huge step 
> backwards in functionality.  And asking Fortran programmers to 
> provide kernel patches probably won't be a productive response.  
> But I digress.
>
> I was trying to get an exact retired instruction count from 
> p10c7l. I am using the test million.s, available here
>
>  ( http://www.csl.cornell.edu/~vince/projects/perf_counter/million.s )
>
> It should count exactly one million instructions.
>
> Tests with valgrind and qemu show that it does.
>
> Using perfmon2 on Pentium Pro, PII, PIII, P4, Athlon32, and Phenom
> all give the proper result:
>
> tobler:~% pfmon -e retired_instructions ./million
> 1000002 RETIRED_INSTRUCTIONS
>
>    ( it is 1,000,002 +/-2 because on most x86 architectures retired
>      instruction count includes any hardware interrupts that might
>      happen at the time.  It woud be a great feature if p10c7l
>      could add some way of gathering the per-process hardware
>      instruction count statistic to help quantify that).
>
> Yet with perf on the same Athlon32 machine (using
> kernel 2.6.30-03984-g45e3e19) gives:
>
> tobler:~%perf stat ./million
>
>  Performance counter stats for './million':
>
>        1.519366  task-clock-ticks     #       0.835 CPU utilization factor
>               3  context-switches     #       0.002 M/sec
>               0  CPU-migrations       #       0.000 M/sec
>              53  page-faults          #       0.035 M/sec
>         2483822  cycles               #    1634.775 M/sec
>         1240849  instructions         #     816.689 M/sec # 0.500 per cycle
>          612685  cache-references     #     403.250 M/sec
>            3564  cache-misses         #       2.346 M/sec
>
>  Wall-clock time elapsed:     1.819226 msecs
>
> Running multiple times gives:
>    1240849
>    1257312
>    1242313
>
> Which is a varying error of at least 20% which isn't even 
> consistent.  Is this because of sampling?  The documentation 
> doesn't really warn about this as far as I can tell.
>
> Thanks for any help resolving this problem

Thanks for the question! There's still gaps in the documentation so 
let me explain the basics here:

'perf stat' counts the true cost of executing the command in 
question, including the costs of:

   fork()ing the task
   exec()-ing it
   the ELF loader resolving dynamic symbols
   the app hitting various pagefaults that instantiate its pagetables

etc.

Those operations are pretty 'noisy' on a typical CPU, with lots of 
cache effects, so the noise you see is real.

You can eliminate much of the noise by only counting user-space 
instructions, as much of the command startup cost is in 
kernel-space.

Running your test-app that way can be done the following way:

 $ perf stat --repeat 10 -e 0:1:u ./million

 Performance counter stats for './million' (10 runs):

        1002106  instructions           ( +-   0.015% )

    0.000599029  seconds time elapsed.

( note the --repeat feature of perf stat - it does a loop of command 
  executions and observes the noise and displays it. )

Those ~2100 instructions are executed by your app: as the ELF 
dynamic loader starts up your test-app.

If you have some tool that reports less than that then that tool is 
not being truthful about the true overhead of your application.

Also note that applications that only execute 1 million instructions 
are very, very rare - a modern CPU can execute billions of 
instructions, per second, per core.

So i usually test a reference app that is more realistic, that 
executes 1 billion instructions:

 $ perf stat --repeat 10 -e 0:1:u ./loop_1b_instructions

 Performance counter stats for './loop_1b_instructions' (10 runs):

     1000079797  instructions           ( +-   0.000% )

    0.239947420  seconds time elapsed.

the noise there is very low. (despite 230 milliseconds still being a 
very short runtime)

Hope this helps - thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter 20% error finding retired instruction count
  2009-06-24 15:10 ` Ingo Molnar
@ 2009-06-25  2:12   ` Vince Weaver
  2009-06-25  6:50     ` Peter Zijlstra
  2009-06-26 18:22   ` Vince Weaver
  1 sibling, 1 reply; 27+ messages in thread
From: Vince Weaver @ 2009-06-25  2:12 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel

On Wed, 24 Jun 2009, Ingo Molnar wrote:
> * Vince Weaver <vince@deater.net> wrote:
>
> Those ~2100 instructions are executed by your app: as the ELF
> dynamic loader starts up your test-app.
>
> If you have some tool that reports less than that then that tool is
> not being truthful about the true overhead of your application.

I wanted the instruction count of the application, not the loader.
If I wanted the overhead of the loader too, then I would have specified 
it.  I don't think it has anything to do with tools being "less than 
truthful".  I notice perf doesn't seem to include its own overheads into 
the count.

> Also note that applications that only execute 1 million instructions
> are very, very rare - a modern CPU can execute billions of
> instructions, per second, per core.

Yes, I know that.

As I hope you know, the chip designers offer no guarantees with any of the 
performance counters.  So before you can use them, you have to validate 
them a bit to make sure they are returning expected results.  Hence the 
need for microbenchmarks, one of which I used as an example.

You have to be careful with performance counters.  For example, on Pentium 
4, the retired instruction counter will have as much as 2% error on some 
of the spec2k benchmarks because the "fldcw" instruction counts as two 
instructions instead of one.

This kind of difference is important when doing validation work, and can't 
just be swept under the rug with "if you use bigger programs it doesn't 
matter".

It's also nice to be able to skip the loader overhead, as the loader can 
change from system to system and makes it hard to compare counters across 
various machines.  Though it sounds like the perf utility isn't going to 
be supporting this anytime soon.

Vince

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter 20% error finding retired instruction count
  2009-06-25  2:12   ` Vince Weaver
@ 2009-06-25  6:50     ` Peter Zijlstra
  2009-06-25  9:13       ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2009-06-25  6:50 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Ingo Molnar, Paul Mackerras, linux-kernel

On Wed, 2009-06-24 at 22:12 -0400, Vince Weaver wrote:
> 
> It's also nice to be able to skip the loader overhead, as the loader can 
> change from system to system and makes it hard to compare counters across 
> various machines.  Though it sounds like the perf utility isn't going to 
> be supporting this anytime soon.

Feel free to contribute such if you think its important.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter 20% error finding retired instruction count
  2009-06-25  6:50     ` Peter Zijlstra
@ 2009-06-25  9:13       ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-06-25  9:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Vince Weaver, Paul Mackerras, linux-kernel

* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Wed, 2009-06-24 at 22:12 -0400, Vince Weaver wrote:
> > 
> > It's also nice to be able to skip the loader overhead, as the 
> > loader can change from system to system and makes it hard to 
> > compare counters across various machines.  Though it sounds like 
> > the perf utility isn't going to be supporting this anytime soon.
> 
> Feel free to contribute such if you think its important.

I'd be glad to review and test any resulting patches from Vince - 
and/or help out with pointers where to start and help out there's 
any roadblocks along the way.

The kernel side bits can be found in v2.6.31-rc1, in 
kernel/perf_counter.c, include/linux/perf_counter.h and 
arch/x86/kernel/cpu/perf_counter.c. We tried to keep the code as 
hackable as possible.

The tooling bits can be found in tools/perf/ in the kernel repo. 
builtin-stat.c contains the 'perf stat' bits.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter 20% error finding retired instruction count
  2009-06-24 15:10 ` Ingo Molnar
  2009-06-25  2:12   ` Vince Weaver
@ 2009-06-26 18:22   ` Vince Weaver
  2009-06-26 19:12     ` Peter Zijlstra
  2009-06-26 19:23     ` Vince Weaver
  1 sibling, 2 replies; 27+ messages in thread
From: Vince Weaver @ 2009-06-26 18:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel

On Wed, 24 Jun 2009, Ingo Molnar wrote:
> * Vince Weaver <vince@deater.net> wrote:
>
> Those ~2100 instructions are executed by your app: as the ELF
> dynamic loader starts up your test-app.
>
> If you have some tool that reports less than that then that tool is
> not being truthful about the true overhead of your application.

Wait a second... my application is a statically linked binary.  There is 
no ELF dynamic loader involved at all.

On further investigation, all of the overhead comes _entirely_ from the 
perf utility.  This is overhead and instructions that would not occur when 
not using the perf utility.

>From the best I can tell digging through the perf sources, the performance 
counters are set up and started in userspace, but instead of doing an 
immediate clone/exec, thousands of instructions worth of other stuff is 
done by perf in between.

Ther "perfmon" util, plus linux-user simulators like qemu and valgrind do 
things properly.  perf can't it seems, and it seems to be a limitation of 
the new performance counter infrastructure.


Vince

PS.  Why is the perf code littered with many many  __MINGW32__ defined?
      Should this be in the kernel tree?  It makes the code really hard
      to follow.  Are there plans to port perf to windows?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter 20% error finding retired instruction count
  2009-06-26 18:22   ` Vince Weaver
@ 2009-06-26 19:12     ` Peter Zijlstra
  2009-06-27  5:32       ` Ingo Molnar
  2009-06-26 19:23     ` Vince Weaver
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2009-06-26 19:12 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Ingo Molnar, Paul Mackerras, linux-kernel

On Fri, 2009-06-26 at 14:22 -0400, Vince Weaver wrote:
> On Wed, 24 Jun 2009, Ingo Molnar wrote:
> > * Vince Weaver <vince@deater.net> wrote:
> >
> > Those ~2100 instructions are executed by your app: as the ELF
> > dynamic loader starts up your test-app.
> >
> > If you have some tool that reports less than that then that tool is
> > not being truthful about the true overhead of your application.
> 
> Wait a second... my application is a statically linked binary.  There is 
> no ELF dynamic loader involved at all.
> 
> On further investigation, all of the overhead comes _entirely_ from the 
> perf utility.  This is overhead and instructions that would not occur when 
> not using the perf utility.
> 
> From the best I can tell digging through the perf sources, the performance 
> counters are set up and started in userspace, but instead of doing an 
> immediate clone/exec, thousands of instructions worth of other stuff is 
> done by perf in between.
> 
> Ther "perfmon" util, plus linux-user simulators like qemu and valgrind do 
> things properly.  perf can't it seems, and it seems to be a limitation of 
> the new performance counter infrastructure.

perf can do it just fine, all you need is a will to touch ptrace().
Nothing in the perf counter design is limiting this to work.

I just can't really be bothered by this tiny and mostly constant offset,
esp if the cost is risking braindamage from touching ptrace(), but if
you think otherwise (and make the ptrace bit optional) I'm more than
willing to merge the patch.

> PS.  Why is the perf code littered with many many  __MINGW32__ defined?
>       Should this be in the kernel tree?  It makes the code really hard
>       to follow.  Are there plans to port perf to windows?

Comes straight from the git sources.. and littered might be a bit much,
I count only 11.

# git grep MING tools/perf | wc -l
11

But yeah, that might want cleaning up.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter 20% error finding retired instruction count
  2009-06-26 18:22   ` Vince Weaver
  2009-06-26 19:12     ` Peter Zijlstra
@ 2009-06-26 19:23     ` Vince Weaver
  2009-06-27  6:04       ` performance counter ~0.4% " Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: Vince Weaver @ 2009-06-26 19:23 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel

On Fri, 26 Jun 2009, Vince Weaver wrote:

> From the best I can tell digging through the perf sources, the performance 
> counters are set up and started in userspace, but instead of doing an 
> immediate clone/exec, thousands of instructions worth of other stuff is done 
> by perf in between.

and for the curious, wondering how a simple
   prctl(COUNTERS_ENABLE);
   fork()
   execvp()

can cause 6000+ instructions of non-deterministic execution, it turns out 
that perf is dynamically linked.  So it has to spend 5000+ cycles in 
ld-linux.so resolving the excecvp() symbol before it can actually execvp.

So when trying to get accurate profiles of simple statically linked 
programs, you still have to put up with the dynamic loader overhead 
because of the way perf is designed.  nice.

Vince

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter 20% error finding retired instruction count
  2009-06-26 19:12     ` Peter Zijlstra
@ 2009-06-27  5:32       ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-06-27  5:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Vince Weaver, Paul Mackerras, linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > PS.  Why is the perf code littered with many many  __MINGW32__ defined?
> >       Should this be in the kernel tree?  It makes the code really hard
> >       to follow.  Are there plans to port perf to windows?
> 
> Comes straight from the git sources.. and littered might be a bit 
> much, I count only 11.
> 
> # git grep MING tools/perf | wc -l
> 11
> 
> But yeah, that might want cleaning up.

Indeed. I removed those bits - thanks Vince for reporting it!

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter ~0.4% error finding retired instruction count
  2009-06-26 19:23     ` Vince Weaver
@ 2009-06-27  6:04       ` Ingo Molnar
  2009-06-27  6:44         ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar
  2009-06-27  6:48         ` performance counter ~0.4% error finding retired instruction count Paul Mackerras
  0 siblings, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-06-27  6:04 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel

* Vince Weaver <vince@deater.net> wrote:

> On Fri, 26 Jun 2009, Vince Weaver wrote:
>
>> From the best I can tell digging through the perf sources, the 
>> performance counters are set up and started in userspace, but instead 
>> of doing an immediate clone/exec, thousands of instructions worth of 
>> other stuff is done by perf in between.
>
> and for the curious, wondering how a simple
>
>   prctl(COUNTERS_ENABLE);
>   fork()
>   execvp()
>
> can cause 6000+ instructions of non-deterministic execution, it 
> turns out that perf is dynamically linked.  So it has to spend 
> 5000+ cycles in ld-linux.so resolving the excecvp() symbol before 
> it can actually execvp.

I measured 2000, but generally a few thousand cycles per invocation 
sounds about right.

That is in the 0.0001% measurement overhead range (per 'perf stat' 
invocation) for any realistic app that does something worth 
measuring - and even with a worst-case 'cheapest app' case it is in 
the 0.2-0.4% range.

Besides, you compare perfcounters to perfmon (which you seem to be a 
contributor of), while in reality perfmon has much, much worse (and 
unfixable, because designed-in) measurement overhead.

So why are you criticising perfcounters for a 5000 cycles 
measurement overhead while perfmon has huge, _hundreds of millions_ 
of cycles measurement overhead (per second) for various realistic 
workloads? [ In fact in one of the scheduler-tests perfmon has a 
whopping measurement overhead of _nine billion_ cycles, it increased 
total runtime of the workload from 3.3 seconds to 6.6 seconds. (!) ]

Why are you using a double standard here?

Here are some numbers to put the 5000 cycles startup cost into 
perspective. For example the default startup costs of even the 
simplest Linux binaries (/bin/true):

 titan:~> perf stat /bin/true

  Performance counter stats for '/bin/true':

       0.811328  task-clock-msecs     #      1.002 CPUs 
              1  context-switches     #      0.001 M/sec
              1  CPU-migrations       #      0.001 M/sec
            180  page-faults          #      0.222 M/sec
        1267713  cycles               #   1562.516 M/sec
         733772  instructions         #      0.579 IPC  
          26261  cache-references     #     32.368 M/sec
            531  cache-misses         #      0.654 M/sec

    0.000809407  seconds time elapsed

5000/1267713 cycles is in the 0.4% range. Run any app that actually 
does something beyond starting up, an app which has a chance to get 
a decent cache footprint and gets into steady state so that it gets 
stable properties that can be measured reliably - and you'll get 
into the billions of cycles range or more - at which point a few 
thousand cycles is in the 0.0001% measurement overhead range.

Compare to this the intrinsic noise of cycles metrics for some 
benchmark like hackbench:

 titan:~> perf stat -r 2 -e 0:0 -- ~/hackbench 10
 Time: 0.448
 Time: 0.447

  Performance counter stats for '/home/mingo/hackbench 10' (2 runs):

     2661715310  cycles                 ( +-   0.588% )

    0.480153304  seconds time elapsed   ( +-   0.549% )

The noise in this (very short) hackbench run above was 15 _million_ 
cycles. See how small a few thousand cycles are?

If the 5 thousand cycles measurement overhead _still_ matters to you 
under such circumstances then by all means please submit the patches 
to improve it. Despite your claims this is totally fixable with the 
current perfcounters design, Peter outlined the steps of how to 
solve it, you can utilize ptrace if you want to.

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [numbers] perfmon/pfmon overhead of 17%-94%
  2009-06-27  6:04       ` performance counter ~0.4% " Ingo Molnar
@ 2009-06-27  6:44         ` Ingo Molnar
  2009-06-29 18:25           ` Vince Weaver
  2009-06-27  6:48         ` performance counter ~0.4% error finding retired instruction count Paul Mackerras
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2009-06-27  6:44 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith


* Ingo Molnar <mingo@elte.hu> wrote:

> Besides, you compare perfcounters to perfmon (which you seem to be 
> a contributor of), while in reality perfmon has much, much worse 
> (and unfixable, because designed-in) measurement overhead.
> 
> So why are you criticising perfcounters for a 5000 cycles 
> measurement overhead while perfmon has huge, _hundreds of 
> millions_ of cycles measurement overhead (per second) for various 
> realistic workloads? [ In fact in one of the scheduler-tests 
> perfmon has a whopping measurement overhead of _nine billion_ 
> cycles, it increased total runtime of the workload from 3.3 
> seconds to 6.6 seconds. (!) ]

Here are the more detailed perfmon/pfmon measurement overhead 
numbers.

Test system is a "Intel Core2 E6800 @ 2.93GHz", 1 GB of RAM, default 
Fedora install.

I've measured two workloads:

    hackbench.c         # messaging server benchmark
    test-1m-pipes.c     # does 1 million pipe ops, similar to lat_pipe

v2.6.28+perfmon patches (v3, full):

    ./hackbench 10
    0.496400985  seconds time elapsed   ( +-   1.699% )

    pfmon --follow-fork--aggregate-results ./hackbench 10
    0.580812999  seconds time elapsed   ( +-   2.233% )

I.e. this workload runs 17% slower under pfmon, the measurement 
overhead is about 1.45 billion cycles.
 
Furthermore, when running a 'pipe latency benchmark', an app that 
does one million pipe reads and writes between two tasks (source 
code attached below), i measured the following perfmon/pfmon 
overhead:

    ./pipe-test-1m
    3.344280347  seconds time elapsed   ( +-   0.361% )

    pfmon --follow-fork --aggregate-results ./pipe-test-1m
    6.508737983  seconds time elapsed   ( +-   0.243% )

That's an about 94% measurement overhead, or about 9.2 _billion_ 
cycles overhead on this test-system.

These perfmon/pfmon overhead figures are consistently reproducible, 
and they happen on other test-systems as well, and with other 
workloads as well. Basically for any app that involves task creation 
or context-switching, perfmon adds considerable runtime overhead - 
well beyond the overhead of perfcounters.

	Ingo

-----------------{ pipe-test-1m.c }-------------------->

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>

#define LOOPS 1000000

int main (void)
{
	unsigned long long t0, t1;
	int pipe_1[2], pipe_2[2];
	int m = 0, i;

	pipe(pipe_1);
	pipe(pipe_2);

	if (!fork()) {
		for (i = 0; i < LOOPS; i++) {
			read(pipe_1[0], &m, sizeof(int));
			write(pipe_2[1], &m, sizeof(int));
		}
	} else {
		for (i = 0; i < LOOPS; i++) {
			write(pipe_1[1], &m, sizeof(int));
			read(pipe_2[0], &m, sizeof(int));
		}
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter ~0.4% error finding retired instruction count
  2009-06-27  6:04       ` performance counter ~0.4% " Ingo Molnar
  2009-06-27  6:44         ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar
@ 2009-06-27  6:48         ` Paul Mackerras
  2009-06-27 17:28           ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: Paul Mackerras @ 2009-06-27  6:48 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Vince Weaver, Peter Zijlstra, linux-kernel

Ingo Molnar writes:

> I measured 2000, but generally a few thousand cycles per invocation 
> sounds about right.

We could actually do a bit better than we do, fairly easily.  We could
attach the counters to the child after the fork instead of the parent
before the fork, using a couple of pipes for synchronization.  And
there's probably a way to get the dynamic linker to resolve the execvp
call early in the child so we avoid that overhead.  I think we should
be able to get the overhead down to tens of userspace instructions
without doing anything unnatural.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter ~0.4% error finding retired instruction count
  2009-06-27  6:48         ` performance counter ~0.4% error finding retired instruction count Paul Mackerras
@ 2009-06-27 17:28           ` Ingo Molnar
  2009-06-29  2:12             ` Paul Mackerras
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2009-06-27 17:28 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Vince Weaver, Peter Zijlstra, linux-kernel


* Paul Mackerras <paulus@samba.org> wrote:

> Ingo Molnar writes:
> 
> > I measured 2000, but generally a few thousand cycles per 
> > invocation sounds about right.
> 
> We could actually do a bit better than we do, fairly easily.  We 
> could attach the counters to the child after the fork instead of 
> the parent before the fork, using a couple of pipes for 
> synchronization.  And there's probably a way to get the dynamic 
> linker to resolve the execvp call early in the child so we avoid 
> that overhead.  I think we should be able to get the overhead down 
> to tens of userspace instructions without doing anything 
> unnatural.

Definitely so.

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter ~0.4% error finding retired instruction count
  2009-06-27 17:28           ` Ingo Molnar
@ 2009-06-29  2:12             ` Paul Mackerras
  2009-06-29  2:13               ` Paul Mackerras
  2009-06-29  3:48               ` Ingo Molnar
  0 siblings, 2 replies; 27+ messages in thread
From: Paul Mackerras @ 2009-06-29  2:12 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel

I can think of three ways to eliminate the PLT resolver overhead on
execvp:

(1) Do execvp on a non-executable file first to get execvp resolved:

	char tmpnam[16];
	int fd;
	char *args[1];

	strcpy(tmpname, "/tmp/perfXXXXXX");
	fd = mkstemp(tmpname);
	if (fd >= 0) {
		args[1] = NULL;
		execvp(tmpname, args);
		close(fd);
		unlink(tmpname);
	}
	enable_counters();
	execvp(prog, argv);

(2) Look up execvp in glibc and call it directly:

	int (*execptr)(const char *, char *const []);

	execptr = dlsym(RTLD_NEXT, "execvp");
	enable_counters();
	(*execptr)(prog, argv);

(3) Resolve the executable path ourselves and then invoke the execve
system call directly:

	char *execpath;

	execpath = search_path(getenv("PATH"), prog);
	enable_counters();
	syscall(NR_execve, execpath, argv, envp);

(4) Same as (1), but rely on "" being an invalid program name for
execvp:

	execvp("", argv);
	enable_counters();
	execvp(prog, argv);

What do you guys think?  Does any of these appeal more than the
others?  I'm leaning towards (4) myself.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter ~0.4% error finding retired instruction count
  2009-06-29  2:12             ` Paul Mackerras
@ 2009-06-29  2:13               ` Paul Mackerras
  2009-06-29  3:48               ` Ingo Molnar
  1 sibling, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2009-06-29  2:13 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel

Paul Mackerras writes:

> I can think of three ways to eliminate the PLT resolver overhead on
> execvp:

s/three/four/, obviously - I thought of the 4th while I was writing
the mail.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: performance counter ~0.4% error finding retired instruction count
  2009-06-29  2:12             ` Paul Mackerras
  2009-06-29  2:13               ` Paul Mackerras
@ 2009-06-29  3:48               ` Ingo Molnar
  1 sibling, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-06-29  3:48 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Peter Zijlstra, linux-kernel


* Paul Mackerras <paulus@samba.org> wrote:

> I can think of three ways to eliminate the PLT resolver overhead on
> execvp:
> 
> (1) Do execvp on a non-executable file first to get execvp resolved:
> 
> 	char tmpnam[16];
> 	int fd;
> 	char *args[1];
> 
> 	strcpy(tmpname, "/tmp/perfXXXXXX");
> 	fd = mkstemp(tmpname);
> 	if (fd >= 0) {
> 		args[1] = NULL;
> 		execvp(tmpname, args);
> 		close(fd);
> 		unlink(tmpname);
> 	}
> 	enable_counters();
> 	execvp(prog, argv);
> 
> (2) Look up execvp in glibc and call it directly:
> 
> 	int (*execptr)(const char *, char *const []);
> 
> 	execptr = dlsym(RTLD_NEXT, "execvp");
> 	enable_counters();
> 	(*execptr)(prog, argv);
> 
> (3) Resolve the executable path ourselves and then invoke the execve
> system call directly:
> 
> 	char *execpath;
> 
> 	execpath = search_path(getenv("PATH"), prog);
> 	enable_counters();
> 	syscall(NR_execve, execpath, argv, envp);
> 
> (4) Same as (1), but rely on "" being an invalid program name for
> execvp:
> 
> 	execvp("", argv);
> 	enable_counters();
> 	execvp(prog, argv);
> 
> What do you guys think?  Does any of these appeal more than the
> others?  I'm leaning towards (4) myself.

(4) looks convincingly elegant.

We could also do (5): a one-shot counters-disabled ptrace run of the 
target, then enable-counters-in-target + ptrace-detach after the 
first stop.

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-06-27  6:44         ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar
@ 2009-06-29 18:25           ` Vince Weaver
  2009-06-29 21:02             ` Ingo Molnar
                               ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Vince Weaver @ 2009-06-29 18:25 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith

Hello

> Ingo Molnar <mingo@elte.hu> wrote:
>> Vince Weaver <vince@deater.net> wrote:

> That is in the 0.0001% measurement overhead range (per 'perf stat' 
> invocation) for any realistic app that does something worth 
> measuring

I'm just curious about this "app worth measuring" idea.

Do you intend for performance counters to simply be "oprofile done right"
or do you intend it to be a generic way of exposing performance counters 
to userspace?

For the research my co-workers and I are currently working on the former 
is uninteresting.  If we wanted oprofile, we'd use it.

What matters for us is getting very exact counts of counters on programs 
that are being run as deterministically as possible.  This includes 
very small programs, and counts like retired_instructions, load/store 
ratios, uop_counts, etc.

This may be uninteresting to you, but it is important to us.  Hence my 
interest in the capabilities of the infrastructure finally getting merged 
into the kernel.

> Besides, you compare perfcounters to perfmon

what else shoud I be comparing it to?

> (which you seem to be a contributor of)

is that not allowed?

> workloads? [ In fact in one of the scheduler-tests perfmon has a 
> whopping measurement overhead of _nine billion_ cycles, it increased 
> total runtime of the workload from 3.3 seconds to 6.6 seconds. (!) ]

I'm sure the perfmon2 people would welcome any patches you have to fix 
this problem.

as I said, I am looking for aggregate counts for deterministic programs.
Compared to the ovreheads of 50x for DBI-based tools like Valgrind, or 
1000x for "cycle-accurate" simulations, then even overhead of 2x really 
isn't that bad.

Counting cycles or time is always a dangerous thing when performance 
counters are involved.  Things as trivial as compiler, object link-order,
length of the executable name, number of environment variables, number of 
ELF auxilliary vectors, etc, can all vastly change what results you get. 
I'd reccomend the following paper for more details:

   "Producing wrong data without doing anything obviously wrong"
   by Mytkowicz et al.
   http://www-plan.cs.colorado.edu/klipto/mytkowicz-asplos09.pdf

> If the 5 thousand cycles measurement overhead _still_ matters to you 
> under such circumstances then by all means please submit the patches 
> to improve it. Despite your claims this is totally fixable with the 
> current perfcounters design, Peter outlined the steps of how to 
> solve it, you can utilize ptrace if you want to.

Is it really "totally" fixible?  I don't just mean getting the overhead 
from ~3000 down to ~100, I mean down to zero.

> Here are the more detailed perfmon/pfmon measurement overhead
> numbers.
>
> ...
>
> I.e. this workload runs 17% slower under pfmon, the measurement
> overhead is about 1.45 billion cycles.
>
> ..
>
> That's an about 94% measurement overhead, or about 9.2 _billion_
> cycles overhead on this test-system.

I'm more interested in very CPU-intensive benchmarks.  I ran some 
experiments with gcc and equake from the spec2k benchmark suite.

This is on a 32-bit AMD Athlon(tm) XP 2000+ machine

gcc.200 (spec2k)

+ 2.6.30-03984-g45e3e19, configured with perf counters disabled

    108.44s +/- 0.7

+ 2.6.30-03984-g45e3e19, perf stat -e 0:1:u --

    109.17s +/- 0.7

*** For a slowdown of about 0.6%

+ 2.6.29.5 (unpatched)

   115.31s +/- 0.5

+ 2.6.29.5 with perfmon2 patches applied,  pfmon -e retired_instructions,cpu_clk_unhalted

   115.62 +/- 0.5

** For a slowdown of about 0.2%

So in this case, perfmon2 had less overhead, though it's so small overhead 
as to be lost in the noise.  Why the 2.6.30-git kernel 
seems to be much faster on this hardware, I don't know.

equake (spec2k)

+ 2.6.30-03984-g45e3e19, configured with perf counters disabled

    392.77s +/- 1.5

+ 2.6.30-03984-g45e3e19, perf stat -e 0:1:u --

    393.45s +/- 0.7

*** For a slowdown of about 0.17%

+ 2.6.29.5 (unpatched)

   429.25s +/- 1.7

+ 2.6.29.5 with perfmon2 patches applied,  pfmon -e retired_instructions,cpu_clk_unhalted

   428.91 +/- 0.8

** For a _speedup_ of about 0.08%

So again the difference in overheads is in the noise.  Again I am not sure 
why 2.6.30-git is so much faster on this hardware.

As for counter results, in this case retired instructions:

gcc.200
   perf:  72,618,643,132 +/- 8million
   pfmon: 72,618,519,792 +/- 5million

equake
   perf:  144,952,319,472 +/- 8000
   pfmon: 144,952,327,906 +/-  500

So in the equake case you can easily see that the few thousand instruction 
overhead from perf can show up even on long-running programs.

In any case, the point I am trying to make is that perf counters are used 
by a wide variety of people in a wide variety of ways, with lots of 
different performance/accuracy tradeoffs.  Don't limit the API just 
because you can't envision a use for certain features.

Vince

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-06-29 18:25           ` Vince Weaver
@ 2009-06-29 21:02             ` Ingo Molnar
  2009-07-02 21:07               ` Vince Weaver
  2009-06-29 23:46             ` [patch] perf_counter: Add enable-on-exec attribute Ingo Molnar
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2009-06-29 21:02 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith


* Vince Weaver <vince@deater.net> wrote:

>> If the 5 thousand cycles measurement overhead _still_ matters to 
>> you under such circumstances then by all means please submit the 
>> patches to improve it. Despite your claims this is totally 
>> fixable with the current perfcounters design, Peter outlined the 
>> steps of how to solve it, you can utilize ptrace if you want to.
>
> Is it really "totally" fixible?  I don't just mean getting the 
> overhead from ~3000 down to ~100, I mean down to zero.

The thing is, not even pfmon gets it down to zero:

  pfmon -e INSTRUCTIONS_RETIRED --follow-fork --aggregate-results ~/million
  1000001 INSTRUCTIONS_RETIRED

So ... do you take the hardliner purist view and consider it crap 
due to that imprecision, or do you take the pragmatist view of also 
considering the relative relevance of any imperfection? ;-)

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch] perf_counter: Add enable-on-exec attribute
  2009-06-29 18:25           ` Vince Weaver
  2009-06-29 21:02             ` Ingo Molnar
@ 2009-06-29 23:46             ` Ingo Molnar
  2009-06-29 23:55             ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar
  2009-06-30  0:05             ` Ingo Molnar
  3 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-06-29 23:46 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith


* Vince Weaver <vince@deater.net> wrote:

>> If the 5 thousand cycles measurement overhead _still_ matters to 
>> you under such circumstances then by all means please submit the 
>> patches to improve it. Despite your claims this is totally 
>> fixable with the current perfcounters design, Peter outlined the 
>> steps of how to solve it, you can utilize ptrace if you want to.
>
> Is it really "totally" fixible?  I don't just mean getting the 
> overhead from ~3000 down to ~100, I mean down to zero.

Yes, it's truly very easy to get exactly the same output as pfmon, 
for the 'million.s' test app you posted:

  titan:~> perf stat -e 0:1:u ./million

   Performance counter stats for './million':

          1000001  instructions            

      0.000489736  seconds time elapsed

See the small patch below.

( Note that this approach does not use ptrace, hence it can be used
  to measure debuggers too. ptrace attach has the limitation of
  being exclusive - no task can be attached to twice. perfmon used
  ptrace attach, which limited its capabilities unreasonably. )

The question was really not whether we can do it - but whether we 
want to do it. I have no strong feelings either way - because as i 
told you in my first mail, all the other noise sources in the system 
dominate the metrics far more than this very small constant startup 
offset.

And the thing is, as a perfmon contributor i assume you have 
experience in these matters. Had you taken a serious, unbiased look 
at perfcounters, and had this problem truly bothered you personally, 
you could have come up with a similar patch yourself as well, while 
only spending a fraction of the energies you are putting into these 
emails. Instead you ignored our technical arguments, you refused to 
touch the code and you went on rambling against how perfcounters 
supposedly cannot solve this problem. Not very productive IMO.

	Ingo

---------------->
Subject: perf_counter: Add enable-on-exec attribute
From: Ingo Molnar <mingo@elte.hu>
Date: Mon Jun 29 22:05:11 CEST 2009

Add another attribute variant: attr.enable_on_exec.

The purpose is to allow the auto-enabling of such counters
on exec(), to measure exec()-ed workloads precisely, from
the first to the last instruction.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 fs/exec.c                    |    3 +--
 include/linux/perf_counter.h |    5 ++++-
 kernel/perf_counter.c        |   39 ++++++++++++++++++++++++++++++++++++---
 tools/perf/builtin-stat.c    |    5 +++--
 4 files changed, 44 insertions(+), 8 deletions(-)

Index: linux/fs/exec.c
===================================================================
--- linux.orig/fs/exec.c
+++ linux/fs/exec.c
@@ -996,8 +996,7 @@ int flush_old_exec(struct linux_binprm *
 	 * Flush performance counters when crossing a
 	 * security domain:
 	 */
-	if (!get_dumpable(current->mm))
-		perf_counter_exit_task(current);
+	perf_counter_exec(current);
 
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
Index: linux/include/linux/perf_counter.h
===================================================================
--- linux.orig/include/linux/perf_counter.h
+++ linux/include/linux/perf_counter.h
@@ -179,8 +179,9 @@ struct perf_counter_attr {
 				comm	       :  1, /* include comm data     */
 				freq           :  1, /* use freq, not period  */
 				inherit_stat   :  1, /* per task counts       */
+				enable_on_exec :  1, /* enable on exec        */
 
-				__reserved_1   : 52;
+				__reserved_1   : 51;
 
 	__u32			wakeup_events;	/* wakeup every n events */
 	__u32			__reserved_2;
@@ -712,6 +713,7 @@ static inline void perf_counter_mmap(str
 
 extern void perf_counter_comm(struct task_struct *tsk);
 extern void perf_counter_fork(struct task_struct *tsk);
+extern void perf_counter_exec(struct task_struct *tsk);
 
 extern struct perf_callchain_entry *perf_callchain(struct pt_regs *regs);
 
@@ -752,6 +754,7 @@ perf_swcounter_event(u32 event, u64 nr, 
 static inline void perf_counter_mmap(struct vm_area_struct *vma)	{ }
 static inline void perf_counter_comm(struct task_struct *tsk)		{ }
 static inline void perf_counter_fork(struct task_struct *tsk)		{ }
+static inline void perf_counter_exec(struct task_struct *tsk)		{ }
 static inline void perf_counter_init(void)				{ }
 #endif
 
Index: linux/kernel/perf_counter.c
===================================================================
--- linux.orig/kernel/perf_counter.c
+++ linux/kernel/perf_counter.c
@@ -903,6 +903,9 @@ static void perf_counter_enable(struct p
 	struct perf_counter_context *ctx = counter->ctx;
 	struct task_struct *task = ctx->task;
 
+	if (counter->attr.enable_on_exec)
+		return;
+
 	if (!task) {
 		/*
 		 * Enable the counter on the cpu that it's on
@@ -2856,6 +2859,32 @@ void perf_counter_fork(struct task_struc
 	perf_counter_fork_event(&fork_event);
 }
 
+void perf_counter_exec(struct task_struct *task)
+{
+	struct perf_counter_context *ctx;
+	struct perf_counter *counter;
+
+	if (!get_dumpable(task->mm)) {
+		perf_counter_exit_task(task);
+		return;
+	}
+
+	if (!task->perf_counter_ctxp)
+		return;
+
+	rcu_read_lock();
+	ctx = task->perf_counter_ctxp;
+	if (ctx) {
+		list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+			if (counter->attr.enable_on_exec) {
+				counter->attr.enable_on_exec = 0;
+				__perf_counter_enable(counter);
+			}
+		}
+	}
+	rcu_read_unlock();
+}
+
 /*
  * comm tracking
  */
@@ -4064,10 +4093,14 @@ inherit_counter(struct perf_counter *par
 	 * not its attr.disabled bit.  We hold the parent's mutex,
 	 * so we won't race with perf_counter_{en, dis}able_family.
 	 */
-	if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE)
-		child_counter->state = PERF_COUNTER_STATE_INACTIVE;
-	else
+	if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE) {
+		if (child_counter->attr.enable_on_exec)
+			child_counter->state = PERF_COUNTER_STATE_OFF;
+		else
+			child_counter->state = PERF_COUNTER_STATE_INACTIVE;
+	} else {
 		child_counter->state = PERF_COUNTER_STATE_OFF;
+	}
 
 	if (parent_counter->attr.freq)
 		child_counter->hw.sample_period = parent_counter->hw.sample_period;
Index: linux/tools/perf/builtin-stat.c
===================================================================
--- linux.orig/tools/perf/builtin-stat.c
+++ linux/tools/perf/builtin-stat.c
@@ -116,8 +116,9 @@ static void create_perf_stat_counter(int
 					fd[cpu][counter], strerror(errno));
 		}
 	} else {
-		attr->inherit	= inherit;
-		attr->disabled	= 1;
+		attr->inherit		= inherit;
+		attr->disabled		= 1;
+		attr->enable_on_exec	= 1;
 
 		fd[0][counter] = sys_perf_counter_open(attr, pid, -1, -1, 0);
 		if (fd[0][counter] < 0 && verbose)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-06-29 18:25           ` Vince Weaver
  2009-06-29 21:02             ` Ingo Molnar
  2009-06-29 23:46             ` [patch] perf_counter: Add enable-on-exec attribute Ingo Molnar
@ 2009-06-29 23:55             ` Ingo Molnar
  2009-06-30  0:05             ` Ingo Molnar
  3 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-06-29 23:55 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith

* Vince Weaver <vince@deater.net> wrote:

>> Besides, you compare perfcounters to perfmon
>
> what else shoud I be comparing it to?
>
>> (which you seem to be a contributor of)
>
> is that not allowed?

Here's the full, uncropped sentence i wrote:

 " Besides, you compare perfcounters to perfmon (which you seem to 
   be a contributor of), while in reality perfmon has much, much 
   worse (and unfixable, because designed-in) measurement overhead. "

Where i question the blatant hypocracy of bringing up perfmon as a 
good example while in reality perfmon has far worse measurement 
overhead than perfcounters, for a wide range of workloads.

As far as i can see you didnt answer my questions: why are you 
dismissing perfcounters for a minor, once per startup measurement 
offset (which is entirely fixable - see the patch i sent), while you 
generously allow perfmon to have serious, 90% measurement overhead 
amounting to billions of instructions overhead per second, for 
certain workloads?

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-06-29 18:25           ` Vince Weaver
                               ` (2 preceding siblings ...)
  2009-06-29 23:55             ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar
@ 2009-06-30  0:05             ` Ingo Molnar
  3 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-06-30  0:05 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith

* Vince Weaver <vince@deater.net> wrote:

>> workloads? [ In fact in one of the scheduler-tests perfmon has a 
>> whopping measurement overhead of _nine billion_ cycles, it 
>> increased total runtime of the workload from 3.3 seconds to 6.6 
>> seconds. (!) ]
>
> I'm sure the perfmon2 people would welcome any patches you have to 
> fix this problem.

I think this flaw of perfmon is unfixable, because perfmon (by 
design) uses a _way_ too low level and way too opaque and 
structure-less abstraction for the PMU, which disallows the kind of 
high-level optimizations that perfcounters can do.

We werent silent about this - to the contrary. Last November Thomas 
and me _did_ take a good look at perfmon patches (we are maintaining 
the code areas affected by perfmon), we saw that it has unfixable 
problems and came up with objections and later on came up with 
patches that fix these problems: the perfcounters subsystem.

>> That's an about 94% measurement overhead, or about 9.2 _billion_ 
>> cycles overhead on this test-system.
>
> I'm more interested in very CPU-intensive benchmarks.  I ran some 
> experiments with gcc and equake from the spec2k benchmark suite.

The workloads i cited are _all_ 100% CPU-intensive benchmarks:

 - hackbench
 - loop-pipe-1-million

But i could add 'lat_tcp localhost', 'bw_tcp localhost' or sysbench 
to the list - all show very significant overhead under perfmon. 
These are all important workloads and important benchmarks. A kernel 
based performance analysis facility that is any good must handle 
them transparently.

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-06-29 21:02             ` Ingo Molnar
@ 2009-07-02 21:07               ` Vince Weaver
  2009-07-03  7:58                 ` Ingo Molnar
  2009-07-03 18:31                 ` Andi Kleen
  0 siblings, 2 replies; 27+ messages in thread
From: Vince Weaver @ 2009-07-02 21:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith


sorry for the delay in responding, was away

On Mon, 29 Jun 2009, Ingo Molnar wrote:
>
> * Vince Weaver <vince@deater.net> wrote:
>
>>> If the 5 thousand cycles measurement overhead _still_ matters to
>>> you under such circumstances then by all means please submit the
>>> patches to improve it. Despite your claims this is totally
>>> fixable with the current perfcounters design, Peter outlined the
>>> steps of how to solve it, you can utilize ptrace if you want to.
>>
>> Is it really "totally" fixible?  I don't just mean getting the
>> overhead from ~3000 down to ~100, I mean down to zero.
>
> The thing is, not even pfmon gets it down to zero:
>
>  pfmon -e INSTRUCTIONS_RETIRED --follow-fork --aggregate-results ~/million
>  1000001 INSTRUCTIONS_RETIRED
>
> So ... do you take the hardliner purist view and consider it crap
> due to that imprecision, or do you take the pragmatist view of also
> considering the relative relevance of any imperfection? ;-)

as I said in a previous post, on most x86 chips the instructions_retired
counter also includes any hardware interrupts that occur during the 
process runtime.  So any clock interrupts, etc, show up as an extra 
instruction.  So on the "million" benchmark, it's usually +/- 2 extra 
instructions.

It looks like support might be added to perfcounters to track these 
hardware interrupt stats per-process, which would be great, as it's been 
really hard to quantify that currently.

In any case, it looks like the changes to make perf have lower overhead 
have been merged, which makes me happy.  Thank you.

Vince


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-07-02 21:07               ` Vince Weaver
@ 2009-07-03  7:58                 ` Ingo Molnar
  2009-07-03 21:43                   ` Vince Weaver
  2009-07-03 18:31                 ` Andi Kleen
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2009-07-03  7:58 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith

* Vince Weaver <vince@deater.net> wrote:

> On Mon, 29 Jun 2009, Ingo Molnar wrote:
>>
>> * Vince Weaver <vince@deater.net> wrote:
>>
>>>> If the 5 thousand cycles measurement overhead _still_ matters to
>>>> you under such circumstances then by all means please submit the
>>>> patches to improve it. Despite your claims this is totally
>>>> fixable with the current perfcounters design, Peter outlined the
>>>> steps of how to solve it, you can utilize ptrace if you want to.
>>>
>>> Is it really "totally" fixible?  I don't just mean getting the
>>> overhead from ~3000 down to ~100, I mean down to zero.
>>
>> The thing is, not even pfmon gets it down to zero:
>>
>>  pfmon -e INSTRUCTIONS_RETIRED --follow-fork --aggregate-results ~/million
>>  1000001 INSTRUCTIONS_RETIRED
>>
>> So ... do you take the hardliner purist view and consider it crap
>> due to that imprecision, or do you take the pragmatist view of also
>> considering the relative relevance of any imperfection? ;-)
>
> as I said in a previous post, on most x86 chips the 
> instructions_retired counter also includes any hardware interrupts 
> that occur during the process runtime.  So any clock interrupts, 
> etc, show up as an extra instruction.  So on the "million" 
> benchmark, it's usually +/- 2 extra instructions.

yeah. But it has nothing to do with the function you are measuring, 
right?

My general point is really that what matters is the statistical 
validity of the end result. I dont think you ever disagreed with 
that point - you just seem to have a lower noise acceptance 
threshold ;-)

> It looks like support might be added to perfcounters to track 
> these hardware interrupt stats per-process, which would be great, 
> as it's been really hard to quantify that currently.

Yeah. There's a patch-set in the works that attempts to do something 
in this area - see these mails on lkml:

    perf_counter: Add Generalized Hardware interrupt support

Right now they are just convenience wrappers around CPU model 
specific hw events - but we could extend the whole thing with 
software counters as well and isolate per IRQ vector events and 
counts, by adding a callback to do_IRQ().

That would give a mixture of hardware and software counter based IRQ 
instrumentation features that looks quite compelling. Any comments 
on what features/capabilities you'd like to see in this area?

> In any case, it looks like the changes to make perf have lower 
> overhead have been merged, which makes me happy.  Thank you.

You are welcome :)

Btw., perfcounters still has no support for older Intel CPUs such as 
P3's and P2's - and they have pretty sane PMUs - so if you have such 
a machine (which your perfmon contribution suggests you might 
have/had) and are interested it would be nice to get support for 
them. P4 support is interesting too but more challenging.

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-07-02 21:07               ` Vince Weaver
  2009-07-03  7:58                 ` Ingo Molnar
@ 2009-07-03 18:31                 ` Andi Kleen
  2009-07-03 21:25                   ` Vince Weaver
  1 sibling, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2009-07-03 18:31 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, linux-kernel,
	Mike Galbraith

Vince Weaver <vince@deater.net> writes:
>
> as I said in a previous post, on most x86 chips the instructions_retired
> counter also includes any hardware interrupts that occur during the
> process runtime.

On the other hand afaik near all chips have interrupt performance counter
events.

So if you're willing to waste one of the variable counter registers 
you can always count those and then correct based on the other count.

But the question is of course if it's worth it, the error should 
be really small. Also you could always lose a few cycles occasionally
in other "random" events, which can happen too.

>  So any clock interrupts, etc, show up as an extra
> instruction.  So on the "million" benchmark, it's usually +/- 2 extra
> instructions.

1-2 error in a million doesn't sound like a catastrophic problem.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-07-03 18:31                 ` Andi Kleen
@ 2009-07-03 21:25                   ` Vince Weaver
  2009-07-03 23:40                     ` Andi Kleen
  0 siblings, 1 reply; 27+ messages in thread
From: Vince Weaver @ 2009-07-03 21:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, linux-kernel,
	Mike Galbraith

> Vince Weaver <vince@deater.net> writes:
>>
>> as I said in a previous post, on most x86 chips the instructions_retired
>> counter also includes any hardware interrupts that occur during the
>> process runtime.
>
> On the other hand afaik near all chips have interrupt performance counter
> events.

I guess by "near all" you mean "only AMD"?  The AMD event also has some 
oddities, as it seems to report things like page faults and other things 
that don't really match up with the excess instruction count.  I must 
admit it's been a while since I've looked at that particular counter.

> But the question is of course if it's worth it, the error should
> be really small. Also you could always lose a few cycles occasionally
> in other "random" events, which can happen too.

> 1-2 error in a million doesn't sound like a catastrophic problem.

well, it's basically at least HZ extra instructions per however many 
seconds your benchmark runs, and unfortunately it's non-deterministic 
because it depends on keyboard/network/usb/etc interrupts too that may by 
chance happen while your program is running.

For me, it's the determinism that matters.  Not overhead, not runtime not 
"oh it doesn't matter, it's small".  For a deterministic benchmark I 
want to get as close to the same value every run as possible.  I admit 
it might not be possible to always get the same result, but the 
closter the better.  This might not match up with the way 
kernel-hackers use perf counters, but it is important for the work I am 
doing.

Vince

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-07-03  7:58                 ` Ingo Molnar
@ 2009-07-03 21:43                   ` Vince Weaver
  0 siblings, 0 replies; 27+ messages in thread
From: Vince Weaver @ 2009-07-03 21:43 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith

On Fri, 3 Jul 2009, Ingo Molnar wrote:
> That would give a mixture of hardware and software counter based IRQ
> instrumentation features that looks quite compelling. Any comments
> on what features/capabilities you'd like to see in this area?

I'm mainly interested in just an aggregate total of "this many interrupts 
occurred".  It wouldn't even need to be separated out by type or number. 
I don't know if the metric would be useful to anyone else.  I tried to 
hack this up a long time ago, to have the result reported with rusage()
but never got anywhere with it.

> Btw., perfcounters still has no support for older Intel CPUs such as
> P3's and P2's - and they have pretty sane PMUs - so if you have such
> a machine (which your perfmon contribution suggests you might
> have/had) and are interested it would be nice to get support for
> them. P4 support is interesting too but more challenging.

I was indeed the one who got perfmon2 running on Pentium Pro, Pentium II, 
and MIPS R12k.  For all those though there was an existing PMU driver and 
I just added the appropriate "case" statements to enable support, and then 
provided an updated list of available counters to the userspace utility. 
The only real kernel hacking involved was the week spent tracking down a 
hard-to-debug interrupt issue on the MIPS machine.

Unfortunately I think writing PMU drivers is a bit beyond me, for the 
amount of time I have.  Especially as the relevant machines I have are 
located in relatively inaccessible locations (and PMU mistakes can lock up 
the machines) plus it can take the better part of a day to compile 2.6 
kernels on some of those machines.

Vince

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [numbers] perfmon/pfmon overhead of 17%-94%
  2009-07-03 21:25                   ` Vince Weaver
@ 2009-07-03 23:40                     ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2009-07-03 23:40 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Andi Kleen, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	linux-kernel, Mike Galbraith

On Fri, Jul 03, 2009 at 05:25:32PM -0400, Vince Weaver wrote:
> >Vince Weaver <vince@deater.net> writes:
> >>
> >>as I said in a previous post, on most x86 chips the instructions_retired
> >>counter also includes any hardware interrupts that occur during the
> >>process runtime.
> >
> >On the other hand afaik near all chips have interrupt performance counter
> >events.
> 
> I guess by "near all" you mean "only AMD"?  The AMD event also has some 

Intel CPUs typically have HW_INT.RX event.  AMD has a similar event.

> well, it's basically at least HZ extra instructions per however many 
> seconds your benchmark runs, and unfortunately it's non-deterministic 
> because it depends on keyboard/network/usb/etc interrupts too that may by 
> chance happen while your program is running.
> 
> For me, it's the determinism that matters.  Not overhead, not runtime not 

To be honest I don't think you'll ever be full deterministic. Modern
computers and operating systems are just too complex with too
many (often unpredictable) things going on in the background.  In my own
experience even simulators (which are much more stable than
real hardware) are not fully deterministic. You'll always run
into problems.

If you need 100% deterministic use a simple micro controller.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2009-07-03 23:40 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-24 13:59 performance counter 20% error finding retired instruction count Vince Weaver
2009-06-24 15:10 ` Ingo Molnar
2009-06-25  2:12   ` Vince Weaver
2009-06-25  6:50     ` Peter Zijlstra
2009-06-25  9:13       ` Ingo Molnar
2009-06-26 18:22   ` Vince Weaver
2009-06-26 19:12     ` Peter Zijlstra
2009-06-27  5:32       ` Ingo Molnar
2009-06-26 19:23     ` Vince Weaver
2009-06-27  6:04       ` performance counter ~0.4% " Ingo Molnar
2009-06-27  6:44         ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar
2009-06-29 18:25           ` Vince Weaver
2009-06-29 21:02             ` Ingo Molnar
2009-07-02 21:07               ` Vince Weaver
2009-07-03  7:58                 ` Ingo Molnar
2009-07-03 21:43                   ` Vince Weaver
2009-07-03 18:31                 ` Andi Kleen
2009-07-03 21:25                   ` Vince Weaver
2009-07-03 23:40                     ` Andi Kleen
2009-06-29 23:46             ` [patch] perf_counter: Add enable-on-exec attribute Ingo Molnar
2009-06-29 23:55             ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar
2009-06-30  0:05             ` Ingo Molnar
2009-06-27  6:48         ` performance counter ~0.4% error finding retired instruction count Paul Mackerras
2009-06-27 17:28           ` Ingo Molnar
2009-06-29  2:12             ` Paul Mackerras
2009-06-29  2:13               ` Paul Mackerras
2009-06-29  3:48               ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox