* performance counter 20% error finding retired instruction count
@ 2009-06-24 13:59 Vince Weaver
2009-06-24 15:10 ` Ingo Molnar
0 siblings, 1 reply; 27+ messages in thread
From: Vince Weaver @ 2009-06-24 13:59 UTC (permalink / raw)
To: linux-kernel
Hello
As an aside, is it time to set up a dedicated Performance Counters
for Linux mailing list? (Hereafter referred to as p10c7l to avoid
confusion with the other implementations that have already taken
all the good abbreviated forms of the concept). If/when the
infrastructure appears in a released kernel, there's going to be a lot of
chatter by people who use performance counters and suddenly find they are
stuck with a huge step backwards in functionality. And asking Fortran
programmers to provide kernel patches probably won't be a productive
response. But I digress.
I was trying to get an exact retired instruction count from p10c7l.
I am using the test million.s, available here
( http://www.csl.cornell.edu/~vince/projects/perf_counter/million.s )
It should count exactly one million instructions.
Tests with valgrind and qemu show that it does.
Using perfmon2 on Pentium Pro, PII, PIII, P4, Athlon32, and Phenom
all give the proper result:
tobler:~% pfmon -e retired_instructions ./million
1000002 RETIRED_INSTRUCTIONS
( it is 1,000,002 +/-2 because on most x86 architectures retired
instruction count includes any hardware interrupts that might
happen at the time. It woud be a great feature if p10c7l
could add some way of gathering the per-process hardware
instruction count statistic to help quantify that).
Yet with perf on the same Athlon32 machine (using
kernel 2.6.30-03984-g45e3e19) gives:
tobler:~%perf stat ./million
Performance counter stats for './million':
1.519366 task-clock-ticks # 0.835 CPU utilization factor
3 context-switches # 0.002 M/sec
0 CPU-migrations # 0.000 M/sec
53 page-faults # 0.035 M/sec
2483822 cycles # 1634.775 M/sec
1240849 instructions # 816.689 M/sec # 0.500 per cycle
612685 cache-references # 403.250 M/sec
3564 cache-misses # 2.346 M/sec
Wall-clock time elapsed: 1.819226 msecs
Running multiple times gives:
1240849
1257312
1242313
Which is a varying error of at least 20% which isn't even
consistent. Is this because of sampling? The documentation doesn't
really warn about this as far as I can tell.
Thanks for any help resolving this problem
Vince
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: performance counter 20% error finding retired instruction count 2009-06-24 13:59 performance counter 20% error finding retired instruction count Vince Weaver @ 2009-06-24 15:10 ` Ingo Molnar 2009-06-25 2:12 ` Vince Weaver 2009-06-26 18:22 ` Vince Weaver 0 siblings, 2 replies; 27+ messages in thread From: Ingo Molnar @ 2009-06-24 15:10 UTC (permalink / raw) To: Vince Weaver, Peter Zijlstra, Paul Mackerras; +Cc: linux-kernel * Vince Weaver <vince@deater.net> wrote: > Hello > > As an aside, is it time to set up a dedicated Performance Counters > for Linux mailing list? (Hereafter referred to as p10c7l to avoid > confusion with the other implementations that have already taken > all the good abbreviated forms of the concept). ('perfcounters' is the name of the subsystem/feature and it's unique.) > [...] If/when the infrastructure appears in a released kernel, > there's going to be a lot of chatter by people who use performance > counters and suddenly find they are stuck with a huge step > backwards in functionality. And asking Fortran programmers to > provide kernel patches probably won't be a productive response. > But I digress. > > I was trying to get an exact retired instruction count from > p10c7l. I am using the test million.s, available here > > ( http://www.csl.cornell.edu/~vince/projects/perf_counter/million.s ) > > It should count exactly one million instructions. > > Tests with valgrind and qemu show that it does. > > Using perfmon2 on Pentium Pro, PII, PIII, P4, Athlon32, and Phenom > all give the proper result: > > tobler:~% pfmon -e retired_instructions ./million > 1000002 RETIRED_INSTRUCTIONS > > ( it is 1,000,002 +/-2 because on most x86 architectures retired > instruction count includes any hardware interrupts that might > happen at the time. It woud be a great feature if p10c7l > could add some way of gathering the per-process hardware > instruction count statistic to help quantify that). > > Yet with perf on the same Athlon32 machine (using > kernel 2.6.30-03984-g45e3e19) gives: > > tobler:~%perf stat ./million > > Performance counter stats for './million': > > 1.519366 task-clock-ticks # 0.835 CPU utilization factor > 3 context-switches # 0.002 M/sec > 0 CPU-migrations # 0.000 M/sec > 53 page-faults # 0.035 M/sec > 2483822 cycles # 1634.775 M/sec > 1240849 instructions # 816.689 M/sec # 0.500 per cycle > 612685 cache-references # 403.250 M/sec > 3564 cache-misses # 2.346 M/sec > > Wall-clock time elapsed: 1.819226 msecs > > Running multiple times gives: > 1240849 > 1257312 > 1242313 > > Which is a varying error of at least 20% which isn't even > consistent. Is this because of sampling? The documentation > doesn't really warn about this as far as I can tell. > > Thanks for any help resolving this problem Thanks for the question! There's still gaps in the documentation so let me explain the basics here: 'perf stat' counts the true cost of executing the command in question, including the costs of: fork()ing the task exec()-ing it the ELF loader resolving dynamic symbols the app hitting various pagefaults that instantiate its pagetables etc. Those operations are pretty 'noisy' on a typical CPU, with lots of cache effects, so the noise you see is real. You can eliminate much of the noise by only counting user-space instructions, as much of the command startup cost is in kernel-space. Running your test-app that way can be done the following way: $ perf stat --repeat 10 -e 0:1:u ./million Performance counter stats for './million' (10 runs): 1002106 instructions ( +- 0.015% ) 0.000599029 seconds time elapsed. ( note the --repeat feature of perf stat - it does a loop of command executions and observes the noise and displays it. ) Those ~2100 instructions are executed by your app: as the ELF dynamic loader starts up your test-app. If you have some tool that reports less than that then that tool is not being truthful about the true overhead of your application. Also note that applications that only execute 1 million instructions are very, very rare - a modern CPU can execute billions of instructions, per second, per core. So i usually test a reference app that is more realistic, that executes 1 billion instructions: $ perf stat --repeat 10 -e 0:1:u ./loop_1b_instructions Performance counter stats for './loop_1b_instructions' (10 runs): 1000079797 instructions ( +- 0.000% ) 0.239947420 seconds time elapsed. the noise there is very low. (despite 230 milliseconds still being a very short runtime) Hope this helps - thanks, Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter 20% error finding retired instruction count 2009-06-24 15:10 ` Ingo Molnar @ 2009-06-25 2:12 ` Vince Weaver 2009-06-25 6:50 ` Peter Zijlstra 2009-06-26 18:22 ` Vince Weaver 1 sibling, 1 reply; 27+ messages in thread From: Vince Weaver @ 2009-06-25 2:12 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel On Wed, 24 Jun 2009, Ingo Molnar wrote: > * Vince Weaver <vince@deater.net> wrote: > > Those ~2100 instructions are executed by your app: as the ELF > dynamic loader starts up your test-app. > > If you have some tool that reports less than that then that tool is > not being truthful about the true overhead of your application. I wanted the instruction count of the application, not the loader. If I wanted the overhead of the loader too, then I would have specified it. I don't think it has anything to do with tools being "less than truthful". I notice perf doesn't seem to include its own overheads into the count. > Also note that applications that only execute 1 million instructions > are very, very rare - a modern CPU can execute billions of > instructions, per second, per core. Yes, I know that. As I hope you know, the chip designers offer no guarantees with any of the performance counters. So before you can use them, you have to validate them a bit to make sure they are returning expected results. Hence the need for microbenchmarks, one of which I used as an example. You have to be careful with performance counters. For example, on Pentium 4, the retired instruction counter will have as much as 2% error on some of the spec2k benchmarks because the "fldcw" instruction counts as two instructions instead of one. This kind of difference is important when doing validation work, and can't just be swept under the rug with "if you use bigger programs it doesn't matter". It's also nice to be able to skip the loader overhead, as the loader can change from system to system and makes it hard to compare counters across various machines. Though it sounds like the perf utility isn't going to be supporting this anytime soon. Vince ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter 20% error finding retired instruction count 2009-06-25 2:12 ` Vince Weaver @ 2009-06-25 6:50 ` Peter Zijlstra 2009-06-25 9:13 ` Ingo Molnar 0 siblings, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2009-06-25 6:50 UTC (permalink / raw) To: Vince Weaver; +Cc: Ingo Molnar, Paul Mackerras, linux-kernel On Wed, 2009-06-24 at 22:12 -0400, Vince Weaver wrote: > > It's also nice to be able to skip the loader overhead, as the loader can > change from system to system and makes it hard to compare counters across > various machines. Though it sounds like the perf utility isn't going to > be supporting this anytime soon. Feel free to contribute such if you think its important. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter 20% error finding retired instruction count 2009-06-25 6:50 ` Peter Zijlstra @ 2009-06-25 9:13 ` Ingo Molnar 0 siblings, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2009-06-25 9:13 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Vince Weaver, Paul Mackerras, linux-kernel * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Wed, 2009-06-24 at 22:12 -0400, Vince Weaver wrote: > > > > It's also nice to be able to skip the loader overhead, as the > > loader can change from system to system and makes it hard to > > compare counters across various machines. Though it sounds like > > the perf utility isn't going to be supporting this anytime soon. > > Feel free to contribute such if you think its important. I'd be glad to review and test any resulting patches from Vince - and/or help out with pointers where to start and help out there's any roadblocks along the way. The kernel side bits can be found in v2.6.31-rc1, in kernel/perf_counter.c, include/linux/perf_counter.h and arch/x86/kernel/cpu/perf_counter.c. We tried to keep the code as hackable as possible. The tooling bits can be found in tools/perf/ in the kernel repo. builtin-stat.c contains the 'perf stat' bits. Thanks, Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter 20% error finding retired instruction count 2009-06-24 15:10 ` Ingo Molnar 2009-06-25 2:12 ` Vince Weaver @ 2009-06-26 18:22 ` Vince Weaver 2009-06-26 19:12 ` Peter Zijlstra 2009-06-26 19:23 ` Vince Weaver 1 sibling, 2 replies; 27+ messages in thread From: Vince Weaver @ 2009-06-26 18:22 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel On Wed, 24 Jun 2009, Ingo Molnar wrote: > * Vince Weaver <vince@deater.net> wrote: > > Those ~2100 instructions are executed by your app: as the ELF > dynamic loader starts up your test-app. > > If you have some tool that reports less than that then that tool is > not being truthful about the true overhead of your application. Wait a second... my application is a statically linked binary. There is no ELF dynamic loader involved at all. On further investigation, all of the overhead comes _entirely_ from the perf utility. This is overhead and instructions that would not occur when not using the perf utility. >From the best I can tell digging through the perf sources, the performance counters are set up and started in userspace, but instead of doing an immediate clone/exec, thousands of instructions worth of other stuff is done by perf in between. Ther "perfmon" util, plus linux-user simulators like qemu and valgrind do things properly. perf can't it seems, and it seems to be a limitation of the new performance counter infrastructure. Vince PS. Why is the perf code littered with many many __MINGW32__ defined? Should this be in the kernel tree? It makes the code really hard to follow. Are there plans to port perf to windows? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter 20% error finding retired instruction count 2009-06-26 18:22 ` Vince Weaver @ 2009-06-26 19:12 ` Peter Zijlstra 2009-06-27 5:32 ` Ingo Molnar 2009-06-26 19:23 ` Vince Weaver 1 sibling, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2009-06-26 19:12 UTC (permalink / raw) To: Vince Weaver; +Cc: Ingo Molnar, Paul Mackerras, linux-kernel On Fri, 2009-06-26 at 14:22 -0400, Vince Weaver wrote: > On Wed, 24 Jun 2009, Ingo Molnar wrote: > > * Vince Weaver <vince@deater.net> wrote: > > > > Those ~2100 instructions are executed by your app: as the ELF > > dynamic loader starts up your test-app. > > > > If you have some tool that reports less than that then that tool is > > not being truthful about the true overhead of your application. > > Wait a second... my application is a statically linked binary. There is > no ELF dynamic loader involved at all. > > On further investigation, all of the overhead comes _entirely_ from the > perf utility. This is overhead and instructions that would not occur when > not using the perf utility. > > From the best I can tell digging through the perf sources, the performance > counters are set up and started in userspace, but instead of doing an > immediate clone/exec, thousands of instructions worth of other stuff is > done by perf in between. > > Ther "perfmon" util, plus linux-user simulators like qemu and valgrind do > things properly. perf can't it seems, and it seems to be a limitation of > the new performance counter infrastructure. perf can do it just fine, all you need is a will to touch ptrace(). Nothing in the perf counter design is limiting this to work. I just can't really be bothered by this tiny and mostly constant offset, esp if the cost is risking braindamage from touching ptrace(), but if you think otherwise (and make the ptrace bit optional) I'm more than willing to merge the patch. > PS. Why is the perf code littered with many many __MINGW32__ defined? > Should this be in the kernel tree? It makes the code really hard > to follow. Are there plans to port perf to windows? Comes straight from the git sources.. and littered might be a bit much, I count only 11. # git grep MING tools/perf | wc -l 11 But yeah, that might want cleaning up. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter 20% error finding retired instruction count 2009-06-26 19:12 ` Peter Zijlstra @ 2009-06-27 5:32 ` Ingo Molnar 0 siblings, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2009-06-27 5:32 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Vince Weaver, Paul Mackerras, linux-kernel * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > PS. Why is the perf code littered with many many __MINGW32__ defined? > > Should this be in the kernel tree? It makes the code really hard > > to follow. Are there plans to port perf to windows? > > Comes straight from the git sources.. and littered might be a bit > much, I count only 11. > > # git grep MING tools/perf | wc -l > 11 > > But yeah, that might want cleaning up. Indeed. I removed those bits - thanks Vince for reporting it! Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter 20% error finding retired instruction count 2009-06-26 18:22 ` Vince Weaver 2009-06-26 19:12 ` Peter Zijlstra @ 2009-06-26 19:23 ` Vince Weaver 2009-06-27 6:04 ` performance counter ~0.4% " Ingo Molnar 1 sibling, 1 reply; 27+ messages in thread From: Vince Weaver @ 2009-06-26 19:23 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel On Fri, 26 Jun 2009, Vince Weaver wrote: > From the best I can tell digging through the perf sources, the performance > counters are set up and started in userspace, but instead of doing an > immediate clone/exec, thousands of instructions worth of other stuff is done > by perf in between. and for the curious, wondering how a simple prctl(COUNTERS_ENABLE); fork() execvp() can cause 6000+ instructions of non-deterministic execution, it turns out that perf is dynamically linked. So it has to spend 5000+ cycles in ld-linux.so resolving the excecvp() symbol before it can actually execvp. So when trying to get accurate profiles of simple statically linked programs, you still have to put up with the dynamic loader overhead because of the way perf is designed. nice. Vince ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter ~0.4% error finding retired instruction count 2009-06-26 19:23 ` Vince Weaver @ 2009-06-27 6:04 ` Ingo Molnar 2009-06-27 6:44 ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar 2009-06-27 6:48 ` performance counter ~0.4% error finding retired instruction count Paul Mackerras 0 siblings, 2 replies; 27+ messages in thread From: Ingo Molnar @ 2009-06-27 6:04 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel * Vince Weaver <vince@deater.net> wrote: > On Fri, 26 Jun 2009, Vince Weaver wrote: > >> From the best I can tell digging through the perf sources, the >> performance counters are set up and started in userspace, but instead >> of doing an immediate clone/exec, thousands of instructions worth of >> other stuff is done by perf in between. > > and for the curious, wondering how a simple > > prctl(COUNTERS_ENABLE); > fork() > execvp() > > can cause 6000+ instructions of non-deterministic execution, it > turns out that perf is dynamically linked. So it has to spend > 5000+ cycles in ld-linux.so resolving the excecvp() symbol before > it can actually execvp. I measured 2000, but generally a few thousand cycles per invocation sounds about right. That is in the 0.0001% measurement overhead range (per 'perf stat' invocation) for any realistic app that does something worth measuring - and even with a worst-case 'cheapest app' case it is in the 0.2-0.4% range. Besides, you compare perfcounters to perfmon (which you seem to be a contributor of), while in reality perfmon has much, much worse (and unfixable, because designed-in) measurement overhead. So why are you criticising perfcounters for a 5000 cycles measurement overhead while perfmon has huge, _hundreds of millions_ of cycles measurement overhead (per second) for various realistic workloads? [ In fact in one of the scheduler-tests perfmon has a whopping measurement overhead of _nine billion_ cycles, it increased total runtime of the workload from 3.3 seconds to 6.6 seconds. (!) ] Why are you using a double standard here? Here are some numbers to put the 5000 cycles startup cost into perspective. For example the default startup costs of even the simplest Linux binaries (/bin/true): titan:~> perf stat /bin/true Performance counter stats for '/bin/true': 0.811328 task-clock-msecs # 1.002 CPUs 1 context-switches # 0.001 M/sec 1 CPU-migrations # 0.001 M/sec 180 page-faults # 0.222 M/sec 1267713 cycles # 1562.516 M/sec 733772 instructions # 0.579 IPC 26261 cache-references # 32.368 M/sec 531 cache-misses # 0.654 M/sec 0.000809407 seconds time elapsed 5000/1267713 cycles is in the 0.4% range. Run any app that actually does something beyond starting up, an app which has a chance to get a decent cache footprint and gets into steady state so that it gets stable properties that can be measured reliably - and you'll get into the billions of cycles range or more - at which point a few thousand cycles is in the 0.0001% measurement overhead range. Compare to this the intrinsic noise of cycles metrics for some benchmark like hackbench: titan:~> perf stat -r 2 -e 0:0 -- ~/hackbench 10 Time: 0.448 Time: 0.447 Performance counter stats for '/home/mingo/hackbench 10' (2 runs): 2661715310 cycles ( +- 0.588% ) 0.480153304 seconds time elapsed ( +- 0.549% ) The noise in this (very short) hackbench run above was 15 _million_ cycles. See how small a few thousand cycles are? If the 5 thousand cycles measurement overhead _still_ matters to you under such circumstances then by all means please submit the patches to improve it. Despite your claims this is totally fixable with the current perfcounters design, Peter outlined the steps of how to solve it, you can utilize ptrace if you want to. Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* [numbers] perfmon/pfmon overhead of 17%-94% 2009-06-27 6:04 ` performance counter ~0.4% " Ingo Molnar @ 2009-06-27 6:44 ` Ingo Molnar 2009-06-29 18:25 ` Vince Weaver 2009-06-27 6:48 ` performance counter ~0.4% error finding retired instruction count Paul Mackerras 1 sibling, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2009-06-27 6:44 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith * Ingo Molnar <mingo@elte.hu> wrote: > Besides, you compare perfcounters to perfmon (which you seem to be > a contributor of), while in reality perfmon has much, much worse > (and unfixable, because designed-in) measurement overhead. > > So why are you criticising perfcounters for a 5000 cycles > measurement overhead while perfmon has huge, _hundreds of > millions_ of cycles measurement overhead (per second) for various > realistic workloads? [ In fact in one of the scheduler-tests > perfmon has a whopping measurement overhead of _nine billion_ > cycles, it increased total runtime of the workload from 3.3 > seconds to 6.6 seconds. (!) ] Here are the more detailed perfmon/pfmon measurement overhead numbers. Test system is a "Intel Core2 E6800 @ 2.93GHz", 1 GB of RAM, default Fedora install. I've measured two workloads: hackbench.c # messaging server benchmark test-1m-pipes.c # does 1 million pipe ops, similar to lat_pipe v2.6.28+perfmon patches (v3, full): ./hackbench 10 0.496400985 seconds time elapsed ( +- 1.699% ) pfmon --follow-fork--aggregate-results ./hackbench 10 0.580812999 seconds time elapsed ( +- 2.233% ) I.e. this workload runs 17% slower under pfmon, the measurement overhead is about 1.45 billion cycles. Furthermore, when running a 'pipe latency benchmark', an app that does one million pipe reads and writes between two tasks (source code attached below), i measured the following perfmon/pfmon overhead: ./pipe-test-1m 3.344280347 seconds time elapsed ( +- 0.361% ) pfmon --follow-fork --aggregate-results ./pipe-test-1m 6.508737983 seconds time elapsed ( +- 0.243% ) That's an about 94% measurement overhead, or about 9.2 _billion_ cycles overhead on this test-system. These perfmon/pfmon overhead figures are consistently reproducible, and they happen on other test-systems as well, and with other workloads as well. Basically for any app that involves task creation or context-switching, perfmon adds considerable runtime overhead - well beyond the overhead of perfcounters. Ingo -----------------{ pipe-test-1m.c }--------------------> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <signal.h> #include <sys/wait.h> #include <linux/unistd.h> #define LOOPS 1000000 int main (void) { unsigned long long t0, t1; int pipe_1[2], pipe_2[2]; int m = 0, i; pipe(pipe_1); pipe(pipe_2); if (!fork()) { for (i = 0; i < LOOPS; i++) { read(pipe_1[0], &m, sizeof(int)); write(pipe_2[1], &m, sizeof(int)); } } else { for (i = 0; i < LOOPS; i++) { write(pipe_1[1], &m, sizeof(int)); read(pipe_2[0], &m, sizeof(int)); } } return 0; } ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-06-27 6:44 ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar @ 2009-06-29 18:25 ` Vince Weaver 2009-06-29 21:02 ` Ingo Molnar ` (3 more replies) 0 siblings, 4 replies; 27+ messages in thread From: Vince Weaver @ 2009-06-29 18:25 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith Hello > Ingo Molnar <mingo@elte.hu> wrote: >> Vince Weaver <vince@deater.net> wrote: > That is in the 0.0001% measurement overhead range (per 'perf stat' > invocation) for any realistic app that does something worth > measuring I'm just curious about this "app worth measuring" idea. Do you intend for performance counters to simply be "oprofile done right" or do you intend it to be a generic way of exposing performance counters to userspace? For the research my co-workers and I are currently working on the former is uninteresting. If we wanted oprofile, we'd use it. What matters for us is getting very exact counts of counters on programs that are being run as deterministically as possible. This includes very small programs, and counts like retired_instructions, load/store ratios, uop_counts, etc. This may be uninteresting to you, but it is important to us. Hence my interest in the capabilities of the infrastructure finally getting merged into the kernel. > Besides, you compare perfcounters to perfmon what else shoud I be comparing it to? > (which you seem to be a contributor of) is that not allowed? > workloads? [ In fact in one of the scheduler-tests perfmon has a > whopping measurement overhead of _nine billion_ cycles, it increased > total runtime of the workload from 3.3 seconds to 6.6 seconds. (!) ] I'm sure the perfmon2 people would welcome any patches you have to fix this problem. as I said, I am looking for aggregate counts for deterministic programs. Compared to the ovreheads of 50x for DBI-based tools like Valgrind, or 1000x for "cycle-accurate" simulations, then even overhead of 2x really isn't that bad. Counting cycles or time is always a dangerous thing when performance counters are involved. Things as trivial as compiler, object link-order, length of the executable name, number of environment variables, number of ELF auxilliary vectors, etc, can all vastly change what results you get. I'd reccomend the following paper for more details: "Producing wrong data without doing anything obviously wrong" by Mytkowicz et al. http://www-plan.cs.colorado.edu/klipto/mytkowicz-asplos09.pdf > If the 5 thousand cycles measurement overhead _still_ matters to you > under such circumstances then by all means please submit the patches > to improve it. Despite your claims this is totally fixable with the > current perfcounters design, Peter outlined the steps of how to > solve it, you can utilize ptrace if you want to. Is it really "totally" fixible? I don't just mean getting the overhead from ~3000 down to ~100, I mean down to zero. > Here are the more detailed perfmon/pfmon measurement overhead > numbers. > > ... > > I.e. this workload runs 17% slower under pfmon, the measurement > overhead is about 1.45 billion cycles. > > .. > > That's an about 94% measurement overhead, or about 9.2 _billion_ > cycles overhead on this test-system. I'm more interested in very CPU-intensive benchmarks. I ran some experiments with gcc and equake from the spec2k benchmark suite. This is on a 32-bit AMD Athlon(tm) XP 2000+ machine gcc.200 (spec2k) + 2.6.30-03984-g45e3e19, configured with perf counters disabled 108.44s +/- 0.7 + 2.6.30-03984-g45e3e19, perf stat -e 0:1:u -- 109.17s +/- 0.7 *** For a slowdown of about 0.6% + 2.6.29.5 (unpatched) 115.31s +/- 0.5 + 2.6.29.5 with perfmon2 patches applied, pfmon -e retired_instructions,cpu_clk_unhalted 115.62 +/- 0.5 ** For a slowdown of about 0.2% So in this case, perfmon2 had less overhead, though it's so small overhead as to be lost in the noise. Why the 2.6.30-git kernel seems to be much faster on this hardware, I don't know. equake (spec2k) + 2.6.30-03984-g45e3e19, configured with perf counters disabled 392.77s +/- 1.5 + 2.6.30-03984-g45e3e19, perf stat -e 0:1:u -- 393.45s +/- 0.7 *** For a slowdown of about 0.17% + 2.6.29.5 (unpatched) 429.25s +/- 1.7 + 2.6.29.5 with perfmon2 patches applied, pfmon -e retired_instructions,cpu_clk_unhalted 428.91 +/- 0.8 ** For a _speedup_ of about 0.08% So again the difference in overheads is in the noise. Again I am not sure why 2.6.30-git is so much faster on this hardware. As for counter results, in this case retired instructions: gcc.200 perf: 72,618,643,132 +/- 8million pfmon: 72,618,519,792 +/- 5million equake perf: 144,952,319,472 +/- 8000 pfmon: 144,952,327,906 +/- 500 So in the equake case you can easily see that the few thousand instruction overhead from perf can show up even on long-running programs. In any case, the point I am trying to make is that perf counters are used by a wide variety of people in a wide variety of ways, with lots of different performance/accuracy tradeoffs. Don't limit the API just because you can't envision a use for certain features. Vince ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-06-29 18:25 ` Vince Weaver @ 2009-06-29 21:02 ` Ingo Molnar 2009-07-02 21:07 ` Vince Weaver 2009-06-29 23:46 ` [patch] perf_counter: Add enable-on-exec attribute Ingo Molnar ` (2 subsequent siblings) 3 siblings, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2009-06-29 21:02 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith * Vince Weaver <vince@deater.net> wrote: >> If the 5 thousand cycles measurement overhead _still_ matters to >> you under such circumstances then by all means please submit the >> patches to improve it. Despite your claims this is totally >> fixable with the current perfcounters design, Peter outlined the >> steps of how to solve it, you can utilize ptrace if you want to. > > Is it really "totally" fixible? I don't just mean getting the > overhead from ~3000 down to ~100, I mean down to zero. The thing is, not even pfmon gets it down to zero: pfmon -e INSTRUCTIONS_RETIRED --follow-fork --aggregate-results ~/million 1000001 INSTRUCTIONS_RETIRED So ... do you take the hardliner purist view and consider it crap due to that imprecision, or do you take the pragmatist view of also considering the relative relevance of any imperfection? ;-) Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-06-29 21:02 ` Ingo Molnar @ 2009-07-02 21:07 ` Vince Weaver 2009-07-03 7:58 ` Ingo Molnar 2009-07-03 18:31 ` Andi Kleen 0 siblings, 2 replies; 27+ messages in thread From: Vince Weaver @ 2009-07-02 21:07 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith sorry for the delay in responding, was away On Mon, 29 Jun 2009, Ingo Molnar wrote: > > * Vince Weaver <vince@deater.net> wrote: > >>> If the 5 thousand cycles measurement overhead _still_ matters to >>> you under such circumstances then by all means please submit the >>> patches to improve it. Despite your claims this is totally >>> fixable with the current perfcounters design, Peter outlined the >>> steps of how to solve it, you can utilize ptrace if you want to. >> >> Is it really "totally" fixible? I don't just mean getting the >> overhead from ~3000 down to ~100, I mean down to zero. > > The thing is, not even pfmon gets it down to zero: > > pfmon -e INSTRUCTIONS_RETIRED --follow-fork --aggregate-results ~/million > 1000001 INSTRUCTIONS_RETIRED > > So ... do you take the hardliner purist view and consider it crap > due to that imprecision, or do you take the pragmatist view of also > considering the relative relevance of any imperfection? ;-) as I said in a previous post, on most x86 chips the instructions_retired counter also includes any hardware interrupts that occur during the process runtime. So any clock interrupts, etc, show up as an extra instruction. So on the "million" benchmark, it's usually +/- 2 extra instructions. It looks like support might be added to perfcounters to track these hardware interrupt stats per-process, which would be great, as it's been really hard to quantify that currently. In any case, it looks like the changes to make perf have lower overhead have been merged, which makes me happy. Thank you. Vince ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-07-02 21:07 ` Vince Weaver @ 2009-07-03 7:58 ` Ingo Molnar 2009-07-03 21:43 ` Vince Weaver 2009-07-03 18:31 ` Andi Kleen 1 sibling, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2009-07-03 7:58 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith * Vince Weaver <vince@deater.net> wrote: > On Mon, 29 Jun 2009, Ingo Molnar wrote: >> >> * Vince Weaver <vince@deater.net> wrote: >> >>>> If the 5 thousand cycles measurement overhead _still_ matters to >>>> you under such circumstances then by all means please submit the >>>> patches to improve it. Despite your claims this is totally >>>> fixable with the current perfcounters design, Peter outlined the >>>> steps of how to solve it, you can utilize ptrace if you want to. >>> >>> Is it really "totally" fixible? I don't just mean getting the >>> overhead from ~3000 down to ~100, I mean down to zero. >> >> The thing is, not even pfmon gets it down to zero: >> >> pfmon -e INSTRUCTIONS_RETIRED --follow-fork --aggregate-results ~/million >> 1000001 INSTRUCTIONS_RETIRED >> >> So ... do you take the hardliner purist view and consider it crap >> due to that imprecision, or do you take the pragmatist view of also >> considering the relative relevance of any imperfection? ;-) > > as I said in a previous post, on most x86 chips the > instructions_retired counter also includes any hardware interrupts > that occur during the process runtime. So any clock interrupts, > etc, show up as an extra instruction. So on the "million" > benchmark, it's usually +/- 2 extra instructions. yeah. But it has nothing to do with the function you are measuring, right? My general point is really that what matters is the statistical validity of the end result. I dont think you ever disagreed with that point - you just seem to have a lower noise acceptance threshold ;-) > It looks like support might be added to perfcounters to track > these hardware interrupt stats per-process, which would be great, > as it's been really hard to quantify that currently. Yeah. There's a patch-set in the works that attempts to do something in this area - see these mails on lkml: perf_counter: Add Generalized Hardware interrupt support Right now they are just convenience wrappers around CPU model specific hw events - but we could extend the whole thing with software counters as well and isolate per IRQ vector events and counts, by adding a callback to do_IRQ(). That would give a mixture of hardware and software counter based IRQ instrumentation features that looks quite compelling. Any comments on what features/capabilities you'd like to see in this area? > In any case, it looks like the changes to make perf have lower > overhead have been merged, which makes me happy. Thank you. You are welcome :) Btw., perfcounters still has no support for older Intel CPUs such as P3's and P2's - and they have pretty sane PMUs - so if you have such a machine (which your perfmon contribution suggests you might have/had) and are interested it would be nice to get support for them. P4 support is interesting too but more challenging. Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-07-03 7:58 ` Ingo Molnar @ 2009-07-03 21:43 ` Vince Weaver 0 siblings, 0 replies; 27+ messages in thread From: Vince Weaver @ 2009-07-03 21:43 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith On Fri, 3 Jul 2009, Ingo Molnar wrote: > That would give a mixture of hardware and software counter based IRQ > instrumentation features that looks quite compelling. Any comments > on what features/capabilities you'd like to see in this area? I'm mainly interested in just an aggregate total of "this many interrupts occurred". It wouldn't even need to be separated out by type or number. I don't know if the metric would be useful to anyone else. I tried to hack this up a long time ago, to have the result reported with rusage() but never got anywhere with it. > Btw., perfcounters still has no support for older Intel CPUs such as > P3's and P2's - and they have pretty sane PMUs - so if you have such > a machine (which your perfmon contribution suggests you might > have/had) and are interested it would be nice to get support for > them. P4 support is interesting too but more challenging. I was indeed the one who got perfmon2 running on Pentium Pro, Pentium II, and MIPS R12k. For all those though there was an existing PMU driver and I just added the appropriate "case" statements to enable support, and then provided an updated list of available counters to the userspace utility. The only real kernel hacking involved was the week spent tracking down a hard-to-debug interrupt issue on the MIPS machine. Unfortunately I think writing PMU drivers is a bit beyond me, for the amount of time I have. Especially as the relevant machines I have are located in relatively inaccessible locations (and PMU mistakes can lock up the machines) plus it can take the better part of a day to compile 2.6 kernels on some of those machines. Vince ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-07-02 21:07 ` Vince Weaver 2009-07-03 7:58 ` Ingo Molnar @ 2009-07-03 18:31 ` Andi Kleen 2009-07-03 21:25 ` Vince Weaver 1 sibling, 1 reply; 27+ messages in thread From: Andi Kleen @ 2009-07-03 18:31 UTC (permalink / raw) To: Vince Weaver Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith Vince Weaver <vince@deater.net> writes: > > as I said in a previous post, on most x86 chips the instructions_retired > counter also includes any hardware interrupts that occur during the > process runtime. On the other hand afaik near all chips have interrupt performance counter events. So if you're willing to waste one of the variable counter registers you can always count those and then correct based on the other count. But the question is of course if it's worth it, the error should be really small. Also you could always lose a few cycles occasionally in other "random" events, which can happen too. > So any clock interrupts, etc, show up as an extra > instruction. So on the "million" benchmark, it's usually +/- 2 extra > instructions. 1-2 error in a million doesn't sound like a catastrophic problem. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-07-03 18:31 ` Andi Kleen @ 2009-07-03 21:25 ` Vince Weaver 2009-07-03 23:40 ` Andi Kleen 0 siblings, 1 reply; 27+ messages in thread From: Vince Weaver @ 2009-07-03 21:25 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith > Vince Weaver <vince@deater.net> writes: >> >> as I said in a previous post, on most x86 chips the instructions_retired >> counter also includes any hardware interrupts that occur during the >> process runtime. > > On the other hand afaik near all chips have interrupt performance counter > events. I guess by "near all" you mean "only AMD"? The AMD event also has some oddities, as it seems to report things like page faults and other things that don't really match up with the excess instruction count. I must admit it's been a while since I've looked at that particular counter. > But the question is of course if it's worth it, the error should > be really small. Also you could always lose a few cycles occasionally > in other "random" events, which can happen too. > 1-2 error in a million doesn't sound like a catastrophic problem. well, it's basically at least HZ extra instructions per however many seconds your benchmark runs, and unfortunately it's non-deterministic because it depends on keyboard/network/usb/etc interrupts too that may by chance happen while your program is running. For me, it's the determinism that matters. Not overhead, not runtime not "oh it doesn't matter, it's small". For a deterministic benchmark I want to get as close to the same value every run as possible. I admit it might not be possible to always get the same result, but the closter the better. This might not match up with the way kernel-hackers use perf counters, but it is important for the work I am doing. Vince ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-07-03 21:25 ` Vince Weaver @ 2009-07-03 23:40 ` Andi Kleen 0 siblings, 0 replies; 27+ messages in thread From: Andi Kleen @ 2009-07-03 23:40 UTC (permalink / raw) To: Vince Weaver Cc: Andi Kleen, Ingo Molnar, Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith On Fri, Jul 03, 2009 at 05:25:32PM -0400, Vince Weaver wrote: > >Vince Weaver <vince@deater.net> writes: > >> > >>as I said in a previous post, on most x86 chips the instructions_retired > >>counter also includes any hardware interrupts that occur during the > >>process runtime. > > > >On the other hand afaik near all chips have interrupt performance counter > >events. > > I guess by "near all" you mean "only AMD"? The AMD event also has some Intel CPUs typically have HW_INT.RX event. AMD has a similar event. > well, it's basically at least HZ extra instructions per however many > seconds your benchmark runs, and unfortunately it's non-deterministic > because it depends on keyboard/network/usb/etc interrupts too that may by > chance happen while your program is running. > > For me, it's the determinism that matters. Not overhead, not runtime not To be honest I don't think you'll ever be full deterministic. Modern computers and operating systems are just too complex with too many (often unpredictable) things going on in the background. In my own experience even simulators (which are much more stable than real hardware) are not fully deterministic. You'll always run into problems. If you need 100% deterministic use a simple micro controller. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [patch] perf_counter: Add enable-on-exec attribute 2009-06-29 18:25 ` Vince Weaver 2009-06-29 21:02 ` Ingo Molnar @ 2009-06-29 23:46 ` Ingo Molnar 2009-06-29 23:55 ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar 2009-06-30 0:05 ` Ingo Molnar 3 siblings, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2009-06-29 23:46 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith * Vince Weaver <vince@deater.net> wrote: >> If the 5 thousand cycles measurement overhead _still_ matters to >> you under such circumstances then by all means please submit the >> patches to improve it. Despite your claims this is totally >> fixable with the current perfcounters design, Peter outlined the >> steps of how to solve it, you can utilize ptrace if you want to. > > Is it really "totally" fixible? I don't just mean getting the > overhead from ~3000 down to ~100, I mean down to zero. Yes, it's truly very easy to get exactly the same output as pfmon, for the 'million.s' test app you posted: titan:~> perf stat -e 0:1:u ./million Performance counter stats for './million': 1000001 instructions 0.000489736 seconds time elapsed See the small patch below. ( Note that this approach does not use ptrace, hence it can be used to measure debuggers too. ptrace attach has the limitation of being exclusive - no task can be attached to twice. perfmon used ptrace attach, which limited its capabilities unreasonably. ) The question was really not whether we can do it - but whether we want to do it. I have no strong feelings either way - because as i told you in my first mail, all the other noise sources in the system dominate the metrics far more than this very small constant startup offset. And the thing is, as a perfmon contributor i assume you have experience in these matters. Had you taken a serious, unbiased look at perfcounters, and had this problem truly bothered you personally, you could have come up with a similar patch yourself as well, while only spending a fraction of the energies you are putting into these emails. Instead you ignored our technical arguments, you refused to touch the code and you went on rambling against how perfcounters supposedly cannot solve this problem. Not very productive IMO. Ingo ----------------> Subject: perf_counter: Add enable-on-exec attribute From: Ingo Molnar <mingo@elte.hu> Date: Mon Jun 29 22:05:11 CEST 2009 Add another attribute variant: attr.enable_on_exec. The purpose is to allow the auto-enabling of such counters on exec(), to measure exec()-ed workloads precisely, from the first to the last instruction. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- fs/exec.c | 3 +-- include/linux/perf_counter.h | 5 ++++- kernel/perf_counter.c | 39 ++++++++++++++++++++++++++++++++++++--- tools/perf/builtin-stat.c | 5 +++-- 4 files changed, 44 insertions(+), 8 deletions(-) Index: linux/fs/exec.c =================================================================== --- linux.orig/fs/exec.c +++ linux/fs/exec.c @@ -996,8 +996,7 @@ int flush_old_exec(struct linux_binprm * * Flush performance counters when crossing a * security domain: */ - if (!get_dumpable(current->mm)) - perf_counter_exit_task(current); + perf_counter_exec(current); /* An exec changes our domain. We are no longer part of the thread group */ Index: linux/include/linux/perf_counter.h =================================================================== --- linux.orig/include/linux/perf_counter.h +++ linux/include/linux/perf_counter.h @@ -179,8 +179,9 @@ struct perf_counter_attr { comm : 1, /* include comm data */ freq : 1, /* use freq, not period */ inherit_stat : 1, /* per task counts */ + enable_on_exec : 1, /* enable on exec */ - __reserved_1 : 52; + __reserved_1 : 51; __u32 wakeup_events; /* wakeup every n events */ __u32 __reserved_2; @@ -712,6 +713,7 @@ static inline void perf_counter_mmap(str extern void perf_counter_comm(struct task_struct *tsk); extern void perf_counter_fork(struct task_struct *tsk); +extern void perf_counter_exec(struct task_struct *tsk); extern struct perf_callchain_entry *perf_callchain(struct pt_regs *regs); @@ -752,6 +754,7 @@ perf_swcounter_event(u32 event, u64 nr, static inline void perf_counter_mmap(struct vm_area_struct *vma) { } static inline void perf_counter_comm(struct task_struct *tsk) { } static inline void perf_counter_fork(struct task_struct *tsk) { } +static inline void perf_counter_exec(struct task_struct *tsk) { } static inline void perf_counter_init(void) { } #endif Index: linux/kernel/perf_counter.c =================================================================== --- linux.orig/kernel/perf_counter.c +++ linux/kernel/perf_counter.c @@ -903,6 +903,9 @@ static void perf_counter_enable(struct p struct perf_counter_context *ctx = counter->ctx; struct task_struct *task = ctx->task; + if (counter->attr.enable_on_exec) + return; + if (!task) { /* * Enable the counter on the cpu that it's on @@ -2856,6 +2859,32 @@ void perf_counter_fork(struct task_struc perf_counter_fork_event(&fork_event); } +void perf_counter_exec(struct task_struct *task) +{ + struct perf_counter_context *ctx; + struct perf_counter *counter; + + if (!get_dumpable(task->mm)) { + perf_counter_exit_task(task); + return; + } + + if (!task->perf_counter_ctxp) + return; + + rcu_read_lock(); + ctx = task->perf_counter_ctxp; + if (ctx) { + list_for_each_entry(counter, &ctx->counter_list, list_entry) { + if (counter->attr.enable_on_exec) { + counter->attr.enable_on_exec = 0; + __perf_counter_enable(counter); + } + } + } + rcu_read_unlock(); +} + /* * comm tracking */ @@ -4064,10 +4093,14 @@ inherit_counter(struct perf_counter *par * not its attr.disabled bit. We hold the parent's mutex, * so we won't race with perf_counter_{en, dis}able_family. */ - if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE) - child_counter->state = PERF_COUNTER_STATE_INACTIVE; - else + if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE) { + if (child_counter->attr.enable_on_exec) + child_counter->state = PERF_COUNTER_STATE_OFF; + else + child_counter->state = PERF_COUNTER_STATE_INACTIVE; + } else { child_counter->state = PERF_COUNTER_STATE_OFF; + } if (parent_counter->attr.freq) child_counter->hw.sample_period = parent_counter->hw.sample_period; Index: linux/tools/perf/builtin-stat.c =================================================================== --- linux.orig/tools/perf/builtin-stat.c +++ linux/tools/perf/builtin-stat.c @@ -116,8 +116,9 @@ static void create_perf_stat_counter(int fd[cpu][counter], strerror(errno)); } } else { - attr->inherit = inherit; - attr->disabled = 1; + attr->inherit = inherit; + attr->disabled = 1; + attr->enable_on_exec = 1; fd[0][counter] = sys_perf_counter_open(attr, pid, -1, -1, 0); if (fd[0][counter] < 0 && verbose) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-06-29 18:25 ` Vince Weaver 2009-06-29 21:02 ` Ingo Molnar 2009-06-29 23:46 ` [patch] perf_counter: Add enable-on-exec attribute Ingo Molnar @ 2009-06-29 23:55 ` Ingo Molnar 2009-06-30 0:05 ` Ingo Molnar 3 siblings, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2009-06-29 23:55 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith * Vince Weaver <vince@deater.net> wrote: >> Besides, you compare perfcounters to perfmon > > what else shoud I be comparing it to? > >> (which you seem to be a contributor of) > > is that not allowed? Here's the full, uncropped sentence i wrote: " Besides, you compare perfcounters to perfmon (which you seem to be a contributor of), while in reality perfmon has much, much worse (and unfixable, because designed-in) measurement overhead. " Where i question the blatant hypocracy of bringing up perfmon as a good example while in reality perfmon has far worse measurement overhead than perfcounters, for a wide range of workloads. As far as i can see you didnt answer my questions: why are you dismissing perfcounters for a minor, once per startup measurement offset (which is entirely fixable - see the patch i sent), while you generously allow perfmon to have serious, 90% measurement overhead amounting to billions of instructions overhead per second, for certain workloads? Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [numbers] perfmon/pfmon overhead of 17%-94% 2009-06-29 18:25 ` Vince Weaver ` (2 preceding siblings ...) 2009-06-29 23:55 ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar @ 2009-06-30 0:05 ` Ingo Molnar 3 siblings, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2009-06-30 0:05 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, Paul Mackerras, linux-kernel, Mike Galbraith * Vince Weaver <vince@deater.net> wrote: >> workloads? [ In fact in one of the scheduler-tests perfmon has a >> whopping measurement overhead of _nine billion_ cycles, it >> increased total runtime of the workload from 3.3 seconds to 6.6 >> seconds. (!) ] > > I'm sure the perfmon2 people would welcome any patches you have to > fix this problem. I think this flaw of perfmon is unfixable, because perfmon (by design) uses a _way_ too low level and way too opaque and structure-less abstraction for the PMU, which disallows the kind of high-level optimizations that perfcounters can do. We werent silent about this - to the contrary. Last November Thomas and me _did_ take a good look at perfmon patches (we are maintaining the code areas affected by perfmon), we saw that it has unfixable problems and came up with objections and later on came up with patches that fix these problems: the perfcounters subsystem. >> That's an about 94% measurement overhead, or about 9.2 _billion_ >> cycles overhead on this test-system. > > I'm more interested in very CPU-intensive benchmarks. I ran some > experiments with gcc and equake from the spec2k benchmark suite. The workloads i cited are _all_ 100% CPU-intensive benchmarks: - hackbench - loop-pipe-1-million But i could add 'lat_tcp localhost', 'bw_tcp localhost' or sysbench to the list - all show very significant overhead under perfmon. These are all important workloads and important benchmarks. A kernel based performance analysis facility that is any good must handle them transparently. Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter ~0.4% error finding retired instruction count 2009-06-27 6:04 ` performance counter ~0.4% " Ingo Molnar 2009-06-27 6:44 ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar @ 2009-06-27 6:48 ` Paul Mackerras 2009-06-27 17:28 ` Ingo Molnar 1 sibling, 1 reply; 27+ messages in thread From: Paul Mackerras @ 2009-06-27 6:48 UTC (permalink / raw) To: Ingo Molnar; +Cc: Vince Weaver, Peter Zijlstra, linux-kernel Ingo Molnar writes: > I measured 2000, but generally a few thousand cycles per invocation > sounds about right. We could actually do a bit better than we do, fairly easily. We could attach the counters to the child after the fork instead of the parent before the fork, using a couple of pipes for synchronization. And there's probably a way to get the dynamic linker to resolve the execvp call early in the child so we avoid that overhead. I think we should be able to get the overhead down to tens of userspace instructions without doing anything unnatural. Paul. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter ~0.4% error finding retired instruction count 2009-06-27 6:48 ` performance counter ~0.4% error finding retired instruction count Paul Mackerras @ 2009-06-27 17:28 ` Ingo Molnar 2009-06-29 2:12 ` Paul Mackerras 0 siblings, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2009-06-27 17:28 UTC (permalink / raw) To: Paul Mackerras; +Cc: Vince Weaver, Peter Zijlstra, linux-kernel * Paul Mackerras <paulus@samba.org> wrote: > Ingo Molnar writes: > > > I measured 2000, but generally a few thousand cycles per > > invocation sounds about right. > > We could actually do a bit better than we do, fairly easily. We > could attach the counters to the child after the fork instead of > the parent before the fork, using a couple of pipes for > synchronization. And there's probably a way to get the dynamic > linker to resolve the execvp call early in the child so we avoid > that overhead. I think we should be able to get the overhead down > to tens of userspace instructions without doing anything > unnatural. Definitely so. Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter ~0.4% error finding retired instruction count 2009-06-27 17:28 ` Ingo Molnar @ 2009-06-29 2:12 ` Paul Mackerras 2009-06-29 2:13 ` Paul Mackerras 2009-06-29 3:48 ` Ingo Molnar 0 siblings, 2 replies; 27+ messages in thread From: Paul Mackerras @ 2009-06-29 2:12 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel I can think of three ways to eliminate the PLT resolver overhead on execvp: (1) Do execvp on a non-executable file first to get execvp resolved: char tmpnam[16]; int fd; char *args[1]; strcpy(tmpname, "/tmp/perfXXXXXX"); fd = mkstemp(tmpname); if (fd >= 0) { args[1] = NULL; execvp(tmpname, args); close(fd); unlink(tmpname); } enable_counters(); execvp(prog, argv); (2) Look up execvp in glibc and call it directly: int (*execptr)(const char *, char *const []); execptr = dlsym(RTLD_NEXT, "execvp"); enable_counters(); (*execptr)(prog, argv); (3) Resolve the executable path ourselves and then invoke the execve system call directly: char *execpath; execpath = search_path(getenv("PATH"), prog); enable_counters(); syscall(NR_execve, execpath, argv, envp); (4) Same as (1), but rely on "" being an invalid program name for execvp: execvp("", argv); enable_counters(); execvp(prog, argv); What do you guys think? Does any of these appeal more than the others? I'm leaning towards (4) myself. Paul. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter ~0.4% error finding retired instruction count 2009-06-29 2:12 ` Paul Mackerras @ 2009-06-29 2:13 ` Paul Mackerras 2009-06-29 3:48 ` Ingo Molnar 1 sibling, 0 replies; 27+ messages in thread From: Paul Mackerras @ 2009-06-29 2:13 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel Paul Mackerras writes: > I can think of three ways to eliminate the PLT resolver overhead on > execvp: s/three/four/, obviously - I thought of the 4th while I was writing the mail. Paul. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: performance counter ~0.4% error finding retired instruction count 2009-06-29 2:12 ` Paul Mackerras 2009-06-29 2:13 ` Paul Mackerras @ 2009-06-29 3:48 ` Ingo Molnar 1 sibling, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2009-06-29 3:48 UTC (permalink / raw) To: Paul Mackerras; +Cc: Peter Zijlstra, linux-kernel * Paul Mackerras <paulus@samba.org> wrote: > I can think of three ways to eliminate the PLT resolver overhead on > execvp: > > (1) Do execvp on a non-executable file first to get execvp resolved: > > char tmpnam[16]; > int fd; > char *args[1]; > > strcpy(tmpname, "/tmp/perfXXXXXX"); > fd = mkstemp(tmpname); > if (fd >= 0) { > args[1] = NULL; > execvp(tmpname, args); > close(fd); > unlink(tmpname); > } > enable_counters(); > execvp(prog, argv); > > (2) Look up execvp in glibc and call it directly: > > int (*execptr)(const char *, char *const []); > > execptr = dlsym(RTLD_NEXT, "execvp"); > enable_counters(); > (*execptr)(prog, argv); > > (3) Resolve the executable path ourselves and then invoke the execve > system call directly: > > char *execpath; > > execpath = search_path(getenv("PATH"), prog); > enable_counters(); > syscall(NR_execve, execpath, argv, envp); > > (4) Same as (1), but rely on "" being an invalid program name for > execvp: > > execvp("", argv); > enable_counters(); > execvp(prog, argv); > > What do you guys think? Does any of these appeal more than the > others? I'm leaning towards (4) myself. (4) looks convincingly elegant. We could also do (5): a one-shot counters-disabled ptrace run of the target, then enable-counters-in-target + ptrace-detach after the first stop. Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2009-07-03 23:40 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-06-24 13:59 performance counter 20% error finding retired instruction count Vince Weaver 2009-06-24 15:10 ` Ingo Molnar 2009-06-25 2:12 ` Vince Weaver 2009-06-25 6:50 ` Peter Zijlstra 2009-06-25 9:13 ` Ingo Molnar 2009-06-26 18:22 ` Vince Weaver 2009-06-26 19:12 ` Peter Zijlstra 2009-06-27 5:32 ` Ingo Molnar 2009-06-26 19:23 ` Vince Weaver 2009-06-27 6:04 ` performance counter ~0.4% " Ingo Molnar 2009-06-27 6:44 ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar 2009-06-29 18:25 ` Vince Weaver 2009-06-29 21:02 ` Ingo Molnar 2009-07-02 21:07 ` Vince Weaver 2009-07-03 7:58 ` Ingo Molnar 2009-07-03 21:43 ` Vince Weaver 2009-07-03 18:31 ` Andi Kleen 2009-07-03 21:25 ` Vince Weaver 2009-07-03 23:40 ` Andi Kleen 2009-06-29 23:46 ` [patch] perf_counter: Add enable-on-exec attribute Ingo Molnar 2009-06-29 23:55 ` [numbers] perfmon/pfmon overhead of 17%-94% Ingo Molnar 2009-06-30 0:05 ` Ingo Molnar 2009-06-27 6:48 ` performance counter ~0.4% error finding retired instruction count Paul Mackerras 2009-06-27 17:28 ` Ingo Molnar 2009-06-29 2:12 ` Paul Mackerras 2009-06-29 2:13 ` Paul Mackerras 2009-06-29 3:48 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox