* Perf event for Wall-time based sampling? @ 2014-09-18 12:32 Milian Wolff 2014-09-18 13:23 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 25+ messages in thread From: Milian Wolff @ 2014-09-18 12:32 UTC (permalink / raw) To: linux-perf-users Hello, is it somehow possible to use perf based on some kernel timer? I'd like to get an overview of where a userspace application is spending time, both on-CPU as well as waiting off-CPU. E.g. something similar to using GDB as a poor-mans profiler and regularly interrupting the process and investigating the callgraphs. This is quite efficient for a high-level overview when you want to figure out where time is spent, unrelated to how it was actually spent (cpu, thread locks, io wait, ...). E.g. what event would I use for a simple application like this: ~~~~~~~~~~~~~~ #include <unistd.h> int main() { sleep(10); return 0; } ~~~~~~~~~~~~~~ Which perf event would show me that most of the time is spent sleeping? I tried something like this to no avail: $ perf record --call-graph dwarf -e cpu-clock -F 100 ./a.out [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.007 MB perf.data (~304 samples) ] perf report --stdio Error: The perf.data file has no samples! # To display the perf.data header info, please use --header/--header-only options. I read https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times and tried it out. The result is odd, as I get the "same" backtrace multiple times, all with 100% cost: ~~~~~~~~~~~~~~~~~~~~~~~~~~ 100.00% 0.00% 0 a.out libc-2.19.so [.] __GI___libc_nanosleep | --- __GI___libc_nanosleep 100.00% 0.00% 0 a.out [kernel.kallsyms] [k] system_call_fastpath | --- system_call_fastpath __GI___libc_nanosleep 100.00% 0.00% 0 a.out [kernel.kallsyms] [k] sys_nanosleep | --- sys_nanosleep system_call_fastpath __GI___libc_nanosleep 100.00% 0.00% 0 a.out [kernel.kallsyms] [k] hrtimer_nanosleep | --- hrtimer_nanosleep sys_nanosleep system_call_fastpath __GI___libc_nanosleep ~~~~~~~~~~~~~~~~~~~~~~~~~~ And generally, this would *only* profile sleep time and would ignore the on- CPU time (and maybe thread waits) and so forth. Is there a technical reason on why it is not possible to use a plain timer as a sampling event? If I'm not mistaken, then Intel VTune actually uses a similar technique for its simple profiling modes which can already give extremely useful data - both to find CPU hotspots as well as locks&waits. Bye -- Milian Wolff mail@milianw.de http://milianw.de ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 12:32 Perf event for Wall-time based sampling? Milian Wolff @ 2014-09-18 13:23 ` Arnaldo Carvalho de Melo 2014-09-18 13:41 ` Milian Wolff 0 siblings, 1 reply; 25+ messages in thread From: Arnaldo Carvalho de Melo @ 2014-09-18 13:23 UTC (permalink / raw) To: Milian Wolff; +Cc: linux-perf-users Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu: > Hello, > > is it somehow possible to use perf based on some kernel timer? I'd like to get Try with tracepoints or with probe points combined with callchains instead of using a hardware counter. - Arnaldo > an overview of where a userspace application is spending time, both on-CPU as > well as waiting off-CPU. E.g. something similar to using GDB as a poor-mans > profiler and regularly interrupting the process and investigating the > callgraphs. This is quite efficient for a high-level overview when you want to > figure out where time is spent, unrelated to how it was actually spent (cpu, > thread locks, io wait, ...). > > E.g. what event would I use for a simple application like this: > > ~~~~~~~~~~~~~~ > #include <unistd.h> > > int main() > { > sleep(10); > return 0; > } > ~~~~~~~~~~~~~~ > > Which perf event would show me that most of the time is spent sleeping? I > tried something like this to no avail: > > $ perf record --call-graph dwarf -e cpu-clock -F 100 ./a.out > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.007 MB perf.data (~304 samples) ] > perf report --stdio > Error: > The perf.data file has no samples! > # To display the perf.data header info, please use --header/--header-only > options. > > I read https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times > and tried it out. The result is odd, as I get the "same" backtrace multiple > times, all with 100% cost: > > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > 100.00% 0.00% 0 a.out libc-2.19.so [.] > __GI___libc_nanosleep > | > --- __GI___libc_nanosleep > > 100.00% 0.00% 0 a.out [kernel.kallsyms] [k] > system_call_fastpath > | > --- system_call_fastpath > __GI___libc_nanosleep > > 100.00% 0.00% 0 a.out [kernel.kallsyms] [k] > sys_nanosleep > | > --- sys_nanosleep > system_call_fastpath > __GI___libc_nanosleep > > 100.00% 0.00% 0 a.out [kernel.kallsyms] [k] > hrtimer_nanosleep > | > --- hrtimer_nanosleep > sys_nanosleep > system_call_fastpath > __GI___libc_nanosleep > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > And generally, this would *only* profile sleep time and would ignore the on- > CPU time (and maybe thread waits) and so forth. > > Is there a technical reason on why it is not possible to use a plain timer as > a sampling event? If I'm not mistaken, then Intel VTune actually uses a > similar technique for its simple profiling modes which can already give > extremely useful data - both to find CPU hotspots as well as locks&waits. > > Bye > -- > Milian Wolff > mail@milianw.de > http://milianw.de > -- > To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 13:23 ` Arnaldo Carvalho de Melo @ 2014-09-18 13:41 ` Milian Wolff 2014-09-18 14:51 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 25+ messages in thread From: Milian Wolff @ 2014-09-18 13:41 UTC (permalink / raw) To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users On Thursday 18 September 2014 10:23:50 Arnaldo Carvalho de Melo wrote: > Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu: > > Hello, > > > > is it somehow possible to use perf based on some kernel timer? I'd like to > > get > Try with tracepoints or with probe points combined with callchains > instead of using a hardware counter. Hey Arnoldo, where would you add such tracepoints? Or what tracepoint would you use? And what is the difference between tracepoints and probe points (I'm only aware of `perf probe`). Bye -- Milian Wolff mail@milianw.de http://milianw.de ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 13:41 ` Milian Wolff @ 2014-09-18 14:51 ` Arnaldo Carvalho de Melo 2014-09-18 15:26 ` Milian Wolff 0 siblings, 1 reply; 25+ messages in thread From: Arnaldo Carvalho de Melo @ 2014-09-18 14:51 UTC (permalink / raw) To: Milian Wolff; +Cc: linux-perf-users Em Thu, Sep 18, 2014 at 03:41:20PM +0200, Milian Wolff escreveu: > On Thursday 18 September 2014 10:23:50 Arnaldo Carvalho de Melo wrote: > > Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu: > > > is it somehow possible to use perf based on some kernel timer? I'd like to > > > get > > Try with tracepoints or with probe points combined with callchains > > instead of using a hardware counter. > where would you add such tracepoints? Or what tracepoint would you use? And > what is the difference between tracepoints and probe points (I'm only aware of > `perf probe`). tracepoints are places in the kernel (and in userspace as well, came later tho) where developers put in place for later tracing. They are super optimized, so have a lower cost than 'probe points' that you can put in place using 'perf probe'. To see the tracepoints, or any other available event in your system, use 'perf list'. The debugfs filesystem will need to be mounted, but that will be transparently done if the user has enough privileges. For instance, here are some tracepoints that you may want to use: [root@zoo ~]# perf list sched:* sched:sched_kthread_stop [Tracepoint event] sched:sched_kthread_stop_ret [Tracepoint event] sched:sched_wakeup [Tracepoint event] sched:sched_wakeup_new [Tracepoint event] sched:sched_switch [Tracepoint event] sched:sched_migrate_task [Tracepoint event] sched:sched_process_free [Tracepoint event] sched:sched_process_exit [Tracepoint event] sched:sched_wait_task [Tracepoint event] sched:sched_process_wait [Tracepoint event] sched:sched_process_fork [Tracepoint event] sched:sched_process_exec [Tracepoint event] sched:sched_stat_wait [Tracepoint event] sched:sched_stat_sleep [Tracepoint event] sched:sched_stat_iowait [Tracepoint event] sched:sched_stat_blocked [Tracepoint event] sched:sched_stat_runtime [Tracepoint event] sched:sched_pi_setprio [Tracepoint event] sched:sched_move_numa [Tracepoint event] sched:sched_stick_numa [Tracepoint event] sched:sched_swap_numa [Tracepoint event] sched:sched_wake_idle_without_ipi [Tracepoint event] [root@zoo ~] All tracepoints, for the system scheduler. You can ask for all of them to be collected, togethed with callchains, using: [root@zoo ~]# perf record -a -e sched:* -g usleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.294 MB perf.data (~56528 samples) ] [root@zoo ~]# This collected system wide samples for all the scheduler tracepoints for 1 microsecond. Use perf report and it will tell you how many of those tracepoints were hit during this short system wide session, and will allow you to traverse its callchains. I would recommend that you take a look at Brendan Greggs _excellent_ tutorials at: http://www.brendangregg.com/perf.html He will explain all this in way more detail than I briefly skimmed above. :-) - Arnaldo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 14:51 ` Arnaldo Carvalho de Melo @ 2014-09-18 15:26 ` Milian Wolff 2014-09-18 15:57 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 25+ messages in thread From: Milian Wolff @ 2014-09-18 15:26 UTC (permalink / raw) To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote: > Em Thu, Sep 18, 2014 at 03:41:20PM +0200, Milian Wolff escreveu: > > On Thursday 18 September 2014 10:23:50 Arnaldo Carvalho de Melo wrote: > > > Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu: > > > > is it somehow possible to use perf based on some kernel timer? I'd > > > > like to > > > > get > > > > > > Try with tracepoints or with probe points combined with callchains > > > instead of using a hardware counter. > > > > where would you add such tracepoints? Or what tracepoint would you use? > > And > > what is the difference between tracepoints and probe points (I'm only > > aware of `perf probe`). > > tracepoints are places in the kernel (and in userspace as well, came > later tho) where developers put in place for later tracing. > > They are super optimized, so have a lower cost than 'probe points' that > you can put in place using 'perf probe'. > > To see the tracepoints, or any other available event in your system, use > 'perf list'. > > The debugfs filesystem will need to be mounted, but that will be > transparently done if the user has enough privileges. Thanks for the quick rundown Arnaldo! Sadly, that much I "knew" already, yet am not able to understand how to use for my purpose. > For instance, here are some tracepoints that you may want to use: > > [root@zoo ~]# perf list sched:* <snip> I tried this: # a.out is the result of compiling `int main { sleep(1); return 0; }`: perf record -e sched:* --call-graph dwarf ./a.out perf report -g graph --stdio # the result can be found here: http://paste.kde.org/pflkskwrf How do I have to interpret this? a) This is not Wall-clock profiling, no? It just grabs a callgraph whenever one of the sched* events occurs, none of these events will occur, say, every X ms. b) The callgraphs are really strange, imo. Different traces are printed with the same cost, which sounds wrong, no? See e.g. the multiple 44.44% traces in sched:sched_wakeup. c) Most of the traces point into the kernel, how can I hide these traces and only concentrate on the user-space? Do I have to grep manually for [.] ? I tried something like `perf report --parent "main"` but that makes no difference. > I would recommend that you take a look at Brendan Greggs _excellent_ > tutorials at: > > http://www.brendangregg.com/perf.html > > He will explain all this in way more detail than I briefly skimmed > above. :-) I did that already, but Brendan and the other available Perf documentation mostly concentrates on performance issues in the Kernel. I'm interested purely in the user space. Perf record with one of the hardware PMU events works nicely in that case, but one cannot use it to find locks&waits similar to what VTune offers. Bye -- Milian Wolff mail@milianw.de http://milianw.de ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 15:26 ` Milian Wolff @ 2014-09-18 15:57 ` Arnaldo Carvalho de Melo 2014-09-18 16:37 ` Milian Wolff 0 siblings, 1 reply; 25+ messages in thread From: Arnaldo Carvalho de Melo @ 2014-09-18 15:57 UTC (permalink / raw) To: Milian Wolff; +Cc: linux-perf-users, Namhyung Kim, Ingo Molnar Em Thu, Sep 18, 2014 at 05:26:33PM +0200, Milian Wolff escreveu: > On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote: > > Em Thu, Sep 18, 2014 at 03:41:20PM +0200, Milian Wolff escreveu: > > > On Thursday 18 September 2014 10:23:50 Arnaldo Carvalho de Melo wrote: > > > > Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu: > > > > > is it somehow possible to use perf based on some kernel timer? I'd > > > > > like to get > > > > Try with tracepoints or with probe points combined with callchains > > > > instead of using a hardware counter. > > > > > > where would you add such tracepoints? Or what tracepoint would you use? > > > And > > > what is the difference between tracepoints and probe points (I'm only > > > aware of `perf probe`). > > > > tracepoints are places in the kernel (and in userspace as well, came > > later tho) where developers put in place for later tracing. > > > > They are super optimized, so have a lower cost than 'probe points' that > > you can put in place using 'perf probe'. > > > > To see the tracepoints, or any other available event in your system, use > > 'perf list'. > > > > The debugfs filesystem will need to be mounted, but that will be > > transparently done if the user has enough privileges. > > Thanks for the quick rundown Arnaldo! Sadly, that much I "knew" already, yet > am not able to understand how to use for my purpose. > > > For instance, here are some tracepoints that you may want to use: > > > > [root@zoo ~]# perf list sched:* > <snip> > I tried this: > # a.out is the result of compiling `int main { sleep(1); return 0; }`: > perf record -e sched:* --call-graph dwarf ./a.out > perf report -g graph --stdio > # the result can be found here: http://paste.kde.org/pflkskwrf > How do I have to interpret this? > a) This is not Wall-clock profiling, no? It just grabs a callgraph whenever > one of the sched* events occurs, none of these events will occur, say, every X > ms. > b) The callgraphs are really strange, imo. Different traces are printed with > the same cost, which sounds wrong, no? See e.g. the multiple 44.44% traces in > sched:sched_wakeup. Try using --no-children in the 'report' command line. > c) Most of the traces point into the kernel, how can I hide these traces and > only concentrate on the user-space? Do I have to grep manually for [.] ? I Oh well, for userspace you need to be aware of how callchains are collected, i.e. if your binaries and libraries use -fno-omit-frame-pointer, because if they do you will not get callchains going into userspace, so you will need to specifically ask for 'DWARF' callchains, from 'perf record' documentation: --call-graph Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI - Call Frame Information) as the method to collect the information used to show the call graphs. In some systems, where binaries are build with gcc --fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the libunwind library) should be used instead. This has to be made automated, i.e. the tooling needs to figure out that the binaries used do use %bp for optimization and automagically collect DWARF, but till then, one needs to know about such issues and deal with it. User space support is something that as you see, is still rough, we need people like you trying it, but while it is rough, people tend to avoid it... :-\ > tried something like `perf report --parent "main"` but that makes no > difference. > > > I would recommend that you take a look at Brendan Greggs _excellent_ > > tutorials at: > > > > http://www.brendangregg.com/perf.html > > > > He will explain all this in way more detail than I briefly skimmed > > above. :-) > > I did that already, but Brendan and the other available Perf documentation > mostly concentrates on performance issues in the Kernel. I'm interested purely > in the user space. Perf record with one of the hardware PMU events works > nicely in that case, but one cannot use it to find locks&waits similar to what > VTune offers. Humm, yeah, you need to figure out how to solve your issue, what I tried was to show what kinds of building blocks you could use to build what you need, but no, there is no ready to use tool for this, that I am aware of. For instance, you need to collect scheduler events, then do some scripting, perhaps using perl or python, perhaps using the scripting support that is built into perf already, but yeah, not documented. You could try starting with: [root@ssdandy ~]# perf script -l List of available trace scripts: net_dropmonitor display a table of dropped frames failed-syscalls-by-pid [comm] system-wide failed syscalls, by pid syscall-counts [comm] system-wide syscall counts netdev-times [tx] [rx] [dev=] [debug] display a process of packet and processing time sctop [comm] [interval] syscall top syscall-counts-by-pid [comm] system-wide syscall counts, by pid sched-migration sched migration overview event_analyzing_sample analyze all perf samples futex-contention futext contention measurement failed-syscalls [comm] system-wide failed syscalls workqueue-stats workqueue stats (ins/exe/create/destroy) wakeup-latency system-wide min/max/avg wakeup latency rw-by-file <comm> r/w activity for a program, by file rw-by-pid system-wide r/w activity rwtop [interval] system-wide r/w top [root@ssdandy ~]# They will all have a 'perf record' phase that will collect some tracepoints into a perf.data file, then some perl/python scripts to handle the events and get to some conclusion or generate some tables, etc. Haven't tested these in a while, wouldn't be surprised if some of them bitrotted. - Arnaldo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 15:57 ` Arnaldo Carvalho de Melo @ 2014-09-18 16:37 ` Milian Wolff 2014-09-18 19:17 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 25+ messages in thread From: Milian Wolff @ 2014-09-18 16:37 UTC (permalink / raw) To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users, Namhyung Kim, Ingo Molnar On Thursday 18 September 2014 12:57:45 Arnaldo Carvalho de Melo wrote: > Em Thu, Sep 18, 2014 at 05:26:33PM +0200, Milian Wolff escreveu: > > On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote: <snip> > > b) The callgraphs are really strange, imo. Different traces are printed > > with the same cost, which sounds wrong, no? See e.g. the multiple 44.44% > > traces in sched:sched_wakeup. > > Try using --no-children in the 'report' command line. Nice, this is very useful. Many thanks! > > c) Most of the traces point into the kernel, how can I hide these traces > > and only concentrate on the user-space? Do I have to grep manually for > > [.] ? I > Oh well, for userspace you need to be aware of how callchains are > collected, i.e. if your binaries and libraries use > -fno-omit-frame-pointer, because if they do you will not get callchains > going into userspace, so you will need to specifically ask for 'DWARF' > callchains, from 'perf record' documentation: I'm actually aware of that and I did add that option to my initial record call, sorry for not being clear here. <snip> > This has to be made automated, i.e. the tooling needs to figure out that > the binaries used do use %bp for optimization and automagically collect > DWARF, but till then, one needs to know about such issues and deal with > it. That would indeed be very welcome. There are multiple "defaults" in perf which I find highly confusing. The --no-children above e.g. could/should probably be the default, no? Similar, I find it extremely irritating that `perf report -g` defaults to `-g fractal` and not `-g graph`. 100% foo 70% bar 70% asdf 30% lalala 30% baz is much harder to interpret than 100% foo 70% bar 49% asdf 21% lalala 30% baz especially for more involved call chains. It took me quite some time to become aware of the ability to pass `-g graph` to get the desired output. KCacheGrind e.g. also defaults to something similar to `-g graph` and only optionally allows the user to get the "relative to parent" cost of `-g fractal`. > User space support is something that as you see, is still rough, we need > people like you trying it, but while it is rough, people tend to avoid > it... :-\ Yes. But already perf is extremely useful and I use it a lot. I'm also actively educating people about using it more. I've talked about it at last year's Akademy and Qt Developer Days, and again this year at a profiling workshop at Akademy. Please keep up the good work! > > tried something like `perf report --parent "main"` but that makes no > > difference. > > > > > I would recommend that you take a look at Brendan Greggs _excellent_ > > > tutorials at: > > > > > > http://www.brendangregg.com/perf.html > > > > > > He will explain all this in way more detail than I briefly skimmed > > > above. :-) > > > > I did that already, but Brendan and the other available Perf documentation > > mostly concentrates on performance issues in the Kernel. I'm interested > > purely in the user space. Perf record with one of the hardware PMU events > > works nicely in that case, but one cannot use it to find locks&waits > > similar to what VTune offers. > > Humm, yeah, you need to figure out how to solve your issue, what I tried > was to show what kinds of building blocks you could use to build what > you need, but no, there is no ready to use tool for this, that I am > aware of. > > For instance, you need to collect scheduler events, then do some > scripting, perhaps using perl or python, perhaps using the scripting > support that is built into perf already, but yeah, not documented. And also lacking the ability to get callgraphs, if I'm not mistaken. This is crucial for my undertaking. Or has this been added in the meantime? <snip> This was also why I asked my initial question, which I want to repeat once more: Is there a technical reason to not offer a "timer" software event to perf? I'm a complete layman when it comes to Kernel internals, but from a user point of view this would be awesome: perf record --call-graph dwarf -e sw-timer -F 100 someapplication This command would then create a timer in the kernel with a 100Hz frequency. Whenever it fires, the callgraphs of all threads in $someapplication are sampled and written to perf.data. Is this technically not feasible? Or is it simply not implemented? I'm experimenting with a libunwind based profiler, and with some ugly signal hackery I can now grab backtraces by sending my application SIGUSR1. Based on that, I can probably create a profiling tool that fits my needs. I just wonder why one cannot do the same with perf. Thanks for your time -- Milian Wolff mail@milianw.de http://milianw.de ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 16:37 ` Milian Wolff @ 2014-09-18 19:17 ` Arnaldo Carvalho de Melo 2014-09-18 19:31 ` Arnaldo Carvalho de Melo ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Arnaldo Carvalho de Melo @ 2014-09-18 19:17 UTC (permalink / raw) To: Milian Wolff Cc: linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu: > On Thursday 18 September 2014 12:57:45 Arnaldo Carvalho de Melo wrote: > > Em Thu, Sep 18, 2014 at 05:26:33PM +0200, Milian Wolff escreveu: > > > On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote: > <snip> > > > b) The callgraphs are really strange, imo. Different traces are printed > > > with the same cost, which sounds wrong, no? See e.g. the multiple 44.44% > > > traces in sched:sched_wakeup. > > Try using --no-children in the 'report' command line. > Nice, this is very useful. Many thanks! npo > > > c) Most of the traces point into the kernel, how can I hide these traces > > > and only concentrate on the user-space? Do I have to grep manually for > > > [.] ? I > > Oh well, for userspace you need to be aware of how callchains are > > collected, i.e. if your binaries and libraries use > > -fno-omit-frame-pointer, because if they do you will not get callchains > > going into userspace, so you will need to specifically ask for 'DWARF' > > callchains, from 'perf record' documentation: > I'm actually aware of that and I did add that option to my initial record > call, sorry for not being clear here. > <snip> > > This has to be made automated, i.e. the tooling needs to figure out that > > the binaries used do use %bp for optimization and automagically collect > > DWARF, but till then, one needs to know about such issues and deal with > > it. > > That would indeed be very welcome. There are multiple "defaults" in perf which > I find highly confusing. The --no-children above e.g. could/should probably be > the default, no? Similar, I find it extremely irritating that `perf report -g` It was, this is something we've actually been discussing recently: the change that made --children be the default mode. That is why I added Namhyung and Ingo to the CC list, so that they become aware of more reaction to this change. > defaults to `-g fractal` and not `-g graph`. > > 100% foo > 70% bar > 70% asdf > 30% lalala > 30% baz > > is much harder to interpret than > > 100% foo > 70% bar > 49% asdf > 21% lalala > 30% baz But the question then is if this is configurable, if not that would be a first step, i.e. making this possible via some ~/.perfconfig change. Later we could advocate changing the default. Or perhaps provide some "skins", i.e. config files that could be sourced into ~/.perfconfig so that perf mimics the decisions of other profilers, with which people are used to. Kinda like making mutt behave like pine (as I did a long time ago), even if just for a while, till one gets used to the "superior" default way of doing things of the new tool :-) > especially for more involved call chains. It took me quite some time to become > aware of the ability to pass `-g graph` to get the desired output. KCacheGrind > e.g. also defaults to something similar to `-g graph` and only optionally > allows the user to get the "relative to parent" cost of `-g fractal`. > > > User space support is something that as you see, is still rough, we need > > people like you trying it, but while it is rough, people tend to avoid > > it... :-\ > > Yes. But already perf is extremely useful and I use it a lot. I'm also > actively educating people about using it more. I've talked about it at last > year's Akademy and Qt Developer Days, and again this year at a profiling > workshop at Akademy. Please keep up the good work! Thanks a lot for doing that! > > > tried something like `perf report --parent "main"` but that makes no > > > difference. > > > > I would recommend that you take a look at Brendan Greggs _excellent_ > > > > tutorials at: > > > > http://www.brendangregg.com/perf.html > > > > He will explain all this in way more detail than I briefly skimmed > > > > above. :-) > > > I did that already, but Brendan and the other available Perf documentation > > > mostly concentrates on performance issues in the Kernel. I'm interested > > > purely in the user space. Perf record with one of the hardware PMU events > > > works nicely in that case, but one cannot use it to find locks&waits > > > similar to what VTune offers. > > Humm, yeah, you need to figure out how to solve your issue, what I tried > > was to show what kinds of building blocks you could use to build what > > you need, but no, there is no ready to use tool for this, that I am > > aware of. > > For instance, you need to collect scheduler events, then do some > > scripting, perhaps using perl or python, perhaps using the scripting > > support that is built into perf already, but yeah, not documented. > And also lacking the ability to get callgraphs, if I'm not mistaken. This is > crucial for my undertaking. Or has this been added in the meantime? I guess it was: commit 57608cfd8827a74237d264a197722e2c99f72da4 Author: Joseph Schuchart <joseph.schuchart@tu-dresden.de> Date: Thu Jul 10 13:50:56 2014 +0200 perf script: Provide additional sample information on generic events To python scripts, including pid, tid, and cpu for which the event was recorded. At the moment, the pointer to the sample struct is passed to scripts, which seems to be of little use. The patch puts this information in dictionaries for easy access by Python scripts. commit 0f5f5bcd112292f14b75750dde7461463bb1c7bb Author: Joseph Schuchart <joseph.schuchart@tu-dresden.de> Date: Thu Jul 10 13:50:51 2014 +0200 perf script: Add callchain to generic and tracepoint events This provides valuable information for tracing performance problems. Since this change alters the interface for the python scripts, also adjust the script generation and the provided scripts. > <snip> > > This was also why I asked my initial question, which I want to repeat once > more: Is there a technical reason to not offer a "timer" software event to > perf? I'm a complete layman when it comes to Kernel internals, but from a user > point of view this would be awesome: > perf record --call-graph dwarf -e sw-timer -F 100 someapplication > This command would then create a timer in the kernel with a 100Hz frequency. > Whenever it fires, the callgraphs of all threads in $someapplication are > sampled and written to perf.data. Is this technically not feasible? Or is it > simply not implemented? > I'm experimenting with a libunwind based profiler, and with some ugly signal > hackery I can now grab backtraces by sending my application SIGUSR1. Based on Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to your app with the frequency you want, and then hook a 'perf probe' into your signal... /me tries some stuff, will get back with results... > that, I can probably create a profiling tool that fits my needs. I just wonder > why one cannot do the same with perf. - Arnaldo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 19:17 ` Arnaldo Carvalho de Melo @ 2014-09-18 19:31 ` Arnaldo Carvalho de Melo 2014-09-18 20:17 ` David Ahern 2014-09-19 5:59 ` Namhyung Kim 2 siblings, 0 replies; 25+ messages in thread From: Arnaldo Carvalho de Melo @ 2014-09-18 19:31 UTC (permalink / raw) To: Milian Wolff Cc: linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart Em Thu, Sep 18, 2014 at 04:17:13PM -0300, Arnaldo Carvalho de Melo escreveu: > Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu: > > This was also why I asked my initial question, which I want to repeat once > > more: Is there a technical reason to not offer a "timer" software event to > > perf? I'm a complete layman when it comes to Kernel internals, but from a user > > point of view this would be awesome: > > > perf record --call-graph dwarf -e sw-timer -F 100 someapplication > > > This command would then create a timer in the kernel with a 100Hz frequency. > > Whenever it fires, the callgraphs of all threads in $someapplication are > > sampled and written to perf.data. Is this technically not feasible? Or is it > > simply not implemented? What you want is a sysrq-t that goes all the way to userspace, right? Try: echo t > /proc/sysrq dmesg i.e. probably you want that when a thread goes to sleep, we can take a sample with callchains, no? With a 'ping 127.0.0.1' running, I get this for its process: [19255.244315] ping S ffff88043e213100 0 6754 6548 0x00000080 [19255.244316] ffff880231693a80 0000000000000082 ffff88040ffb1940 0000000000013100 [19255.244317] ffff880231693fd8 0000000000013100 ffff880423c88000 ffff88040ffb1940 [19255.244318] ffffffff81efb1c0 ffff880231693ab0 00000001012131d1 ffffffff81efb1c0 [19255.244319] Call Trace: [19255.244321] [<ffffffff816b93a9>] schedule+0x29/0x70 [19255.244322] [<ffffffff816bbf91>] schedule_timeout+0x151/0x270 [19255.244325] [<ffffffff815ac606>] __skb_recv_datagram+0x426/0x4d0 [19255.244328] [<ffffffff815ac6e2>] skb_recv_datagram+0x32/0x40 [19255.244330] [<ffffffff8161a1ad>] raw_recvmsg+0x7d/0x1b0 [19255.244331] [<ffffffff8162961c>] inet_recvmsg+0x6c/0x80 [19255.244333] [<ffffffff8159d42a>] sock_recvmsg+0x9a/0xd0 [19255.244345] [<ffffffff8159e1a2>] __sys_recvmsg+0x42/0x80 [19255.244346] [<ffffffff8159e1f2>] SyS_recvmsg+0x12/0x20 [19255.244348] [<ffffffff816bd412>] system_call_fastpath+0x16/0x1b No? Trying to collect dwarf callchains after installing iputils-debuginfo (where /bin/ping symtab is)... > > I'm experimenting with a libunwind based profiler, and with some ugly signal > > hackery I can now grab backtraces by sending my application SIGUSR1. Based on > > Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to > your app with the frequency you want, and then hook a 'perf probe' into > your signal... /me tries some stuff, will get back with results... > > > that, I can probably create a profiling tool that fits my needs. I just wonder > > why one cannot do the same with perf. > > - Arnaldo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 19:17 ` Arnaldo Carvalho de Melo 2014-09-18 19:31 ` Arnaldo Carvalho de Melo @ 2014-09-18 20:17 ` David Ahern 2014-09-18 20:36 ` Arnaldo Carvalho de Melo 2014-09-19 5:59 ` Namhyung Kim 2 siblings, 1 reply; 25+ messages in thread From: David Ahern @ 2014-09-18 20:17 UTC (permalink / raw) To: Arnaldo Carvalho de Melo, Milian Wolff Cc: linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote: >> This was also why I asked my initial question, which I want to repeat once >> >more: Is there a technical reason to not offer a "timer" software event to >> >perf? I'm a complete layman when it comes to Kernel internals, but from a user >> >point of view this would be awesome: > >> >perf record --call-graph dwarf -e sw-timer -F 100 someapplication > >> >This command would then create a timer in the kernel with a 100Hz frequency. >> >Whenever it fires, the callgraphs of all threads in $someapplication are >> >sampled and written to perf.data. Is this technically not feasible? Or is it >> >simply not implemented? >> >I'm experimenting with a libunwind based profiler, and with some ugly signal >> >hackery I can now grab backtraces by sending my application SIGUSR1. Based on > Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to > your app with the frequency you want, and then hook a 'perf probe' into > your signal... /me tries some stuff, will get back with results... > Current profiling options with perf require the process to be running. What Milian want is to grab samples every timer expiration even if process is not running. Any limitations that would prevent doing this with a sw event? e.g, mimic task-clock just don't disable the timer when the task is scheduled out. David ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 20:17 ` David Ahern @ 2014-09-18 20:36 ` Arnaldo Carvalho de Melo 2014-09-18 20:39 ` David Ahern ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Arnaldo Carvalho de Melo @ 2014-09-18 20:36 UTC (permalink / raw) To: David Ahern Cc: Milian Wolff, linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu: > On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote: > >>This was also why I asked my initial question, which I want to repeat once > >>>more: Is there a technical reason to not offer a "timer" software event to > >>>perf? I'm a complete layman when it comes to Kernel internals, but from a user > >>>point of view this would be awesome: > >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication > >>>This command would then create a timer in the kernel with a 100Hz frequency. > >>>Whenever it fires, the callgraphs of all threads in $someapplication are > >>>sampled and written to perf.data. Is this technically not feasible? Or is it > >>>simply not implemented? > >>>I'm experimenting with a libunwind based profiler, and with some ugly signal > >>>hackery I can now grab backtraces by sending my application SIGUSR1. Based on > >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to > >your app with the frequency you want, and then hook a 'perf probe' into > >your signal... /me tries some stuff, will get back with results... > Current profiling options with perf require the process to be running. What Ok, so you want to see what is the wait channel and unwind the stack from there? Is that the case? I.e. again, a sysrq-t equivalent? > Milian want is to grab samples every timer expiration even if process is not > running. What for? And by "grab samples" you want to know where it is waiting for something, together with its callchain? > Any limitations that would prevent doing this with a sw event? e.g, mimic > task-clock just don't disable the timer when the task is scheduled out. I'm just trying to figure out if what people want is a complete backtrace of all threads in a process, no matter what they are doing, at some given time, i.e. at "sysrq-t" time, is that the case? - Arnaldo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 20:36 ` Arnaldo Carvalho de Melo @ 2014-09-18 20:39 ` David Ahern 2014-09-19 8:11 ` Milian Wolff 2014-09-19 14:17 ` David Ahern 2 siblings, 0 replies; 25+ messages in thread From: David Ahern @ 2014-09-18 20:39 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Milian Wolff, linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart On 9/18/14, 2:36 PM, Arnaldo Carvalho de Melo wrote: > I'm just trying to figure out if what people want is a complete > backtrace of all threads in a process, no matter what they are doing, at > some given time, i.e. at "sysrq-t" time, is that the case? I think so. That's what poor-man's profiler is doing -- attach gdb every N seconds, dump stack. In perf terminology generate a sample and include the callchain. David ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 20:36 ` Arnaldo Carvalho de Melo 2014-09-18 20:39 ` David Ahern @ 2014-09-19 8:11 ` Milian Wolff 2014-09-19 9:08 ` Milian Wolff 2014-09-19 14:47 ` Arnaldo Carvalho de Melo 2014-09-19 14:17 ` David Ahern 2 siblings, 2 replies; 25+ messages in thread From: Milian Wolff @ 2014-09-19 8:11 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: David Ahern, linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart On Thursday 18 September 2014 17:36:25 Arnaldo Carvalho de Melo wrote: > Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu: > > On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote: > > >>This was also why I asked my initial question, which I want to repeat > > >>once > > >> > > >>>more: Is there a technical reason to not offer a "timer" software event > > >>>to > > >>>perf? I'm a complete layman when it comes to Kernel internals, but from > > >>>a user point of view this would be awesome: > > >>> > > >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication > > >>> > > >>>This command would then create a timer in the kernel with a 100Hz > > >>>frequency. Whenever it fires, the callgraphs of all threads in > > >>>$someapplication are sampled and written to perf.data. Is this > > >>>technically not feasible? Or is it simply not implemented? > > >>>I'm experimenting with a libunwind based profiler, and with some ugly > > >>>signal hackery I can now grab backtraces by sending my application > > >>>SIGUSR1. Based on> > > > >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to > > >your app with the frequency you want, and then hook a 'perf probe' into > > >your signal... /me tries some stuff, will get back with results... That is actually a very good idea. With the more powerful scripting abilities in perf now, that could/should do the job indeed. I'll also try this out. > > Current profiling options with perf require the process to be running. > > What > > Ok, so you want to see what is the wait channel and unwind the stack > from there? Is that the case? I.e. again, a sysrq-t equivalent? Hey again :) If I'm not mistaken, I could not yet bring my point across. The final goal is to profile both, wait time _and_ CPU time, combined! By sampling a userspace program with a constant frequency and some statistics one can get extremely useful information out of the data. I want to use perf to automate the process that Mike Dunlavey explains here: http://stackoverflow.com/a/378024/35250 So yes, I want to see the wait channel and unwind from there, if the process is waiting. Otherwise just do the normal unwinding you'd do when any other perf event occurs. > > Milian want is to grab samples every timer expiration even if process is > > not running. > > What for? And by "grab samples" you want to know where it is waiting for > something, together with its callchain? See above. If I can sample the callchain every N ms, preferrably in a per- thread manner, I can create tools which find the slowest userspace functions. This is based on "inclusive" cost, which is made up out of CPU time _and_ wait time combined. With it one will find all of the following: - CPU hotspots - IO wait time - lock contention in a multi-threaded application > > Any limitations that would prevent doing this with a sw event? e.g, mimic > > task-clock just don't disable the timer when the task is scheduled out. > > I'm just trying to figure out if what people want is a complete > backtrace of all threads in a process, no matter what they are doing, at > some given time, i.e. at "sysrq-t" time, is that the case? Yes, that sounds correct to me. I tried it out on my system, but the output in dmesg is far too large and I could not find the call stack information of my test application while it slept. Looking at the other backtraces I see there, I'm not sure whether sysrq-t only outputs a kernel-backtrace? Or maybe it's because I'm missing debug symbols for these applications? [167576.049113] mysqld S 0000000000000000 0 17023 1 0x00000000 [167576.049115] ffff8801222a7ad0 0000000000000082 ffff8800aa8c9460 0000000000014580 [167576.049117] ffff8801222a7fd8 0000000000014580 ffff8800aa8c9460 ffff8801222a7a18 [167576.049119] ffffffff812a2640 ffff8800bb975c40 ffff880237cd4ef0 ffff8801222a7b20 [167576.049121] Call Trace: [167576.049124] [<ffffffff812a2640>] ? cpumask_next_and+0x30/0x50 [167576.049126] [<ffffffff810b65b4>] ? add_wait_queue+0x44/0x50 [167576.049128] [<ffffffff8152ceb9>] schedule+0x29/0x70 [167576.049130] [<ffffffff8152c4ad>] schedule_hrtimeout_range_clock+0x3d/0x60 [167576.049132] [<ffffffff8152c4e3>] schedule_hrtimeout_range+0x13/0x20 [167576.049134] [<ffffffff811d5589>] poll_schedule_timeout+0x49/0x70 [167576.049136] [<ffffffff811d6d5e>] do_sys_poll+0x4be/0x570 [167576.049138] [<ffffffff81155880>] ? __alloc_pages_nodemask+0x180/0xc20 [167576.049140] [<ffffffff8108e3ae>] ? alloc_pid+0x2e/0x4c0 [167576.049142] [<ffffffff811d5700>] ? poll_select_copy_remaining+0x150/0x150 [167576.049145] [<ffffffff810bd1f0>] ? cpuacct_charge+0x50/0x60 [167576.049147] [<ffffffff811a5de3>] ? kmem_cache_alloc+0x143/0x170 [167576.049149] [<ffffffff8108e3ae>] ? alloc_pid+0x2e/0x4c0 [167576.049151] [<ffffffff810a7d08>] ? __enqueue_entity+0x78/0x80 [167576.049153] [<ffffffff810addee>] ? enqueue_entity+0x24e/0xaa0 [167576.049155] [<ffffffff812a2640>] ? cpumask_next_and+0x30/0x50 [167576.049158] [<ffffffff810ae74d>] ? enqueue_task_fair+0x10d/0x5b0 [167576.049160] [<ffffffff81045c3d>] ? native_smp_send_reschedule+0x4d/0x70 [167576.049162] [<ffffffff8109e740>] ? resched_task+0xc0/0xd0 [167576.049165] [<ffffffff8109fff7>] ? check_preempt_curr+0x57/0x90 [167576.049167] [<ffffffff810a2e2b>] ? wake_up_new_task+0x11b/0x1c0 [167576.049169] [<ffffffff8106d5c6>] ? do_fork+0x136/0x3f0 [167576.049171] [<ffffffff811df235>] ? __fget_light+0x25/0x70 [167576.049173] [<ffffffff811d6f37>] SyS_poll+0x97/0x120 [167576.049176] [<ffffffff81530d69>] ? system_call_fastpath+0x16/0x1b [167576.049178] [<ffffffff81530d69>] system_call_fastpath+0x16/0x1b I'm interested in the the other side of the fence. The backtrace could/should stop at system_call_fastpath. Bye -- Milian Wolff mail@milianw.de http://milianw.de ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 8:11 ` Milian Wolff @ 2014-09-19 9:08 ` Milian Wolff 2014-09-19 14:47 ` Arnaldo Carvalho de Melo 1 sibling, 0 replies; 25+ messages in thread From: Milian Wolff @ 2014-09-19 9:08 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: David Ahern, linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart On Friday 19 September 2014 10:11:21 Milian Wolff wrote: > On Thursday 18 September 2014 17:36:25 Arnaldo Carvalho de Melo wrote: > > Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu: > > > On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote: > > > >>This was also why I asked my initial question, which I want to repeat > > > >>once > > > >> > > > >>>more: Is there a technical reason to not offer a "timer" software > > > >>>event > > > >>>to > > > >>>perf? I'm a complete layman when it comes to Kernel internals, but > > > >>>from > > > >>>a user point of view this would be awesome: > > > >>> > > > >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication > > > >>> > > > >>>This command would then create a timer in the kernel with a 100Hz > > > >>>frequency. Whenever it fires, the callgraphs of all threads in > > > >>>$someapplication are sampled and written to perf.data. Is this > > > >>>technically not feasible? Or is it simply not implemented? > > > >>>I'm experimenting with a libunwind based profiler, and with some ugly > > > >>>signal hackery I can now grab backtraces by sending my application > > > >>>SIGUSR1. Based on> > > > > > > > > >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to > > > >your app with the frequency you want, and then hook a 'perf probe' into > > > >your signal... /me tries some stuff, will get back with results... > > That is actually a very good idea. With the more powerful scripting > abilities in perf now, that could/should do the job indeed. I'll also try > this out. Indeed, this does not seem to work as intended: ~~~~~~~~~~~~~~~~~~ #include <csignal> extern "C" { void signalHandler(int sig) { } } struct Pace { Pace() { signal(SIGUSR1, &signalHandler); } }; static Pace initializeLibPace; ~~~~~~~~~~~~~~~~~~~ # as user, run: g++ -shared -fPIC -o libpace.so libpace.cpp LD_PRELOAD=$(readlink -f libpace.so) somebinary while true; do killall -s SIGUSR1 somebinary; done # as root, run: perf probe -x libpace.so signalHandler perf record --call-graph dwarf -e probe_libpace:signalHandler -p $(pidof yourApp) perf report -g graph --no-children Nice! But there are multiple issues with this approach, compared to potentially offering it in perf directly: 1) you need root/elevated access to not only create the probe, but also to use it then during record. this also requires you to to attach to the debuggee, as you don't want to run most user-space apps as root. while runtime-attaching is a powerful feature, it is cumbersome to use for the simple case 2) all of the issues of signal handling. Try the above on my initial test application: int main() { sleep(10); return 0; } After sending it the first SIGUSR1, the sleep will return. So we change the behavior of the application by profiling it, which is of course very bad. Sure, most of the time such sleeps are run in a loop, but even then one will influence the runtime behavior. And you will never find the spurious "sleep(100)" that you forgot in a debug session in your application... 3) What about multiple threads? One could potentially get it working by overwriting pthread_create/exit via LD_PRELOAD and propagating a custom signal via pthread_kill to all running threads when the external SIGUSR1 comes in. So, I hope this shows you all that having such a feature in perf directly would be a good thing. -- Milian Wolff mail@milianw.de http://milianw.de ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 8:11 ` Milian Wolff 2014-09-19 9:08 ` Milian Wolff @ 2014-09-19 14:47 ` Arnaldo Carvalho de Melo 2014-09-19 15:04 ` David Ahern 2014-09-19 15:05 ` Milian Wolff 1 sibling, 2 replies; 25+ messages in thread From: Arnaldo Carvalho de Melo @ 2014-09-19 14:47 UTC (permalink / raw) To: Milian Wolff Cc: David Ahern, linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart Em Fri, Sep 19, 2014 at 10:11:21AM +0200, Milian Wolff escreveu: > On Thursday 18 September 2014 17:36:25 Arnaldo Carvalho de Melo wrote: > > Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu: > > > On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote: > > > >>This was also why I asked my initial question, which I want to repeat > > > >>once > > > >> > > > >>>more: Is there a technical reason to not offer a "timer" software event > > > >>>to > > > >>>perf? I'm a complete layman when it comes to Kernel internals, but from > > > >>>a user point of view this would be awesome: > > > >>> > > > >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication > > > >>> > > > >>>This command would then create a timer in the kernel with a 100Hz > > > >>>frequency. Whenever it fires, the callgraphs of all threads in > > > >>>$someapplication are sampled and written to perf.data. Is this > > > >>>technically not feasible? Or is it simply not implemented? > > > >>>I'm experimenting with a libunwind based profiler, and with some ugly > > > >>>signal hackery I can now grab backtraces by sending my application > > > >>>SIGUSR1. Based on> > > > > >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to > > > >your app with the frequency you want, and then hook a 'perf probe' into > > > >your signal... /me tries some stuff, will get back with results... > > That is actually a very good idea. With the more powerful scripting abilities > in perf now, that could/should do the job indeed. I'll also try this out. Now that the need for getting a backtrace from existing threads, a-la using ptrace via gdb to attach to it and then traverse its stack to provide that backtrace, I think we need to do something on the perf infrastructure in the kernel to do that, i.e. somehow signal the perf kernel part that we want a backtrace for some specific thread. Not at event time, but at some arbitrary time, be it creating an event that, as you suggested, will create a timer and then when that timer fires will use (parts of the) mechanism used by ptrace. But in the end we need a mechanism to ask for backtraces for existing, sleeping, threads. For waits in the future, we just need to ask for the tracepoints where waits take place and with the current infrastructure we can get most/all of what we need, no? > > > Current profiling options with perf require the process to be running. > > > What > > Ok, so you want to see what is the wait channel and unwind the stack > > from there? Is that the case? I.e. again, a sysrq-t equivalent? > Hey again :) > If I'm not mistaken, I could not yet bring my point across. The final goal is Right, but the discussion so far was to see if the existing kernel infrastructure would allow us to write such a tool. > to profile both, wait time _and_ CPU time, combined! By sampling a userspace > program with a constant frequency and some statistics one can get extremely > useful information out of the data. I want to use perf to automate the process > that Mike Dunlavey explains here: http://stackoverflow.com/a/378024/35250 Will read. > So yes, I want to see the wait channel and unwind from there, if the process > is waiting. Otherwise just do the normal unwinding you'd do when any other > perf event occurs. Ok, we need two mechanisms to get that, one for existing, sleeping threads at the time we start monitoring (we don't have that now other than freezing the process, looking for threads that were waiting, then using ptrace and asking for that backtrace), and another for threads that will sleep/wait _after_ we start monitoring. > > > Milian want is to grab samples every timer expiration even if process is > > > not running. > > > > What for? And by "grab samples" you want to know where it is waiting for > > something, together with its callchain? > > See above. If I can sample the callchain every N ms, preferrably in a per- Do you need to take periodic samples of the callchain? Or only when you wait for something? For things that happen at such a high freq, yeah, just sampling would be better, but you're thinking about slow things, so multiple samples for something waiting and waiting is not needed, just when it starts waiting, right? > thread manner, I can create tools which find the slowest userspace functions. > This is based on "inclusive" cost, which is made up out of CPU time _and_ wait > time combined. With it one will find all of the following: > > - CPU hotspots > - IO wait time > - lock contention in a multi-threaded application > > > Any limitations that would prevent doing this with a sw event? e.g, mimic > > > task-clock just don't disable the timer when the task is scheduled out. > > I'm just trying to figure out if what people want is a complete > > backtrace of all threads in a process, no matter what they are doing, at > > some given time, i.e. at "sysrq-t" time, is that the case? > Yes, that sounds correct to me. I tried it out on my system, but the output in > dmesg is far too large and I could not find the call stack information of my > test application while it slept. Looking at the other backtraces I see there, > I'm not sure whether sysrq-t only outputs a kernel-backtrace? Or maybe it's sysrq-t is just for the kernel, that is why I said that you seemed to want it to cross into userspace. > because I'm missing debug symbols for these applications? > > [167576.049113] mysqld S 0000000000000000 0 17023 1 > 0x00000000 > [167576.049115] ffff8801222a7ad0 0000000000000082 ffff8800aa8c9460 > 0000000000014580 > [167576.049117] ffff8801222a7fd8 0000000000014580 ffff8800aa8c9460 > ffff8801222a7a18 > [167576.049119] ffffffff812a2640 ffff8800bb975c40 ffff880237cd4ef0 > ffff8801222a7b20 > [167576.049121] Call Trace: > [167576.049124] [<ffffffff812a2640>] ? cpumask_next_and+0x30/0x50 > [167576.049126] [<ffffffff810b65b4>] ? add_wait_queue+0x44/0x50 > [167576.049128] [<ffffffff8152ceb9>] schedule+0x29/0x70 > [167576.049130] [<ffffffff8152c4ad>] schedule_hrtimeout_range_clock+0x3d/0x60 > [167576.049132] [<ffffffff8152c4e3>] schedule_hrtimeout_range+0x13/0x20 > [167576.049134] [<ffffffff811d5589>] poll_schedule_timeout+0x49/0x70 > [167576.049136] [<ffffffff811d6d5e>] do_sys_poll+0x4be/0x570 > [167576.049138] [<ffffffff81155880>] ? __alloc_pages_nodemask+0x180/0xc20 > [167576.049140] [<ffffffff8108e3ae>] ? alloc_pid+0x2e/0x4c0 > [167576.049142] [<ffffffff811d5700>] ? poll_select_copy_remaining+0x150/0x150 > [167576.049145] [<ffffffff810bd1f0>] ? cpuacct_charge+0x50/0x60 > [167576.049147] [<ffffffff811a5de3>] ? kmem_cache_alloc+0x143/0x170 > [167576.049149] [<ffffffff8108e3ae>] ? alloc_pid+0x2e/0x4c0 > [167576.049151] [<ffffffff810a7d08>] ? __enqueue_entity+0x78/0x80 > [167576.049153] [<ffffffff810addee>] ? enqueue_entity+0x24e/0xaa0 > [167576.049155] [<ffffffff812a2640>] ? cpumask_next_and+0x30/0x50 > [167576.049158] [<ffffffff810ae74d>] ? enqueue_task_fair+0x10d/0x5b0 > [167576.049160] [<ffffffff81045c3d>] ? native_smp_send_reschedule+0x4d/0x70 > [167576.049162] [<ffffffff8109e740>] ? resched_task+0xc0/0xd0 > [167576.049165] [<ffffffff8109fff7>] ? check_preempt_curr+0x57/0x90 > [167576.049167] [<ffffffff810a2e2b>] ? wake_up_new_task+0x11b/0x1c0 > [167576.049169] [<ffffffff8106d5c6>] ? do_fork+0x136/0x3f0 > [167576.049171] [<ffffffff811df235>] ? __fget_light+0x25/0x70 > [167576.049173] [<ffffffff811d6f37>] SyS_poll+0x97/0x120 > [167576.049176] [<ffffffff81530d69>] ? system_call_fastpath+0x16/0x1b > [167576.049178] [<ffffffff81530d69>] system_call_fastpath+0x16/0x1b > > I'm interested in the the other side of the fence. The backtrace could/should > stop at system_call_fastpath. Right, adding callchains to 'perf trace' and using it with --duration may provide a first approximation for threads that will start waiting after 'perf trace' starts, I guess. - Arnaldo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 14:47 ` Arnaldo Carvalho de Melo @ 2014-09-19 15:04 ` David Ahern 2014-09-19 15:05 ` Milian Wolff 1 sibling, 0 replies; 25+ messages in thread From: David Ahern @ 2014-09-19 15:04 UTC (permalink / raw) To: Arnaldo Carvalho de Melo, Milian Wolff Cc: linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart On 9/19/14, 8:47 AM, Arnaldo Carvalho de Melo wrote: > Now that the need for getting a backtrace from existing threads, a-la > using ptrace via gdb to attach to it and then traverse its stack to > provide that backtrace, I think we need to do something on the perf > infrastructure in the kernel to do that, i.e. somehow signal the perf > kernel part that we want a backtrace for some specific thread. > > Not at event time, but at some arbitrary time, be it creating an event > that, as you suggested, will create a timer and then when that timer > fires will use (parts of the) mechanism used by ptrace. > > But in the end we need a mechanism to ask for backtraces for existing, > sleeping, threads. But perf already has that information -- stacktraces on a sched_switch. We already know where the task is when it is sleeping. > Ok, we need two mechanisms to get that, one for existing, sleeping > threads at the time we start monitoring (we don't have that now other > than freezing the process, looking for threads that were waiting, then > using ptrace and asking for that backtrace), and another for threads > that will sleep/wait_after_ we start monitoring. That's a missing piece -- callchains for sleeping threads when perf starts. >> >I'm interested in the the other side of the fence. The backtrace could/should >> >stop at system_call_fastpath. We talked about this last year - options to control stack depth obtained on a record. Most flexibility is to allow controls of user space stack and kernel. Here someone could set kstack depth to 0 to get only userspace stack. > Right, adding callchains to 'perf trace' and using it with --duration > may provide a first approximation for threads that will start waiting > after 'perf trace' starts, I guess. Talked about this one too. Integration of perf-script with perf-trace. So many features, so little time. David ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 14:47 ` Arnaldo Carvalho de Melo 2014-09-19 15:04 ` David Ahern @ 2014-09-19 15:05 ` Milian Wolff 1 sibling, 0 replies; 25+ messages in thread From: Milian Wolff @ 2014-09-19 15:05 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: David Ahern, linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart On Friday 19 September 2014 11:47:47 Arnaldo Carvalho de Melo wrote: > Em Fri, Sep 19, 2014 at 10:11:21AM +0200, Milian Wolff escreveu: > > On Thursday 18 September 2014 17:36:25 Arnaldo Carvalho de Melo wrote: > > > Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu: > > > > On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote: > > > > >>This was also why I asked my initial question, which I want to > > > > >>repeat > > > > >>once > > > > >> > > > > >>>more: Is there a technical reason to not offer a "timer" software > > > > >>>event > > > > >>>to > > > > >>>perf? I'm a complete layman when it comes to Kernel internals, but > > > > >>>from > > > > >>>a user point of view this would be awesome: > > > > >>> > > > > >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication > > > > >>> > > > > >>>This command would then create a timer in the kernel with a 100Hz > > > > >>>frequency. Whenever it fires, the callgraphs of all threads in > > > > >>>$someapplication are sampled and written to perf.data. Is this > > > > >>>technically not feasible? Or is it simply not implemented? > > > > >>>I'm experimenting with a libunwind based profiler, and with some > > > > >>>ugly > > > > >>>signal hackery I can now grab backtraces by sending my application > > > > >>>SIGUSR1. Based on> > > > > > > > > > > >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to > > > > >your app with the frequency you want, and then hook a 'perf probe' > > > > >into > > > > >your signal... /me tries some stuff, will get back with results... > > > > That is actually a very good idea. With the more powerful scripting > > abilities in perf now, that could/should do the job indeed. I'll also try > > this out. > > Now that the need for getting a backtrace from existing threads, a-la > using ptrace via gdb to attach to it and then traverse its stack to > provide that backtrace, I think we need to do something on the perf > infrastructure in the kernel to do that, i.e. somehow signal the perf > kernel part that we want a backtrace for some specific thread. > > Not at event time, but at some arbitrary time, be it creating an event > that, as you suggested, will create a timer and then when that timer > fires will use (parts of the) mechanism used by ptrace. > > But in the end we need a mechanism to ask for backtraces for existing, > sleeping, threads. Yes, such a capability would be tremendously helpful for writing profiling tools for userspace applications. It should also work for threads that are not sleeping though, or are all threads of a process frozen when a perf event fires anyways? > For waits in the future, we just need to ask for the tracepoints where > waits take place and with the current infrastructure we can get most/all > of what we need, no? See David's email, this is possible even now with the sched:* events. > > > > Current profiling options with perf require the process to be running. > > > > What > > > > > > Ok, so you want to see what is the wait channel and unwind the stack > > > from there? Is that the case? I.e. again, a sysrq-t equivalent? > > > > Hey again :) > > > > If I'm not mistaken, I could not yet bring my point across. The final goal > > is > Right, but the discussion so far was to see if the existing kernel > infrastructure would allow us to write such a tool. > > > to profile both, wait time _and_ CPU time, combined! By sampling a > > userspace program with a constant frequency and some statistics one can > > get extremely useful information out of the data. I want to use perf to > > automate the process that Mike Dunlavey explains here: > > http://stackoverflow.com/a/378024/35250 > Will read. > > > So yes, I want to see the wait channel and unwind from there, if the > > process is waiting. Otherwise just do the normal unwinding you'd do when > > any other perf event occurs. > > Ok, we need two mechanisms to get that, one for existing, sleeping > threads at the time we start monitoring (we don't have that now other > than freezing the process, looking for threads that were waiting, then > using ptrace and asking for that backtrace), and another for threads > that will sleep/wait _after_ we start monitoring. > > > > > Milian want is to grab samples every timer expiration even if process > > > > is > > > > not running. > > > > > > What for? And by "grab samples" you want to know where it is waiting for > > > something, together with its callchain? > > > > See above. If I can sample the callchain every N ms, preferrably in a per- > > Do you need to take periodic samples of the callchain? Or only when you > wait for something? > > For things that happen at such a high freq, yeah, just sampling would be > better, but you're thinking about slow things, so multiple samples for > something waiting and waiting is not needed, just when it starts > waiting, right? I need periodic samples. I'm not looking exclusively for wait time (that can be done already, see above). I'm looking for a generic overview of the userspace program. Where does it spent time? Without periodic samples, I cannot do any statistics. I think this should become clear when you read Mike Dunlavey's text that explains the GDB-based poor-mans-profiler (which is actually pretty helpful, despite the name). > > thread manner, I can create tools which find the slowest userspace > > functions. This is based on "inclusive" cost, which is made up out of CPU > > time _and_ wait> > > time combined. With it one will find all of the following: > > - CPU hotspots > > - IO wait time > > - lock contention in a multi-threaded application > > > > > > Any limitations that would prevent doing this with a sw event? e.g, > > > > mimic > > > > task-clock just don't disable the timer when the task is scheduled > > > > out. > > > > > > I'm just trying to figure out if what people want is a complete > > > backtrace of all threads in a process, no matter what they are doing, at > > > some given time, i.e. at "sysrq-t" time, is that the case? > > > > Yes, that sounds correct to me. I tried it out on my system, but the > > output in dmesg is far too large and I could not find the call stack > > information of my test application while it slept. Looking at the other > > backtraces I see there, I'm not sure whether sysrq-t only outputs a > > kernel-backtrace? Or maybe it's > sysrq-t is just for the kernel, that is why I said that you seemed to > want it to cross into userspace. Ah, ok. I misunderstood you. <snip> > Right, adding callchains to 'perf trace' and using it with --duration > may provide a first approximation for threads that will start waiting > after 'perf trace' starts, I guess. Yes, see also the other mail on that. This would be very helpful, but only partially related to the goal I have in mind here :) Bye -- Milian Wolff mail@milianw.de http://milianw.de ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 20:36 ` Arnaldo Carvalho de Melo 2014-09-18 20:39 ` David Ahern 2014-09-19 8:11 ` Milian Wolff @ 2014-09-19 14:17 ` David Ahern 2014-09-19 14:39 ` Milian Wolff 2 siblings, 1 reply; 25+ messages in thread From: David Ahern @ 2014-09-19 14:17 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Milian Wolff, linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart On 9/18/14, 2:36 PM, Arnaldo Carvalho de Melo wrote: > Ok, so you want to see what is the wait channel and unwind the stack > from there? Is that the case? I.e. again, a sysrq-t equivalent? > >> Milian want is to grab samples every timer expiration even if process is not >> running. > > What for? And by "grab samples" you want to know where it is waiting for > something, together with its callchain? > >> Any limitations that would prevent doing this with a sw event? e.g, mimic >> task-clock just don't disable the timer when the task is scheduled out. > > I'm just trying to figure out if what people want is a complete > backtrace of all threads in a process, no matter what they are doing, at > some given time, i.e. at "sysrq-t" time, is that the case? This has been discussed before: http://thread.gmane.org/gmane.linux.kernel/1307306 Example on the perf wiki: https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times David ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 14:17 ` David Ahern @ 2014-09-19 14:39 ` Milian Wolff 2014-09-19 14:55 ` David Ahern 0 siblings, 1 reply; 25+ messages in thread From: Milian Wolff @ 2014-09-19 14:39 UTC (permalink / raw) To: David Ahern Cc: Arnaldo Carvalho de Melo, linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart On Friday 19 September 2014 08:17:47 David Ahern wrote: > On 9/18/14, 2:36 PM, Arnaldo Carvalho de Melo wrote: > > Ok, so you want to see what is the wait channel and unwind the stack > > from there? Is that the case? I.e. again, a sysrq-t equivalent? > > > >> Milian want is to grab samples every timer expiration even if process is > >> not running. > > > > What for? And by "grab samples" you want to know where it is waiting for > > something, together with its callchain? > > > >> Any limitations that would prevent doing this with a sw event? e.g, mimic > >> task-clock just don't disable the timer when the task is scheduled out. > > > > I'm just trying to figure out if what people want is a complete > > backtrace of all threads in a process, no matter what they are doing, at > > some given time, i.e. at "sysrq-t" time, is that the case? > > This has been discussed before: > http://thread.gmane.org/gmane.linux.kernel/1307306 > > Example on the perf wiki: > https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times Right, I also mentioned this in my initial email. This is _not_ what I want though :) Bye -- Milian Wolff mail@milianw.de http://milianw.de ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 14:39 ` Milian Wolff @ 2014-09-19 14:55 ` David Ahern 0 siblings, 0 replies; 25+ messages in thread From: David Ahern @ 2014-09-19 14:55 UTC (permalink / raw) To: Milian Wolff Cc: Arnaldo Carvalho de Melo, linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart On 9/19/14, 8:39 AM, Milian Wolff wrote: >> This has been discussed before: >> > http://thread.gmane.org/gmane.linux.kernel/1307306 >> > >> >Example on the perf wiki: >> > https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times > Right, I also mentioned this in my initial email. This is_not_ what I want > though:) D'oh. missed that. And I did find a blurb on StackOverflow that distros are now disabling SCHEDSTATS so commands cannot rely on sched_stat tracepoints. Now I am glad I never converted my timehist command to rely on those. David ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-18 19:17 ` Arnaldo Carvalho de Melo 2014-09-18 19:31 ` Arnaldo Carvalho de Melo 2014-09-18 20:17 ` David Ahern @ 2014-09-19 5:59 ` Namhyung Kim 2014-09-19 14:33 ` Arnaldo Carvalho de Melo 2014-09-22 7:56 ` Namhyung Kim 2 siblings, 2 replies; 25+ messages in thread From: Namhyung Kim @ 2014-09-19 5:59 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Milian Wolff, linux-perf-users, Ingo Molnar, Joseph Schuchart Hi Arnaldo and Millan, On Thu, 18 Sep 2014 16:17:13 -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu: >> On Thursday 18 September 2014 12:57:45 Arnaldo Carvalho de Melo wrote: >> > Em Thu, Sep 18, 2014 at 05:26:33PM +0200, Milian Wolff escreveu: >> > > On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote: > >> <snip> > >> > > b) The callgraphs are really strange, imo. Different traces are printed >> > > with the same cost, which sounds wrong, no? See e.g. the multiple 44.44% >> > > traces in sched:sched_wakeup. > >> > Try using --no-children in the 'report' command line. > >> Nice, this is very useful. Many thanks! > > npo > >> > > c) Most of the traces point into the kernel, how can I hide these traces >> > > and only concentrate on the user-space? Do I have to grep manually for >> > > [.] ? I > >> > Oh well, for userspace you need to be aware of how callchains are >> > collected, i.e. if your binaries and libraries use >> > -fno-omit-frame-pointer, because if they do you will not get callchains >> > going into userspace, so you will need to specifically ask for 'DWARF' >> > callchains, from 'perf record' documentation: > >> I'm actually aware of that and I did add that option to my initial record >> call, sorry for not being clear here. > >> <snip> > >> > This has to be made automated, i.e. the tooling needs to figure out that >> > the binaries used do use %bp for optimization and automagically collect >> > DWARF, but till then, one needs to know about such issues and deal with >> > it. >> >> That would indeed be very welcome. There are multiple "defaults" in perf which >> I find highly confusing. The --no-children above e.g. could/should probably be >> the default, no? Similar, I find it extremely irritating that `perf report -g` > > It was, this is something we've actually been discussing recently: the > change that made --children be the default mode. That is why I added > Namhyung and Ingo to the CC list, so that they become aware of more > reaction to this change. Yeah, we should rethink about changing the default now. Actually I'm okay with the change, Ingo what do you think? > >> defaults to `-g fractal` and not `-g graph`. >> >> 100% foo >> 70% bar >> 70% asdf >> 30% lalala >> 30% baz >> >> is much harder to interpret than >> >> 100% foo >> 70% bar >> 49% asdf >> 21% lalala >> 30% baz I also agree with you. :) > > But the question then is if this is configurable, if not that would be a > first step, i.e. making this possible via some ~/.perfconfig change. Yes, we have record.call-graph and top.call-graph config options now so adding a new report.call-graph option should not be difficult. However I think it'd be better being call-graph.XXX as it can be applied to all other subcommands transparently. What about like below? [call-graph] mode = dwarf dump-size = 8192 print-type = fractal order = callee threshold = 0.5 print-limit = 128 sort-key = function > > Later we could advocate changing the default. Or perhaps provide some > "skins", i.e. config files that could be sourced into ~/.perfconfig so > that perf mimics the decisions of other profilers, with which people are > used to. > > Kinda like making mutt behave like pine (as I did a long time ago), even > if just for a while, till one gets used to the "superior" default way of > doing things of the new tool :-) > >> especially for more involved call chains. It took me quite some time to become >> aware of the ability to pass `-g graph` to get the desired output. KCacheGrind >> e.g. also defaults to something similar to `-g graph` and only optionally >> allows the user to get the "relative to parent" cost of `-g fractal`. >> >> > User space support is something that as you see, is still rough, we need >> > people like you trying it, but while it is rough, people tend to avoid >> > it... :-\ >> >> Yes. But already perf is extremely useful and I use it a lot. I'm also >> actively educating people about using it more. I've talked about it at last >> year's Akademy and Qt Developer Days, and again this year at a profiling >> workshop at Akademy. Please keep up the good work! > > Thanks a lot for doing that! > >> > > tried something like `perf report --parent "main"` but that makes no >> > > difference. > >> > > > I would recommend that you take a look at Brendan Greggs _excellent_ >> > > > tutorials at: > >> > > > http://www.brendangregg.com/perf.html > >> > > > He will explain all this in way more detail than I briefly skimmed >> > > > above. :-) > >> > > I did that already, but Brendan and the other available Perf documentation >> > > mostly concentrates on performance issues in the Kernel. I'm interested >> > > purely in the user space. Perf record with one of the hardware PMU events >> > > works nicely in that case, but one cannot use it to find locks&waits >> > > similar to what VTune offers. > >> > Humm, yeah, you need to figure out how to solve your issue, what I tried >> > was to show what kinds of building blocks you could use to build what >> > you need, but no, there is no ready to use tool for this, that I am >> > aware of. I'm also *very* interest in collecting idle/wait info using perf. Looks like we can somehow use sched:* tracepoints but it requires root privilege though (unless /proc/sys/kernel/perf_event_paranoid being -1). With that restriction however, we might improve perf sched (or even plain perf record/report) to provide such info.. David may have an idea. :) Thanks, Namhyung > >> > For instance, you need to collect scheduler events, then do some >> > scripting, perhaps using perl or python, perhaps using the scripting >> > support that is built into perf already, but yeah, not documented. > >> And also lacking the ability to get callgraphs, if I'm not mistaken. This is >> crucial for my undertaking. Or has this been added in the meantime? > > I guess it was: > > commit 57608cfd8827a74237d264a197722e2c99f72da4 > Author: Joseph Schuchart <joseph.schuchart@tu-dresden.de> > Date: Thu Jul 10 13:50:56 2014 +0200 > > perf script: Provide additional sample information on generic events > > To python scripts, including pid, tid, and cpu for which the event > was recorded. > > At the moment, the pointer to the sample struct is passed to > scripts, which seems to be of little use. > > The patch puts this information in dictionaries for easy access by > Python scripts. > > commit 0f5f5bcd112292f14b75750dde7461463bb1c7bb > Author: Joseph Schuchart <joseph.schuchart@tu-dresden.de> > Date: Thu Jul 10 13:50:51 2014 +0200 > > perf script: Add callchain to generic and tracepoint events > > This provides valuable information for tracing performance problems. > > Since this change alters the interface for the python scripts, also > adjust the script generation and the provided scripts. > >> <snip> >> >> This was also why I asked my initial question, which I want to repeat once >> more: Is there a technical reason to not offer a "timer" software event to >> perf? I'm a complete layman when it comes to Kernel internals, but from a user >> point of view this would be awesome: > >> perf record --call-graph dwarf -e sw-timer -F 100 someapplication > >> This command would then create a timer in the kernel with a 100Hz frequency. >> Whenever it fires, the callgraphs of all threads in $someapplication are >> sampled and written to perf.data. Is this technically not feasible? Or is it >> simply not implemented? > >> I'm experimenting with a libunwind based profiler, and with some ugly signal >> hackery I can now grab backtraces by sending my application SIGUSR1. Based on > > Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to > your app with the frequency you want, and then hook a 'perf probe' into > your signal... /me tries some stuff, will get back with results... > >> that, I can probably create a profiling tool that fits my needs. I just wonder >> why one cannot do the same with perf. > > - Arnaldo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 5:59 ` Namhyung Kim @ 2014-09-19 14:33 ` Arnaldo Carvalho de Melo 2014-09-19 14:53 ` Milian Wolff 2014-09-22 7:56 ` Namhyung Kim 1 sibling, 1 reply; 25+ messages in thread From: Arnaldo Carvalho de Melo @ 2014-09-19 14:33 UTC (permalink / raw) To: Namhyung Kim Cc: Milian Wolff, linux-perf-users, Ingo Molnar, Joseph Schuchart Em Fri, Sep 19, 2014 at 02:59:55PM +0900, Namhyung Kim escreveu: > Hi Arnaldo and Millan, > > On Thu, 18 Sep 2014 16:17:13 -0300, Arnaldo Carvalho de Melo wrote: > > Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu: > >> That would indeed be very welcome. There are multiple "defaults" in perf which > >> I find highly confusing. The --no-children above e.g. could/should probably be > >> the default, no? Similar, I find it extremely irritating that `perf report -g` > > It was, this is something we've actually been discussing recently: the > > change that made --children be the default mode. That is why I added > > Namhyung and Ingo to the CC list, so that they become aware of more > > reaction to this change. > Yeah, we should rethink about changing the default now. Actually I'm > okay with the change, Ingo what do you think? > >> defaults to `-g fractal` and not `-g graph`. > >> > >> 100% foo > >> 70% bar > >> 70% asdf > >> 30% lalala > >> 30% baz > >> is much harder to interpret than > >> 100% foo > >> 70% bar > >> 49% asdf > >> 21% lalala > >> 30% baz > I also agree with you. :) > > But the question then is if this is configurable, if not that would be a > > first step, i.e. making this possible via some ~/.perfconfig change. > Yes, we have record.call-graph and top.call-graph config options now so > adding a new report.call-graph option should not be difficult. However > I think it'd be better being call-graph.XXX as it can be applied to all No problem, with sourcing being supported in ~/.perfconfig, we can have as many #include call-graph.XXX as needed, multiple levels of includes and all. > other subcommands transparently. > What about like below? > [call-graph] > mode = dwarf > dump-size = 8192 > print-type = fractal > order = callee > threshold = 0.5 > print-limit = 128 > sort-key = function Milian, does this provide what you expect? How would we call this specific call-graph profile? Ideas on where to put this below tools/perf/? > > Later we could advocate changing the default. Or perhaps provide some > > "skins", i.e. config files that could be sourced into ~/.perfconfig so > > that perf mimics the decisions of other profilers, with which people are > > used to. > > Kinda like making mutt behave like pine (as I did a long time ago), even > > if just for a while, till one gets used to the "superior" default way of > > doing things of the new tool :-) > >> > > I did that already, but Brendan and the other available Perf documentation > >> > > mostly concentrates on performance issues in the Kernel. I'm interested > >> > > purely in the user space. Perf record with one of the hardware PMU events > >> > > works nicely in that case, but one cannot use it to find locks&waits > >> > > similar to what VTune offers. > >> > Humm, yeah, you need to figure out how to solve your issue, what I tried > >> > was to show what kinds of building blocks you could use to build what > >> > you need, but no, there is no ready to use tool for this, that I am > >> > aware of. > I'm also *very* interest in collecting idle/wait info using perf. Looks > like we can somehow use sched:* tracepoints but it requires root > privilege though (unless /proc/sys/kernel/perf_event_paranoid being -1). > With that restriction however, we might improve perf sched (or even > plain perf record/report) to provide such info.. David may have an > idea. :) So this is like, thinking just on userspace here for a change, using syscall entry as the wait time entry, then later, when we are at syscall exit time (another tracepoint) we subtract the time from syscall entry, and that is our waiting-for-the-kernel time, put a call-chain there and we can sort by that syscall time, i.e. a bit like 'perf trace' + callchains, no? I.e. if I want to see what syscalls are taking more than 50ms on my system: [root@zoo ProgramasRFB]# trace --duration 50 | head -10 132.962 (85.670 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0 267.184 (134.110 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0 328.303 (263.584 ms): chrome/17715 poll(ufds: 0x3ae9a95653f0, nfds: 6, timeout_msecs: 263) = 0 Timeout 380.623 (300.665 ms): xchat/17397 poll(ufds: 0x25397d0, nfds: 6, timeout_msecs: 300 ) = 0 Timeout 425.713 (250.214 ms): qemu-system-x8/17344 ioctl(fd: 18<anon_inode:kvm-vcpu>, cmd: 0xae80) = 0 545.482 (482.962 ms): chrome/18798 futex(uaddr: 0x7fff19d88624, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88590, val3: 4294967295) = -1 ETIMEDOUT Connection timed out 547.938 (500.641 ms): chrome/18173 futex(uaddr: 0x7fe1c1ed796c, op: WAIT_BITSET|PRIV|CLKRT, val: 197751, utime: 0x7fe1c15b5a60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out 547.969 (272.853 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0 633.711 (500.637 ms): chrome/18829 futex(uaddr: 0x7fe1bdbb862c, op: WAIT_BITSET|PRIV|CLKRT, val: 89537, utime: 0x7fe1ba67ea60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out 633.767 (85.697 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0 [root@zoo ProgramasRFB]# But then want to filter this a bit more and exclude poll calls: [root@zoo ProgramasRFB]# trace --duration 50 -e \!poll | head -10 299.079 (221.049 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0 304.830 (250.211 ms): qemu-system-x8/17344 ioctl(fd: 18<anon_inode:kvm-vcpu>, cmd: 0xae80) = 0 486.710 (187.180 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0 555.125 (250.216 ms): qemu-system-x8/17344 ioctl(fd: 18<anon_inode:kvm-vcpu>, cmd: 0xae80) = 0 578.685 (500.638 ms): chrome/18829 futex(uaddr: 0x7fe1bdbb862c, op: WAIT_BITSET|PRIV|CLKRT, val: 89879, utime: 0x7fe1ba67ea60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out 578.749 (91.940 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0 800.123 (500.640 ms): chrome/18827 futex(uaddr: 0x7fe1b93aa7cc, op: WAIT_BITSET|PRIV|CLKRT, val: 89803, utime: 0x7fe1bd2b3a60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out 800.152 (500.618 ms): chrome/18077 futex(uaddr: 0x7fe1c179e96c, op: WAIT_BITSET|PRIV|CLKRT, val: 198047, utime: 0x7fe1c1637a60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out 800.165 (221.309 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0 805.419 (250.219 ms): qemu-system-x8/17344 ioctl(fd: 18<anon_inode:kvm-vcpu>, cmd: 0xae80) = 0 [root@zoo ProgramasRFB]# format: time-since-start (syscall duration): ... - Arnaldo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 14:33 ` Arnaldo Carvalho de Melo @ 2014-09-19 14:53 ` Milian Wolff 2014-09-19 15:50 ` Namhyung Kim 0 siblings, 1 reply; 25+ messages in thread From: Milian Wolff @ 2014-09-19 14:53 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Namhyung Kim, linux-perf-users, Ingo Molnar, Joseph Schuchart On Friday 19 September 2014 11:33:40 Arnaldo Carvalho de Melo wrote: > Em Fri, Sep 19, 2014 at 02:59:55PM +0900, Namhyung Kim escreveu: > > Hi Arnaldo and Millan, > > > > On Thu, 18 Sep 2014 16:17:13 -0300, Arnaldo Carvalho de Melo wrote: > > > Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu: > > >> That would indeed be very welcome. There are multiple "defaults" in > > >> perf which I find highly confusing. The --no-children above e.g. > > >> could/should probably be the default, no? Similar, I find it extremely > > >> irritating that `perf report -g`> > > > > It was, this is something we've actually been discussing recently: the > > > change that made --children be the default mode. That is why I added > > > Namhyung and Ingo to the CC list, so that they become aware of more > > > reaction to this change. > > > > Yeah, we should rethink about changing the default now. Actually I'm > > okay with the change, Ingo what do you think? > > > > >> defaults to `-g fractal` and not `-g graph`. > > >> > > >> 100% foo > > >> > > >> 70% bar > > >> > > >> 70% asdf > > >> 30% lalala > > >> > > >> 30% baz > > >> > > >> is much harder to interpret than > > >> > > >> 100% foo > > >> > > >> 70% bar > > >> > > >> 49% asdf > > >> 21% lalala > > >> > > >> 30% baz > > > > I also agree with you. :) > > > > > But the question then is if this is configurable, if not that would be a > > > first step, i.e. making this possible via some ~/.perfconfig change. > > > > Yes, we have record.call-graph and top.call-graph config options now so > > adding a new report.call-graph option should not be difficult. However > > I think it'd be better being call-graph.XXX as it can be applied to all > > No problem, with sourcing being supported in ~/.perfconfig, we can have > as many #include call-graph.XXX as needed, multiple levels of includes > and all. > > > other subcommands transparently. > > > > What about like below? > > > > [call-graph] > > > > mode = dwarf > > dump-size = 8192 > > print-type = fractal > > order = callee > > threshold = 0.5 > > print-limit = 128 > > sort-key = function > > Milian, does this provide what you expect? How would we call this > specific call-graph profile? print-type should be graph, not fractal. Otherwise it sounds good to me. But how would one use it? I tried putting it into ~/.perfconfig, but apparently my 3.16.2-1-ARCH Perf does not support this feature yet? How/when would that config be used? As soon as one does "perf record -g" (for mode and dump-size) or "perf report" (for a perf.data with call graphs)? That would be very useful! > Ideas on where to put this below tools/perf/? Nope. > > > Later we could advocate changing the default. Or perhaps provide some > > > "skins", i.e. config files that could be sourced into ~/.perfconfig so > > > that perf mimics the decisions of other profilers, with which people are > > > used to. > > > > > > Kinda like making mutt behave like pine (as I did a long time ago), even > > > if just for a while, till one gets used to the "superior" default way of > > > doing things of the new tool :-) > > > > > >> > > I did that already, but Brendan and the other available Perf > > >> > > documentation > > >> > > mostly concentrates on performance issues in the Kernel. I'm > > >> > > interested > > >> > > purely in the user space. Perf record with one of the hardware PMU > > >> > > events > > >> > > works nicely in that case, but one cannot use it to find > > >> > > locks&waits > > >> > > similar to what VTune offers. > > >> > > > >> > Humm, yeah, you need to figure out how to solve your issue, what I > > >> > tried > > >> > was to show what kinds of building blocks you could use to build what > > >> > you need, but no, there is no ready to use tool for this, that I am > > >> > aware of. > > > > I'm also *very* interest in collecting idle/wait info using perf. Looks > > like we can somehow use sched:* tracepoints but it requires root > > privilege though (unless /proc/sys/kernel/perf_event_paranoid being -1). > > > > With that restriction however, we might improve perf sched (or even > > plain perf record/report) to provide such info.. David may have an > > idea. :) > > So this is like, thinking just on userspace here for a change, using > syscall entry as the wait time entry, then later, when we are at syscall > exit time (another tracepoint) we subtract the time from syscall entry, > and that is our waiting-for-the-kernel time, put a call-chain there and > we can sort by that syscall time, i.e. a bit like 'perf trace' + > callchains, no? > > I.e. if I want to see what syscalls are taking more than 50ms on my > system: > > [root@zoo ProgramasRFB]# trace --duration 50 | head -10 <snip> > > But then want to filter this a bit more and exclude poll calls: > > [root@zoo ProgramasRFB]# trace --duration 50 -e \!poll | head -10 <snip> This sounds useful. If it now would also print backtraces... :) But note how this would not find a userspace function that does 5 syscalls, each taking 10ms. And again, just sampling syscalls is not what I'm looking for. It is definitely a good tool to have at hand, but very specific already. If you want to get an overview on where your app is spending time, having a Wall-time overview is much more helpful as it will also show CPU hotspots or stuff unrelated to syscalls. Bye -- Milian Wolff mail@milianw.de http://milianw.de ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 14:53 ` Milian Wolff @ 2014-09-19 15:50 ` Namhyung Kim 0 siblings, 0 replies; 25+ messages in thread From: Namhyung Kim @ 2014-09-19 15:50 UTC (permalink / raw) To: Milian Wolff Cc: Arnaldo Carvalho de Melo, linux-perf-users, Ingo Molnar, Joseph Schuchart 2014-09-19 (금), 16:53 +0200, Milian Wolff: > On Friday 19 September 2014 11:33:40 Arnaldo Carvalho de Melo wrote: > > Em Fri, Sep 19, 2014 at 02:59:55PM +0900, Namhyung Kim escreveu: > > > Hi Arnaldo and Millan, > > > > > > On Thu, 18 Sep 2014 16:17:13 -0300, Arnaldo Carvalho de Melo wrote: > > > > Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu: > > > >> That would indeed be very welcome. There are multiple "defaults" in > > > >> perf which I find highly confusing. The --no-children above e.g. > > > >> could/should probably be the default, no? Similar, I find it extremely > > > >> irritating that `perf report -g`> > > > > > It was, this is something we've actually been discussing recently: the > > > > change that made --children be the default mode. That is why I added > > > > Namhyung and Ingo to the CC list, so that they become aware of more > > > > reaction to this change. > > > > > > Yeah, we should rethink about changing the default now. Actually I'm > > > okay with the change, Ingo what do you think? > > > > > > >> defaults to `-g fractal` and not `-g graph`. > > > >> > > > >> 100% foo > > > >> > > > >> 70% bar > > > >> > > > >> 70% asdf > > > >> 30% lalala > > > >> > > > >> 30% baz > > > >> > > > >> is much harder to interpret than > > > >> > > > >> 100% foo > > > >> > > > >> 70% bar > > > >> > > > >> 49% asdf > > > >> 21% lalala > > > >> > > > >> 30% baz > > > > > > I also agree with you. :) > > > > > > > But the question then is if this is configurable, if not that would be a > > > > first step, i.e. making this possible via some ~/.perfconfig change. > > > > > > Yes, we have record.call-graph and top.call-graph config options now so > > > adding a new report.call-graph option should not be difficult. However > > > I think it'd be better being call-graph.XXX as it can be applied to all > > > > No problem, with sourcing being supported in ~/.perfconfig, we can have > > as many #include call-graph.XXX as needed, multiple levels of includes > > and all. > > > > > other subcommands transparently. > > > > > > What about like below? > > > > > > [call-graph] > > > > > > mode = dwarf > > > dump-size = 8192 > > > print-type = fractal > > > order = callee > > > threshold = 0.5 > > > print-limit = 128 > > > sort-key = function > > > > Milian, does this provide what you expect? How would we call this > > specific call-graph profile? > > print-type should be graph, not fractal. Otherwise it sounds good to me. But > how would one use it? I tried putting it into ~/.perfconfig, but apparently my > 3.16.2-1-ARCH Perf does not support this feature yet? How/when would that > config be used? As soon as one does "perf record -g" (for mode and dump-size) > or "perf report" (for a perf.data with call graphs)? That would be very > useful! This is a proposal for new config items which can be included into the ~/.perfconfig file. You can change it as you want later and it should apply to default behavior of perf record -g and perf report. Thanks, Namhyung ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Perf event for Wall-time based sampling? 2014-09-19 5:59 ` Namhyung Kim 2014-09-19 14:33 ` Arnaldo Carvalho de Melo @ 2014-09-22 7:56 ` Namhyung Kim 1 sibling, 0 replies; 25+ messages in thread From: Namhyung Kim @ 2014-09-22 7:56 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Milian Wolff, linux-perf-users, Ingo Molnar, Joseph Schuchart, David Ahern On Fri, 19 Sep 2014 14:59:55 +0900, Namhyung Kim wrote: > I'm also *very* interest in collecting idle/wait info using perf. Looks > like we can somehow use sched:* tracepoints but it requires root > privilege though (unless /proc/sys/kernel/perf_event_paranoid being -1). > > With that restriction however, we might improve perf sched (or even > plain perf record/report) to provide such info.. David may have an > idea. :) I wrote a naive script called tasktime that uses task-clock and sched:sched_switch events to record and report task execution and wait time. Very lightly tested only.. The idea is that it counts time between context-switch and next task-clock event as an wait time. Callchain is searched with a pattern similar to the parent pattern to find the reason of the switch. Please have a look and give some comments! :) Thanks, Namhyung From ff8497d91277c25d8c2c7688b7c056b38f799614 Mon Sep 17 00:00:00 2001 From: Namhyung Kim <namhyung@kernel.org> Date: Mon, 22 Sep 2014 16:08:30 +0900 Subject: [PATCH] perf script: Introduce a new 'tasktime' python script The tasktime script record and report task execution (and wait) time using task-clock and sched:sched_switch events. When a task goes to a context switch it searches callchain to have a certain pattern (sys_.*|.*[^b][Ll]ock|page_fault) and prints with a "wait: " prefix. If it fails to find the pattern in the callchain, it prints 'idle' or 'context-switch' depends on the next task scheduled. Below is an example run of this script. You can see 'context-switch' overhead is around 0.8% of total execution time in this workload. Note that this script is not complete and posted only for discussion. $ perf script record tasktime -- make clean all $ perf script report tasktime | head -20 Overhead Command Shared Object Symbol -------- ---------------- -------------------- ---------------- 2.15% cc1 cc1 ht_lookup_with_hash 2.10% cc1 cc1 _cpp_lex_direct 1.78% cc1 libc-2.17.so _int_malloc 1.44% cc1 cc1 ggc_internal_alloc_stat 1.39% cc1 cc1 htab_find_slot_with_hash 0.85% cc1 cc1 bitmap_set_bit 0.84% cc1 cc1 lex_identifier 0.82% cc1 libc-2.17.so _int_free 0.80% cc1 <kernel> context-switch 0.57% cc1 [kernel.kallsyms] __do_page_fault 0.55% cc1 cc1 cpp_get_token_1 0.54% cc1 cc1 df_note_compute 0.53% cc1 cc1 htab_find_with_hash 0.53% cc1 cc1 for_each_rtx_1 0.50% cc1 libc-2.17.so malloc_consolidate 0.48% cc1 cc1 grokdeclarator 0.46% as libc-2.17.so __strncmp_sse42 0.46% as as 4260905 Signed-off-by: Namhyung Kim <namhyung@kernel.org> --- tools/perf/scripts/python/bin/tasktime-record | 3 + tools/perf/scripts/python/bin/tasktime-report | 3 + tools/perf/scripts/python/tasktime.py | 125 ++++++++++++++++++++++++++ 3 files changed, 131 insertions(+) create mode 100644 tools/perf/scripts/python/bin/tasktime-record create mode 100644 tools/perf/scripts/python/bin/tasktime-report create mode 100644 tools/perf/scripts/python/tasktime.py diff --git a/tools/perf/scripts/python/bin/tasktime-record b/tools/perf/scripts/python/bin/tasktime-record new file mode 100644 index 000000000000..5aebcef93e55 --- /dev/null +++ b/tools/perf/scripts/python/bin/tasktime-record @@ -0,0 +1,3 @@ +#!/bin/bash +perf record -gR -e task-clock -e sched:sched_switch $@ + diff --git a/tools/perf/scripts/python/bin/tasktime-report b/tools/perf/scripts/python/bin/tasktime-report new file mode 100644 index 000000000000..a21f8dea15e7 --- /dev/null +++ b/tools/perf/scripts/python/bin/tasktime-report @@ -0,0 +1,3 @@ +#!/bin/bash +# description: analyze task execution and wait time +perf script $@ -s "$PERF_EXEC_PATH"/scripts/python/tasktime.py diff --git a/tools/perf/scripts/python/tasktime.py b/tools/perf/scripts/python/tasktime.py new file mode 100644 index 000000000000..0d69d94cb31b --- /dev/null +++ b/tools/perf/scripts/python/tasktime.py @@ -0,0 +1,125 @@ +# perf script event handlers, generated by perf script -g python +# Licensed under the terms of the GNU GPL License version 2 + +# The common_* event handler fields are the most useful fields common to +# all events. They don't necessarily correspond to the 'common_*' fields +# in the format files. Those fields not available as handler params can +# be retrieved using Python functions of the form common_*(context). +# See the perf-trace-python Documentation for the list of available functions. + +import os +import sys +import re + +sys.path.append(os.environ['PERF_EXEC_PATH'] + \ + '/scripts/python/Perf-Trace-Util/lib/Perf/Trace') + +from perf_trace_context import * +from Core import * + +time_table = {} +wait_table = [] +evcount = {} +sample_count = {} +wait_pattern = '' + +total_time = 0 + +def trace_begin(): + global wait_pattern + wait_pattern = re.compile(".*[^b][Ll]ock.*|sys_.*|page_fault") + +def trace_end(): + output = [(k,time_table[k]) for k in time_table] + output.sort(key=lambda e: e[1], reverse=True) + + print " Overhead Command Shared Object Symbol" + print " -------- ---------------- -------------------- ----------------" + for entry in output: + key = entry[0].split(',') + print " %6.2f%% %-16s %-20s %s" % \ + ((100.0 * entry[1] / total_time), key[0], key[1], key[2]) + + print + print "[event stats]" + print "Total time:", total_time + print "Number of events:" + for k in evcount: + print " ", k, evcount[k] + +def sched__sched_switch(event_name, context, common_cpu, + common_secs, common_nsecs, common_pid, common_comm, + common_callchain, prev_comm, prev_pid, prev_prio, prev_state, + next_comm, next_pid, next_prio): + + global total_time + if 'sched_switch' not in evcount: + evcount['sched_switch'] = 0 + evcount['sched_switch'] += 1 + + ts = common_secs * 1000000000 + common_nsecs + + if next_comm == 'swapper': + name = "idle" + else: + name = "context-switch" + dso = '<kernel>' + + if common_callchain: + for entry in common_callchain: + if 'sym' not in entry: + continue + symname = entry['sym']['name'] + if symname and wait_pattern.match(symname): + name = "wait: " + symname + dso = '<unknown>' + break + + key = ','.join([common_comm, dso, name]) + if key not in sample_count: + sample_count[key] = 0 + sample_count[key] += 1 + wait_table.insert(0, {'start': ts, 'end': 0, 'key': key, + 'prev': prev_comm, 'next': next_comm }) + +def process_event(param_dict): + global total_time + event = param_dict['ev_name'] + if event not in evcount: + evcount[event] = 0 + evcount[event] += 1 + + comm = param_dict['comm'] + if 'dso' in param_dict: + dso = os.path.basename(param_dict['dso']) + else: + dso = '<unknown>' + if 'symbol' in param_dict: + sym = param_dict['symbol'] + else: + sym = str(param_dict['sample']['ip']) + + key = ','.join([comm, dso, sym]) + if key not in time_table: + time_table[key] = 0 + time_table[key] += param_dict['sample']['period'] + total_time += param_dict['sample']['period'] + + if key not in sample_count: + sample_count[key] = 0 + sample_count[key] += 1 + + if len(wait_table) > 0 and wait_table[0]['end'] == 0: + ts_end = param_dict['sample']['time'] + wait_table[0]['end'] = ts_end - param_dict['sample']['period'] + waittime = wait_table[0]['end'] - wait_table[0]['start'] + if waittime <= 0: + waittime = param_dict['sample']['period'] / 2 + key = wait_table[0]['key'] + if key not in time_table: + time_table[key] = 0 + time_table[key] += waittime + total_time += waittime + +def trace_unhandled(event_name, context, event_fields_dict): + print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())]) -- 2.1.0 ^ permalink raw reply related [flat|nested] 25+ messages in thread
end of thread, other threads:[~2014-09-22 7:56 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-18 12:32 Perf event for Wall-time based sampling? Milian Wolff 2014-09-18 13:23 ` Arnaldo Carvalho de Melo 2014-09-18 13:41 ` Milian Wolff 2014-09-18 14:51 ` Arnaldo Carvalho de Melo 2014-09-18 15:26 ` Milian Wolff 2014-09-18 15:57 ` Arnaldo Carvalho de Melo 2014-09-18 16:37 ` Milian Wolff 2014-09-18 19:17 ` Arnaldo Carvalho de Melo 2014-09-18 19:31 ` Arnaldo Carvalho de Melo 2014-09-18 20:17 ` David Ahern 2014-09-18 20:36 ` Arnaldo Carvalho de Melo 2014-09-18 20:39 ` David Ahern 2014-09-19 8:11 ` Milian Wolff 2014-09-19 9:08 ` Milian Wolff 2014-09-19 14:47 ` Arnaldo Carvalho de Melo 2014-09-19 15:04 ` David Ahern 2014-09-19 15:05 ` Milian Wolff 2014-09-19 14:17 ` David Ahern 2014-09-19 14:39 ` Milian Wolff 2014-09-19 14:55 ` David Ahern 2014-09-19 5:59 ` Namhyung Kim 2014-09-19 14:33 ` Arnaldo Carvalho de Melo 2014-09-19 14:53 ` Milian Wolff 2014-09-19 15:50 ` Namhyung Kim 2014-09-22 7:56 ` Namhyung Kim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).