Perf event for Wall-time based sampling?

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Perf event for Wall-time based sampling?
@ 2014-09-18 12:32 Milian Wolff
  2014-09-18 13:23 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 25+ messages in thread
From: Milian Wolff @ 2014-09-18 12:32 UTC (permalink / raw)
  To: linux-perf-users

Hello,

is it somehow possible to use perf based on some kernel timer? I'd like to get 
an overview of where a userspace application is spending time, both on-CPU as 
well as waiting off-CPU. E.g. something similar to using GDB as a poor-mans 
profiler and regularly interrupting the process and investigating the 
callgraphs. This is quite efficient for a high-level overview when you want to 
figure out where time is spent, unrelated to how it was actually spent (cpu, 
thread locks, io wait, ...).

E.g. what event would I use for a simple application like this:

~~~~~~~~~~~~~~
#include <unistd.h>

int main()
{
  sleep(10);
  return 0;
}
~~~~~~~~~~~~~~

Which perf event would show me that most of the time is spent sleeping? I 
tried something like this to no avail:

$ perf record --call-graph dwarf -e cpu-clock -F 100 ./a.out 
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.007 MB perf.data (~304 samples) ]
perf report --stdio
Error:
The perf.data file has no samples!
# To display the perf.data header info, please use --header/--header-only 
options.

I read https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times
and tried it out. The result is odd, as I get the "same" backtrace multiple 
times, all with 100% cost:

~~~~~~~~~~~~~~~~~~~~~~~~~~
   100.00%     0.00%             0    a.out  libc-2.19.so       [.] 
__GI___libc_nanosleep
              |
              --- __GI___libc_nanosleep

   100.00%     0.00%             0    a.out  [kernel.kallsyms]  [k] 
system_call_fastpath 
              |
              --- system_call_fastpath
                  __GI___libc_nanosleep

   100.00%     0.00%             0    a.out  [kernel.kallsyms]  [k] 
sys_nanosleep        
              |
              --- sys_nanosleep
                  system_call_fastpath
                  __GI___libc_nanosleep

   100.00%     0.00%             0    a.out  [kernel.kallsyms]  [k] 
hrtimer_nanosleep    
              |
              --- hrtimer_nanosleep
                  sys_nanosleep
                  system_call_fastpath
                  __GI___libc_nanosleep
~~~~~~~~~~~~~~~~~~~~~~~~~~

And generally, this would *only* profile sleep time and would ignore the on-
CPU time (and maybe thread waits) and so forth.

Is there a technical reason on why it is not possible to use a plain timer as 
a sampling event? If I'm not mistaken, then Intel VTune actually uses a 
similar technique for its simple profiling modes which can already give 
extremely useful data - both to find CPU hotspots as well as locks&waits.

Bye
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 12:32 Perf event for Wall-time based sampling? Milian Wolff
@ 2014-09-18 13:23 ` Arnaldo Carvalho de Melo
  2014-09-18 13:41   ` Milian Wolff
  0 siblings, 1 reply; 25+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-18 13:23 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users

Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu:
> Hello,
> 
> is it somehow possible to use perf based on some kernel timer? I'd like to get 

Try with tracepoints or with probe points combined with callchains
instead of using a hardware counter.

- Arnaldo

> an overview of where a userspace application is spending time, both on-CPU as 
> well as waiting off-CPU. E.g. something similar to using GDB as a poor-mans 
> profiler and regularly interrupting the process and investigating the 
> callgraphs. This is quite efficient for a high-level overview when you want to 
> figure out where time is spent, unrelated to how it was actually spent (cpu, 
> thread locks, io wait, ...).
> 
> E.g. what event would I use for a simple application like this:
> 
> ~~~~~~~~~~~~~~
> #include <unistd.h>
> 
> int main()
> {
>   sleep(10);
>   return 0;
> }
> ~~~~~~~~~~~~~~
> 
> Which perf event would show me that most of the time is spent sleeping? I 
> tried something like this to no avail:
> 
> $ perf record --call-graph dwarf -e cpu-clock -F 100 ./a.out 
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.007 MB perf.data (~304 samples) ]
> perf report --stdio
> Error:
> The perf.data file has no samples!
> # To display the perf.data header info, please use --header/--header-only 
> options.
> 
> I read https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times
> and tried it out. The result is odd, as I get the "same" backtrace multiple 
> times, all with 100% cost:
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>    100.00%     0.00%             0    a.out  libc-2.19.so       [.] 
> __GI___libc_nanosleep
>               |
>               --- __GI___libc_nanosleep
> 
>    100.00%     0.00%             0    a.out  [kernel.kallsyms]  [k] 
> system_call_fastpath 
>               |
>               --- system_call_fastpath
>                   __GI___libc_nanosleep
> 
>    100.00%     0.00%             0    a.out  [kernel.kallsyms]  [k] 
> sys_nanosleep        
>               |
>               --- sys_nanosleep
>                   system_call_fastpath
>                   __GI___libc_nanosleep
> 
>    100.00%     0.00%             0    a.out  [kernel.kallsyms]  [k] 
> hrtimer_nanosleep    
>               |
>               --- hrtimer_nanosleep
>                   sys_nanosleep
>                   system_call_fastpath
>                   __GI___libc_nanosleep
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> And generally, this would *only* profile sleep time and would ignore the on-
> CPU time (and maybe thread waits) and so forth.
> 
> Is there a technical reason on why it is not possible to use a plain timer as 
> a sampling event? If I'm not mistaken, then Intel VTune actually uses a 
> similar technique for its simple profiling modes which can already give 
> extremely useful data - both to find CPU hotspots as well as locks&waits.
> 
> Bye
> -- 
> Milian Wolff
> mail@milianw.de
> http://milianw.de
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 13:23 ` Arnaldo Carvalho de Melo
@ 2014-09-18 13:41   ` Milian Wolff
  2014-09-18 14:51     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 25+ messages in thread
From: Milian Wolff @ 2014-09-18 13:41 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users

On Thursday 18 September 2014 10:23:50 Arnaldo Carvalho de Melo wrote:
> Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu:
> > Hello,
> > 
> > is it somehow possible to use perf based on some kernel timer? I'd like to
> > get
> Try with tracepoints or with probe points combined with callchains
> instead of using a hardware counter.

Hey Arnoldo,

where would you add such tracepoints? Or what tracepoint would you use? And 
what is the difference between tracepoints and probe points (I'm only aware of 
`perf probe`).

Bye
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 13:41   ` Milian Wolff
@ 2014-09-18 14:51     ` Arnaldo Carvalho de Melo
  2014-09-18 15:26       ` Milian Wolff
  0 siblings, 1 reply; 25+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-18 14:51 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users

Em Thu, Sep 18, 2014 at 03:41:20PM +0200, Milian Wolff escreveu:
> On Thursday 18 September 2014 10:23:50 Arnaldo Carvalho de Melo wrote:
> > Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu:
> > > is it somehow possible to use perf based on some kernel timer? I'd like to
> > > get
> > Try with tracepoints or with probe points combined with callchains
> > instead of using a hardware counter.
 
> where would you add such tracepoints? Or what tracepoint would you use? And 
> what is the difference between tracepoints and probe points (I'm only aware of 
> `perf probe`).

tracepoints are places in the kernel (and in userspace as well, came
later tho) where developers put in place for later tracing.

They are super optimized, so have a lower cost than 'probe points' that
you can put in place using 'perf probe'.

To see the tracepoints, or any other available event in your system, use
'perf list'.

The debugfs filesystem will need to be mounted, but that will be
transparently done if the user has enough privileges.

For instance, here are some tracepoints that you may want to use:

[root@zoo ~]# perf list sched:*
  sched:sched_kthread_stop                           [Tracepoint event]
  sched:sched_kthread_stop_ret                       [Tracepoint event]
  sched:sched_wakeup                                 [Tracepoint event]
  sched:sched_wakeup_new                             [Tracepoint event]
  sched:sched_switch                                 [Tracepoint event]
  sched:sched_migrate_task                           [Tracepoint event]
  sched:sched_process_free                           [Tracepoint event]
  sched:sched_process_exit                           [Tracepoint event]
  sched:sched_wait_task                              [Tracepoint event]
  sched:sched_process_wait                           [Tracepoint event]
  sched:sched_process_fork                           [Tracepoint event]
  sched:sched_process_exec                           [Tracepoint event]
  sched:sched_stat_wait                              [Tracepoint event]
  sched:sched_stat_sleep                             [Tracepoint event]
  sched:sched_stat_iowait                            [Tracepoint event]
  sched:sched_stat_blocked                           [Tracepoint event]
  sched:sched_stat_runtime                           [Tracepoint event]
  sched:sched_pi_setprio                             [Tracepoint event]
  sched:sched_move_numa                              [Tracepoint event]
  sched:sched_stick_numa                             [Tracepoint event]
  sched:sched_swap_numa                              [Tracepoint event]
  sched:sched_wake_idle_without_ipi                  [Tracepoint event]
[root@zoo ~]

All tracepoints, for the system scheduler.

You can ask for all of them to be collected, togethed with callchains,
using:

[root@zoo ~]# perf record -a -e sched:* -g usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.294 MB perf.data (~56528 samples) ]
[root@zoo ~]#

This collected system wide samples for all the scheduler tracepoints for
1 microsecond.

Use perf report and it will tell you how many of those tracepoints were
hit during this short system wide session, and will allow you to
traverse its callchains.

I would recommend that you take a look at Brendan Greggs _excellent_
tutorials at:

http://www.brendangregg.com/perf.html

He will explain all this in way more detail than I briefly skimmed
above. :-)

- Arnaldo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 14:51     ` Arnaldo Carvalho de Melo
@ 2014-09-18 15:26       ` Milian Wolff
  2014-09-18 15:57         ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 25+ messages in thread
From: Milian Wolff @ 2014-09-18 15:26 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users

On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote:
> Em Thu, Sep 18, 2014 at 03:41:20PM +0200, Milian Wolff escreveu:
> > On Thursday 18 September 2014 10:23:50 Arnaldo Carvalho de Melo wrote:
> > > Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu:
> > > > is it somehow possible to use perf based on some kernel timer? I'd
> > > > like to
> > > > get
> > > 
> > > Try with tracepoints or with probe points combined with callchains
> > > instead of using a hardware counter.
> > 
> > where would you add such tracepoints? Or what tracepoint would you use?
> > And
> > what is the difference between tracepoints and probe points (I'm only
> > aware of `perf probe`).
> 
> tracepoints are places in the kernel (and in userspace as well, came
> later tho) where developers put in place for later tracing.
> 
> They are super optimized, so have a lower cost than 'probe points' that
> you can put in place using 'perf probe'.
> 
> To see the tracepoints, or any other available event in your system, use
> 'perf list'.
> 
> The debugfs filesystem will need to be mounted, but that will be
> transparently done if the user has enough privileges.

Thanks for the quick rundown Arnaldo! Sadly, that much I "knew" already, yet 
am not able to understand how to use for my purpose.

> For instance, here are some tracepoints that you may want to use:
> 
> [root@zoo ~]# perf list sched:*

<snip>

I tried this:

# a.out is the result of compiling `int main { sleep(1); return 0; }`:
perf record -e sched:* --call-graph dwarf ./a.out
perf report -g graph --stdio
# the result can be found here: http://paste.kde.org/pflkskwrf

How do I have to interpret this?

a) This is not Wall-clock profiling, no? It just grabs a callgraph whenever 
one of the sched* events occurs, none of these events will occur, say, every X 
ms.
b) The callgraphs are really strange, imo. Different traces are printed with 
the same cost, which sounds wrong, no? See e.g. the multiple 44.44% traces in 
sched:sched_wakeup.
c) Most of the traces point into the kernel, how can I hide these traces and 
only concentrate on the user-space? Do I have to grep manually for [.] ? I 
tried something like `perf report --parent "main"` but that makes no 
difference.

> I would recommend that you take a look at Brendan Greggs _excellent_
> tutorials at:
> 
> http://www.brendangregg.com/perf.html
> 
> He will explain all this in way more detail than I briefly skimmed
> above. :-)

I did that already, but Brendan and the other available Perf documentation 
mostly concentrates on performance issues in the Kernel. I'm interested purely 
in the user space. Perf record with one of the hardware PMU events works 
nicely in that case, but one cannot use it to find locks&waits similar to what 
VTune offers.

Bye

-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 15:26       ` Milian Wolff
@ 2014-09-18 15:57         ` Arnaldo Carvalho de Melo
  2014-09-18 16:37           ` Milian Wolff
  0 siblings, 1 reply; 25+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-18 15:57 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users, Namhyung Kim, Ingo Molnar

Em Thu, Sep 18, 2014 at 05:26:33PM +0200, Milian Wolff escreveu:
> On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote:
> > Em Thu, Sep 18, 2014 at 03:41:20PM +0200, Milian Wolff escreveu:
> > > On Thursday 18 September 2014 10:23:50 Arnaldo Carvalho de Melo wrote:
> > > > Em Thu, Sep 18, 2014 at 02:32:10PM +0200, Milian Wolff escreveu:
> > > > > is it somehow possible to use perf based on some kernel timer? I'd
> > > > > like to get

> > > > Try with tracepoints or with probe points combined with callchains
> > > > instead of using a hardware counter.
> > > 
> > > where would you add such tracepoints? Or what tracepoint would you use?
> > > And
> > > what is the difference between tracepoints and probe points (I'm only
> > > aware of `perf probe`).
> > 
> > tracepoints are places in the kernel (and in userspace as well, came
> > later tho) where developers put in place for later tracing.
> > 
> > They are super optimized, so have a lower cost than 'probe points' that
> > you can put in place using 'perf probe'.
> > 
> > To see the tracepoints, or any other available event in your system, use
> > 'perf list'.
> > 
> > The debugfs filesystem will need to be mounted, but that will be
> > transparently done if the user has enough privileges.
> 
> Thanks for the quick rundown Arnaldo! Sadly, that much I "knew" already, yet 
> am not able to understand how to use for my purpose.
> 
> > For instance, here are some tracepoints that you may want to use:
> > 
> > [root@zoo ~]# perf list sched:*
 
> <snip>
 
> I tried this:
 
> # a.out is the result of compiling `int main { sleep(1); return 0; }`:
> perf record -e sched:* --call-graph dwarf ./a.out
> perf report -g graph --stdio
> # the result can be found here: http://paste.kde.org/pflkskwrf
 
> How do I have to interpret this?
 
> a) This is not Wall-clock profiling, no? It just grabs a callgraph whenever 
> one of the sched* events occurs, none of these events will occur, say, every X 
> ms.
> b) The callgraphs are really strange, imo. Different traces are printed with 
> the same cost, which sounds wrong, no? See e.g. the multiple 44.44% traces in 
> sched:sched_wakeup.

Try using --no-children in the 'report' command line.

> c) Most of the traces point into the kernel, how can I hide these traces and 
> only concentrate on the user-space? Do I have to grep manually for [.] ? I 

Oh well, for userspace you need to be aware of how callchains are
collected, i.e. if your binaries and libraries use
-fno-omit-frame-pointer, because if they do you will not get callchains
going into userspace, so you will need to specifically ask for 'DWARF'
callchains, from 'perf record' documentation:

    --call-graph
        Setup and enable call-graph (stack chain/backtrace) recording, implies -g.

            Allows specifying "fp" (frame pointer) or "dwarf"
            (DWARF's CFI - Call Frame Information) as the method to collect
            the information used to show the call graphs.

            In some systems, where binaries are build with gcc
            --fomit-frame-pointer, using the "fp" method will produce bogus
            call graphs, using "dwarf", if available (perf tools linked to
            the libunwind library) should be used instead.


This has to be made automated, i.e. the tooling needs to figure out that
the binaries used do use %bp for optimization and automagically collect
DWARF, but till then, one needs to know about such issues and deal with
it.

User space support is something that as you see, is still rough, we need
people like you trying it, but while it is rough, people tend to avoid
it... :-\


> tried something like `perf report --parent "main"` but that makes no 
> difference.
> 
> > I would recommend that you take a look at Brendan Greggs _excellent_
> > tutorials at:
> > 
> > http://www.brendangregg.com/perf.html
> > 
> > He will explain all this in way more detail than I briefly skimmed
> > above. :-)
> 
> I did that already, but Brendan and the other available Perf documentation 
> mostly concentrates on performance issues in the Kernel. I'm interested purely 
> in the user space. Perf record with one of the hardware PMU events works 
> nicely in that case, but one cannot use it to find locks&waits similar to what 
> VTune offers.

Humm, yeah, you need to figure out how to solve your issue, what I tried
was to show what kinds of building blocks you could use to build what
you need, but no, there is no ready to use tool for this, that I am
aware of.

For instance, you need to collect scheduler events, then do some
scripting, perhaps using perl or python, perhaps using the scripting
support that is built into perf already, but yeah, not documented.

You could try starting with:

[root@ssdandy ~]# perf script -l
List of available trace scripts:
  net_dropmonitor                      display a table of dropped frames
  failed-syscalls-by-pid [comm]        system-wide failed syscalls, by pid
  syscall-counts [comm]                system-wide syscall counts
  netdev-times [tx] [rx] [dev=] [debug] display a process of packet and processing time
  sctop [comm] [interval]              syscall top
  syscall-counts-by-pid [comm]         system-wide syscall counts, by pid
  sched-migration                      sched migration overview
  event_analyzing_sample               analyze all perf samples
  futex-contention                     futext contention measurement
  failed-syscalls [comm]               system-wide failed syscalls
  workqueue-stats                      workqueue stats (ins/exe/create/destroy)
  wakeup-latency                       system-wide min/max/avg wakeup latency
  rw-by-file <comm>                    r/w activity for a program, by file
  rw-by-pid                            system-wide r/w activity
  rwtop [interval]                     system-wide r/w top
[root@ssdandy ~]#

They will all have a 'perf record'  phase that will collect some
tracepoints into a perf.data file, then some perl/python scripts to
handle the events and get to some conclusion or generate some tables,
etc.

Haven't tested these in a while, wouldn't be surprised if some of them
bitrotted.

- Arnaldo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 15:57         ` Arnaldo Carvalho de Melo
@ 2014-09-18 16:37           ` Milian Wolff
  2014-09-18 19:17             ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 25+ messages in thread
From: Milian Wolff @ 2014-09-18 16:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users, Namhyung Kim, Ingo Molnar

On Thursday 18 September 2014 12:57:45 Arnaldo Carvalho de Melo wrote:
> Em Thu, Sep 18, 2014 at 05:26:33PM +0200, Milian Wolff escreveu:
> > On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote:

<snip>

> > b) The callgraphs are really strange, imo. Different traces are printed
> > with the same cost, which sounds wrong, no? See e.g. the multiple 44.44%
> > traces in sched:sched_wakeup.
> 
> Try using --no-children in the 'report' command line.

Nice, this is very useful. Many thanks!

> > c) Most of the traces point into the kernel, how can I hide these traces
> > and only concentrate on the user-space? Do I have to grep manually for
> > [.] ? I

> Oh well, for userspace you need to be aware of how callchains are
> collected, i.e. if your binaries and libraries use
> -fno-omit-frame-pointer, because if they do you will not get callchains
> going into userspace, so you will need to specifically ask for 'DWARF'
> callchains, from 'perf record' documentation:

I'm actually aware of that and I did add that option to my initial record 
call, sorry for not being clear here.

<snip>

> This has to be made automated, i.e. the tooling needs to figure out that
> the binaries used do use %bp for optimization and automagically collect
> DWARF, but till then, one needs to know about such issues and deal with
> it.

That would indeed be very welcome. There are multiple "defaults" in perf which 
I find highly confusing. The --no-children above e.g. could/should probably be 
the default, no? Similar, I find it extremely irritating that `perf report -g` 
defaults to `-g fractal` and not `-g graph`.

100% foo
  70% bar
    70% asdf
    30% lalala
  30% baz

is much harder to interpret than

100% foo
  70% bar
    49% asdf
    21% lalala
  30% baz

especially for more involved call chains. It took me quite some time to become 
aware of the ability to pass `-g graph` to get the desired output. KCacheGrind 
e.g. also defaults to something similar to `-g graph` and only optionally 
allows the user to get the "relative to parent" cost of `-g fractal`.

> User space support is something that as you see, is still rough, we need
> people like you trying it, but while it is rough, people tend to avoid
> it... :-\

Yes. But already perf is extremely useful and I use it a lot. I'm also 
actively educating people about using it more. I've talked about it at last 
year's Akademy and Qt Developer Days, and again this year at a profiling 
workshop at Akademy. Please keep up the good work!

> > tried something like `perf report --parent "main"` but that makes no
> > difference.
> > 
> > > I would recommend that you take a look at Brendan Greggs _excellent_
> > > tutorials at:
> > > 
> > > http://www.brendangregg.com/perf.html
> > > 
> > > He will explain all this in way more detail than I briefly skimmed
> > > above. :-)
> > 
> > I did that already, but Brendan and the other available Perf documentation
> > mostly concentrates on performance issues in the Kernel. I'm interested
> > purely in the user space. Perf record with one of the hardware PMU events
> > works nicely in that case, but one cannot use it to find locks&waits
> > similar to what VTune offers.
> 
> Humm, yeah, you need to figure out how to solve your issue, what I tried
> was to show what kinds of building blocks you could use to build what
> you need, but no, there is no ready to use tool for this, that I am
> aware of.
> 
> For instance, you need to collect scheduler events, then do some
> scripting, perhaps using perl or python, perhaps using the scripting
> support that is built into perf already, but yeah, not documented.

And also lacking the ability to get callgraphs, if I'm not mistaken. This is 
crucial for my undertaking. Or has this been added in the meantime?

<snip>

This was also why I asked my initial question, which I want to repeat once 
more: Is there a technical reason to not offer a "timer" software event to 
perf? I'm a complete layman when it comes to Kernel internals, but from a user 
point of view this would be awesome:

perf record --call-graph dwarf -e sw-timer -F 100 someapplication

This command would then create a timer in the kernel with a 100Hz frequency. 
Whenever it fires, the callgraphs of all threads in $someapplication are 
sampled and written to perf.data. Is this technically not feasible? Or is it 
simply not implemented?

I'm experimenting with a libunwind based profiler, and with some ugly signal 
hackery I can now grab backtraces by sending my application SIGUSR1. Based on 
that, I can probably create a profiling tool that fits my needs. I just wonder 
why one cannot do the same with perf.

Thanks for your time

-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 16:37           ` Milian Wolff
@ 2014-09-18 19:17             ` Arnaldo Carvalho de Melo
  2014-09-18 19:31               ` Arnaldo Carvalho de Melo
                                 ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-18 19:17 UTC (permalink / raw)
  To: Milian Wolff
  Cc: linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart

Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu:
> On Thursday 18 September 2014 12:57:45 Arnaldo Carvalho de Melo wrote:
> > Em Thu, Sep 18, 2014 at 05:26:33PM +0200, Milian Wolff escreveu:
> > > On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote:
 
> <snip>
 
> > > b) The callgraphs are really strange, imo. Different traces are printed
> > > with the same cost, which sounds wrong, no? See e.g. the multiple 44.44%
> > > traces in sched:sched_wakeup.

> > Try using --no-children in the 'report' command line.

> Nice, this is very useful. Many thanks!

npo
 
> > > c) Most of the traces point into the kernel, how can I hide these traces
> > > and only concentrate on the user-space? Do I have to grep manually for
> > > [.] ? I

> > Oh well, for userspace you need to be aware of how callchains are
> > collected, i.e. if your binaries and libraries use
> > -fno-omit-frame-pointer, because if they do you will not get callchains
> > going into userspace, so you will need to specifically ask for 'DWARF'
> > callchains, from 'perf record' documentation:
 
> I'm actually aware of that and I did add that option to my initial record 
> call, sorry for not being clear here.
 
> <snip>
 
> > This has to be made automated, i.e. the tooling needs to figure out that
> > the binaries used do use %bp for optimization and automagically collect
> > DWARF, but till then, one needs to know about such issues and deal with
> > it.
> 
> That would indeed be very welcome. There are multiple "defaults" in perf which 
> I find highly confusing. The --no-children above e.g. could/should probably be 
> the default, no? Similar, I find it extremely irritating that `perf report -g` 

It was, this is something we've actually been discussing recently: the
change that made --children be the default mode. That is why I added
Namhyung and Ingo to the CC list, so that they become aware of more
reaction to this change.

> defaults to `-g fractal` and not `-g graph`.
> 
> 100% foo
>   70% bar
>     70% asdf
>     30% lalala
>   30% baz
> 
> is much harder to interpret than
> 
> 100% foo
>   70% bar
>     49% asdf
>     21% lalala
>   30% baz

But the question then is if this is configurable, if not that would be a
first step, i.e. making this possible via some ~/.perfconfig change.

Later we could advocate changing the default. Or perhaps provide some
"skins", i.e. config files that could be sourced into ~/.perfconfig so
that perf mimics the decisions of other profilers, with which people are
used to.

Kinda like making mutt behave like pine (as I did a long time ago), even
if just for a while, till one gets used to the "superior" default way of
doing things of the new tool :-)
 
> especially for more involved call chains. It took me quite some time to become 
> aware of the ability to pass `-g graph` to get the desired output. KCacheGrind 
> e.g. also defaults to something similar to `-g graph` and only optionally 
> allows the user to get the "relative to parent" cost of `-g fractal`.
> 
> > User space support is something that as you see, is still rough, we need
> > people like you trying it, but while it is rough, people tend to avoid
> > it... :-\
> 
> Yes. But already perf is extremely useful and I use it a lot. I'm also 
> actively educating people about using it more. I've talked about it at last 
> year's Akademy and Qt Developer Days, and again this year at a profiling 
> workshop at Akademy. Please keep up the good work!

Thanks a lot for doing that!

> > > tried something like `perf report --parent "main"` but that makes no
> > > difference.

> > > > I would recommend that you take a look at Brendan Greggs _excellent_
> > > > tutorials at:

> > > > http://www.brendangregg.com/perf.html

> > > > He will explain all this in way more detail than I briefly skimmed
> > > > above. :-)

> > > I did that already, but Brendan and the other available Perf documentation
> > > mostly concentrates on performance issues in the Kernel. I'm interested
> > > purely in the user space. Perf record with one of the hardware PMU events
> > > works nicely in that case, but one cannot use it to find locks&waits
> > > similar to what VTune offers.

> > Humm, yeah, you need to figure out how to solve your issue, what I tried
> > was to show what kinds of building blocks you could use to build what
> > you need, but no, there is no ready to use tool for this, that I am
> > aware of.

> > For instance, you need to collect scheduler events, then do some
> > scripting, perhaps using perl or python, perhaps using the scripting
> > support that is built into perf already, but yeah, not documented.

> And also lacking the ability to get callgraphs, if I'm not mistaken. This is 
> crucial for my undertaking. Or has this been added in the meantime?

I guess it was:

commit 57608cfd8827a74237d264a197722e2c99f72da4
Author: Joseph Schuchart <joseph.schuchart@tu-dresden.de>
Date:   Thu Jul 10 13:50:56 2014 +0200

    perf script: Provide additional sample information on generic events
    
    To python scripts, including pid, tid, and cpu for which the event
    was recorded.
    
    At the moment, the pointer to the sample struct is passed to
    scripts, which seems to be of little use.
    
    The patch puts this information in dictionaries for easy access by
    Python scripts.

commit 0f5f5bcd112292f14b75750dde7461463bb1c7bb
Author: Joseph Schuchart <joseph.schuchart@tu-dresden.de>
Date:   Thu Jul 10 13:50:51 2014 +0200

    perf script: Add callchain to generic and tracepoint events
    
    This provides valuable information for tracing performance problems.
    
    Since this change alters the interface for the python scripts, also
    adjust the script generation and the provided scripts.
 
> <snip>
> 
> This was also why I asked my initial question, which I want to repeat once 
> more: Is there a technical reason to not offer a "timer" software event to 
> perf? I'm a complete layman when it comes to Kernel internals, but from a user 
> point of view this would be awesome:
 
> perf record --call-graph dwarf -e sw-timer -F 100 someapplication
 
> This command would then create a timer in the kernel with a 100Hz frequency. 
> Whenever it fires, the callgraphs of all threads in $someapplication are 
> sampled and written to perf.data. Is this technically not feasible? Or is it 
> simply not implemented?

> I'm experimenting with a libunwind based profiler, and with some ugly signal 
> hackery I can now grab backtraces by sending my application SIGUSR1. Based on 

Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to
your app with the frequency you want, and then hook a 'perf probe' into
your signal... /me tries some stuff, will get back with results...

> that, I can probably create a profiling tool that fits my needs. I just wonder 
> why one cannot do the same with perf.

- Arnaldo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 19:17             ` Arnaldo Carvalho de Melo
@ 2014-09-18 19:31               ` Arnaldo Carvalho de Melo
  2014-09-18 20:17               ` David Ahern
  2014-09-19  5:59               ` Namhyung Kim
  2 siblings, 0 replies; 25+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-18 19:31 UTC (permalink / raw)
  To: Milian Wolff
  Cc: linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart

Em Thu, Sep 18, 2014 at 04:17:13PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu:
> > This was also why I asked my initial question, which I want to repeat once 
> > more: Is there a technical reason to not offer a "timer" software event to 
> > perf? I'm a complete layman when it comes to Kernel internals, but from a user 
> > point of view this would be awesome:
>  
> > perf record --call-graph dwarf -e sw-timer -F 100 someapplication
>  
> > This command would then create a timer in the kernel with a 100Hz frequency. 
> > Whenever it fires, the callgraphs of all threads in $someapplication are 
> > sampled and written to perf.data. Is this technically not feasible? Or is it 
> > simply not implemented?

What you want is a sysrq-t that goes all the way to userspace, right?

Try:

echo t > /proc/sysrq
dmesg

i.e. probably you want that when a thread goes to sleep, we can take a
sample with callchains, no?

With a 'ping 127.0.0.1' running, I get this for its process:

[19255.244315] ping            S ffff88043e213100     0  6754   6548 0x00000080
[19255.244316]  ffff880231693a80 0000000000000082 ffff88040ffb1940 0000000000013100
[19255.244317]  ffff880231693fd8 0000000000013100 ffff880423c88000 ffff88040ffb1940
[19255.244318]  ffffffff81efb1c0 ffff880231693ab0 00000001012131d1 ffffffff81efb1c0
[19255.244319] Call Trace:
[19255.244321]  [<ffffffff816b93a9>] schedule+0x29/0x70
[19255.244322]  [<ffffffff816bbf91>] schedule_timeout+0x151/0x270
[19255.244325]  [<ffffffff815ac606>] __skb_recv_datagram+0x426/0x4d0
[19255.244328]  [<ffffffff815ac6e2>] skb_recv_datagram+0x32/0x40
[19255.244330]  [<ffffffff8161a1ad>] raw_recvmsg+0x7d/0x1b0
[19255.244331]  [<ffffffff8162961c>] inet_recvmsg+0x6c/0x80
[19255.244333]  [<ffffffff8159d42a>] sock_recvmsg+0x9a/0xd0
[19255.244345]  [<ffffffff8159e1a2>] __sys_recvmsg+0x42/0x80
[19255.244346]  [<ffffffff8159e1f2>] SyS_recvmsg+0x12/0x20
[19255.244348]  [<ffffffff816bd412>] system_call_fastpath+0x16/0x1b

No? Trying to collect dwarf callchains after installing iputils-debuginfo (where /bin/ping
symtab is)...
 
> > I'm experimenting with a libunwind based profiler, and with some ugly signal 
> > hackery I can now grab backtraces by sending my application SIGUSR1. Based on 
> 
> Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to
> your app with the frequency you want, and then hook a 'perf probe' into
> your signal... /me tries some stuff, will get back with results...
> 
> > that, I can probably create a profiling tool that fits my needs. I just wonder 
> > why one cannot do the same with perf.
> 
> - Arnaldo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 19:17             ` Arnaldo Carvalho de Melo
  2014-09-18 19:31               ` Arnaldo Carvalho de Melo
@ 2014-09-18 20:17               ` David Ahern
  2014-09-18 20:36                 ` Arnaldo Carvalho de Melo
  2014-09-19  5:59               ` Namhyung Kim
  2 siblings, 1 reply; 25+ messages in thread
From: David Ahern @ 2014-09-18 20:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Milian Wolff
  Cc: linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart

On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote:
>> This was also why I asked my initial question, which I want to repeat once
>> >more: Is there a technical reason to not offer a "timer" software event to
>> >perf? I'm a complete layman when it comes to Kernel internals, but from a user
>> >point of view this would be awesome:
>
>> >perf record --call-graph dwarf -e sw-timer -F 100 someapplication
>
>> >This command would then create a timer in the kernel with a 100Hz frequency.
>> >Whenever it fires, the callgraphs of all threads in $someapplication are
>> >sampled and written to perf.data. Is this technically not feasible? Or is it
>> >simply not implemented?
>> >I'm experimenting with a libunwind based profiler, and with some ugly signal
>> >hackery I can now grab backtraces by sending my application SIGUSR1. Based on
> Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to
> your app with the frequency you want, and then hook a 'perf probe' into
> your signal... /me tries some stuff, will get back with results...
>

Current profiling options with perf require the process to be running. 
What Milian want is to grab samples every timer expiration even if 
process is not running.

Any limitations that would prevent doing this with a sw event? e.g, 
mimic task-clock just don't disable the timer when the task is scheduled 
out.

David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 20:17               ` David Ahern
@ 2014-09-18 20:36                 ` Arnaldo Carvalho de Melo
  2014-09-18 20:39                   ` David Ahern
                                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-18 20:36 UTC (permalink / raw)
  To: David Ahern
  Cc: Milian Wolff, linux-perf-users, Namhyung Kim, Ingo Molnar,
	Joseph Schuchart

Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu:
> On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote:
> >>This was also why I asked my initial question, which I want to repeat once
> >>>more: Is there a technical reason to not offer a "timer" software event to
> >>>perf? I'm a complete layman when it comes to Kernel internals, but from a user
> >>>point of view this would be awesome:

> >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication

> >>>This command would then create a timer in the kernel with a 100Hz frequency.
> >>>Whenever it fires, the callgraphs of all threads in $someapplication are
> >>>sampled and written to perf.data. Is this technically not feasible? Or is it
> >>>simply not implemented?
> >>>I'm experimenting with a libunwind based profiler, and with some ugly signal
> >>>hackery I can now grab backtraces by sending my application SIGUSR1. Based on
> >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to
> >your app with the frequency you want, and then hook a 'perf probe' into
> >your signal... /me tries some stuff, will get back with results...
 
> Current profiling options with perf require the process to be running. What

Ok, so you want to see what is the wait channel and unwind the stack
from there? Is that the case? I.e. again, a sysrq-t equivalent?

> Milian want is to grab samples every timer expiration even if process is not
> running.

What for? And by "grab samples" you want to know where it is waiting for
something, together with its callchain?
 
> Any limitations that would prevent doing this with a sw event? e.g, mimic
> task-clock just don't disable the timer when the task is scheduled out.

I'm just trying to figure out if what people want is a complete
backtrace of all threads in a process, no matter what they are doing, at
some given time, i.e. at "sysrq-t" time, is that the case?

- Arnaldo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 20:36                 ` Arnaldo Carvalho de Melo
@ 2014-09-18 20:39                   ` David Ahern
  2014-09-19  8:11                   ` Milian Wolff
  2014-09-19 14:17                   ` David Ahern
  2 siblings, 0 replies; 25+ messages in thread
From: David Ahern @ 2014-09-18 20:39 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Milian Wolff, linux-perf-users, Namhyung Kim, Ingo Molnar,
	Joseph Schuchart

On 9/18/14, 2:36 PM, Arnaldo Carvalho de Melo wrote:
> I'm just trying to figure out if what people want is a complete
> backtrace of all threads in a process, no matter what they are doing, at
> some given time, i.e. at "sysrq-t" time, is that the case?

I think so. That's what poor-man's profiler is doing -- attach gdb every 
N seconds, dump stack. In perf terminology generate a sample and include 
the callchain.

David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 20:36                 ` Arnaldo Carvalho de Melo
  2014-09-18 20:39                   ` David Ahern
@ 2014-09-19  8:11                   ` Milian Wolff
  2014-09-19  9:08                     ` Milian Wolff
  2014-09-19 14:47                     ` Arnaldo Carvalho de Melo
  2014-09-19 14:17                   ` David Ahern
  2 siblings, 2 replies; 25+ messages in thread
From: Milian Wolff @ 2014-09-19  8:11 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Ahern, linux-perf-users, Namhyung Kim, Ingo Molnar,
	Joseph Schuchart

On Thursday 18 September 2014 17:36:25 Arnaldo Carvalho de Melo wrote:
> Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu:
> > On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote:
> > >>This was also why I asked my initial question, which I want to repeat
> > >>once
> > >>
> > >>>more: Is there a technical reason to not offer a "timer" software event
> > >>>to
> > >>>perf? I'm a complete layman when it comes to Kernel internals, but from
> > >>>a user point of view this would be awesome:
> > >>>
> > >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication
> > >>>
> > >>>This command would then create a timer in the kernel with a 100Hz
> > >>>frequency. Whenever it fires, the callgraphs of all threads in
> > >>>$someapplication are sampled and written to perf.data. Is this
> > >>>technically not feasible? Or is it simply not implemented?
> > >>>I'm experimenting with a libunwind based profiler, and with some ugly
> > >>>signal hackery I can now grab backtraces by sending my application
> > >>>SIGUSR1. Based on> >
> > >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to
> > >your app with the frequency you want, and then hook a 'perf probe' into
> > >your signal... /me tries some stuff, will get back with results...

That is actually a very good idea. With the more powerful scripting abilities 
in perf now, that could/should do the job indeed. I'll also try this out.

> > Current profiling options with perf require the process to be running.
> > What
> 
> Ok, so you want to see what is the wait channel and unwind the stack
> from there? Is that the case? I.e. again, a sysrq-t equivalent?

Hey again :)

If I'm not mistaken, I could not yet bring my point across. The final goal is 
to profile both, wait time _and_ CPU time, combined! By sampling a userspace 
program with a constant frequency and some statistics one can get extremely 
useful information out of the data. I want to use perf to automate the process 
that Mike Dunlavey explains here: http://stackoverflow.com/a/378024/35250

So yes, I want to see the wait channel and unwind from there, if the process 
is waiting. Otherwise just do the normal unwinding you'd do when any other 
perf event occurs.

> > Milian want is to grab samples every timer expiration even if process is
> > not running.
> 
> What for? And by "grab samples" you want to know where it is waiting for
> something, together with its callchain?

See above. If I can sample the callchain every N ms, preferrably in a per-
thread manner, I can create tools which find the slowest userspace functions. 
This is based on "inclusive" cost, which is made up out of CPU time _and_ wait 
time combined. With it one will find all of the following:

  - CPU hotspots
  - IO wait time
  - lock contention in a multi-threaded application

> > Any limitations that would prevent doing this with a sw event? e.g, mimic
> > task-clock just don't disable the timer when the task is scheduled out.
> 
> I'm just trying to figure out if what people want is a complete
> backtrace of all threads in a process, no matter what they are doing, at
> some given time, i.e. at "sysrq-t" time, is that the case?

Yes, that sounds correct to me. I tried it out on my system, but the output in 
dmesg is far too large and I could not find the call stack information of my 
test application while it slept. Looking at the other backtraces I see there, 
I'm not sure whether sysrq-t only outputs a kernel-backtrace? Or maybe it's 
because I'm missing debug symbols for these applications?

[167576.049113] mysqld          S 0000000000000000     0 17023      1 
0x00000000
[167576.049115]  ffff8801222a7ad0 0000000000000082 ffff8800aa8c9460 
0000000000014580
[167576.049117]  ffff8801222a7fd8 0000000000014580 ffff8800aa8c9460 
ffff8801222a7a18
[167576.049119]  ffffffff812a2640 ffff8800bb975c40 ffff880237cd4ef0 
ffff8801222a7b20
[167576.049121] Call Trace:
[167576.049124]  [<ffffffff812a2640>] ? cpumask_next_and+0x30/0x50
[167576.049126]  [<ffffffff810b65b4>] ? add_wait_queue+0x44/0x50
[167576.049128]  [<ffffffff8152ceb9>] schedule+0x29/0x70
[167576.049130]  [<ffffffff8152c4ad>] schedule_hrtimeout_range_clock+0x3d/0x60
[167576.049132]  [<ffffffff8152c4e3>] schedule_hrtimeout_range+0x13/0x20
[167576.049134]  [<ffffffff811d5589>] poll_schedule_timeout+0x49/0x70
[167576.049136]  [<ffffffff811d6d5e>] do_sys_poll+0x4be/0x570
[167576.049138]  [<ffffffff81155880>] ? __alloc_pages_nodemask+0x180/0xc20
[167576.049140]  [<ffffffff8108e3ae>] ? alloc_pid+0x2e/0x4c0
[167576.049142]  [<ffffffff811d5700>] ? poll_select_copy_remaining+0x150/0x150
[167576.049145]  [<ffffffff810bd1f0>] ? cpuacct_charge+0x50/0x60
[167576.049147]  [<ffffffff811a5de3>] ? kmem_cache_alloc+0x143/0x170
[167576.049149]  [<ffffffff8108e3ae>] ? alloc_pid+0x2e/0x4c0
[167576.049151]  [<ffffffff810a7d08>] ? __enqueue_entity+0x78/0x80
[167576.049153]  [<ffffffff810addee>] ? enqueue_entity+0x24e/0xaa0
[167576.049155]  [<ffffffff812a2640>] ? cpumask_next_and+0x30/0x50
[167576.049158]  [<ffffffff810ae74d>] ? enqueue_task_fair+0x10d/0x5b0
[167576.049160]  [<ffffffff81045c3d>] ? native_smp_send_reschedule+0x4d/0x70
[167576.049162]  [<ffffffff8109e740>] ? resched_task+0xc0/0xd0
[167576.049165]  [<ffffffff8109fff7>] ? check_preempt_curr+0x57/0x90
[167576.049167]  [<ffffffff810a2e2b>] ? wake_up_new_task+0x11b/0x1c0
[167576.049169]  [<ffffffff8106d5c6>] ? do_fork+0x136/0x3f0
[167576.049171]  [<ffffffff811df235>] ? __fget_light+0x25/0x70
[167576.049173]  [<ffffffff811d6f37>] SyS_poll+0x97/0x120
[167576.049176]  [<ffffffff81530d69>] ? system_call_fastpath+0x16/0x1b
[167576.049178]  [<ffffffff81530d69>] system_call_fastpath+0x16/0x1b

I'm interested in the the other side of the fence. The backtrace could/should 
stop at system_call_fastpath.

Bye

-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19  8:11                   ` Milian Wolff
@ 2014-09-19  9:08                     ` Milian Wolff
  2014-09-19 14:47                     ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 25+ messages in thread
From: Milian Wolff @ 2014-09-19  9:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Ahern, linux-perf-users, Namhyung Kim, Ingo Molnar,
	Joseph Schuchart

On Friday 19 September 2014 10:11:21 Milian Wolff wrote:
> On Thursday 18 September 2014 17:36:25 Arnaldo Carvalho de Melo wrote:
> > Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu:
> > > On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote:
> > > >>This was also why I asked my initial question, which I want to repeat
> > > >>once
> > > >>
> > > >>>more: Is there a technical reason to not offer a "timer" software
> > > >>>event
> > > >>>to
> > > >>>perf? I'm a complete layman when it comes to Kernel internals, but
> > > >>>from
> > > >>>a user point of view this would be awesome:
> > > >>>
> > > >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication
> > > >>>
> > > >>>This command would then create a timer in the kernel with a 100Hz
> > > >>>frequency. Whenever it fires, the callgraphs of all threads in
> > > >>>$someapplication are sampled and written to perf.data. Is this
> > > >>>technically not feasible? Or is it simply not implemented?
> > > >>>I'm experimenting with a libunwind based profiler, and with some ugly
> > > >>>signal hackery I can now grab backtraces by sending my application
> > > >>>SIGUSR1. Based on> >
> > > >
> > > >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to
> > > >your app with the frequency you want, and then hook a 'perf probe' into
> > > >your signal... /me tries some stuff, will get back with results...
> 
> That is actually a very good idea. With the more powerful scripting
> abilities in perf now, that could/should do the job indeed. I'll also try
> this out.

Indeed, this does not seem to work as intended:

~~~~~~~~~~~~~~~~~~

#include <csignal>

extern "C" {

void signalHandler(int sig)
{
}

}

struct Pace
{
    Pace()
    {
        signal(SIGUSR1, &signalHandler);
    }
};

static Pace initializeLibPace;
~~~~~~~~~~~~~~~~~~~

# as user, run:
g++ -shared -fPIC -o libpace.so libpace.cpp
LD_PRELOAD=$(readlink -f libpace.so) somebinary
while true; do killall -s SIGUSR1 somebinary; done

# as root, run:
perf probe -x libpace.so signalHandler
perf record --call-graph dwarf -e probe_libpace:signalHandler -p $(pidof 
yourApp)
perf report -g graph --no-children

Nice! But there are multiple issues with this approach, compared to 
potentially offering it in perf directly:

1) you need root/elevated access to not only create the probe, but also to use 
it then during record. this also requires you to to attach to the debuggee, as 
you don't want to run most user-space apps as root. while runtime-attaching is 
a powerful feature, it is cumbersome to use for the simple case

2) all of the issues of signal handling. Try the above on my initial test 
application:

int main() {
  sleep(10);
  return 0;
}

After sending it the first SIGUSR1, the sleep will return. So we change the 
behavior of the application by profiling it, which is of course very bad. 
Sure, most of the time such sleeps are run in a loop, but even then one will 
influence the runtime behavior. And you will never find the spurious 
"sleep(100)" that you forgot in a debug session in your application...

3) What about multiple threads? One could potentially get it working by 
overwriting pthread_create/exit via LD_PRELOAD and propagating a custom signal 
via pthread_kill to all running threads when the external SIGUSR1 comes in.

So, I hope this shows you all that having such a feature in perf directly 
would be a good thing.

-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19  8:11                   ` Milian Wolff
  2014-09-19  9:08                     ` Milian Wolff
@ 2014-09-19 14:47                     ` Arnaldo Carvalho de Melo
  2014-09-19 15:04                       ` David Ahern
  2014-09-19 15:05                       ` Milian Wolff
  1 sibling, 2 replies; 25+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-19 14:47 UTC (permalink / raw)
  To: Milian Wolff
  Cc: David Ahern, linux-perf-users, Namhyung Kim, Ingo Molnar,
	Joseph Schuchart

Em Fri, Sep 19, 2014 at 10:11:21AM +0200, Milian Wolff escreveu:
> On Thursday 18 September 2014 17:36:25 Arnaldo Carvalho de Melo wrote:
> > Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu:
> > > On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote:
> > > >>This was also why I asked my initial question, which I want to repeat
> > > >>once
> > > >>
> > > >>>more: Is there a technical reason to not offer a "timer" software event
> > > >>>to
> > > >>>perf? I'm a complete layman when it comes to Kernel internals, but from
> > > >>>a user point of view this would be awesome:
> > > >>>
> > > >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication
> > > >>>
> > > >>>This command would then create a timer in the kernel with a 100Hz
> > > >>>frequency. Whenever it fires, the callgraphs of all threads in
> > > >>>$someapplication are sampled and written to perf.data. Is this
> > > >>>technically not feasible? Or is it simply not implemented?
> > > >>>I'm experimenting with a libunwind based profiler, and with some ugly
> > > >>>signal hackery I can now grab backtraces by sending my application
> > > >>>SIGUSR1. Based on> >
> > > >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to
> > > >your app with the frequency you want, and then hook a 'perf probe' into
> > > >your signal... /me tries some stuff, will get back with results...
> 
> That is actually a very good idea. With the more powerful scripting abilities 
> in perf now, that could/should do the job indeed. I'll also try this out.

Now that the need for getting a backtrace from existing threads, a-la
using ptrace via gdb to attach to it and then traverse its stack to
provide that backtrace, I think we need to do something on the perf
infrastructure in the kernel to do that, i.e. somehow signal the perf
kernel part that we want a backtrace for some specific thread.

Not at event time, but at some arbitrary time, be it creating an event
that, as you suggested, will create a timer and then when that timer
fires will use (parts of the) mechanism used by ptrace.

But in the end we need a mechanism to ask for backtraces for existing,
sleeping, threads.

For waits in the future, we just need to ask for the tracepoints where
waits take place and with the current infrastructure we can get most/all
of what we need, no?
 
> > > Current profiling options with perf require the process to be running.
> > > What

> > Ok, so you want to see what is the wait channel and unwind the stack
> > from there? Is that the case? I.e. again, a sysrq-t equivalent?
 
> Hey again :)
 
> If I'm not mistaken, I could not yet bring my point across. The final goal is 

Right, but the discussion so far was to see if the existing kernel
infrastructure would allow us to write such a tool.

> to profile both, wait time _and_ CPU time, combined! By sampling a userspace 
> program with a constant frequency and some statistics one can get extremely 
> useful information out of the data. I want to use perf to automate the process 
> that Mike Dunlavey explains here: http://stackoverflow.com/a/378024/35250

Will read.
 
> So yes, I want to see the wait channel and unwind from there, if the process 
> is waiting. Otherwise just do the normal unwinding you'd do when any other 
> perf event occurs.

Ok, we need two mechanisms to get that, one for existing, sleeping
threads at the time we start monitoring (we don't have that now other
than freezing the process, looking for threads that were waiting, then
using ptrace and asking for that backtrace), and another for threads
that will sleep/wait _after_ we start monitoring.
 
> > > Milian want is to grab samples every timer expiration even if process is
> > > not running.
> > 
> > What for? And by "grab samples" you want to know where it is waiting for
> > something, together with its callchain?
> 
> See above. If I can sample the callchain every N ms, preferrably in a per-

Do you need to take periodic samples of the callchain? Or only when you
wait for something?

For things that happen at such a high freq, yeah, just sampling would be
better, but you're thinking about slow things, so multiple samples for
something waiting and waiting is not needed, just when it starts
waiting, right?

> thread manner, I can create tools which find the slowest userspace functions. 
> This is based on "inclusive" cost, which is made up out of CPU time _and_ wait 
> time combined. With it one will find all of the following:
> 
>   - CPU hotspots
>   - IO wait time
>   - lock contention in a multi-threaded application
 
> > > Any limitations that would prevent doing this with a sw event? e.g, mimic
> > > task-clock just don't disable the timer when the task is scheduled out.

> > I'm just trying to figure out if what people want is a complete
> > backtrace of all threads in a process, no matter what they are doing, at
> > some given time, i.e. at "sysrq-t" time, is that the case?
 
> Yes, that sounds correct to me. I tried it out on my system, but the output in 
> dmesg is far too large and I could not find the call stack information of my 
> test application while it slept. Looking at the other backtraces I see there, 
> I'm not sure whether sysrq-t only outputs a kernel-backtrace? Or maybe it's 

sysrq-t is just for the kernel, that is why I said that you seemed to
want it to cross into userspace.

> because I'm missing debug symbols for these applications?
> 
> [167576.049113] mysqld          S 0000000000000000     0 17023      1 
> 0x00000000
> [167576.049115]  ffff8801222a7ad0 0000000000000082 ffff8800aa8c9460 
> 0000000000014580
> [167576.049117]  ffff8801222a7fd8 0000000000014580 ffff8800aa8c9460 
> ffff8801222a7a18
> [167576.049119]  ffffffff812a2640 ffff8800bb975c40 ffff880237cd4ef0 
> ffff8801222a7b20
> [167576.049121] Call Trace:
> [167576.049124]  [<ffffffff812a2640>] ? cpumask_next_and+0x30/0x50
> [167576.049126]  [<ffffffff810b65b4>] ? add_wait_queue+0x44/0x50
> [167576.049128]  [<ffffffff8152ceb9>] schedule+0x29/0x70
> [167576.049130]  [<ffffffff8152c4ad>] schedule_hrtimeout_range_clock+0x3d/0x60
> [167576.049132]  [<ffffffff8152c4e3>] schedule_hrtimeout_range+0x13/0x20
> [167576.049134]  [<ffffffff811d5589>] poll_schedule_timeout+0x49/0x70
> [167576.049136]  [<ffffffff811d6d5e>] do_sys_poll+0x4be/0x570
> [167576.049138]  [<ffffffff81155880>] ? __alloc_pages_nodemask+0x180/0xc20
> [167576.049140]  [<ffffffff8108e3ae>] ? alloc_pid+0x2e/0x4c0
> [167576.049142]  [<ffffffff811d5700>] ? poll_select_copy_remaining+0x150/0x150
> [167576.049145]  [<ffffffff810bd1f0>] ? cpuacct_charge+0x50/0x60
> [167576.049147]  [<ffffffff811a5de3>] ? kmem_cache_alloc+0x143/0x170
> [167576.049149]  [<ffffffff8108e3ae>] ? alloc_pid+0x2e/0x4c0
> [167576.049151]  [<ffffffff810a7d08>] ? __enqueue_entity+0x78/0x80
> [167576.049153]  [<ffffffff810addee>] ? enqueue_entity+0x24e/0xaa0
> [167576.049155]  [<ffffffff812a2640>] ? cpumask_next_and+0x30/0x50
> [167576.049158]  [<ffffffff810ae74d>] ? enqueue_task_fair+0x10d/0x5b0
> [167576.049160]  [<ffffffff81045c3d>] ? native_smp_send_reschedule+0x4d/0x70
> [167576.049162]  [<ffffffff8109e740>] ? resched_task+0xc0/0xd0
> [167576.049165]  [<ffffffff8109fff7>] ? check_preempt_curr+0x57/0x90
> [167576.049167]  [<ffffffff810a2e2b>] ? wake_up_new_task+0x11b/0x1c0
> [167576.049169]  [<ffffffff8106d5c6>] ? do_fork+0x136/0x3f0
> [167576.049171]  [<ffffffff811df235>] ? __fget_light+0x25/0x70
> [167576.049173]  [<ffffffff811d6f37>] SyS_poll+0x97/0x120
> [167576.049176]  [<ffffffff81530d69>] ? system_call_fastpath+0x16/0x1b
> [167576.049178]  [<ffffffff81530d69>] system_call_fastpath+0x16/0x1b
> 
> I'm interested in the the other side of the fence. The backtrace could/should 
> stop at system_call_fastpath.

Right, adding callchains to 'perf trace' and using it with --duration
may provide a first approximation for threads that will start waiting
after 'perf trace' starts, I guess.

- Arnaldo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19 14:47                     ` Arnaldo Carvalho de Melo
@ 2014-09-19 15:04                       ` David Ahern
  2014-09-19 15:05                       ` Milian Wolff
  1 sibling, 0 replies; 25+ messages in thread
From: David Ahern @ 2014-09-19 15:04 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Milian Wolff
  Cc: linux-perf-users, Namhyung Kim, Ingo Molnar, Joseph Schuchart

On 9/19/14, 8:47 AM, Arnaldo Carvalho de Melo wrote:
> Now that the need for getting a backtrace from existing threads, a-la
> using ptrace via gdb to attach to it and then traverse its stack to
> provide that backtrace, I think we need to do something on the perf
> infrastructure in the kernel to do that, i.e. somehow signal the perf
> kernel part that we want a backtrace for some specific thread.
>
> Not at event time, but at some arbitrary time, be it creating an event
> that, as you suggested, will create a timer and then when that timer
> fires will use (parts of the) mechanism used by ptrace.
>
> But in the end we need a mechanism to ask for backtraces for existing,
> sleeping, threads.

But perf already has that information -- stacktraces on a sched_switch. 
We already know where the task is when it is sleeping.

> Ok, we need two mechanisms to get that, one for existing, sleeping
> threads at the time we start monitoring (we don't have that now other
> than freezing the process, looking for threads that were waiting, then
> using ptrace and asking for that backtrace), and another for threads
> that will sleep/wait_after_  we start monitoring.

That's a missing piece -- callchains for sleeping threads when perf starts.


>> >I'm interested in the the other side of the fence. The backtrace could/should
>> >stop at system_call_fastpath.

We talked about this last year - options to control stack depth obtained 
on a record. Most flexibility is to allow controls of user space stack 
and kernel. Here someone could set kstack depth to 0 to get only 
userspace stack.

> Right, adding callchains to 'perf trace' and using it with --duration
> may provide a first approximation for threads that will start waiting
> after 'perf trace' starts, I guess.

Talked about this one too. Integration of perf-script with perf-trace.

So many features, so little time.

David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19 14:47                     ` Arnaldo Carvalho de Melo
  2014-09-19 15:04                       ` David Ahern
@ 2014-09-19 15:05                       ` Milian Wolff
  1 sibling, 0 replies; 25+ messages in thread
From: Milian Wolff @ 2014-09-19 15:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Ahern, linux-perf-users, Namhyung Kim, Ingo Molnar,
	Joseph Schuchart

On Friday 19 September 2014 11:47:47 Arnaldo Carvalho de Melo wrote:
> Em Fri, Sep 19, 2014 at 10:11:21AM +0200, Milian Wolff escreveu:
> > On Thursday 18 September 2014 17:36:25 Arnaldo Carvalho de Melo wrote:
> > > Em Thu, Sep 18, 2014 at 02:17:49PM -0600, David Ahern escreveu:
> > > > On 9/18/14, 1:17 PM, Arnaldo Carvalho de Melo wrote:
> > > > >>This was also why I asked my initial question, which I want to
> > > > >>repeat
> > > > >>once
> > > > >>
> > > > >>>more: Is there a technical reason to not offer a "timer" software
> > > > >>>event
> > > > >>>to
> > > > >>>perf? I'm a complete layman when it comes to Kernel internals, but
> > > > >>>from
> > > > >>>a user point of view this would be awesome:
> > > > >>>
> > > > >>>perf record --call-graph dwarf -e sw-timer -F 100 someapplication
> > > > >>>
> > > > >>>This command would then create a timer in the kernel with a 100Hz
> > > > >>>frequency. Whenever it fires, the callgraphs of all threads in
> > > > >>>$someapplication are sampled and written to perf.data. Is this
> > > > >>>technically not feasible? Or is it simply not implemented?
> > > > >>>I'm experimenting with a libunwind based profiler, and with some
> > > > >>>ugly
> > > > >>>signal hackery I can now grab backtraces by sending my application
> > > > >>>SIGUSR1. Based on> >
> > > > >
> > > > >Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to
> > > > >your app with the frequency you want, and then hook a 'perf probe'
> > > > >into
> > > > >your signal... /me tries some stuff, will get back with results...
> > 
> > That is actually a very good idea. With the more powerful scripting
> > abilities in perf now, that could/should do the job indeed. I'll also try
> > this out.
>
> Now that the need for getting a backtrace from existing threads, a-la
> using ptrace via gdb to attach to it and then traverse its stack to
> provide that backtrace, I think we need to do something on the perf
> infrastructure in the kernel to do that, i.e. somehow signal the perf
> kernel part that we want a backtrace for some specific thread.
> 
> Not at event time, but at some arbitrary time, be it creating an event
> that, as you suggested, will create a timer and then when that timer
> fires will use (parts of the) mechanism used by ptrace.
> 
> But in the end we need a mechanism to ask for backtraces for existing,
> sleeping, threads.

Yes, such a capability would be tremendously helpful for writing profiling 
tools for userspace applications. It should also work for threads that are not 
sleeping though, or are all threads of a process frozen when a perf event 
fires anyways?

> For waits in the future, we just need to ask for the tracepoints where
> waits take place and with the current infrastructure we can get most/all
> of what we need, no?

See David's email, this is possible even now with the sched:* events.
 
> > > > Current profiling options with perf require the process to be running.
> > > > What
> > > 
> > > Ok, so you want to see what is the wait channel and unwind the stack
> > > from there? Is that the case? I.e. again, a sysrq-t equivalent?
> > 
> > Hey again :)
> > 
> > If I'm not mistaken, I could not yet bring my point across. The final goal
> > is
> Right, but the discussion so far was to see if the existing kernel
> infrastructure would allow us to write such a tool.
> 
> > to profile both, wait time _and_ CPU time, combined! By sampling a
> > userspace program with a constant frequency and some statistics one can
> > get extremely useful information out of the data. I want to use perf to
> > automate the process that Mike Dunlavey explains here:
> > http://stackoverflow.com/a/378024/35250
> Will read.
> 
> > So yes, I want to see the wait channel and unwind from there, if the
> > process is waiting. Otherwise just do the normal unwinding you'd do when
> > any other perf event occurs.
> 
> Ok, we need two mechanisms to get that, one for existing, sleeping
> threads at the time we start monitoring (we don't have that now other
> than freezing the process, looking for threads that were waiting, then
> using ptrace and asking for that backtrace), and another for threads
> that will sleep/wait _after_ we start monitoring.
> 
> > > > Milian want is to grab samples every timer expiration even if process
> > > > is
> > > > not running.
> > > 
> > > What for? And by "grab samples" you want to know where it is waiting for
> > > something, together with its callchain?
> > 
> > See above. If I can sample the callchain every N ms, preferrably in a per-
> 
> Do you need to take periodic samples of the callchain? Or only when you
> wait for something?
> 
> For things that happen at such a high freq, yeah, just sampling would be
> better, but you're thinking about slow things, so multiple samples for
> something waiting and waiting is not needed, just when it starts
> waiting, right?

I need periodic samples. I'm not looking exclusively for wait time (that can 
be done already, see above). I'm looking for a generic overview of the 
userspace program. Where does it spent time? Without periodic samples, I 
cannot do any statistics. I think this should become clear when you read Mike 
Dunlavey's text that explains the GDB-based poor-mans-profiler (which is 
actually pretty helpful, despite the name).

> > thread manner, I can create tools which find the slowest userspace
> > functions. This is based on "inclusive" cost, which is made up out of CPU
> > time _and_ wait> 
> > time combined. With it one will find all of the following:
> >   - CPU hotspots
> >   - IO wait time
> >   - lock contention in a multi-threaded application
> >   
> > > > Any limitations that would prevent doing this with a sw event? e.g,
> > > > mimic
> > > > task-clock just don't disable the timer when the task is scheduled
> > > > out.
> > > 
> > > I'm just trying to figure out if what people want is a complete
> > > backtrace of all threads in a process, no matter what they are doing, at
> > > some given time, i.e. at "sysrq-t" time, is that the case?
> > 
> > Yes, that sounds correct to me. I tried it out on my system, but the
> > output in dmesg is far too large and I could not find the call stack
> > information of my test application while it slept. Looking at the other
> > backtraces I see there, I'm not sure whether sysrq-t only outputs a
> > kernel-backtrace? Or maybe it's
> sysrq-t is just for the kernel, that is why I said that you seemed to
> want it to cross into userspace.

Ah, ok. I misunderstood you.

<snip>

> Right, adding callchains to 'perf trace' and using it with --duration
> may provide a first approximation for threads that will start waiting
> after 'perf trace' starts, I guess.

Yes, see also the other mail on that. This would be very helpful, but only 
partially related to the goal I have in mind here :)

Bye
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 20:36                 ` Arnaldo Carvalho de Melo
  2014-09-18 20:39                   ` David Ahern
  2014-09-19  8:11                   ` Milian Wolff
@ 2014-09-19 14:17                   ` David Ahern
  2014-09-19 14:39                     ` Milian Wolff
  2 siblings, 1 reply; 25+ messages in thread
From: David Ahern @ 2014-09-19 14:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Milian Wolff, linux-perf-users, Namhyung Kim, Ingo Molnar,
	Joseph Schuchart

On 9/18/14, 2:36 PM, Arnaldo Carvalho de Melo wrote:

> Ok, so you want to see what is the wait channel and unwind the stack
> from there? Is that the case? I.e. again, a sysrq-t equivalent?
>
>> Milian want is to grab samples every timer expiration even if process is not
>> running.
>
> What for? And by "grab samples" you want to know where it is waiting for
> something, together with its callchain?
>
>> Any limitations that would prevent doing this with a sw event? e.g, mimic
>> task-clock just don't disable the timer when the task is scheduled out.
>
> I'm just trying to figure out if what people want is a complete
> backtrace of all threads in a process, no matter what they are doing, at
> some given time, i.e. at "sysrq-t" time, is that the case?

This has been discussed before:
    http://thread.gmane.org/gmane.linux.kernel/1307306

Example on the perf wiki:
     https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times


David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19 14:17                   ` David Ahern
@ 2014-09-19 14:39                     ` Milian Wolff
  2014-09-19 14:55                       ` David Ahern
  0 siblings, 1 reply; 25+ messages in thread
From: Milian Wolff @ 2014-09-19 14:39 UTC (permalink / raw)
  To: David Ahern
  Cc: Arnaldo Carvalho de Melo, linux-perf-users, Namhyung Kim,
	Ingo Molnar, Joseph Schuchart

On Friday 19 September 2014 08:17:47 David Ahern wrote:
> On 9/18/14, 2:36 PM, Arnaldo Carvalho de Melo wrote:
> > Ok, so you want to see what is the wait channel and unwind the stack
> > from there? Is that the case? I.e. again, a sysrq-t equivalent?
> > 
> >> Milian want is to grab samples every timer expiration even if process is
> >> not running.
> > 
> > What for? And by "grab samples" you want to know where it is waiting for
> > something, together with its callchain?
> > 
> >> Any limitations that would prevent doing this with a sw event? e.g, mimic
> >> task-clock just don't disable the timer when the task is scheduled out.
> > 
> > I'm just trying to figure out if what people want is a complete
> > backtrace of all threads in a process, no matter what they are doing, at
> > some given time, i.e. at "sysrq-t" time, is that the case?
> 
> This has been discussed before:
>     http://thread.gmane.org/gmane.linux.kernel/1307306
> 
> Example on the perf wiki:
>      https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times

Right, I also mentioned this in my initial email. This is _not_ what I want 
though :)

Bye
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19 14:39                     ` Milian Wolff
@ 2014-09-19 14:55                       ` David Ahern
  0 siblings, 0 replies; 25+ messages in thread
From: David Ahern @ 2014-09-19 14:55 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Arnaldo Carvalho de Melo, linux-perf-users, Namhyung Kim,
	Ingo Molnar, Joseph Schuchart

On 9/19/14, 8:39 AM, Milian Wolff wrote:
>> This has been discussed before:
>> >     http://thread.gmane.org/gmane.linux.kernel/1307306
>> >
>> >Example on the perf wiki:
>> >      https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times
> Right, I also mentioned this in my initial email. This is_not_  what I want
> though:)

D'oh. missed that. And I did find a blurb on StackOverflow that distros 
are now disabling SCHEDSTATS so commands cannot rely on sched_stat 
tracepoints. Now I am glad I never converted my timehist command to rely 
on those.

David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-18 19:17             ` Arnaldo Carvalho de Melo
  2014-09-18 19:31               ` Arnaldo Carvalho de Melo
  2014-09-18 20:17               ` David Ahern
@ 2014-09-19  5:59               ` Namhyung Kim
  2014-09-19 14:33                 ` Arnaldo Carvalho de Melo
  2014-09-22  7:56                 ` Namhyung Kim
  2 siblings, 2 replies; 25+ messages in thread
From: Namhyung Kim @ 2014-09-19  5:59 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Milian Wolff, linux-perf-users, Ingo Molnar, Joseph Schuchart

Hi Arnaldo and Millan,

On Thu, 18 Sep 2014 16:17:13 -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu:
>> On Thursday 18 September 2014 12:57:45 Arnaldo Carvalho de Melo wrote:
>> > Em Thu, Sep 18, 2014 at 05:26:33PM +0200, Milian Wolff escreveu:
>> > > On Thursday 18 September 2014 11:51:24 Arnaldo Carvalho de Melo wrote:
>  
>> <snip>
>  
>> > > b) The callgraphs are really strange, imo. Different traces are printed
>> > > with the same cost, which sounds wrong, no? See e.g. the multiple 44.44%
>> > > traces in sched:sched_wakeup.
>
>> > Try using --no-children in the 'report' command line.
>
>> Nice, this is very useful. Many thanks!
>
> npo
>  
>> > > c) Most of the traces point into the kernel, how can I hide these traces
>> > > and only concentrate on the user-space? Do I have to grep manually for
>> > > [.] ? I
>
>> > Oh well, for userspace you need to be aware of how callchains are
>> > collected, i.e. if your binaries and libraries use
>> > -fno-omit-frame-pointer, because if they do you will not get callchains
>> > going into userspace, so you will need to specifically ask for 'DWARF'
>> > callchains, from 'perf record' documentation:
>  
>> I'm actually aware of that and I did add that option to my initial record 
>> call, sorry for not being clear here.
>  
>> <snip>
>  
>> > This has to be made automated, i.e. the tooling needs to figure out that
>> > the binaries used do use %bp for optimization and automagically collect
>> > DWARF, but till then, one needs to know about such issues and deal with
>> > it.
>> 
>> That would indeed be very welcome. There are multiple "defaults" in perf which 
>> I find highly confusing. The --no-children above e.g. could/should probably be 
>> the default, no? Similar, I find it extremely irritating that `perf report -g` 
>
> It was, this is something we've actually been discussing recently: the
> change that made --children be the default mode. That is why I added
> Namhyung and Ingo to the CC list, so that they become aware of more
> reaction to this change.

Yeah, we should rethink about changing the default now.  Actually I'm
okay with the change, Ingo what do you think?


>
>> defaults to `-g fractal` and not `-g graph`.
>> 
>> 100% foo
>>   70% bar
>>     70% asdf
>>     30% lalala
>>   30% baz
>> 
>> is much harder to interpret than
>> 
>> 100% foo
>>   70% bar
>>     49% asdf
>>     21% lalala
>>   30% baz

I also agree with you. :)


>
> But the question then is if this is configurable, if not that would be a
> first step, i.e. making this possible via some ~/.perfconfig change.

Yes, we have record.call-graph and top.call-graph config options now so
adding a new report.call-graph option should not be difficult.  However
I think it'd be better being call-graph.XXX as it can be applied to all
other subcommands transparently.

What about like below?

[call-graph]
  mode = dwarf
  dump-size = 8192
  print-type = fractal
  order = callee
  threshold = 0.5
  print-limit = 128
  sort-key = function


>
> Later we could advocate changing the default. Or perhaps provide some
> "skins", i.e. config files that could be sourced into ~/.perfconfig so
> that perf mimics the decisions of other profilers, with which people are
> used to.
>
> Kinda like making mutt behave like pine (as I did a long time ago), even
> if just for a while, till one gets used to the "superior" default way of
> doing things of the new tool :-)
>  
>> especially for more involved call chains. It took me quite some time to become 
>> aware of the ability to pass `-g graph` to get the desired output. KCacheGrind 
>> e.g. also defaults to something similar to `-g graph` and only optionally 
>> allows the user to get the "relative to parent" cost of `-g fractal`.
>> 
>> > User space support is something that as you see, is still rough, we need
>> > people like you trying it, but while it is rough, people tend to avoid
>> > it... :-\
>> 
>> Yes. But already perf is extremely useful and I use it a lot. I'm also 
>> actively educating people about using it more. I've talked about it at last 
>> year's Akademy and Qt Developer Days, and again this year at a profiling 
>> workshop at Akademy. Please keep up the good work!
>
> Thanks a lot for doing that!
>
>> > > tried something like `perf report --parent "main"` but that makes no
>> > > difference.
>
>> > > > I would recommend that you take a look at Brendan Greggs _excellent_
>> > > > tutorials at:
>
>> > > > http://www.brendangregg.com/perf.html
>
>> > > > He will explain all this in way more detail than I briefly skimmed
>> > > > above. :-)
>
>> > > I did that already, but Brendan and the other available Perf documentation
>> > > mostly concentrates on performance issues in the Kernel. I'm interested
>> > > purely in the user space. Perf record with one of the hardware PMU events
>> > > works nicely in that case, but one cannot use it to find locks&waits
>> > > similar to what VTune offers.
>
>> > Humm, yeah, you need to figure out how to solve your issue, what I tried
>> > was to show what kinds of building blocks you could use to build what
>> > you need, but no, there is no ready to use tool for this, that I am
>> > aware of.

I'm also *very* interest in collecting idle/wait info using perf.  Looks
like we can somehow use sched:* tracepoints but it requires root
privilege though (unless /proc/sys/kernel/perf_event_paranoid being -1).

With that restriction however, we might improve perf sched (or even
plain perf record/report) to provide such info..  David may have an
idea. :)

Thanks,
Namhyung


>
>> > For instance, you need to collect scheduler events, then do some
>> > scripting, perhaps using perl or python, perhaps using the scripting
>> > support that is built into perf already, but yeah, not documented.
>
>> And also lacking the ability to get callgraphs, if I'm not mistaken. This is 
>> crucial for my undertaking. Or has this been added in the meantime?
>
> I guess it was:
>
> commit 57608cfd8827a74237d264a197722e2c99f72da4
> Author: Joseph Schuchart <joseph.schuchart@tu-dresden.de>
> Date:   Thu Jul 10 13:50:56 2014 +0200
>
>     perf script: Provide additional sample information on generic events
>     
>     To python scripts, including pid, tid, and cpu for which the event
>     was recorded.
>     
>     At the moment, the pointer to the sample struct is passed to
>     scripts, which seems to be of little use.
>     
>     The patch puts this information in dictionaries for easy access by
>     Python scripts.
>
> commit 0f5f5bcd112292f14b75750dde7461463bb1c7bb
> Author: Joseph Schuchart <joseph.schuchart@tu-dresden.de>
> Date:   Thu Jul 10 13:50:51 2014 +0200
>
>     perf script: Add callchain to generic and tracepoint events
>     
>     This provides valuable information for tracing performance problems.
>     
>     Since this change alters the interface for the python scripts, also
>     adjust the script generation and the provided scripts.
>  
>> <snip>
>> 
>> This was also why I asked my initial question, which I want to repeat once 
>> more: Is there a technical reason to not offer a "timer" software event to 
>> perf? I'm a complete layman when it comes to Kernel internals, but from a user 
>> point of view this would be awesome:
>  
>> perf record --call-graph dwarf -e sw-timer -F 100 someapplication
>  
>> This command would then create a timer in the kernel with a 100Hz frequency. 
>> Whenever it fires, the callgraphs of all threads in $someapplication are 
>> sampled and written to perf.data. Is this technically not feasible? Or is it 
>> simply not implemented?
>
>> I'm experimenting with a libunwind based profiler, and with some ugly signal 
>> hackery I can now grab backtraces by sending my application SIGUSR1. Based on 
>
> Humm, can't you do the same thing with perf? I.e. you send SIGUSR1 to
> your app with the frequency you want, and then hook a 'perf probe' into
> your signal... /me tries some stuff, will get back with results...
>
>> that, I can probably create a profiling tool that fits my needs. I just wonder 
>> why one cannot do the same with perf.
>
> - Arnaldo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19  5:59               ` Namhyung Kim
@ 2014-09-19 14:33                 ` Arnaldo Carvalho de Melo
  2014-09-19 14:53                   ` Milian Wolff
  2014-09-22  7:56                 ` Namhyung Kim
  1 sibling, 1 reply; 25+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-19 14:33 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Milian Wolff, linux-perf-users, Ingo Molnar, Joseph Schuchart

Em Fri, Sep 19, 2014 at 02:59:55PM +0900, Namhyung Kim escreveu:
> Hi Arnaldo and Millan,
> 
> On Thu, 18 Sep 2014 16:17:13 -0300, Arnaldo Carvalho de Melo wrote:
> > Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu:
> >> That would indeed be very welcome. There are multiple "defaults" in perf which 
> >> I find highly confusing. The --no-children above e.g. could/should probably be 
> >> the default, no? Similar, I find it extremely irritating that `perf report -g` 

> > It was, this is something we've actually been discussing recently: the
> > change that made --children be the default mode. That is why I added
> > Namhyung and Ingo to the CC list, so that they become aware of more
> > reaction to this change.

> Yeah, we should rethink about changing the default now.  Actually I'm
> okay with the change, Ingo what do you think?
 
> >> defaults to `-g fractal` and not `-g graph`.
> >> 
> >> 100% foo
> >>   70% bar
> >>     70% asdf
> >>     30% lalala
> >>   30% baz

> >> is much harder to interpret than

> >> 100% foo
> >>   70% bar
> >>     49% asdf
> >>     21% lalala
> >>   30% baz

> I also agree with you. :)

> > But the question then is if this is configurable, if not that would be a
> > first step, i.e. making this possible via some ~/.perfconfig change.
 
> Yes, we have record.call-graph and top.call-graph config options now so
> adding a new report.call-graph option should not be difficult.  However
> I think it'd be better being call-graph.XXX as it can be applied to all

No problem, with sourcing being supported in ~/.perfconfig, we can have
as many #include call-graph.XXX as needed, multiple levels of includes
and all.

> other subcommands transparently.
 
> What about like below?
 
> [call-graph]
>   mode = dwarf
>   dump-size = 8192
>   print-type = fractal
>   order = callee
>   threshold = 0.5
>   print-limit = 128
>   sort-key = function

Milian, does this provide what you expect? How would we call this
specific call-graph profile?

Ideas on where to put this below tools/perf/?

> > Later we could advocate changing the default. Or perhaps provide some
> > "skins", i.e. config files that could be sourced into ~/.perfconfig so
> > that perf mimics the decisions of other profilers, with which people are
> > used to.

> > Kinda like making mutt behave like pine (as I did a long time ago), even
> > if just for a while, till one gets used to the "superior" default way of
> > doing things of the new tool :-)

> >> > > I did that already, but Brendan and the other available Perf documentation
> >> > > mostly concentrates on performance issues in the Kernel. I'm interested
> >> > > purely in the user space. Perf record with one of the hardware PMU events
> >> > > works nicely in that case, but one cannot use it to find locks&waits
> >> > > similar to what VTune offers.

> >> > Humm, yeah, you need to figure out how to solve your issue, what I tried
> >> > was to show what kinds of building blocks you could use to build what
> >> > you need, but no, there is no ready to use tool for this, that I am
> >> > aware of.
 
> I'm also *very* interest in collecting idle/wait info using perf.  Looks
> like we can somehow use sched:* tracepoints but it requires root
> privilege though (unless /proc/sys/kernel/perf_event_paranoid being -1).
 
> With that restriction however, we might improve perf sched (or even
> plain perf record/report) to provide such info..  David may have an
> idea. :)

So this is like, thinking just on userspace here for a change, using
syscall entry as the wait time entry, then later, when we are at syscall
exit time (another tracepoint) we subtract the time from syscall entry,
and that is our waiting-for-the-kernel time, put a call-chain there and
we can sort by that syscall time, i.e. a bit like 'perf trace' +
callchains, no?

I.e. if I want to see what syscalls are taking more than 50ms on my
system:

[root@zoo ProgramasRFB]# trace --duration 50 | head -10
   132.962 (85.670 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0
   267.184 (134.110 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0
   328.303 (263.584 ms): chrome/17715 poll(ufds: 0x3ae9a95653f0, nfds: 6, timeout_msecs: 263) = 0 Timeout
   380.623 (300.665 ms): xchat/17397 poll(ufds: 0x25397d0, nfds: 6, timeout_msecs: 300      ) = 0 Timeout
   425.713 (250.214 ms): qemu-system-x8/17344 ioctl(fd: 18<anon_inode:kvm-vcpu>, cmd: 0xae80) = 0
   545.482 (482.962 ms): chrome/18798 futex(uaddr: 0x7fff19d88624, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88590, val3: 4294967295) = -1 ETIMEDOUT Connection timed out
   547.938 (500.641 ms): chrome/18173 futex(uaddr: 0x7fe1c1ed796c, op: WAIT_BITSET|PRIV|CLKRT, val: 197751, utime: 0x7fe1c15b5a60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out
   547.969 (272.853 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0
   633.711 (500.637 ms): chrome/18829 futex(uaddr: 0x7fe1bdbb862c, op: WAIT_BITSET|PRIV|CLKRT, val: 89537, utime: 0x7fe1ba67ea60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out
   633.767 (85.697 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0
[root@zoo ProgramasRFB]#

But then want to filter this a bit more and exclude poll calls:

[root@zoo ProgramasRFB]# trace --duration 50 -e \!poll | head -10
   299.079 (221.049 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0
   304.830 (250.211 ms): qemu-system-x8/17344 ioctl(fd: 18<anon_inode:kvm-vcpu>, cmd: 0xae80) = 0
   486.710 (187.180 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0
   555.125 (250.216 ms): qemu-system-x8/17344 ioctl(fd: 18<anon_inode:kvm-vcpu>, cmd: 0xae80) = 0
   578.685 (500.638 ms): chrome/18829 futex(uaddr: 0x7fe1bdbb862c, op: WAIT_BITSET|PRIV|CLKRT, val: 89879, utime: 0x7fe1ba67ea60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out
   578.749 (91.940 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0
   800.123 (500.640 ms): chrome/18827 futex(uaddr: 0x7fe1b93aa7cc, op: WAIT_BITSET|PRIV|CLKRT, val: 89803, utime: 0x7fe1bd2b3a60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out
   800.152 (500.618 ms): chrome/18077 futex(uaddr: 0x7fe1c179e96c, op: WAIT_BITSET|PRIV|CLKRT, val: 198047, utime: 0x7fe1c1637a60, val3: 4294967295) = -1 ETIMEDOUT Connection timed out
   800.165 (221.309 ms): chrome/18060 futex(uaddr: 0x7fff19d88924, op: WAIT_BITSET|PRIV, val: 1, utime: 0x7fff19d88890, val3: 4294967295) = 0
   805.419 (250.219 ms): qemu-system-x8/17344 ioctl(fd: 18<anon_inode:kvm-vcpu>, cmd: 0xae80) = 0
[root@zoo ProgramasRFB]#

format:

   time-since-start (syscall duration): ...

- Arnaldo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19 14:33                 ` Arnaldo Carvalho de Melo
@ 2014-09-19 14:53                   ` Milian Wolff
  2014-09-19 15:50                     ` Namhyung Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Milian Wolff @ 2014-09-19 14:53 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Namhyung Kim, linux-perf-users, Ingo Molnar, Joseph Schuchart

On Friday 19 September 2014 11:33:40 Arnaldo Carvalho de Melo wrote:
> Em Fri, Sep 19, 2014 at 02:59:55PM +0900, Namhyung Kim escreveu:
> > Hi Arnaldo and Millan,
> > 
> > On Thu, 18 Sep 2014 16:17:13 -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu:
> > >> That would indeed be very welcome. There are multiple "defaults" in
> > >> perf which I find highly confusing. The --no-children above e.g.
> > >> could/should probably be the default, no? Similar, I find it extremely
> > >> irritating that `perf report -g`> > 
> > > It was, this is something we've actually been discussing recently: the
> > > change that made --children be the default mode. That is why I added
> > > Namhyung and Ingo to the CC list, so that they become aware of more
> > > reaction to this change.
> > 
> > Yeah, we should rethink about changing the default now.  Actually I'm
> > okay with the change, Ingo what do you think?
> > 
> > >> defaults to `-g fractal` and not `-g graph`.
> > >> 
> > >> 100% foo
> > >> 
> > >>   70% bar
> > >>   
> > >>     70% asdf
> > >>     30% lalala
> > >>   
> > >>   30% baz
> > >> 
> > >> is much harder to interpret than
> > >> 
> > >> 100% foo
> > >> 
> > >>   70% bar
> > >>   
> > >>     49% asdf
> > >>     21% lalala
> > >>   
> > >>   30% baz
> > 
> > I also agree with you. :)
> > 
> > > But the question then is if this is configurable, if not that would be a
> > > first step, i.e. making this possible via some ~/.perfconfig change.
> > 
> > Yes, we have record.call-graph and top.call-graph config options now so
> > adding a new report.call-graph option should not be difficult.  However
> > I think it'd be better being call-graph.XXX as it can be applied to all
> 
> No problem, with sourcing being supported in ~/.perfconfig, we can have
> as many #include call-graph.XXX as needed, multiple levels of includes
> and all.
> 
> > other subcommands transparently.
> > 
> > What about like below?
> > 
> > [call-graph]
> > 
> >   mode = dwarf
> >   dump-size = 8192
> >   print-type = fractal
> >   order = callee
> >   threshold = 0.5
> >   print-limit = 128
> >   sort-key = function
> 
> Milian, does this provide what you expect? How would we call this
> specific call-graph profile?

print-type should be graph, not fractal. Otherwise it sounds good to me. But 
how would one use it? I tried putting it into ~/.perfconfig, but apparently my 
3.16.2-1-ARCH Perf does not support this feature yet? How/when would that 
config be used? As soon as one does "perf record -g" (for mode and dump-size) 
or "perf report" (for a perf.data with call graphs)? That would be very 
useful!

> Ideas on where to put this below tools/perf/?

Nope.

> > > Later we could advocate changing the default. Or perhaps provide some
> > > "skins", i.e. config files that could be sourced into ~/.perfconfig so
> > > that perf mimics the decisions of other profilers, with which people are
> > > used to.
> > > 
> > > Kinda like making mutt behave like pine (as I did a long time ago), even
> > > if just for a while, till one gets used to the "superior" default way of
> > > doing things of the new tool :-)
> > > 
> > >> > > I did that already, but Brendan and the other available Perf
> > >> > > documentation
> > >> > > mostly concentrates on performance issues in the Kernel. I'm
> > >> > > interested
> > >> > > purely in the user space. Perf record with one of the hardware PMU
> > >> > > events
> > >> > > works nicely in that case, but one cannot use it to find
> > >> > > locks&waits
> > >> > > similar to what VTune offers.
> > >> > 
> > >> > Humm, yeah, you need to figure out how to solve your issue, what I
> > >> > tried
> > >> > was to show what kinds of building blocks you could use to build what
> > >> > you need, but no, there is no ready to use tool for this, that I am
> > >> > aware of.
> > 
> > I'm also *very* interest in collecting idle/wait info using perf.  Looks
> > like we can somehow use sched:* tracepoints but it requires root
> > privilege though (unless /proc/sys/kernel/perf_event_paranoid being -1).
> > 
> > With that restriction however, we might improve perf sched (or even
> > plain perf record/report) to provide such info..  David may have an
> > idea. :)
> 
> So this is like, thinking just on userspace here for a change, using
> syscall entry as the wait time entry, then later, when we are at syscall
> exit time (another tracepoint) we subtract the time from syscall entry,
> and that is our waiting-for-the-kernel time, put a call-chain there and
> we can sort by that syscall time, i.e. a bit like 'perf trace' +
> callchains, no?
> 
> I.e. if I want to see what syscalls are taking more than 50ms on my
> system:
> 
> [root@zoo ProgramasRFB]# trace --duration 50 | head -10
<snip>
> 
> But then want to filter this a bit more and exclude poll calls:
> 
> [root@zoo ProgramasRFB]# trace --duration 50 -e \!poll | head -10
<snip>

This sounds useful. If it now would also print backtraces... :) But note how 
this would not find a userspace function that does 5 syscalls, each taking 
10ms.

And again, just sampling syscalls is not what I'm looking for. It is 
definitely a good tool to have at hand, but very specific already. If you want 
to get an overview on where your app is spending time, having a Wall-time 
overview is much more helpful as it will also show CPU hotspots or stuff 
unrelated to syscalls.

Bye

-- 
Milian Wolff
mail@milianw.de
http://milianw.de

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19 14:53                   ` Milian Wolff
@ 2014-09-19 15:50                     ` Namhyung Kim
  0 siblings, 0 replies; 25+ messages in thread
From: Namhyung Kim @ 2014-09-19 15:50 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Arnaldo Carvalho de Melo, linux-perf-users, Ingo Molnar,
	Joseph Schuchart

2014-09-19 (금), 16:53 +0200, Milian Wolff:
> On Friday 19 September 2014 11:33:40 Arnaldo Carvalho de Melo wrote:
> > Em Fri, Sep 19, 2014 at 02:59:55PM +0900, Namhyung Kim escreveu:
> > > Hi Arnaldo and Millan,
> > > 
> > > On Thu, 18 Sep 2014 16:17:13 -0300, Arnaldo Carvalho de Melo wrote:
> > > > Em Thu, Sep 18, 2014 at 06:37:47PM +0200, Milian Wolff escreveu:
> > > >> That would indeed be very welcome. There are multiple "defaults" in
> > > >> perf which I find highly confusing. The --no-children above e.g.
> > > >> could/should probably be the default, no? Similar, I find it extremely
> > > >> irritating that `perf report -g`> > 
> > > > It was, this is something we've actually been discussing recently: the
> > > > change that made --children be the default mode. That is why I added
> > > > Namhyung and Ingo to the CC list, so that they become aware of more
> > > > reaction to this change.
> > > 
> > > Yeah, we should rethink about changing the default now.  Actually I'm
> > > okay with the change, Ingo what do you think?
> > > 
> > > >> defaults to `-g fractal` and not `-g graph`.
> > > >> 
> > > >> 100% foo
> > > >> 
> > > >>   70% bar
> > > >>   
> > > >>     70% asdf
> > > >>     30% lalala
> > > >>   
> > > >>   30% baz
> > > >> 
> > > >> is much harder to interpret than
> > > >> 
> > > >> 100% foo
> > > >> 
> > > >>   70% bar
> > > >>   
> > > >>     49% asdf
> > > >>     21% lalala
> > > >>   
> > > >>   30% baz
> > > 
> > > I also agree with you. :)
> > > 
> > > > But the question then is if this is configurable, if not that would be a
> > > > first step, i.e. making this possible via some ~/.perfconfig change.
> > > 
> > > Yes, we have record.call-graph and top.call-graph config options now so
> > > adding a new report.call-graph option should not be difficult.  However
> > > I think it'd be better being call-graph.XXX as it can be applied to all
> > 
> > No problem, with sourcing being supported in ~/.perfconfig, we can have
> > as many #include call-graph.XXX as needed, multiple levels of includes
> > and all.
> > 
> > > other subcommands transparently.
> > > 
> > > What about like below?
> > > 
> > > [call-graph]
> > > 
> > >   mode = dwarf
> > >   dump-size = 8192
> > >   print-type = fractal
> > >   order = callee
> > >   threshold = 0.5
> > >   print-limit = 128
> > >   sort-key = function
> > 
> > Milian, does this provide what you expect? How would we call this
> > specific call-graph profile?
> 
> print-type should be graph, not fractal. Otherwise it sounds good to me. But 
> how would one use it? I tried putting it into ~/.perfconfig, but apparently my 
> 3.16.2-1-ARCH Perf does not support this feature yet? How/when would that 
> config be used? As soon as one does "perf record -g" (for mode and dump-size) 
> or "perf report" (for a perf.data with call graphs)? That would be very 
> useful!

This is a proposal for new config items which can be included into the
~/.perfconfig file.  You can change it as you want later and it should
apply to default behavior of perf record -g and perf report.

Thanks,
Namhyung 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Perf event for Wall-time based sampling?
  2014-09-19  5:59               ` Namhyung Kim
  2014-09-19 14:33                 ` Arnaldo Carvalho de Melo
@ 2014-09-22  7:56                 ` Namhyung Kim
  1 sibling, 0 replies; 25+ messages in thread
From: Namhyung Kim @ 2014-09-22  7:56 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Milian Wolff, linux-perf-users, Ingo Molnar, Joseph Schuchart,
	David Ahern

On Fri, 19 Sep 2014 14:59:55 +0900, Namhyung Kim wrote:
> I'm also *very* interest in collecting idle/wait info using perf.  Looks
> like we can somehow use sched:* tracepoints but it requires root
> privilege though (unless /proc/sys/kernel/perf_event_paranoid being -1).
>
> With that restriction however, we might improve perf sched (or even
> plain perf record/report) to provide such info..  David may have an
> idea. :)

I wrote a naive script called tasktime that uses task-clock and
sched:sched_switch events to record and report task execution and wait
time.  Very lightly tested only..

The idea is that it counts time between context-switch and next
task-clock event as an wait time.  Callchain is searched with a pattern
similar to the parent pattern to find the reason of the switch.

Please have a look and give some comments! :)

Thanks,
Namhyung



From ff8497d91277c25d8c2c7688b7c056b38f799614 Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Mon, 22 Sep 2014 16:08:30 +0900
Subject: [PATCH] perf script: Introduce a new 'tasktime' python script

The tasktime script record and report task execution (and wait) time
using task-clock and sched:sched_switch events.  When a task goes to a
context switch it searches callchain to have a certain pattern
(sys_.*|.*[^b][Ll]ock|page_fault) and prints with a "wait: " prefix.
If it fails to find the pattern in the callchain, it prints 'idle' or
'context-switch' depends on the next task scheduled.

Below is an example run of this script.  You can see 'context-switch'
overhead is around 0.8% of total execution time in this workload.

Note that this script is not complete and posted only for discussion.

  $ perf script record tasktime -- make clean all
  $ perf script report tasktime | head -20
   Overhead  Command           Shared Object         Symbol
   --------  ----------------  --------------------  ----------------
      2.15%  cc1               cc1                   ht_lookup_with_hash
      2.10%  cc1               cc1                   _cpp_lex_direct
      1.78%  cc1               libc-2.17.so          _int_malloc
      1.44%  cc1               cc1                   ggc_internal_alloc_stat
      1.39%  cc1               cc1                   htab_find_slot_with_hash
      0.85%  cc1               cc1                   bitmap_set_bit
      0.84%  cc1               cc1                   lex_identifier
      0.82%  cc1               libc-2.17.so          _int_free
      0.80%  cc1               <kernel>              context-switch
      0.57%  cc1               [kernel.kallsyms]     __do_page_fault
      0.55%  cc1               cc1                   cpp_get_token_1
      0.54%  cc1               cc1                   df_note_compute
      0.53%  cc1               cc1                   htab_find_with_hash
      0.53%  cc1               cc1                   for_each_rtx_1
      0.50%  cc1               libc-2.17.so          malloc_consolidate
      0.48%  cc1               cc1                   grokdeclarator
      0.46%  as                libc-2.17.so          __strncmp_sse42
      0.46%  as                as                    4260905

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/scripts/python/bin/tasktime-record |   3 +
 tools/perf/scripts/python/bin/tasktime-report |   3 +
 tools/perf/scripts/python/tasktime.py         | 125 ++++++++++++++++++++++++++
 3 files changed, 131 insertions(+)
 create mode 100644 tools/perf/scripts/python/bin/tasktime-record
 create mode 100644 tools/perf/scripts/python/bin/tasktime-report
 create mode 100644 tools/perf/scripts/python/tasktime.py

diff --git a/tools/perf/scripts/python/bin/tasktime-record b/tools/perf/scripts/python/bin/tasktime-record
new file mode 100644
index 000000000000..5aebcef93e55
--- /dev/null
+++ b/tools/perf/scripts/python/bin/tasktime-record
@@ -0,0 +1,3 @@
+#!/bin/bash
+perf record -gR -e task-clock -e sched:sched_switch $@
+
diff --git a/tools/perf/scripts/python/bin/tasktime-report b/tools/perf/scripts/python/bin/tasktime-report
new file mode 100644
index 000000000000..a21f8dea15e7
--- /dev/null
+++ b/tools/perf/scripts/python/bin/tasktime-report
@@ -0,0 +1,3 @@
+#!/bin/bash
+# description: analyze task execution and wait time
+perf script $@ -s "$PERF_EXEC_PATH"/scripts/python/tasktime.py
diff --git a/tools/perf/scripts/python/tasktime.py b/tools/perf/scripts/python/tasktime.py
new file mode 100644
index 000000000000..0d69d94cb31b
--- /dev/null
+++ b/tools/perf/scripts/python/tasktime.py
@@ -0,0 +1,125 @@
+# perf script event handlers, generated by perf script -g python
+# Licensed under the terms of the GNU GPL License version 2
+
+# The common_* event handler fields are the most useful fields common to
+# all events.  They don't necessarily correspond to the 'common_*' fields
+# in the format files.  Those fields not available as handler params can
+# be retrieved using Python functions of the form common_*(context).
+# See the perf-trace-python Documentation for the list of available functions.
+
+import os
+import sys
+import re
+
+sys.path.append(os.environ['PERF_EXEC_PATH'] + \
+	'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
+
+from perf_trace_context import *
+from Core import *
+
+time_table = {}
+wait_table = []
+evcount = {}
+sample_count = {}
+wait_pattern = ''
+
+total_time = 0
+
+def trace_begin():
+	global wait_pattern
+	wait_pattern = re.compile(".*[^b][Ll]ock.*|sys_.*|page_fault")
+
+def trace_end():
+	output = [(k,time_table[k]) for k in time_table]
+	output.sort(key=lambda e: e[1], reverse=True)
+
+	print " Overhead  Command           Shared Object         Symbol"
+	print " --------  ----------------  --------------------  ----------------"
+	for entry in output:
+		key = entry[0].split(',')
+		print "  %6.2f%%  %-16s  %-20s  %s" % \
+		    ((100.0 * entry[1] / total_time), key[0], key[1], key[2])
+
+	print
+	print "[event stats]"
+	print "Total time:", total_time
+	print "Number of events:"
+	for k in evcount:
+		print "   ", k, evcount[k]
+
+def sched__sched_switch(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	common_callchain, prev_comm, prev_pid, prev_prio, prev_state, 
+	next_comm, next_pid, next_prio):
+
+	global total_time
+	if 'sched_switch' not in evcount:
+		evcount['sched_switch'] = 0
+	evcount['sched_switch'] += 1
+
+	ts = common_secs * 1000000000 + common_nsecs
+
+	if next_comm == 'swapper':
+		name = "idle"
+	else:
+		name = "context-switch"
+	dso = '<kernel>'
+
+	if common_callchain:
+		for entry in common_callchain:
+			if 'sym' not in entry:
+				continue
+			symname = entry['sym']['name']
+			if symname and wait_pattern.match(symname):
+				name = "wait: " + symname
+				dso = '<unknown>'
+				break
+
+	key = ','.join([common_comm, dso, name])
+	if key not in sample_count:
+		sample_count[key] = 0
+	sample_count[key] += 1
+	wait_table.insert(0, {'start': ts, 'end': 0, 'key': key,
+			      'prev': prev_comm, 'next': next_comm })
+
+def process_event(param_dict):
+	global total_time
+	event = param_dict['ev_name']
+	if event not in evcount:
+		evcount[event] = 0
+	evcount[event] += 1
+
+	comm = param_dict['comm']
+	if 'dso' in param_dict:
+		dso = os.path.basename(param_dict['dso'])
+	else:
+		dso = '<unknown>'
+	if 'symbol' in param_dict:
+		sym = param_dict['symbol']
+	else:
+		sym = str(param_dict['sample']['ip'])
+
+	key = ','.join([comm, dso, sym])
+	if key not in time_table:
+		time_table[key] = 0
+	time_table[key] += param_dict['sample']['period']
+	total_time += param_dict['sample']['period']
+
+	if key not in sample_count:
+		sample_count[key] = 0
+	sample_count[key] += 1
+
+	if len(wait_table) > 0 and wait_table[0]['end'] == 0:
+		ts_end = param_dict['sample']['time']
+		wait_table[0]['end'] = ts_end - param_dict['sample']['period']
+		waittime = wait_table[0]['end'] - wait_table[0]['start']
+		if waittime <= 0:
+			waittime = param_dict['sample']['period'] / 2
+		key = wait_table[0]['key']
+		if key not in time_table:
+			time_table[key] = 0
+		time_table[key] += waittime
+		total_time += waittime
+
+def trace_unhandled(event_name, context, event_fields_dict):
+		print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-09-22  7:56 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-18 12:32 Perf event for Wall-time based sampling? Milian Wolff
2014-09-18 13:23 ` Arnaldo Carvalho de Melo
2014-09-18 13:41   ` Milian Wolff
2014-09-18 14:51     ` Arnaldo Carvalho de Melo
2014-09-18 15:26       ` Milian Wolff
2014-09-18 15:57         ` Arnaldo Carvalho de Melo
2014-09-18 16:37           ` Milian Wolff
2014-09-18 19:17             ` Arnaldo Carvalho de Melo
2014-09-18 19:31               ` Arnaldo Carvalho de Melo
2014-09-18 20:17               ` David Ahern
2014-09-18 20:36                 ` Arnaldo Carvalho de Melo
2014-09-18 20:39                   ` David Ahern
2014-09-19  8:11                   ` Milian Wolff
2014-09-19  9:08                     ` Milian Wolff
2014-09-19 14:47                     ` Arnaldo Carvalho de Melo
2014-09-19 15:04                       ` David Ahern
2014-09-19 15:05                       ` Milian Wolff
2014-09-19 14:17                   ` David Ahern
2014-09-19 14:39                     ` Milian Wolff
2014-09-19 14:55                       ` David Ahern
2014-09-19  5:59               ` Namhyung Kim
2014-09-19 14:33                 ` Arnaldo Carvalho de Melo
2014-09-19 14:53                   ` Milian Wolff
2014-09-19 15:50                     ` Namhyung Kim
2014-09-22  7:56                 ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).