Re: usability issues with inlining and backtraces

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Milian Wolff <milian.wolff@kdab.com>
Cc: linux-perf-users@vger.kernel.org
Subject: Re: usability issues with inlining and backtraces
Date: Mon, 23 May 2016 11:11:04 -0300	[thread overview]
Message-ID: <20160523141104.GG8897@kernel.org> (raw)
In-Reply-To: <3421510.JYGX19gbWH@milian-kdab2>

Em Mon, May 23, 2016 at 11:10:00AM +0200, Milian Wolff escreveu:
> Hey all,
> 
> I've recently encountered some nasty issues with perf which should - I hope - 
> be possible to be fixed.

I've quickly read this, all interesting, but I need some time to digest
it fully, which I don't have right now, I'll get to it as soon as I can.

The fact that other open source tools do it as you expect is a precious
observation, as we can go look at how they solved problems.

Yeah, agreed that we need getting these tools to be more useful for
userland developers, and to, as much as we can, connect events connected
to source code.

Hopefully some other people will participate.

- Arnaldo
 
> First of all, I learned about `perf report -s sym,srcline`, most notably the 
> `srcline`, which finally (!) allows me to correlate sample cost directly to 
> line numbers, without trying to interpret the often quite convoluted output of 
> `perf annotate`. But, take the following example code:
> 
> ~~~~~~~~~~~~
> #include <complex>
> #include <cmath>
> #include <random>
> #include <iostream>
> 
> using namespace std;
> 
> int main()
> {
>     uniform_real_distribution<double> uniform(-1E5, 1E5);
>     default_random_engine engine;
>     double s = 0;
>     for (int i = 0; i < 1000000; ++i) {
>         s += norm(complex<double>(uniform(engine), uniform(engine)));
>     }
>     cout << s << '\n';
>     return 0;
> }
> ~~~~~~~~~~~~
> 
> Compile it with
> 
>     g++ -std=c++11 -g -O2 test.cpp
> 
> Then profile it with perf:
> 
>     perf record --call-graph dwarf
> 
> ########## perf report backtraces in face of inlining
> 
> Now let us go through the the output of the report with srcline enabled:
> 
> ~~~~~~~~~~~~
>     perf report --stdio -s sym,srcline --no-children
> 
>     21.07%  [.] main                   random.tcc:3332
>             |
>             ---__libc_start_main
>                _start
>     20.30%  [.] main                   random.h:143
>             |
>             ---__libc_start_main
>                _start
> ~~~~~~~~~~~~
> 
> Here, the cost of inlined functions (random.tcc:3332,random.h:143) is 
> attributed to the main function. This is of course correct, but very unhelpful 
> to me as a programmer. I'm much more interested in the line inside test.cpp 
> which triggered the call to random.tcc:3332 etc. Is there a way to get that 
> data? Note how you can a way better backtrace when using GDB on the same 
> binary as above: 
> 
> ~~~~~~~~~~~~
>     gdb a.out
>     break random.tcc:3332
>     run
> Breakpoint 1, std::generate_canonical<double, 53ul, 
> std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > 
> (__urng=...) at /usr/include/c++/6.1.1/bits/random.tcc:3332
> 3332                  __sum += _RealType(__urng() - __urng.min()) * __tmp;
> (gdb) bt
> #0  std::generate_canonical<double, 53ul, 
> std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > 
> (__urng=...) at /usr/include/c++/6.1.1/bits/random.tcc:3332
> #1  std::__detail::_Adaptor<std::linear_congruential_engine<unsigned long, 
> 16807ul, 0ul, 2147483647ul>, double>::operator() (this=<optimized out>)
>     at /usr/include/c++/6.1.1/bits/random.h:185
> #2  
> std::uniform_real_distribution<double>::operator()<std::linear_congruential_engine<unsigned 
> long, 16807ul, 0ul, 2147483647ul> > (this=<optimized out>, __p=..., 
> __urng=...)
>     at /usr/include/c++/6.1.1/bits/random.h:1818
> #3  
> std::uniform_real_distribution<double>::operator()<std::linear_congruential_engine<unsigned 
> long, 16807ul, 0ul, 2147483647ul> > (__urng=<synthetic pointer>, 
> this=<synthetic pointer>)
>     at /usr/include/c++/6.1.1/bits/random.h:1809
> #4  main () at test.cpp:14
> ~~~~~~~~~~~~
> 
> Why is perf (or is it libunwind?) not capable of getting the same backtrace?
> 
> ######### srcline in perf report backtraces
> 
> If we continue with the `perf report` output from the command above, we'll
> eventually see:
> 
> ~~~~~~~~~~~~
> 
>      4.60%  [.] main                   test.cpp:14
>             |
>             ---__libc_start_main
>                _start
> 
>      3.63%  [.] __hypot_finite         __hypot_finite+257
>             |
>             ---hypot
>                main
>                __libc_start_main
>                _start
> 
>      3.25%  [.] hypot                  hypot+55
>             |
>             ---main
>                __libc_start_main
>                _start
> 
>      3.21%  [.] __hypot_finite         __hypot_finite+152
>             |
>             ---hypot
>                main
>                __libc_start_main
>                _start
> 
>      2.63%  [.] __hypot_finite         __hypot_finite+66
>             |
>             ---hypot
>                main
> ... continued
> ~~~~~~~~~
> 
> Here, I have again a few things to note:
> 
> On one hand, we have the same issue as above, namely inlined functions being 
> attributed directly to the parent function. See how the backtrace shows main 
> calling hypot? Look at the source, I'm not calling hypot anywhere - it's 
> std::norm calling it internally eventually. And GDB does know about that and 
> can give me a proper backtrace.
> 
> But, differently to above, the major gripe I have with this output is 
> exemplified by this part:
> 
> ~~~~~~~~~
>      3.63%  [.] __hypot_finite         __hypot_finite+257
>             |
>             ---hypot
>                main
>                __libc_start_main
>                _start
> ~~~~~~~~~
> 
> I added `srcline` to the report, but the backtrace still only contains the 
> symbol name. I hope, that we can simply honor srcline there as well, to at 
> least print something like this instead:
> 
> ~~~~~~~~~
>      3.63%  [.] __hypot_finite         __hypot_finite+257
>             |
>             ---hypot
>                main at test.cpp:14 ### <-- srcline here as well please?
>                __libc_start_main
>                _start
> ~~~~~~~~~
> 
> Is there a reason why this was not yet implemented? If not, then I'll prepare 
> a patch for that, as I hope this to be easy to implement.
> 
> ########## symbol cost aggregation
> 
> Once we have that implemented, can we maybe account for the following: I told 
> `perf report` to aggregate by symbol first, then srcline, i.e. `-s 
> sym,srcline`. But the report seems to aggregate by `srcline`, because I see 
> some symbols (__hypot_finite) multiple times, for different code points. Can 
> we merge those and start the backtrace then at the different code points? 
> Something like this would be my desirable output:
> 
> ~~~~~~~~
> # see above for what I actually get, here I'm accumulating manually
> $ perf report --stdio -s sym,srcline --no-children |grep __hypot_finite+ | awk 
> '{sum+=$1}END{print sum}'
> 34.96
> # this is what I would like to get directly
> $ perf report --stdio -s sym,srcline --no-children 
> ...
> 34.96%  [.] __hypot_finite
>             |
>      3.63%  | __hypot_finite         __hypot_finite+257
>             - ...
>      3.21%  | __hypot_finite         __hypot_finite+152
>             - ...
>      ...
> ~~~~~~~~~
> 
> What do you think?
> 
> ####### perf annotate for mere mortals
> 
> To conclude this already long email, I want to raise the mess that is `perf 
> annotate`. If I run it on my above data file, I get the following output:
> 
> ~~~~~~~~~
>   1.23 │       mulsd  0x2f2(%rip),%xmm0        # 400bd0 <_IO_stdin_used+0x20>
>   2.47 │       subsd  0x2f2(%rip),%xmm0        # 400bd8 <_IO_stdin_used+0x28>
>        │     _ZSt13__complex_absCd():
>        │     #if _GLIBCXX_USE_C99_COMPLEX
>        │       inline float
>        │       __complex_abs(__complex__ float __z) { return 
> __builtin_cabsf(__z); }
>        │
>        │       inline double
>        │       __complex_abs(__complex__ double __z) { return 
> __builtin_cabs(__z); }
>   1.23 │     → callq  cabs@plt
>        │     _ZNSt12_Norm_helperILb1EE8_S_do_itIdEET_RKSt7complexIS2_E():
>        │         {
>        │           template<typename _Tp>
>        │             static inline _Tp _S_do_it(const complex<_Tp>& __z)
>        │             {
>        │               _Tp __res = std::abs(__z);
>        │               return __res * __res;
>        │       mulsd  %xmm0,%xmm0
>        │     main():
>        │         for (int i = 0; i < 1000000; ++i) {
>   1.23 │       sub    $0x1,%ebp
>        │             s += norm(complex<double>(uniform(engine), 
> uniform(engine)));
>   7.41 │       addsd  0x8(%rsp),%xmm0
>        │       movsd  %xmm0,0x8(%rsp)
>        │     int main()
>        │     {
>        │         uniform_real_distribution<double> uniform(-1E5, 1E5);
>        │         default_random_engine engine;
>        │         double s = 0;
>        │         for (int i = 0; i < 1000000; ++i) {
>        │     ↓ je     24e
>        │     
> _ZNSt8__detail4_ModImLm2147483647ELm16807ELm0ELb1ELb1EE6__calcEm():
>        │           struct _Mod<_Tp, __m, __a, __c, true, __s>
>        │           {
>        │             static _Tp
>        │             __calc(_Tp __x)
>        │             {
>        │               _Tp __res = __a * __x + __c;
>        │194:   imul   $0x41a7,%rbx,%rbx
> 
> ~~~~~~~
> 
> I personally think this output is extremely hard to interpret. I see the value 
> of the above output when you want to look at the cost of hot computational 
> kernels, but for most user space applications the above is pretty much 
> useless. Most notably, note that the above shows self cost only and I could 
> not find a way to trigger inclusive cost visualization. It makes me believe 
> that the addition of the complex norms is the most costly operation in my 
> code. That is not true, the inclusive cost of the random engine as well as 
> actually computing the norm is way higher, i.e. I see
> 
>    1.23 │     → callq  cabs@plt
> 
> but from the output of `perf report` above that triggers the calls to 
> `__hypot` which have close to 40% of the cost.
> 
> As such, I propose a "simplified" `perf annotate` output, which gets closer to 
> what you'll see in other profilers, e.g. VTune or also Microsoft Visual 
> studio's sampling profilers: Use the source files and annotate those with 
> inclusive cost. Only show the cost of individual binary instructions (status 
> quo) when explicitly asked. I.e. I want the output to look something like that 
> (I made up the actual percentages):
> 
> ~~~~~~~~~
>         #include <complex>
>         #include <cmath>
>         #include <random>
>         #include <iostream>
> 
>         using namespace std;
> 
>         int main()
>         {
> 1%          uniform_real_distribution<double> uniform(-1E5, 1E5);
> 1%          default_random_engine engine;
>             double s = 0;
> 1%          for (int i = 0; i < 1000000; ++i) {
> 95%            s += norm(complex<double>(uniform(engine), uniform(engine)));
>             }
> 2%          cout << s << '\n';
>             return 0;
>         }
> ~~~~~~~~~
> 
> Bonus points if I could then navigate to a source line and dig deeper (i.e. 
> see the instructions for this line and then jump to a function that is being 
> called e.g.).
> 
> Thanks a lot for reading this far. Please note that I really like perf, I just 
> have trouble teaching people using it as many things are really really hard 
> and I hope we can work together to make perf much easier to use!
> 
> Thanks
> -- 
> Milian Wolff | milian.wolff@kdab.com | Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts

next prev parent reply	other threads:[~2016-05-23 14:11 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-23  9:10 usability issues with inlining and backtraces Milian Wolff
2016-05-23 14:11 ` Arnaldo Carvalho de Melo [this message]
2016-08-15 11:44   ` Milian Wolff
2016-05-23 14:59 ` Andi Kleen
2016-05-23 15:18   ` Milian Wolff
2016-06-13 16:07     ` Milian Wolff
2016-08-15  9:32       ` Milian Wolff
2016-08-15 11:32         ` Milian Wolff
2016-08-15 17:13           ` Andi Kleen
2016-08-16 15:13             ` Namhyung Kim
2016-08-16 15:45               ` Milian Wolff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160523141104.GG8897@kernel.org \
    --to=acme@kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=milian.wolff@kdab.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).