linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* usability issues with inlining and backtraces
@ 2016-05-23  9:10 Milian Wolff
  2016-05-23 14:11 ` Arnaldo Carvalho de Melo
  2016-05-23 14:59 ` Andi Kleen
  0 siblings, 2 replies; 11+ messages in thread
From: Milian Wolff @ 2016-05-23  9:10 UTC (permalink / raw)
  To: linux-perf-users

[-- Attachment #1: Type: text/plain, Size: 10777 bytes --]

Hey all,

I've recently encountered some nasty issues with perf which should - I hope - 
be possible to be fixed.

First of all, I learned about `perf report -s sym,srcline`, most notably the 
`srcline`, which finally (!) allows me to correlate sample cost directly to 
line numbers, without trying to interpret the often quite convoluted output of 
`perf annotate`. But, take the following example code:

~~~~~~~~~~~~
#include <complex>
#include <cmath>
#include <random>
#include <iostream>

using namespace std;

int main()
{
    uniform_real_distribution<double> uniform(-1E5, 1E5);
    default_random_engine engine;
    double s = 0;
    for (int i = 0; i < 1000000; ++i) {
        s += norm(complex<double>(uniform(engine), uniform(engine)));
    }
    cout << s << '\n';
    return 0;
}
~~~~~~~~~~~~

Compile it with

    g++ -std=c++11 -g -O2 test.cpp

Then profile it with perf:

    perf record --call-graph dwarf

########## perf report backtraces in face of inlining

Now let us go through the the output of the report with srcline enabled:

~~~~~~~~~~~~
    perf report --stdio -s sym,srcline --no-children

    21.07%  [.] main                   random.tcc:3332
            |
            ---__libc_start_main
               _start
    20.30%  [.] main                   random.h:143
            |
            ---__libc_start_main
               _start
~~~~~~~~~~~~

Here, the cost of inlined functions (random.tcc:3332,random.h:143) is 
attributed to the main function. This is of course correct, but very unhelpful 
to me as a programmer. I'm much more interested in the line inside test.cpp 
which triggered the call to random.tcc:3332 etc. Is there a way to get that 
data? Note how you can a way better backtrace when using GDB on the same 
binary as above: 

~~~~~~~~~~~~
    gdb a.out
    break random.tcc:3332
    run
Breakpoint 1, std::generate_canonical<double, 53ul, 
std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > 
(__urng=...) at /usr/include/c++/6.1.1/bits/random.tcc:3332
3332                  __sum += _RealType(__urng() - __urng.min()) * __tmp;
(gdb) bt
#0  std::generate_canonical<double, 53ul, 
std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > 
(__urng=...) at /usr/include/c++/6.1.1/bits/random.tcc:3332
#1  std::__detail::_Adaptor<std::linear_congruential_engine<unsigned long, 
16807ul, 0ul, 2147483647ul>, double>::operator() (this=<optimized out>)
    at /usr/include/c++/6.1.1/bits/random.h:185
#2  
std::uniform_real_distribution<double>::operator()<std::linear_congruential_engine<unsigned 
long, 16807ul, 0ul, 2147483647ul> > (this=<optimized out>, __p=..., 
__urng=...)
    at /usr/include/c++/6.1.1/bits/random.h:1818
#3  
std::uniform_real_distribution<double>::operator()<std::linear_congruential_engine<unsigned 
long, 16807ul, 0ul, 2147483647ul> > (__urng=<synthetic pointer>, 
this=<synthetic pointer>)
    at /usr/include/c++/6.1.1/bits/random.h:1809
#4  main () at test.cpp:14
~~~~~~~~~~~~

Why is perf (or is it libunwind?) not capable of getting the same backtrace?

######### srcline in perf report backtraces

If we continue with the `perf report` output from the command above, we'll
eventually see:

~~~~~~~~~~~~

     4.60%  [.] main                   test.cpp:14
            |
            ---__libc_start_main
               _start

     3.63%  [.] __hypot_finite         __hypot_finite+257
            |
            ---hypot
               main
               __libc_start_main
               _start

     3.25%  [.] hypot                  hypot+55
            |
            ---main
               __libc_start_main
               _start

     3.21%  [.] __hypot_finite         __hypot_finite+152
            |
            ---hypot
               main
               __libc_start_main
               _start

     2.63%  [.] __hypot_finite         __hypot_finite+66
            |
            ---hypot
               main
... continued
~~~~~~~~~

Here, I have again a few things to note:

On one hand, we have the same issue as above, namely inlined functions being 
attributed directly to the parent function. See how the backtrace shows main 
calling hypot? Look at the source, I'm not calling hypot anywhere - it's 
std::norm calling it internally eventually. And GDB does know about that and 
can give me a proper backtrace.

But, differently to above, the major gripe I have with this output is 
exemplified by this part:

~~~~~~~~~
     3.63%  [.] __hypot_finite         __hypot_finite+257
            |
            ---hypot
               main
               __libc_start_main
               _start
~~~~~~~~~

I added `srcline` to the report, but the backtrace still only contains the 
symbol name. I hope, that we can simply honor srcline there as well, to at 
least print something like this instead:

~~~~~~~~~
     3.63%  [.] __hypot_finite         __hypot_finite+257
            |
            ---hypot
               main at test.cpp:14 ### <-- srcline here as well please?
               __libc_start_main
               _start
~~~~~~~~~

Is there a reason why this was not yet implemented? If not, then I'll prepare 
a patch for that, as I hope this to be easy to implement.

########## symbol cost aggregation

Once we have that implemented, can we maybe account for the following: I told 
`perf report` to aggregate by symbol first, then srcline, i.e. `-s 
sym,srcline`. But the report seems to aggregate by `srcline`, because I see 
some symbols (__hypot_finite) multiple times, for different code points. Can 
we merge those and start the backtrace then at the different code points? 
Something like this would be my desirable output:

~~~~~~~~
# see above for what I actually get, here I'm accumulating manually
$ perf report --stdio -s sym,srcline --no-children |grep __hypot_finite+ | awk 
'{sum+=$1}END{print sum}'
34.96
# this is what I would like to get directly
$ perf report --stdio -s sym,srcline --no-children 
...
34.96%  [.] __hypot_finite
            |
     3.63%  | __hypot_finite         __hypot_finite+257
            - ...
     3.21%  | __hypot_finite         __hypot_finite+152
            - ...
     ...
~~~~~~~~~

What do you think?

####### perf annotate for mere mortals

To conclude this already long email, I want to raise the mess that is `perf 
annotate`. If I run it on my above data file, I get the following output:

~~~~~~~~~
  1.23 │       mulsd  0x2f2(%rip),%xmm0        # 400bd0 <_IO_stdin_used+0x20>
  2.47 │       subsd  0x2f2(%rip),%xmm0        # 400bd8 <_IO_stdin_used+0x28>
       │     _ZSt13__complex_absCd():
       │     #if _GLIBCXX_USE_C99_COMPLEX
       │       inline float
       │       __complex_abs(__complex__ float __z) { return 
__builtin_cabsf(__z); }
       │
       │       inline double
       │       __complex_abs(__complex__ double __z) { return 
__builtin_cabs(__z); }
  1.23 │     → callq  cabs@plt
       │     _ZNSt12_Norm_helperILb1EE8_S_do_itIdEET_RKSt7complexIS2_E():
       │         {
       │           template<typename _Tp>
       │             static inline _Tp _S_do_it(const complex<_Tp>& __z)
       │             {
       │               _Tp __res = std::abs(__z);
       │               return __res * __res;
       │       mulsd  %xmm0,%xmm0
       │     main():
       │         for (int i = 0; i < 1000000; ++i) {
  1.23 │       sub    $0x1,%ebp
       │             s += norm(complex<double>(uniform(engine), 
uniform(engine)));
  7.41 │       addsd  0x8(%rsp),%xmm0
       │       movsd  %xmm0,0x8(%rsp)
       │     int main()
       │     {
       │         uniform_real_distribution<double> uniform(-1E5, 1E5);
       │         default_random_engine engine;
       │         double s = 0;
       │         for (int i = 0; i < 1000000; ++i) {
       │     ↓ je     24e
       │     
_ZNSt8__detail4_ModImLm2147483647ELm16807ELm0ELb1ELb1EE6__calcEm():
       │           struct _Mod<_Tp, __m, __a, __c, true, __s>
       │           {
       │             static _Tp
       │             __calc(_Tp __x)
       │             {
       │               _Tp __res = __a * __x + __c;
       │194:   imul   $0x41a7,%rbx,%rbx

~~~~~~~

I personally think this output is extremely hard to interpret. I see the value 
of the above output when you want to look at the cost of hot computational 
kernels, but for most user space applications the above is pretty much 
useless. Most notably, note that the above shows self cost only and I could 
not find a way to trigger inclusive cost visualization. It makes me believe 
that the addition of the complex norms is the most costly operation in my 
code. That is not true, the inclusive cost of the random engine as well as 
actually computing the norm is way higher, i.e. I see

   1.23 │     → callq  cabs@plt

but from the output of `perf report` above that triggers the calls to 
`__hypot` which have close to 40% of the cost.

As such, I propose a "simplified" `perf annotate` output, which gets closer to 
what you'll see in other profilers, e.g. VTune or also Microsoft Visual 
studio's sampling profilers: Use the source files and annotate those with 
inclusive cost. Only show the cost of individual binary instructions (status 
quo) when explicitly asked. I.e. I want the output to look something like that 
(I made up the actual percentages):

~~~~~~~~~
        #include <complex>
        #include <cmath>
        #include <random>
        #include <iostream>

        using namespace std;

        int main()
        {
1%          uniform_real_distribution<double> uniform(-1E5, 1E5);
1%          default_random_engine engine;
            double s = 0;
1%          for (int i = 0; i < 1000000; ++i) {
95%            s += norm(complex<double>(uniform(engine), uniform(engine)));
            }
2%          cout << s << '\n';
            return 0;
        }
~~~~~~~~~

Bonus points if I could then navigate to a source line and dig deeper (i.e. 
see the instructions for this line and then jump to a function that is being 
called e.g.).

Thanks a lot for reading this far. Please note that I really like perf, I just 
have trouble teaching people using it as many things are really really hard 
and I hope we can work together to make perf much easier to use!

Thanks
-- 
Milian Wolff | milian.wolff@kdab.com | Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5903 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-16 15:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-23  9:10 usability issues with inlining and backtraces Milian Wolff
2016-05-23 14:11 ` Arnaldo Carvalho de Melo
2016-08-15 11:44   ` Milian Wolff
2016-05-23 14:59 ` Andi Kleen
2016-05-23 15:18   ` Milian Wolff
2016-06-13 16:07     ` Milian Wolff
2016-08-15  9:32       ` Milian Wolff
2016-08-15 11:32         ` Milian Wolff
2016-08-15 17:13           ` Andi Kleen
2016-08-16 15:13             ` Namhyung Kim
2016-08-16 15:45               ` Milian Wolff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).