PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results
@ 2022-08-30 14:21 William Cohen
  2022-08-31 12:17 ` Jiri Olsa
  2022-09-08 19:12 ` William Cohen
  0 siblings, 2 replies; 6+ messages in thread
From: William Cohen @ 2022-08-30 14:21 UTC (permalink / raw)
  To: linux-perf-users; +Cc: wcohen

[-- Attachment #1: Type: text/plain, Size: 2645 bytes --]

With a perf Fedora 36 perf-tools-5.18.13-200.fc36 I was examining
where perf-report was spending its time when generating its report and
found there was an efficiency issue in Fedora 36's binutils-2.37.  The
efficient issue been addressed in Fedora rawhide and will be
backported to Fedora 36
(https://bugzilla.redhat.com/show_bug.cgi?id=2120752).  This was
initially discovered when processing perf.data files created with
--call-graph=dwarf.  The output of the perf-report call-graph for
dwarf information notes inlined functions in the report. The excessive
time spent in binutils bfd's lookup_func_by_offset was caused by perf-report
building up a red-black tree mapping IP addresses to functions
including inlined functions.

I ran a similar experiment with --call-graph=fp to see if it triggered
the same execessive overhead in building the red-black tree for
inlined functions.  It did not.  The resulting output of the perf-report
for --call-graph=fp does not include information about inlined functions.

I have a small reproducer in the attached perf_inlined.tar.gz that
demonstrates the difference between the two methods of storing
call-chain information. Compile and collect data with:

tar xvfz perf_inlined.tar.gz
cd perf_inlined
make all
perf report --input=perf_fp.data > fp.log
perf report --input=perf_dwarf.data > dwarf.log

The dwarf.log has the expected call structure for main:


               main
               |          
                --85.72%--fill_array (inlined)
                          |          
                          |--78.09%--rand
                          |          |          
                          |           --75.10%--__random
                          |                     |          
                          |                      --9.14%--__random_r
                          |          
                          |--1.58%--compute_sqrt (inlined)
                          |          
                           --1.32%--_init

The fp.log looks odd given program:

   99.99%     0.00%  time_waste  libc.so.6             [.] __libc_start_call_main
            |
            ---__libc_start_call_main
               |          
               |--66.07%--__random
               |          
               |--21.28%--main
               |          
               |--8.42%--__random_r
               |          
               |--2.91%--rand
               |          
                --1.31%--_init

Given how common that functions are inlined in optimized code it seems
like perf-report of --call-graph=fp should include information about
time spent in inlined functions.

-Will Cohen

[-- Attachment #2: perf_inlined.tar.gz --]
[-- Type: application/gzip, Size: 649 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results
  2022-08-30 14:21 PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results William Cohen
@ 2022-08-31 12:17 ` Jiri Olsa
  2022-08-31 13:47   ` William Cohen
  2022-08-31 14:22   ` Milian Wolff
  2022-09-08 19:12 ` William Cohen
  1 sibling, 2 replies; 6+ messages in thread
From: Jiri Olsa @ 2022-08-31 12:17 UTC (permalink / raw)
  To: William Cohen; +Cc: linux-perf-users

On Tue, Aug 30, 2022 at 10:21:28AM -0400, William Cohen wrote:
> With a perf Fedora 36 perf-tools-5.18.13-200.fc36 I was examining
> where perf-report was spending its time when generating its report and
> found there was an efficiency issue in Fedora 36's binutils-2.37.  The
> efficient issue been addressed in Fedora rawhide and will be
> backported to Fedora 36
> (https://bugzilla.redhat.com/show_bug.cgi?id=2120752).  This was

nice :)

> initially discovered when processing perf.data files created with
> --call-graph=dwarf.  The output of the perf-report call-graph for
> dwarf information notes inlined functions in the report. The excessive
> time spent in binutils bfd's lookup_func_by_offset was caused by perf-report
> building up a red-black tree mapping IP addresses to functions
> including inlined functions.
> 
> I ran a similar experiment with --call-graph=fp to see if it triggered
> the same execessive overhead in building the red-black tree for
> inlined functions.  It did not.  The resulting output of the perf-report
> for --call-graph=fp does not include information about inlined functions.
> 
> I have a small reproducer in the attached perf_inlined.tar.gz that
> demonstrates the difference between the two methods of storing
> call-chain information. Compile and collect data with:
> 
> tar xvfz perf_inlined.tar.gz
> cd perf_inlined
> make all
> perf report --input=perf_fp.data > fp.log
> perf report --input=perf_dwarf.data > dwarf.log
> 
> The dwarf.log has the expected call structure for main:
> 
> 
>                main
>                |          
>                 --85.72%--fill_array (inlined)
>                           |          
>                           |--78.09%--rand
>                           |          |          
>                           |           --75.10%--__random
>                           |                     |          
>                           |                      --9.14%--__random_r
>                           |          
>                           |--1.58%--compute_sqrt (inlined)
>                           |          
>                            --1.32%--_init
> 
> The fp.log looks odd given program:

what's odd about that? perhaps the confusion is that you are
in children mode? you could try --no-children

> 
>    99.99%     0.00%  time_waste  libc.so.6             [.] __libc_start_call_main
>             |
>             ---__libc_start_call_main
>                |          
>                |--66.07%--__random
>                |          
>                |--21.28%--main
>                |          
>                |--8.42%--__random_r
>                |          
>                |--2.91%--rand
>                |          
>                 --1.31%--_init
> 
> Given how common that functions are inlined in optimized code it seems
> like perf-report of --call-graph=fp should include information about
> time spent in inlined functions.

hum, so 'fp' call graph is tracersing frame pointers which I would
not expect for inlined functions

jirka

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results
  2022-08-31 12:17 ` Jiri Olsa
@ 2022-08-31 13:47   ` William Cohen
  2022-08-31 14:22   ` Milian Wolff
  1 sibling, 0 replies; 6+ messages in thread
From: William Cohen @ 2022-08-31 13:47 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: wcohen, linux-perf-users

On 8/31/22 08:17, Jiri Olsa wrote:
> On Tue, Aug 30, 2022 at 10:21:28AM -0400, William Cohen wrote:
>> With a perf Fedora 36 perf-tools-5.18.13-200.fc36 I was examining
>> where perf-report was spending its time when generating its report and
>> found there was an efficiency issue in Fedora 36's binutils-2.37.  The
>> efficient issue been addressed in Fedora rawhide and will be
>> backported to Fedora 36
>> (https://bugzilla.redhat.com/show_bug.cgi?id=2120752).  This was
> 
> nice :)
> 
>> initially discovered when processing perf.data files created with
>> --call-graph=dwarf.  The output of the perf-report call-graph for
>> dwarf information notes inlined functions in the report. The excessive
>> time spent in binutils bfd's lookup_func_by_offset was caused by perf-report
>> building up a red-black tree mapping IP addresses to functions
>> including inlined functions.
>>
>> I ran a similar experiment with --call-graph=fp to see if it triggered
>> the same execessive overhead in building the red-black tree for
>> inlined functions.  It did not.  The resulting output of the perf-report
>> for --call-graph=fp does not include information about inlined functions.
>>
>> I have a small reproducer in the attached perf_inlined.tar.gz that
>> demonstrates the difference between the two methods of storing
>> call-chain information. Compile and collect data with:
>>
>> tar xvfz perf_inlined.tar.gz
>> cd perf_inlined
>> make all
>> perf report --input=perf_fp.data > fp.log
>> perf report --input=perf_dwarf.data > dwarf.log
>>
>> The dwarf.log has the expected call structure for main:
>>
>>
>>                main
>>                |          
>>                 --85.72%--fill_array (inlined)
>>                           |          
>>                           |--78.09%--rand
>>                           |          |          
>>                           |           --75.10%--__random
>>                           |                     |          
>>                           |                      --9.14%--__random_r
>>                           |          
>>                           |--1.58%--compute_sqrt (inlined)
>>                           |          
>>                            --1.32%--_init
>>
>> The fp.log looks odd given program:
> 
> what's odd about that? perhaps the confusion is that you are
> in children mode? you could try --no-children
> 
>>
>>    99.99%     0.00%  time_waste  libc.so.6             [.] __libc_start_call_main
>>             |
>>             ---__libc_start_call_main
>>                |          
>>                |--66.07%--__random
>>                |          
>>                |--21.28%--main
>>                |          
>>                |--8.42%--__random_r
>>                |          
>>                |--2.91%--rand
>>                |          
>>                 --1.31%--_init
>>
>> Given how common that functions are inlined in optimized code it seems
>> like perf-report of --call-graph=fp should include information about
>> time spent in inlined functions.
> 
> hum, so 'fp' call graph is tracersing frame pointers which I would
> not expect for inlined functions

Hi,

The --call-graph=dwarf version gives a more accurate picture of what is going on in the code.  The output of --call-graph=fp makes it appear that the call to rand is coming from main when it is actually the result of the inlined fill_array function.  The results fp would be even more confusing if both main and fill_array had a call to rand.  The perf user might overlook the contribution by fill_function's rand  The dwarf handling makes it clearer where to look in the source code.  One could resort to using perf's annotation of  the binary to figure out where in the code, but it seems like a better approach would be to add the additional expansion that --call-graph=dwarf does to the backtrace to --call-graph=fp.

-Will
> 
> jirka
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results
  2022-08-31 12:17 ` Jiri Olsa
  2022-08-31 13:47   ` William Cohen
@ 2022-08-31 14:22   ` Milian Wolff
  2022-08-31 16:15     ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 6+ messages in thread
From: Milian Wolff @ 2022-08-31 14:22 UTC (permalink / raw)
  To: William Cohen, Jiri Olsa; +Cc: linux-perf-users

[-- Attachment #1: Type: text/plain, Size: 3638 bytes --]

On Mittwoch, 31. August 2022 14:17:26 CEST Jiri Olsa wrote:
> On Tue, Aug 30, 2022 at 10:21:28AM -0400, William Cohen wrote:
> > With a perf Fedora 36 perf-tools-5.18.13-200.fc36 I was examining
> > where perf-report was spending its time when generating its report and
> > found there was an efficiency issue in Fedora 36's binutils-2.37.  The
> > efficient issue been addressed in Fedora rawhide and will be
> > backported to Fedora 36
> > (https://bugzilla.redhat.com/show_bug.cgi?id=2120752).  This was
> 
> nice :)
> 
> > initially discovered when processing perf.data files created with
> > --call-graph=dwarf.  The output of the perf-report call-graph for
> > dwarf information notes inlined functions in the report. The excessive
> > time spent in binutils bfd's lookup_func_by_offset was caused by
> > perf-report building up a red-black tree mapping IP addresses to
> > functions
> > including inlined functions.
> > 
> > I ran a similar experiment with --call-graph=fp to see if it triggered
> > the same execessive overhead in building the red-black tree for
> > inlined functions.  It did not.  The resulting output of the perf-report
> > for --call-graph=fp does not include information about inlined functions.
> > 
> > I have a small reproducer in the attached perf_inlined.tar.gz that
> > demonstrates the difference between the two methods of storing
> > call-chain information. Compile and collect data with:
> > 
> > tar xvfz perf_inlined.tar.gz
> > cd perf_inlined
> > make all
> > perf report --input=perf_fp.data > fp.log
> > perf report --input=perf_dwarf.data > dwarf.log
> > 
> > The dwarf.log has the expected call structure for main:
> >                main
> >                
> >                 --85.72%--fill_array (inlined)
> >                 
> >                           |--78.09%--rand
> >                           |
> >                           |           --75.10%--__random
> >                           |           
> >                           |                      --9.14%--__random_r
> >                           |
> >                           |--1.58%--compute_sqrt (inlined)
> >                           |
> >                            --1.32%--_init
> > 
> > The fp.log looks odd given program:
> what's odd about that? perhaps the confusion is that you are
> in children mode? you could try --no-children
> 
> >    99.99%     0.00%  time_waste  libc.so.6             [.]
> >    __libc_start_call_main>    
> >             ---__libc_start_call_main
> >             
> >                |--66.07%--__random
> >                |
> >                |--21.28%--main
> >                |
> >                |--8.42%--__random_r
> >                |
> >                |--2.91%--rand
> >                |
> >                 --1.31%--_init
> > 
> > Given how common that functions are inlined in optimized code it seems
> > like perf-report of --call-graph=fp should include information about
> > time spent in inlined functions.
> 
> hum, so 'fp' call graph is tracersing frame pointers which I would
> not expect for inlined functions

I think the oddity is that there is no mention of any (inlined) frames at all 
in frame pointer unwinding. I guess this is simply an oversight - I never had 
a setup where frame pointers are available and thus never implemented that. It 
might be that similar issues arises for LBR unwinding? Hopefully it's just a 
matter of reusing some of the DWARF code for FP/LBR.

Cheers

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5272 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results
  2022-08-31 14:22   ` Milian Wolff
@ 2022-08-31 16:15     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 6+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-08-31 16:15 UTC (permalink / raw)
  To: Milian Wolff; +Cc: William Cohen, Jiri Olsa, linux-perf-users

Em Wed, Aug 31, 2022 at 04:22:31PM +0200, Milian Wolff escreveu:
> On Mittwoch, 31. August 2022 14:17:26 CEST Jiri Olsa wrote:

> > > initially discovered when processing perf.data files created with
> > > --call-graph=dwarf.  The output of the perf-report call-graph for
> > > dwarf information notes inlined functions in the report. The
> > > excessive time spent in binutils bfd's lookup_func_by_offset was
> > > caused by perf-report building up a red-black tree mapping IP
> > > addresses to functions including inlined functions.

> > > I ran a similar experiment with --call-graph=fp to see if it
> > > triggered the same execessive overhead in building the red-black
> > > tree for inlined functions.  It did not.  The resulting output of
> > > the perf-report for --call-graph=fp does not include information
> > > about inlined functions.

> > > I have a small reproducer in the attached perf_inlined.tar.gz that
> > > demonstrates the difference between the two methods of storing
> > > call-chain information. Compile and collect data with:

> > > tar xvfz perf_inlined.tar.gz
> > > cd perf_inlined
> > > make all
> > > perf report --input=perf_fp.data > fp.log
> > > perf report --input=perf_dwarf.data > dwarf.log
> > > 
> > > The dwarf.log has the expected call structure for main:
> > >                main
> > >                
> > >                 --85.72%--fill_array (inlined)
> > >                 
> > >                           |--78.09%--rand
> > >                           |
> > >                           |           --75.10%--__random
> > >                           |           
> > >                           |                      --9.14%--__random_r
> > >                           |
> > >                           |--1.58%--compute_sqrt (inlined)
> > >                           |
> > >                            --1.32%--_init
> > > 
> > > The fp.log looks odd given program:
> > what's odd about that? perhaps the confusion is that you are
> > in children mode? you could try --no-children
> > 
> > >    99.99%     0.00%  time_waste  libc.so.6             [.]
> > >    __libc_start_call_main>    
> > >             ---__libc_start_call_main
> > >             
> > >                |--66.07%--__random
> > >                |
> > >                |--21.28%--main
> > >                |
> > >                |--8.42%--__random_r
> > >                |
> > >                |--2.91%--rand
> > >                |
> > >                 --1.31%--_init
 
> > > Given how common that functions are inlined in optimized code it seems
> > > like perf-report of --call-graph=fp should include information about
> > > time spent in inlined functions.
 
> > hum, so 'fp' call graph is tracersing frame pointers which I would
> > not expect for inlined functions
 
> I think the oddity is that there is no mention of any (inlined) frames at all 
> in frame pointer unwinding. I guess this is simply an oversight - I never had 
> a setup where frame pointers are available and thus never implemented that. It 
> might be that similar issues arises for LBR unwinding? Hopefully it's just a 
> matter of reusing some of the DWARF code for FP/LBR.

Right, probably add the inlining info in DWARF for the ranges where its
expanded, things like:

 <1><9da>: Abbrev Number: 54 (DW_TAG_subprogram)
    <9db>   DW_AT_external    : 1
    <9db>   DW_AT_name        : (indirect string, offset: 0x5680): cmd_bench
    <9df>   DW_AT_decl_file   : 1
    <9e0>   DW_AT_decl_line   : 252
    <9e1>   DW_AT_decl_column : 5
    <9e2>   DW_AT_prototyped  : 1
    <9e2>   DW_AT_type        : <0x55>
    <9e6>   DW_AT_low_pc      : 0x40fa40
    <9ee>   DW_AT_high_pc     : 0x461
    <9f6>   DW_AT_frame_base  : 1 byte block: 9c        (DW_OP_call_frame_cfa)
    <9f8>   DW_AT_call_all_calls: 1
    <9f8>   DW_AT_sibling     : <0x11b4>
<SNIP>
 <3><a76>: Abbrev Number: 24 (DW_TAG_inlined_subroutine)
    <a77>   DW_AT_abstract_origin: <0x177b>
    <a7b>   DW_AT_entry_pc    : 0x40fc15
    <a83>   DW_AT_GNU_entry_view: 1
    <a84>   DW_AT_low_pc      : 0x40fc15
    <a8c>   DW_AT_high_pc     : 0x4c
    <a94>   DW_AT_call_file   : 1
    <a94>   DW_AT_call_line   : 318
    <a96>   DW_AT_call_column : 4
    <a97>   DW_AT_sibling     : <0xb48>
<SNIP>
 <1><177b>: Abbrev Number: 43 (DW_TAG_subprogram)
    <177c>   DW_AT_name        : (indirect string, offset: 0xa42b): dump_benchmarks
    <1780>   DW_AT_decl_file   : 1
    <1780>   DW_AT_decl_line   : 133
    <1781>   DW_AT_decl_column : 13
    <1781>   DW_AT_prototyped  : 1
    <1781>   DW_AT_inline      : 1      (inlined)
    <1781>   DW_AT_sibling     : <0x179d>


Where, in the perf binary, cmd_bench() calls dump_benchmarks() that gets
inlined at that DW_AT_entry_pc/DW_AT_low_pc/DW_AT_high_pc range inside
cmd_bench().

Probably we could add those to the dso symbols list with some flag to
mean its an inline and from where to get more info about it.

This is important, as Will points out as functions not explicitely
marked inline end up being inlined by the compiler.

- Arnaldo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results
  2022-08-30 14:21 PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results William Cohen
  2022-08-31 12:17 ` Jiri Olsa
@ 2022-09-08 19:12 ` William Cohen
  1 sibling, 0 replies; 6+ messages in thread
From: William Cohen @ 2022-09-08 19:12 UTC (permalink / raw)
  To: William Cohen, linux-perf-users@vger.kernel.org; +Cc: wcohen

Hi,

Because of the previous issue of fp unwinding not including
information about inlined functions I wanted to get an idea how often
inline functions occurring in the code.  I have a simple script that
uses dwgrep to extract the number of actual functions and inlined
function then compute the average number of inlined functions for each
actual function
(https://github.com/wcohen/quality_info/blob/master/dwgrep/inlined_funcs).
I ran some quick a quick experiment a few binaries on various
architectures running Fedora to get some actual numbers.

	      average inlined functions per real function

	       architecture
binary		i386	 x86_64	    armv7    aarch64  riscv
vmlinux		-	 6.9	    3.5	     7.4      7.7
libc		1.33	 1.3	    1.3	     1.3      1.1
emacs		-	 17.3	    16.8     18.3     18.6	
stap		-	 75.9	    82.3     77.3     50.3		
libdw-0.187.so  -	 3.3	    3.7	     3.7      3.7
Xwayland	-	 3.9	    -	     3.9      -
libxul.so	-	 19.2	    -	     27.8     -

One thing to note is that for some executables like emacs and
systemtap's stap command there are a significant number of inlined
functions per function.  For stap part of this might be due to it
using C++.  With that ratio of inline functions to real function
omitting the inline functions from the backtrace information is likely
to provide an incomplete picture of what is occurring in the code.

One concern would be sampling skid for inlined functions in the
current function.  Rather than marking the sample to the instruction
that triggered the event it gets attached to another instruction for a
inlined function that doesn't have anything to do with the sample.
The triggering instructions could have been from another inlined
function or the real function everything has been inlined into.  That
might cause people to look in the wrong area of source code.  This
would only be an issue for the current function that is executing.
For the unwinding of the callers that information should be more
precise.

-Will

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-09-08 19:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-30 14:21 PROBLEM: The --call-graph=fp data does do not agree with the --call-graph=dwarf results William Cohen
2022-08-31 12:17 ` Jiri Olsa
2022-08-31 13:47   ` William Cohen
2022-08-31 14:22   ` Milian Wolff
2022-08-31 16:15     ` Arnaldo Carvalho de Melo
2022-09-08 19:12 ` William Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).