perf performance with libdw vs libunwind

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* perf performance with libdw vs libunwind
@ 2026-04-22 13:26 Guilherme Amadio
  2026-04-23  4:21 ` Ian Rogers
  0 siblings, 1 reply; 4+ messages in thread
From: Guilherme Amadio @ 2026-04-22 13:26 UTC (permalink / raw)
  To: Ian Rogers; +Cc: acme, linux-perf-users, linux-kernel, libunwind-devel

Dear Ian,

Now that linux-7.0 is out, I've updated perf in Gentoo and moved it
to use libdw, as libunwind has been deprecated. However, when I tried
to use perf, I noticed a substantial performance regression and some
other problems, which I report below.

I use here an example which is my own "standard candle"¹ for checking
that stack unwinding is working properly: the startup of ROOT², which
is a C++ interpreter heavily used in high energy physics data analysis.
I simply run 'root -l -q' which is the equivalent of 'python -c ""' for
ROOT. It takes less than a second to run, but since it runs a full
initialization of Clang/LLVM as part of the interpreter, it produces a
rich flamegraph that I know ahead of time how it should look like, so
I use it to check that stack unwinding and symbol resolution are working.

1. https://en.wikipedia.org/wiki/Cosmic_distance_ladder#Standard_candles
2. https://root.cern

Below I show a comparison of the timings of perf record/report for this.

First, I run it with perf-6.19.12 which is configured to use libunwind:

$ perf config
call-graph.record-mode=fp
$ perf -vv
perf version 6.19.12
                   aio: [ on  ]  # HAVE_AIO_SUPPORT
                   bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
         bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL
            debuginfod: [ on  ]  # HAVE_DEBUGINFOD_SUPPORT
                 dwarf: [ on  ]  # HAVE_LIBDW_SUPPORT
    dwarf_getlocations: [ on  ]  # HAVE_LIBDW_SUPPORT
          dwarf-unwind: [ on  ]  # HAVE_DWARF_UNWIND_SUPPORT
                libbfd: [ OFF ]  # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
        libbpf-strings: [ on  ]  # HAVE_LIBBPF_STRINGS_SUPPORT
           libcapstone: [ on  ]  # HAVE_LIBCAPSTONE_SUPPORT
    libdw-dwarf-unwind: [ on  ]  # HAVE_LIBDW_SUPPORT
                libelf: [ on  ]  # HAVE_LIBELF_SUPPORT
               libLLVM: [ on  ]  # HAVE_LIBLLVM_SUPPORT
               libnuma: [ on  ]  # HAVE_LIBNUMA_SUPPORT
            libopencsd: [ OFF ]  # HAVE_CSTRACE_SUPPORT
               libperl: [ on  ]  # HAVE_LIBPERL_SUPPORT
               libpfm4: [ on  ]  # HAVE_LIBPFM
             libpython: [ on  ]  # HAVE_LIBPYTHON_SUPPORT
              libslang: [ on  ]  # HAVE_SLANG_SUPPORT
         libtraceevent: [ on  ]  # HAVE_LIBTRACEEVENT
             libunwind: [ on  ]  # HAVE_LIBUNWIND_SUPPORT
                  lzma: [ on  ]  # HAVE_LZMA_SUPPORT
numa_num_possible_cpus: [ on  ]  # HAVE_LIBNUMA_SUPPORT
                  zlib: [ on  ]  # HAVE_ZLIB_SUPPORT
                  zstd: [ on  ]  # HAVE_ZSTD_SUPPORT
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 79800 Hz

[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.688 MB perf.data (19693 samples) ]
1.25
$ time perf report -q --stdio -g none --children --percent-limit 75
    92.63%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
    92.63%     0.00%  root.exe  libc.so.6             [.] __libc_start_main@@GLIBC_2.34
    92.63%     0.00%  root.exe  root.exe              [.] _start
    92.63%     0.00%  root.exe  root.exe              [.] main
    91.53%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::GetROOT()
    89.36%     0.00%  root.exe  libRint.so.6.38.04    [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
    88.18%     0.00%  root.exe  libCore.so.6.38.04    [.] TApplication::TApplication(char const*, int*, char**, void*, int)
    88.10%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::Internal::GetROOT2()
    88.08%     0.00%  root.exe  libCore.so.6.38.04    [.] TROOT::InitInterpreter()
    75.62%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
    75.62%     0.00%  root.exe  libCling.so.6.38.04   [.] TCling::TCling(char const*, char const*, char const* const*, void*)

1.86
$ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-6.19-libunwind.stacks
4.08
$ flamegraph.pl -w 2560 --title 'Flame Graph: ROOT Startup' --subtitle 'Created with perf-6.19.12 using libunwind' < root-perf-6.19-libunwind.stacks >| root-perf-6.19-libunwind.svg

So, as you can see above, a simple perf-report took 1.86 seconds, and
perf-script took 4.08 seconds with libunwind. Now with perf upgraded to
perf-7.0 with libdw, this is what I see:

$ perf -vv
perf version 7.0
                   aio: [ on  ]  # HAVE_AIO_SUPPORT
                   bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
         bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL
            debuginfod: [ on  ]  # HAVE_DEBUGINFOD_SUPPORT
                 dwarf: [ on  ]  # HAVE_LIBDW_SUPPORT
    dwarf_getlocations: [ on  ]  # HAVE_LIBDW_SUPPORT
          dwarf-unwind: [ on  ]  # HAVE_DWARF_UNWIND_SUPPORT
                libbfd: [ OFF ]  # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
         libbabeltrace: [ on  ]  # HAVE_LIBBABELTRACE_SUPPORT
        libbpf-strings: [ on  ]  # HAVE_LIBBPF_STRINGS_SUPPORT
           libcapstone: [ on  ]  # HAVE_LIBCAPSTONE_SUPPORT
    libdw-dwarf-unwind: [ on  ]  # HAVE_LIBDW_SUPPORT
                libelf: [ on  ]  # HAVE_LIBELF_SUPPORT
               libLLVM: [ on  ]  # HAVE_LIBLLVM_SUPPORT
               libnuma: [ on  ]  # HAVE_LIBNUMA_SUPPORT
            libopencsd: [ OFF ]  # HAVE_CSTRACE_SUPPORT
               libperl: [ on  ]  # HAVE_LIBPERL_SUPPORT
               libpfm4: [ on  ]  # HAVE_LIBPFM
             libpython: [ on  ]  # HAVE_LIBPYTHON_SUPPORT
              libslang: [ on  ]  # HAVE_SLANG_SUPPORT
         libtraceevent: [ on  ]  # HAVE_LIBTRACEEVENT
             libunwind: [ OFF ]  # HAVE_LIBUNWIND_SUPPORT ( tip: Deprecated, use LIBUNWIND=1 and install libunwind-dev[el] to build with it )
                  lzma: [ on  ]  # HAVE_LZMA_SUPPORT
numa_num_possible_cpus: [ on  ]  # HAVE_LIBNUMA_SUPPORT
                  zlib: [ on  ]  # HAVE_ZLIB_SUPPORT
                  zstd: [ on  ]  # HAVE_ZSTD_SUPPORT
                  rust: [ on  ]  # HAVE_RUST_SUPPORT
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 79800 Hz

[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.766 MB perf.data (19922 samples) ]
1.28
$ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
    92.44%     0.00%  root.exe  root.exe              [.] main
    92.44%     0.00%  root.exe  root.exe              [.] _start
    92.44%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
    87.95%     0.00%  root.exe  libCore.so.6.38.04    [.] GetROOT2 (inlined)
    75.78%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter

	Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
	User time (seconds): 250.33
	System time (seconds): 21.18
	Percent of CPU this job got: 99%
**	Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
**	Maximum resident set size (kbytes): 4433000
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 7
	Minor (reclaiming a frame) page faults: 9850739
	Voluntary context switches: 226
	Involuntary context switches: 11388
	Swaps: 0
	File system inputs: 80776
	File system outputs: 232
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After seeing how much memory perf was using, I decided to record that
too, so as you can see above, perf 7.0 with libdw took 4 minutes and 33
seconds for the same simple perf-report that took 1.86 seconds before,
and the names of the symbols are not as complete as with libunwind.
Stack unwinding itself also seems inconsistent with the previous run.
Here's the equivalent with perf-6.19.12:

$ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
    92.46%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
    92.46%     0.00%  root.exe  libc.so.6             [.] __libc_start_main@@GLIBC_2.34
    92.46%     0.00%  root.exe  root.exe              [.] _start
    92.46%     0.00%  root.exe  root.exe              [.] main
    91.38%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::GetROOT()
    89.24%     0.00%  root.exe  libRint.so.6.38.04    [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
    88.05%     0.01%  root.exe  libCore.so.6.38.04    [.] TApplication::TApplication(char const*, int*, char**, void*, int)
    87.96%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::Internal::GetROOT2()
    87.95%     0.01%  root.exe  libCore.so.6.38.04    [.] TROOT::InitInterpreter()
    75.50%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
    75.50%     0.00%  root.exe  libCling.so.6.38.04   [.] TCling::TCling(char const*, char const*, char const* const*, void*)


	Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
	User time (seconds): 1.79
	System time (seconds): 0.08
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 265108
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 37887
	Voluntary context switches: 4
	Involuntary context switches: 77
	Swaps: 0
	File system inputs: 0
	File system outputs: 8
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Then, this is perf-script:

$ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-7.0-libdw.stacks
cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
... (line repeated many times)
cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
273.49

I see many of these cms__addr2line errors, and it takes 273.49 seconds
compared with 4.08 seconds with perf-6.19.12, and the flamegraph has
abbreviated function names like "operator()" instead of the full name,
which is also somewhat problematic as there's loss of information
relative to what libunwind used to provide. The flamegraphs for the two
runs above are available at https://cern.ch/amadio/perf  I didn't want to
attach the files here as I don't want to send big files to the lists.
For the record, I am using libunwind-1.8.3 and elfutils-0.195 in these tests.

If you'd like to perform the same kind of test, you can install ROOT
from EPEL on a RHEL-like distribution inside a container with a simple
"dnf install root", or just try the same record/report commands with a
clang++ compilation of a simple program as a decent replacement.

Best regards,
-Guilherme

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: perf performance with libdw vs libunwind
  2026-04-22 13:26 perf performance with libdw vs libunwind Guilherme Amadio
@ 2026-04-23  4:21 ` Ian Rogers
  2026-04-23  9:49   ` Guilherme Amadio
  0 siblings, 1 reply; 4+ messages in thread
From: Ian Rogers @ 2026-04-23  4:21 UTC (permalink / raw)
  To: Guilherme Amadio; +Cc: acme, linux-perf-users, linux-kernel, libunwind-devel

On Wed, Apr 22, 2026 at 6:26 AM Guilherme Amadio <amadio@gentoo.org> wrote:
>
> Dear Ian,
>
> Now that linux-7.0 is out, I've updated perf in Gentoo and moved it
> to use libdw, as libunwind has been deprecated. However, when I tried
> to use perf, I noticed a substantial performance regression and some
> other problems, which I report below.
>
> I use here an example which is my own "standard candle"¹ for checking
> that stack unwinding is working properly: the startup of ROOT², which
> is a C++ interpreter heavily used in high energy physics data analysis.
> I simply run 'root -l -q' which is the equivalent of 'python -c ""' for
> ROOT. It takes less than a second to run, but since it runs a full
> initialization of Clang/LLVM as part of the interpreter, it produces a
> rich flamegraph that I know ahead of time how it should look like, so
> I use it to check that stack unwinding and symbol resolution are working.
>
> 1. https://en.wikipedia.org/wiki/Cosmic_distance_ladder#Standard_candles
> 2. https://root.cern
>
> Below I show a comparison of the timings of perf record/report for this.
>
> First, I run it with perf-6.19.12 which is configured to use libunwind:
>
> $ perf config
> call-graph.record-mode=fp
> $ perf -vv
> perf version 6.19.12
>                    aio: [ on  ]  # HAVE_AIO_SUPPORT
>                    bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
>          bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL
>             debuginfod: [ on  ]  # HAVE_DEBUGINFOD_SUPPORT
>                  dwarf: [ on  ]  # HAVE_LIBDW_SUPPORT
>     dwarf_getlocations: [ on  ]  # HAVE_LIBDW_SUPPORT
>           dwarf-unwind: [ on  ]  # HAVE_DWARF_UNWIND_SUPPORT
>                 libbfd: [ OFF ]  # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
>         libbpf-strings: [ on  ]  # HAVE_LIBBPF_STRINGS_SUPPORT
>            libcapstone: [ on  ]  # HAVE_LIBCAPSTONE_SUPPORT
>     libdw-dwarf-unwind: [ on  ]  # HAVE_LIBDW_SUPPORT
>                 libelf: [ on  ]  # HAVE_LIBELF_SUPPORT
>                libLLVM: [ on  ]  # HAVE_LIBLLVM_SUPPORT
>                libnuma: [ on  ]  # HAVE_LIBNUMA_SUPPORT
>             libopencsd: [ OFF ]  # HAVE_CSTRACE_SUPPORT
>                libperl: [ on  ]  # HAVE_LIBPERL_SUPPORT
>                libpfm4: [ on  ]  # HAVE_LIBPFM
>              libpython: [ on  ]  # HAVE_LIBPYTHON_SUPPORT
>               libslang: [ on  ]  # HAVE_SLANG_SUPPORT
>          libtraceevent: [ on  ]  # HAVE_LIBTRACEEVENT
>              libunwind: [ on  ]  # HAVE_LIBUNWIND_SUPPORT
>                   lzma: [ on  ]  # HAVE_LZMA_SUPPORT
> numa_num_possible_cpus: [ on  ]  # HAVE_LIBNUMA_SUPPORT
>                   zlib: [ on  ]  # HAVE_ZLIB_SUPPORT
>                   zstd: [ on  ]  # HAVE_ZSTD_SUPPORT
> $ time perf record -g -F max -e cycles:uk -- root.exe -l -q
> info: Using a maximum frequency rate of 79800 Hz
>
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.688 MB perf.data (19693 samples) ]
> 1.25
> $ time perf report -q --stdio -g none --children --percent-limit 75
>     92.63%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
>     92.63%     0.00%  root.exe  libc.so.6             [.] __libc_start_main@@GLIBC_2.34
>     92.63%     0.00%  root.exe  root.exe              [.] _start
>     92.63%     0.00%  root.exe  root.exe              [.] main
>     91.53%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::GetROOT()
>     89.36%     0.00%  root.exe  libRint.so.6.38.04    [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
>     88.18%     0.00%  root.exe  libCore.so.6.38.04    [.] TApplication::TApplication(char const*, int*, char**, void*, int)
>     88.10%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::Internal::GetROOT2()
>     88.08%     0.00%  root.exe  libCore.so.6.38.04    [.] TROOT::InitInterpreter()
>     75.62%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
>     75.62%     0.00%  root.exe  libCling.so.6.38.04   [.] TCling::TCling(char const*, char const*, char const* const*, void*)
>
> 1.86
> $ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-6.19-libunwind.stacks
> 4.08
> $ flamegraph.pl -w 2560 --title 'Flame Graph: ROOT Startup' --subtitle 'Created with perf-6.19.12 using libunwind' < root-perf-6.19-libunwind.stacks >| root-perf-6.19-libunwind.svg
>
> So, as you can see above, a simple perf-report took 1.86 seconds, and
> perf-script took 4.08 seconds with libunwind. Now with perf upgraded to
> perf-7.0 with libdw, this is what I see:
>
> $ perf -vv
> perf version 7.0
>                    aio: [ on  ]  # HAVE_AIO_SUPPORT
>                    bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
>          bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL
>             debuginfod: [ on  ]  # HAVE_DEBUGINFOD_SUPPORT
>                  dwarf: [ on  ]  # HAVE_LIBDW_SUPPORT
>     dwarf_getlocations: [ on  ]  # HAVE_LIBDW_SUPPORT
>           dwarf-unwind: [ on  ]  # HAVE_DWARF_UNWIND_SUPPORT
>                 libbfd: [ OFF ]  # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
>          libbabeltrace: [ on  ]  # HAVE_LIBBABELTRACE_SUPPORT
>         libbpf-strings: [ on  ]  # HAVE_LIBBPF_STRINGS_SUPPORT
>            libcapstone: [ on  ]  # HAVE_LIBCAPSTONE_SUPPORT
>     libdw-dwarf-unwind: [ on  ]  # HAVE_LIBDW_SUPPORT
>                 libelf: [ on  ]  # HAVE_LIBELF_SUPPORT
>                libLLVM: [ on  ]  # HAVE_LIBLLVM_SUPPORT
>                libnuma: [ on  ]  # HAVE_LIBNUMA_SUPPORT
>             libopencsd: [ OFF ]  # HAVE_CSTRACE_SUPPORT
>                libperl: [ on  ]  # HAVE_LIBPERL_SUPPORT
>                libpfm4: [ on  ]  # HAVE_LIBPFM
>              libpython: [ on  ]  # HAVE_LIBPYTHON_SUPPORT
>               libslang: [ on  ]  # HAVE_SLANG_SUPPORT
>          libtraceevent: [ on  ]  # HAVE_LIBTRACEEVENT
>              libunwind: [ OFF ]  # HAVE_LIBUNWIND_SUPPORT ( tip: Deprecated, use LIBUNWIND=1 and install libunwind-dev[el] to build with it )
>                   lzma: [ on  ]  # HAVE_LZMA_SUPPORT
> numa_num_possible_cpus: [ on  ]  # HAVE_LIBNUMA_SUPPORT
>                   zlib: [ on  ]  # HAVE_ZLIB_SUPPORT
>                   zstd: [ on  ]  # HAVE_ZSTD_SUPPORT
>                   rust: [ on  ]  # HAVE_RUST_SUPPORT
> $ time perf record -g -F max -e cycles:uk -- root.exe -l -q
> info: Using a maximum frequency rate of 79800 Hz
>
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.766 MB perf.data (19922 samples) ]
> 1.28
> $ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
>     92.44%     0.00%  root.exe  root.exe              [.] main
>     92.44%     0.00%  root.exe  root.exe              [.] _start
>     92.44%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
>     87.95%     0.00%  root.exe  libCore.so.6.38.04    [.] GetROOT2 (inlined)
>     75.78%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
>
>         Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
>         User time (seconds): 250.33
>         System time (seconds): 21.18
>         Percent of CPU this job got: 99%
> **      Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32.97
>         Average shared text size (kbytes): 0
>         Average unshared data size (kbytes): 0
>         Average stack size (kbytes): 0
>         Average total size (kbytes): 0
> **      Maximum resident set size (kbytes): 4433000
>         Average resident set size (kbytes): 0
>         Major (requiring I/O) page faults: 7
>         Minor (reclaiming a frame) page faults: 9850739
>         Voluntary context switches: 226
>         Involuntary context switches: 11388
>         Swaps: 0
>         File system inputs: 80776
>         File system outputs: 232
>         Socket messages sent: 0
>         Socket messages received: 0
>         Signals delivered: 0
>         Page size (bytes): 4096
>         Exit status: 0
>
> After seeing how much memory perf was using, I decided to record that
> too, so as you can see above, perf 7.0 with libdw took 4 minutes and 33
> seconds for the same simple perf-report that took 1.86 seconds before,
> and the names of the symbols are not as complete as with libunwind.

Hi Guilherme,

Thanks for the feedback but I'm a little confused. Your .perfconfig is
set to use frame-pointer-based unwinding, so neither libunwind nor
libdw should be used for unwinding. With framepointer unwinding, a
sample contains an array of IPs gathered by walking the linked list of
frame pointers on the stack. With --call-graph=dwarf a region of
memory is copied into a sample (the stack) along with some initial
register values, libdw or libunwind is then used to process this
memory using the dwarf information in the ELF binary.

> Stack unwinding itself also seems inconsistent with the previous run.
> Here's the equivalent with perf-6.19.12:
>
> $ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
>     92.46%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
>     92.46%     0.00%  root.exe  libc.so.6             [.] __libc_start_main@@GLIBC_2.34
>     92.46%     0.00%  root.exe  root.exe              [.] _start
>     92.46%     0.00%  root.exe  root.exe              [.] main
>     91.38%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::GetROOT()
>     89.24%     0.00%  root.exe  libRint.so.6.38.04    [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
>     88.05%     0.01%  root.exe  libCore.so.6.38.04    [.] TApplication::TApplication(char const*, int*, char**, void*, int)
>     87.96%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::Internal::GetROOT2()
>     87.95%     0.01%  root.exe  libCore.so.6.38.04    [.] TROOT::InitInterpreter()
>     75.50%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
>     75.50%     0.00%  root.exe  libCling.so.6.38.04   [.] TCling::TCling(char const*, char const*, char const* const*, void*)
>
>
>         Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
>         User time (seconds): 1.79
>         System time (seconds): 0.08
>         Percent of CPU this job got: 99%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
>         Average shared text size (kbytes): 0
>         Average unshared data size (kbytes): 0
>         Average stack size (kbytes): 0
>         Average total size (kbytes): 0
>         Maximum resident set size (kbytes): 265108
>         Average resident set size (kbytes): 0
>         Major (requiring I/O) page faults: 0
>         Minor (reclaiming a frame) page faults: 37887
>         Voluntary context switches: 4
>         Involuntary context switches: 77
>         Swaps: 0
>         File system inputs: 0
>         File system outputs: 8
>         Socket messages sent: 0
>         Socket messages received: 0
>         Signals delivered: 0
>         Page size (bytes): 4096
>         Exit status: 0
>
> Then, this is perf-script:
>
> $ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-7.0-libdw.stacks
> cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
> ... (line repeated many times)
> cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
> 273.49
>
> I see many of these cms__addr2line errors, and it takes 273.49 seconds
> compared with 4.08 seconds with perf-6.19.12, and the flamegraph has
> abbreviated function names like "operator()" instead of the full name,
> which is also somewhat problematic as there's loss of information
> relative to what libunwind used to provide. The flamegraphs for the two
> runs above are available at https://cern.ch/amadio/perf  I didn't want to
> attach the files here as I don't want to send big files to the lists.
> For the record, I am using libunwind-1.8.3 and elfutils-0.195 in these tests.
>
> If you'd like to perform the same kind of test, you can install ROOT
> from EPEL on a RHEL-like distribution inside a container with a simple
> "dnf install root", or just try the same record/report commands with a
> clang++ compilation of a simple program as a decent replacement.

So something that changed in v7.0 is that with the dwarf libdw or
libunwind unwinding we always had all inline functions on the stack
but not with frame pointers. The IP in the frame pointer array can be
of an instruction within an inlined function. In v7.0 we added a patch
that includes inline information for both frame pointer and LBR-based
stack traces:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/util/machine.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf

By default we try to add inline information using libdw if that fails
we try llvm, then libbfd and finally the command line addr2line tool:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/srcline.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf#n145
I suspect the slow down is for doing all this addr2line work on a
binary that's been stripped. The good news here is that if you can add
a config option to avoid all the fallbacks, "addr2line.style=libdw".
You can also disable inline information by adding "--no-inline" to
your `perf report` command line.

Your report suggests we should tweak the defaults for showing inline
information. Could you try the options I've suggested and see if they
remedy the issue for you?

Many thanks,
Ian

> Best regards,
> -Guilherme

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: perf performance with libdw vs libunwind
  2026-04-23  4:21 ` Ian Rogers
@ 2026-04-23  9:49   ` Guilherme Amadio
  2026-04-23 22:28     ` Ian Rogers
  0 siblings, 1 reply; 4+ messages in thread
From: Guilherme Amadio @ 2026-04-23  9:49 UTC (permalink / raw)
  To: Ian Rogers; +Cc: acme, linux-perf-users, linux-kernel, libunwind-devel

Hi Ian,

On Wed, Apr 22, 2026 at 09:21:52PM -0700, Ian Rogers wrote:
> Hi Guilherme,
> 
> Thanks for the feedback but I'm a little confused. Your .perfconfig is
> set to use frame-pointer-based unwinding, so neither libunwind nor
> libdw should be used for unwinding. With framepointer unwinding, a
> sample contains an array of IPs gathered by walking the linked list of
> frame pointers on the stack. With --call-graph=dwarf a region of
> memory is copied into a sample (the stack) along with some initial
> register values, libdw or libunwind is then used to process this
> memory using the dwarf information in the ELF binary.

Thanks for your reply and pardon my ignorance, I thought that the libraries
were generically used for stack unwinding, regardless of if it's fp or dwarf.
I should have looked a bit deeper before reporting this, but we are on the
right track.

> So something that changed in v7.0 is that with the dwarf libdw or
> libunwind unwinding we always had all inline functions on the stack
> but not with frame pointers. The IP in the frame pointer array can be
> of an instruction within an inlined function. In v7.0 we added a patch
> that includes inline information for both frame pointer and LBR-based
> stack traces:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/util/machine.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf

This is a nice development, I have been using --call-graph=dwarf to see
the inlined symbols, so having the ability to see inlined functions with
fp unwinding, which is much more lightweight in terms of space (i.e. the
size of the final perf.data files), is great.

> By default we try to add inline information using libdw if that fails
> we try llvm, then libbfd and finally the command line addr2line tool:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/srcline.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf#n145
> I suspect the slow down is for doing all this addr2line work on a
> binary that's been stripped. The good news here is that if you can add
> a config option to avoid all the fallbacks, "addr2line.style=libdw".
> You can also disable inline information by adding "--no-inline" to
> your `perf report` command line.

The binary and its direct dependent libraries, as well as most other dependencies
are not stripped, but it's possible that some dependency in the full chain might be.

When I run perf report using --no-inline I indeed recover the performance I
had before with perf-6.19.12. However, setting addr2line.style=libdw did not
help much. Here is what I observe:

$ perf config
call-graph.record-mode=fp
$ perf version
perf version 7.0
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 63800 Hz
[ perf record: Woken up 18 times to write data ]
[ perf record: Captured and wrote 4.720 MB perf.data (16552 samples) ]
1.26
$ time perf report -q --stdio -g none --children --no-inline --percent-limit 75
    92.65%     0.00%  root.exe  root.exe              [.] main
    92.64%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
    92.64%     0.00%  root.exe  libc.so.6             [.] __libc_start_main@@GLIBC_2.34
    92.64%     0.00%  root.exe  root.exe              [.] _start
    91.63%     0.00%  root.exe  libCore.so.6.38.04    [.] ROOT::GetROOT()
    89.46%     0.00%  root.exe  libRint.so.6.38.04    [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
    88.31%     0.00%  root.exe  libCore.so.6.38.04    [.] TApplication::TApplication(char const*, int*, char**, void*, int)
    88.22%     0.00%  root.exe  libCore.so.6.38.04    [.] ROOT::Internal::GetROOT2()
    88.20%     0.00%  root.exe  libCore.so.6.38.04    [.] TROOT::InitInterpreter()
    76.34%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
    76.34%     0.00%  root.exe  libCling.so.6.38.04   [.] TCling::TCling(char const*, char const*, char const* const*, void*)

1.70
Flamegraph: https://amadio.web.cern.ch/perf/perf-report-noinline.svg

Now without --no-inline, and this first command is without addr2line.style=libdw in the config:

$ time perf report -q --stdio -g none --children --percent-limit 75
    92.65%     0.00%  root.exe  root.exe              [.] main
    92.64%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
    92.64%     0.00%  root.exe  root.exe              [.] _start
    88.22%     0.00%  root.exe  libCore.so.6.38.04    [.] GetROOT2 (inlined)
    76.34%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter

241.60
Flamegraph: https://amadio.web.cern.ch/perf/perf-report-addr2line.svg

$ perf config addr2line.style=libdw
$ perf config
call-graph.record-mode=fp
addr2line.style=libdw
$ time perf report -q --stdio -g none --children --percent-limit 75
    92.65%     0.00%  root.exe  root.exe              [.] main
    92.64%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
    92.64%     0.00%  root.exe  root.exe              [.] _start
    88.22%     0.00%  root.exe  libCore.so.6.38.04    [.] GetROOT2 (inlined)

137.93
Flamegraph: https://amadio.web.cern.ch/perf/perf-report-libdw.svg

The flame graphs above are for the perf-report commands themselves.

So, the performance is fine with --no-inline, and it's better with addr2line.style=libdw.
However, the function names are not the best in the last two reports, so this problem remains.

> Your report suggests we should tweak the defaults for showing inline
> information. Could you try the options I've suggested and see if they
> remedy the issue for you?

Thank you for the suggestions. Indeed --no-inline seems to bring back the
previous performance. Please let me know if you would like me to try more
things and what other information you need for the cases without --no-inline.

Best regards,
-Guilherme

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: perf performance with libdw vs libunwind
  2026-04-23  9:49   ` Guilherme Amadio
@ 2026-04-23 22:28     ` Ian Rogers
  0 siblings, 0 replies; 4+ messages in thread
From: Ian Rogers @ 2026-04-23 22:28 UTC (permalink / raw)
  To: Guilherme Amadio; +Cc: acme, linux-perf-users, linux-kernel, libunwind-devel

On Thu, Apr 23, 2026 at 2:49 AM Guilherme Amadio <amadio@gentoo.org> wrote:
>
> Hi Ian,
>
> On Wed, Apr 22, 2026 at 09:21:52PM -0700, Ian Rogers wrote:
> > Hi Guilherme,
> >
> > Thanks for the feedback but I'm a little confused. Your .perfconfig is
> > set to use frame-pointer-based unwinding, so neither libunwind nor
> > libdw should be used for unwinding. With framepointer unwinding, a
> > sample contains an array of IPs gathered by walking the linked list of
> > frame pointers on the stack. With --call-graph=dwarf a region of
> > memory is copied into a sample (the stack) along with some initial
> > register values, libdw or libunwind is then used to process this
> > memory using the dwarf information in the ELF binary.
>
> Thanks for your reply and pardon my ignorance, I thought that the libraries
> were generically used for stack unwinding, regardless of if it's fp or dwarf.
> I should have looked a bit deeper before reporting this, but we are on the
> right track.
>
> > So something that changed in v7.0 is that with the dwarf libdw or
> > libunwind unwinding we always had all inline functions on the stack
> > but not with frame pointers. The IP in the frame pointer array can be
> > of an instruction within an inlined function. In v7.0 we added a patch
> > that includes inline information for both frame pointer and LBR-based
> > stack traces:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/util/machine.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf
>
> This is a nice development, I have been using --call-graph=dwarf to see
> the inlined symbols, so having the ability to see inlined functions with
> fp unwinding, which is much more lightweight in terms of space (i.e. the
> size of the final perf.data files), is great.
>
> > By default we try to add inline information using libdw if that fails
> > we try llvm, then libbfd and finally the command line addr2line tool:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/srcline.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf#n145
> > I suspect the slow down is for doing all this addr2line work on a
> > binary that's been stripped. The good news here is that if you can add
> > a config option to avoid all the fallbacks, "addr2line.style=libdw".
> > You can also disable inline information by adding "--no-inline" to
> > your `perf report` command line.
>
> The binary and its direct dependent libraries, as well as most other dependencies
> are not stripped, but it's possible that some dependency in the full chain might be.
>
> When I run perf report using --no-inline I indeed recover the performance I
> had before with perf-6.19.12. However, setting addr2line.style=libdw did not
> help much. Here is what I observe:
>
> $ perf config
> call-graph.record-mode=fp
> $ perf version
> perf version 7.0
> $ time perf record -g -F max -e cycles:uk -- root.exe -l -q
> info: Using a maximum frequency rate of 63800 Hz
> [ perf record: Woken up 18 times to write data ]
> [ perf record: Captured and wrote 4.720 MB perf.data (16552 samples) ]
> 1.26
> $ time perf report -q --stdio -g none --children --no-inline --percent-limit 75
>     92.65%     0.00%  root.exe  root.exe              [.] main
>     92.64%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
>     92.64%     0.00%  root.exe  libc.so.6             [.] __libc_start_main@@GLIBC_2.34
>     92.64%     0.00%  root.exe  root.exe              [.] _start
>     91.63%     0.00%  root.exe  libCore.so.6.38.04    [.] ROOT::GetROOT()
>     89.46%     0.00%  root.exe  libRint.so.6.38.04    [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
>     88.31%     0.00%  root.exe  libCore.so.6.38.04    [.] TApplication::TApplication(char const*, int*, char**, void*, int)
>     88.22%     0.00%  root.exe  libCore.so.6.38.04    [.] ROOT::Internal::GetROOT2()
>     88.20%     0.00%  root.exe  libCore.so.6.38.04    [.] TROOT::InitInterpreter()
>     76.34%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
>     76.34%     0.00%  root.exe  libCling.so.6.38.04   [.] TCling::TCling(char const*, char const*, char const* const*, void*)
>
> 1.70
> Flamegraph: https://amadio.web.cern.ch/perf/perf-report-noinline.svg

Thanks for all the reporting!
So here the C++ demangler is about 1/3rd of execution time and there's
no dwarf decoding for the inline functions.

> Now without --no-inline, and this first command is without addr2line.style=libdw in the config:
>
> $ time perf report -q --stdio -g none --children --percent-limit 75
>     92.65%     0.00%  root.exe  root.exe              [.] main
>     92.64%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
>     92.64%     0.00%  root.exe  root.exe              [.] _start
>     88.22%     0.00%  root.exe  libCore.so.6.38.04    [.] GetROOT2 (inlined)
>     76.34%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
>
> 241.60
> Flamegraph: https://amadio.web.cern.ch/perf/perf-report-addr2line.svg

Here perf is using libdw trying to do the addr2line and then it is
using the addr2line command to do it. Time is mainly spent gathering
addr2line inline information.

> $ perf config addr2line.style=libdw
> $ perf config
> call-graph.record-mode=fp
> addr2line.style=libdw
> $ time perf report -q --stdio -g none --children --percent-limit 75
>     92.65%     0.00%  root.exe  root.exe              [.] main
>     92.64%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
>     92.64%     0.00%  root.exe  root.exe              [.] _start
>     88.22%     0.00%  root.exe  libCore.so.6.38.04    [.] GetROOT2 (inlined)
>
> 137.93
> Flamegraph: https://amadio.web.cern.ch/perf/perf-report-libdw.svg

Here time is just spent in libdw.

> The flame graphs above are for the perf-report commands themselves.
>
> So, the performance is fine with --no-inline, and it's better with addr2line.style=libdw.
> However, the function names are not the best in the last two reports, so this problem remains.
>
> > Your report suggests we should tweak the defaults for showing inline
> > information. Could you try the options I've suggested and see if they
> > remedy the issue for you?
>
> Thank you for the suggestions. Indeed --no-inline seems to bring back the
> previous performance. Please let me know if you would like me to try more
> things and what other information you need for the cases without --no-inline.

Performance-wise, things are working as expected. I'm confused about
why we see different symbol names, perhaps this points to a libdw bug.
With or without --no-inline libelf gets the symbol name, libdw is only
used to get the source line and inlining information. Perhaps this is
more of a bug with `-g none`, which is an option I've never used. I'm
quite busy at the moment, so it's not easy for me to dig into this.
Perhaps we can create a test and try to get an LLM to investigate it.

Thanks,
Ian

> Best regards,
> -Guilherme

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-23 22:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 13:26 perf performance with libdw vs libunwind Guilherme Amadio
2026-04-23  4:21 ` Ian Rogers
2026-04-23  9:49   ` Guilherme Amadio
2026-04-23 22:28     ` Ian Rogers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox