public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* perf performance with libdw vs libunwind
@ 2026-04-22 13:26 Guilherme Amadio
  2026-04-23  4:21 ` Ian Rogers
  0 siblings, 1 reply; 4+ messages in thread
From: Guilherme Amadio @ 2026-04-22 13:26 UTC (permalink / raw)
  To: Ian Rogers; +Cc: acme, linux-perf-users, linux-kernel, libunwind-devel

Dear Ian,

Now that linux-7.0 is out, I've updated perf in Gentoo and moved it
to use libdw, as libunwind has been deprecated. However, when I tried
to use perf, I noticed a substantial performance regression and some
other problems, which I report below.

I use here an example which is my own "standard candle"¹ for checking
that stack unwinding is working properly: the startup of ROOT², which
is a C++ interpreter heavily used in high energy physics data analysis.
I simply run 'root -l -q' which is the equivalent of 'python -c ""' for
ROOT. It takes less than a second to run, but since it runs a full
initialization of Clang/LLVM as part of the interpreter, it produces a
rich flamegraph that I know ahead of time how it should look like, so
I use it to check that stack unwinding and symbol resolution are working.

1. https://en.wikipedia.org/wiki/Cosmic_distance_ladder#Standard_candles
2. https://root.cern

Below I show a comparison of the timings of perf record/report for this.

First, I run it with perf-6.19.12 which is configured to use libunwind:

$ perf config
call-graph.record-mode=fp
$ perf -vv
perf version 6.19.12
                   aio: [ on  ]  # HAVE_AIO_SUPPORT
                   bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
         bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL
            debuginfod: [ on  ]  # HAVE_DEBUGINFOD_SUPPORT
                 dwarf: [ on  ]  # HAVE_LIBDW_SUPPORT
    dwarf_getlocations: [ on  ]  # HAVE_LIBDW_SUPPORT
          dwarf-unwind: [ on  ]  # HAVE_DWARF_UNWIND_SUPPORT
                libbfd: [ OFF ]  # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
        libbpf-strings: [ on  ]  # HAVE_LIBBPF_STRINGS_SUPPORT
           libcapstone: [ on  ]  # HAVE_LIBCAPSTONE_SUPPORT
    libdw-dwarf-unwind: [ on  ]  # HAVE_LIBDW_SUPPORT
                libelf: [ on  ]  # HAVE_LIBELF_SUPPORT
               libLLVM: [ on  ]  # HAVE_LIBLLVM_SUPPORT
               libnuma: [ on  ]  # HAVE_LIBNUMA_SUPPORT
            libopencsd: [ OFF ]  # HAVE_CSTRACE_SUPPORT
               libperl: [ on  ]  # HAVE_LIBPERL_SUPPORT
               libpfm4: [ on  ]  # HAVE_LIBPFM
             libpython: [ on  ]  # HAVE_LIBPYTHON_SUPPORT
              libslang: [ on  ]  # HAVE_SLANG_SUPPORT
         libtraceevent: [ on  ]  # HAVE_LIBTRACEEVENT
             libunwind: [ on  ]  # HAVE_LIBUNWIND_SUPPORT
                  lzma: [ on  ]  # HAVE_LZMA_SUPPORT
numa_num_possible_cpus: [ on  ]  # HAVE_LIBNUMA_SUPPORT
                  zlib: [ on  ]  # HAVE_ZLIB_SUPPORT
                  zstd: [ on  ]  # HAVE_ZSTD_SUPPORT
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 79800 Hz

[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.688 MB perf.data (19693 samples) ]
1.25
$ time perf report -q --stdio -g none --children --percent-limit 75
    92.63%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
    92.63%     0.00%  root.exe  libc.so.6             [.] __libc_start_main@@GLIBC_2.34
    92.63%     0.00%  root.exe  root.exe              [.] _start
    92.63%     0.00%  root.exe  root.exe              [.] main
    91.53%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::GetROOT()
    89.36%     0.00%  root.exe  libRint.so.6.38.04    [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
    88.18%     0.00%  root.exe  libCore.so.6.38.04    [.] TApplication::TApplication(char const*, int*, char**, void*, int)
    88.10%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::Internal::GetROOT2()
    88.08%     0.00%  root.exe  libCore.so.6.38.04    [.] TROOT::InitInterpreter()
    75.62%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
    75.62%     0.00%  root.exe  libCling.so.6.38.04   [.] TCling::TCling(char const*, char const*, char const* const*, void*)

1.86
$ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-6.19-libunwind.stacks
4.08
$ flamegraph.pl -w 2560 --title 'Flame Graph: ROOT Startup' --subtitle 'Created with perf-6.19.12 using libunwind' < root-perf-6.19-libunwind.stacks >| root-perf-6.19-libunwind.svg

So, as you can see above, a simple perf-report took 1.86 seconds, and
perf-script took 4.08 seconds with libunwind. Now with perf upgraded to
perf-7.0 with libdw, this is what I see:

$ perf -vv
perf version 7.0
                   aio: [ on  ]  # HAVE_AIO_SUPPORT
                   bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
         bpf_skeletons: [ on  ]  # HAVE_BPF_SKEL
            debuginfod: [ on  ]  # HAVE_DEBUGINFOD_SUPPORT
                 dwarf: [ on  ]  # HAVE_LIBDW_SUPPORT
    dwarf_getlocations: [ on  ]  # HAVE_LIBDW_SUPPORT
          dwarf-unwind: [ on  ]  # HAVE_DWARF_UNWIND_SUPPORT
                libbfd: [ OFF ]  # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
         libbabeltrace: [ on  ]  # HAVE_LIBBABELTRACE_SUPPORT
        libbpf-strings: [ on  ]  # HAVE_LIBBPF_STRINGS_SUPPORT
           libcapstone: [ on  ]  # HAVE_LIBCAPSTONE_SUPPORT
    libdw-dwarf-unwind: [ on  ]  # HAVE_LIBDW_SUPPORT
                libelf: [ on  ]  # HAVE_LIBELF_SUPPORT
               libLLVM: [ on  ]  # HAVE_LIBLLVM_SUPPORT
               libnuma: [ on  ]  # HAVE_LIBNUMA_SUPPORT
            libopencsd: [ OFF ]  # HAVE_CSTRACE_SUPPORT
               libperl: [ on  ]  # HAVE_LIBPERL_SUPPORT
               libpfm4: [ on  ]  # HAVE_LIBPFM
             libpython: [ on  ]  # HAVE_LIBPYTHON_SUPPORT
              libslang: [ on  ]  # HAVE_SLANG_SUPPORT
         libtraceevent: [ on  ]  # HAVE_LIBTRACEEVENT
             libunwind: [ OFF ]  # HAVE_LIBUNWIND_SUPPORT ( tip: Deprecated, use LIBUNWIND=1 and install libunwind-dev[el] to build with it )
                  lzma: [ on  ]  # HAVE_LZMA_SUPPORT
numa_num_possible_cpus: [ on  ]  # HAVE_LIBNUMA_SUPPORT
                  zlib: [ on  ]  # HAVE_ZLIB_SUPPORT
                  zstd: [ on  ]  # HAVE_ZSTD_SUPPORT
                  rust: [ on  ]  # HAVE_RUST_SUPPORT
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 79800 Hz

[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.766 MB perf.data (19922 samples) ]
1.28
$ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
    92.44%     0.00%  root.exe  root.exe              [.] main
    92.44%     0.00%  root.exe  root.exe              [.] _start
    92.44%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
    87.95%     0.00%  root.exe  libCore.so.6.38.04    [.] GetROOT2 (inlined)
    75.78%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter

	Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
	User time (seconds): 250.33
	System time (seconds): 21.18
	Percent of CPU this job got: 99%
**	Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
**	Maximum resident set size (kbytes): 4433000
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 7
	Minor (reclaiming a frame) page faults: 9850739
	Voluntary context switches: 226
	Involuntary context switches: 11388
	Swaps: 0
	File system inputs: 80776
	File system outputs: 232
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After seeing how much memory perf was using, I decided to record that
too, so as you can see above, perf 7.0 with libdw took 4 minutes and 33
seconds for the same simple perf-report that took 1.86 seconds before,
and the names of the symbols are not as complete as with libunwind.
Stack unwinding itself also seems inconsistent with the previous run.
Here's the equivalent with perf-6.19.12:

$ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
    92.46%     0.00%  root.exe  libc.so.6             [.] __libc_start_call_main
    92.46%     0.00%  root.exe  libc.so.6             [.] __libc_start_main@@GLIBC_2.34
    92.46%     0.00%  root.exe  root.exe              [.] _start
    92.46%     0.00%  root.exe  root.exe              [.] main
    91.38%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::GetROOT()
    89.24%     0.00%  root.exe  libRint.so.6.38.04    [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
    88.05%     0.01%  root.exe  libCore.so.6.38.04    [.] TApplication::TApplication(char const*, int*, char**, void*, int)
    87.96%     0.01%  root.exe  libCore.so.6.38.04    [.] ROOT::Internal::GetROOT2()
    87.95%     0.01%  root.exe  libCore.so.6.38.04    [.] TROOT::InitInterpreter()
    75.50%     0.00%  root.exe  libCling.so.6.38.04   [.] CreateInterpreter
    75.50%     0.00%  root.exe  libCling.so.6.38.04   [.] TCling::TCling(char const*, char const*, char const* const*, void*)


	Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
	User time (seconds): 1.79
	System time (seconds): 0.08
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 265108
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 37887
	Voluntary context switches: 4
	Involuntary context switches: 77
	Swaps: 0
	File system inputs: 0
	File system outputs: 8
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Then, this is perf-script:

$ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-7.0-libdw.stacks
cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
... (line repeated many times)
cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
273.49

I see many of these cms__addr2line errors, and it takes 273.49 seconds
compared with 4.08 seconds with perf-6.19.12, and the flamegraph has
abbreviated function names like "operator()" instead of the full name,
which is also somewhat problematic as there's loss of information
relative to what libunwind used to provide. The flamegraphs for the two
runs above are available at https://cern.ch/amadio/perf  I didn't want to
attach the files here as I don't want to send big files to the lists.
For the record, I am using libunwind-1.8.3 and elfutils-0.195 in these tests.

If you'd like to perform the same kind of test, you can install ROOT
from EPEL on a RHEL-like distribution inside a container with a simple
"dnf install root", or just try the same record/report commands with a
clang++ compilation of a simple program as a decent replacement.

Best regards,
-Guilherme

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-23 22:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 13:26 perf performance with libdw vs libunwind Guilherme Amadio
2026-04-23  4:21 ` Ian Rogers
2026-04-23  9:49   ` Guilherme Amadio
2026-04-23 22:28     ` Ian Rogers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox