From: Guilherme Amadio <amadio@gentoo.org>
To: Ian Rogers <irogers@google.com>
Cc: acme@kernel.org, linux-perf-users@vger.kernel.org,
linux-kernel@vger.kernel.org, libunwind-devel@nongnu.org
Subject: perf performance with libdw vs libunwind
Date: Wed, 22 Apr 2026 15:26:18 +0200 [thread overview]
Message-ID: <aejMem8Pt3iCxaka@gentoo.org> (raw)
Dear Ian,
Now that linux-7.0 is out, I've updated perf in Gentoo and moved it
to use libdw, as libunwind has been deprecated. However, when I tried
to use perf, I noticed a substantial performance regression and some
other problems, which I report below.
I use here an example which is my own "standard candle"¹ for checking
that stack unwinding is working properly: the startup of ROOT², which
is a C++ interpreter heavily used in high energy physics data analysis.
I simply run 'root -l -q' which is the equivalent of 'python -c ""' for
ROOT. It takes less than a second to run, but since it runs a full
initialization of Clang/LLVM as part of the interpreter, it produces a
rich flamegraph that I know ahead of time how it should look like, so
I use it to check that stack unwinding and symbol resolution are working.
1. https://en.wikipedia.org/wiki/Cosmic_distance_ladder#Standard_candles
2. https://root.cern
Below I show a comparison of the timings of perf record/report for this.
First, I run it with perf-6.19.12 which is configured to use libunwind:
$ perf config
call-graph.record-mode=fp
$ perf -vv
perf version 6.19.12
aio: [ on ] # HAVE_AIO_SUPPORT
bpf: [ on ] # HAVE_LIBBPF_SUPPORT
bpf_skeletons: [ on ] # HAVE_BPF_SKEL
debuginfod: [ on ] # HAVE_DEBUGINFOD_SUPPORT
dwarf: [ on ] # HAVE_LIBDW_SUPPORT
dwarf_getlocations: [ on ] # HAVE_LIBDW_SUPPORT
dwarf-unwind: [ on ] # HAVE_DWARF_UNWIND_SUPPORT
libbfd: [ OFF ] # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
libbpf-strings: [ on ] # HAVE_LIBBPF_STRINGS_SUPPORT
libcapstone: [ on ] # HAVE_LIBCAPSTONE_SUPPORT
libdw-dwarf-unwind: [ on ] # HAVE_LIBDW_SUPPORT
libelf: [ on ] # HAVE_LIBELF_SUPPORT
libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
libopencsd: [ OFF ] # HAVE_CSTRACE_SUPPORT
libperl: [ on ] # HAVE_LIBPERL_SUPPORT
libpfm4: [ on ] # HAVE_LIBPFM
libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
libslang: [ on ] # HAVE_SLANG_SUPPORT
libtraceevent: [ on ] # HAVE_LIBTRACEEVENT
libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
lzma: [ on ] # HAVE_LZMA_SUPPORT
numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
zlib: [ on ] # HAVE_ZLIB_SUPPORT
zstd: [ on ] # HAVE_ZSTD_SUPPORT
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 79800 Hz
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.688 MB perf.data (19693 samples) ]
1.25
$ time perf report -q --stdio -g none --children --percent-limit 75
92.63% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
92.63% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
92.63% 0.00% root.exe root.exe [.] _start
92.63% 0.00% root.exe root.exe [.] main
91.53% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
89.36% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
88.18% 0.00% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
88.10% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
88.08% 0.00% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
75.62% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
75.62% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
1.86
$ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-6.19-libunwind.stacks
4.08
$ flamegraph.pl -w 2560 --title 'Flame Graph: ROOT Startup' --subtitle 'Created with perf-6.19.12 using libunwind' < root-perf-6.19-libunwind.stacks >| root-perf-6.19-libunwind.svg
So, as you can see above, a simple perf-report took 1.86 seconds, and
perf-script took 4.08 seconds with libunwind. Now with perf upgraded to
perf-7.0 with libdw, this is what I see:
$ perf -vv
perf version 7.0
aio: [ on ] # HAVE_AIO_SUPPORT
bpf: [ on ] # HAVE_LIBBPF_SUPPORT
bpf_skeletons: [ on ] # HAVE_BPF_SKEL
debuginfod: [ on ] # HAVE_DEBUGINFOD_SUPPORT
dwarf: [ on ] # HAVE_LIBDW_SUPPORT
dwarf_getlocations: [ on ] # HAVE_LIBDW_SUPPORT
dwarf-unwind: [ on ] # HAVE_DWARF_UNWIND_SUPPORT
libbfd: [ OFF ] # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
libbabeltrace: [ on ] # HAVE_LIBBABELTRACE_SUPPORT
libbpf-strings: [ on ] # HAVE_LIBBPF_STRINGS_SUPPORT
libcapstone: [ on ] # HAVE_LIBCAPSTONE_SUPPORT
libdw-dwarf-unwind: [ on ] # HAVE_LIBDW_SUPPORT
libelf: [ on ] # HAVE_LIBELF_SUPPORT
libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
libopencsd: [ OFF ] # HAVE_CSTRACE_SUPPORT
libperl: [ on ] # HAVE_LIBPERL_SUPPORT
libpfm4: [ on ] # HAVE_LIBPFM
libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
libslang: [ on ] # HAVE_SLANG_SUPPORT
libtraceevent: [ on ] # HAVE_LIBTRACEEVENT
libunwind: [ OFF ] # HAVE_LIBUNWIND_SUPPORT ( tip: Deprecated, use LIBUNWIND=1 and install libunwind-dev[el] to build with it )
lzma: [ on ] # HAVE_LZMA_SUPPORT
numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
zlib: [ on ] # HAVE_ZLIB_SUPPORT
zstd: [ on ] # HAVE_ZSTD_SUPPORT
rust: [ on ] # HAVE_RUST_SUPPORT
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 79800 Hz
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.766 MB perf.data (19922 samples) ]
1.28
$ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
92.44% 0.00% root.exe root.exe [.] main
92.44% 0.00% root.exe root.exe [.] _start
92.44% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
87.95% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
75.78% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
User time (seconds): 250.33
System time (seconds): 21.18
Percent of CPU this job got: 99%
** Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
** Maximum resident set size (kbytes): 4433000
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 7
Minor (reclaiming a frame) page faults: 9850739
Voluntary context switches: 226
Involuntary context switches: 11388
Swaps: 0
File system inputs: 80776
File system outputs: 232
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After seeing how much memory perf was using, I decided to record that
too, so as you can see above, perf 7.0 with libdw took 4 minutes and 33
seconds for the same simple perf-report that took 1.86 seconds before,
and the names of the symbols are not as complete as with libunwind.
Stack unwinding itself also seems inconsistent with the previous run.
Here's the equivalent with perf-6.19.12:
$ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
92.46% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
92.46% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
92.46% 0.00% root.exe root.exe [.] _start
92.46% 0.00% root.exe root.exe [.] main
91.38% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
89.24% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
88.05% 0.01% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
87.96% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
87.95% 0.01% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
75.50% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
75.50% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
User time (seconds): 1.79
System time (seconds): 0.08
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 265108
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 37887
Voluntary context switches: 4
Involuntary context switches: 77
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Then, this is perf-script:
$ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-7.0-libdw.stacks
cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
... (line repeated many times)
cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
273.49
I see many of these cms__addr2line errors, and it takes 273.49 seconds
compared with 4.08 seconds with perf-6.19.12, and the flamegraph has
abbreviated function names like "operator()" instead of the full name,
which is also somewhat problematic as there's loss of information
relative to what libunwind used to provide. The flamegraphs for the two
runs above are available at https://cern.ch/amadio/perf I didn't want to
attach the files here as I don't want to send big files to the lists.
For the record, I am using libunwind-1.8.3 and elfutils-0.195 in these tests.
If you'd like to perform the same kind of test, you can install ROOT
from EPEL on a RHEL-like distribution inside a container with a simple
"dnf install root", or just try the same record/report commands with a
clang++ compilation of a simple program as a decent replacement.
Best regards,
-Guilherme
next reply other threads:[~2026-04-22 13:26 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 13:26 Guilherme Amadio [this message]
2026-04-23 4:21 ` perf performance with libdw vs libunwind Ian Rogers
2026-04-23 9:49 ` Guilherme Amadio
2026-04-23 22:28 ` Ian Rogers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aejMem8Pt3iCxaka@gentoo.org \
--to=amadio@gentoo.org \
--cc=acme@kernel.org \
--cc=irogers@google.com \
--cc=libunwind-devel@nongnu.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox