* perf performance with libdw vs libunwind
@ 2026-04-22 13:26 Guilherme Amadio
2026-04-23 4:21 ` Ian Rogers
0 siblings, 1 reply; 4+ messages in thread
From: Guilherme Amadio @ 2026-04-22 13:26 UTC (permalink / raw)
To: Ian Rogers; +Cc: acme, linux-perf-users, linux-kernel, libunwind-devel
Dear Ian,
Now that linux-7.0 is out, I've updated perf in Gentoo and moved it
to use libdw, as libunwind has been deprecated. However, when I tried
to use perf, I noticed a substantial performance regression and some
other problems, which I report below.
I use here an example which is my own "standard candle"¹ for checking
that stack unwinding is working properly: the startup of ROOT², which
is a C++ interpreter heavily used in high energy physics data analysis.
I simply run 'root -l -q' which is the equivalent of 'python -c ""' for
ROOT. It takes less than a second to run, but since it runs a full
initialization of Clang/LLVM as part of the interpreter, it produces a
rich flamegraph that I know ahead of time how it should look like, so
I use it to check that stack unwinding and symbol resolution are working.
1. https://en.wikipedia.org/wiki/Cosmic_distance_ladder#Standard_candles
2. https://root.cern
Below I show a comparison of the timings of perf record/report for this.
First, I run it with perf-6.19.12 which is configured to use libunwind:
$ perf config
call-graph.record-mode=fp
$ perf -vv
perf version 6.19.12
aio: [ on ] # HAVE_AIO_SUPPORT
bpf: [ on ] # HAVE_LIBBPF_SUPPORT
bpf_skeletons: [ on ] # HAVE_BPF_SKEL
debuginfod: [ on ] # HAVE_DEBUGINFOD_SUPPORT
dwarf: [ on ] # HAVE_LIBDW_SUPPORT
dwarf_getlocations: [ on ] # HAVE_LIBDW_SUPPORT
dwarf-unwind: [ on ] # HAVE_DWARF_UNWIND_SUPPORT
libbfd: [ OFF ] # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
libbpf-strings: [ on ] # HAVE_LIBBPF_STRINGS_SUPPORT
libcapstone: [ on ] # HAVE_LIBCAPSTONE_SUPPORT
libdw-dwarf-unwind: [ on ] # HAVE_LIBDW_SUPPORT
libelf: [ on ] # HAVE_LIBELF_SUPPORT
libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
libopencsd: [ OFF ] # HAVE_CSTRACE_SUPPORT
libperl: [ on ] # HAVE_LIBPERL_SUPPORT
libpfm4: [ on ] # HAVE_LIBPFM
libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
libslang: [ on ] # HAVE_SLANG_SUPPORT
libtraceevent: [ on ] # HAVE_LIBTRACEEVENT
libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
lzma: [ on ] # HAVE_LZMA_SUPPORT
numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
zlib: [ on ] # HAVE_ZLIB_SUPPORT
zstd: [ on ] # HAVE_ZSTD_SUPPORT
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 79800 Hz
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.688 MB perf.data (19693 samples) ]
1.25
$ time perf report -q --stdio -g none --children --percent-limit 75
92.63% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
92.63% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
92.63% 0.00% root.exe root.exe [.] _start
92.63% 0.00% root.exe root.exe [.] main
91.53% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
89.36% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
88.18% 0.00% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
88.10% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
88.08% 0.00% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
75.62% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
75.62% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
1.86
$ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-6.19-libunwind.stacks
4.08
$ flamegraph.pl -w 2560 --title 'Flame Graph: ROOT Startup' --subtitle 'Created with perf-6.19.12 using libunwind' < root-perf-6.19-libunwind.stacks >| root-perf-6.19-libunwind.svg
So, as you can see above, a simple perf-report took 1.86 seconds, and
perf-script took 4.08 seconds with libunwind. Now with perf upgraded to
perf-7.0 with libdw, this is what I see:
$ perf -vv
perf version 7.0
aio: [ on ] # HAVE_AIO_SUPPORT
bpf: [ on ] # HAVE_LIBBPF_SUPPORT
bpf_skeletons: [ on ] # HAVE_BPF_SKEL
debuginfod: [ on ] # HAVE_DEBUGINFOD_SUPPORT
dwarf: [ on ] # HAVE_LIBDW_SUPPORT
dwarf_getlocations: [ on ] # HAVE_LIBDW_SUPPORT
dwarf-unwind: [ on ] # HAVE_DWARF_UNWIND_SUPPORT
libbfd: [ OFF ] # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
libbabeltrace: [ on ] # HAVE_LIBBABELTRACE_SUPPORT
libbpf-strings: [ on ] # HAVE_LIBBPF_STRINGS_SUPPORT
libcapstone: [ on ] # HAVE_LIBCAPSTONE_SUPPORT
libdw-dwarf-unwind: [ on ] # HAVE_LIBDW_SUPPORT
libelf: [ on ] # HAVE_LIBELF_SUPPORT
libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
libopencsd: [ OFF ] # HAVE_CSTRACE_SUPPORT
libperl: [ on ] # HAVE_LIBPERL_SUPPORT
libpfm4: [ on ] # HAVE_LIBPFM
libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
libslang: [ on ] # HAVE_SLANG_SUPPORT
libtraceevent: [ on ] # HAVE_LIBTRACEEVENT
libunwind: [ OFF ] # HAVE_LIBUNWIND_SUPPORT ( tip: Deprecated, use LIBUNWIND=1 and install libunwind-dev[el] to build with it )
lzma: [ on ] # HAVE_LZMA_SUPPORT
numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
zlib: [ on ] # HAVE_ZLIB_SUPPORT
zstd: [ on ] # HAVE_ZSTD_SUPPORT
rust: [ on ] # HAVE_RUST_SUPPORT
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 79800 Hz
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.766 MB perf.data (19922 samples) ]
1.28
$ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
92.44% 0.00% root.exe root.exe [.] main
92.44% 0.00% root.exe root.exe [.] _start
92.44% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
87.95% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
75.78% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
User time (seconds): 250.33
System time (seconds): 21.18
Percent of CPU this job got: 99%
** Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
** Maximum resident set size (kbytes): 4433000
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 7
Minor (reclaiming a frame) page faults: 9850739
Voluntary context switches: 226
Involuntary context switches: 11388
Swaps: 0
File system inputs: 80776
File system outputs: 232
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After seeing how much memory perf was using, I decided to record that
too, so as you can see above, perf 7.0 with libdw took 4 minutes and 33
seconds for the same simple perf-report that took 1.86 seconds before,
and the names of the symbols are not as complete as with libunwind.
Stack unwinding itself also seems inconsistent with the previous run.
Here's the equivalent with perf-6.19.12:
$ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
92.46% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
92.46% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
92.46% 0.00% root.exe root.exe [.] _start
92.46% 0.00% root.exe root.exe [.] main
91.38% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
89.24% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
88.05% 0.01% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
87.96% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
87.95% 0.01% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
75.50% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
75.50% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
User time (seconds): 1.79
System time (seconds): 0.08
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 265108
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 37887
Voluntary context switches: 4
Involuntary context switches: 77
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Then, this is perf-script:
$ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-7.0-libdw.stacks
cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
... (line repeated many times)
cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
273.49
I see many of these cms__addr2line errors, and it takes 273.49 seconds
compared with 4.08 seconds with perf-6.19.12, and the flamegraph has
abbreviated function names like "operator()" instead of the full name,
which is also somewhat problematic as there's loss of information
relative to what libunwind used to provide. The flamegraphs for the two
runs above are available at https://cern.ch/amadio/perf I didn't want to
attach the files here as I don't want to send big files to the lists.
For the record, I am using libunwind-1.8.3 and elfutils-0.195 in these tests.
If you'd like to perform the same kind of test, you can install ROOT
from EPEL on a RHEL-like distribution inside a container with a simple
"dnf install root", or just try the same record/report commands with a
clang++ compilation of a simple program as a decent replacement.
Best regards,
-Guilherme
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: perf performance with libdw vs libunwind
2026-04-22 13:26 perf performance with libdw vs libunwind Guilherme Amadio
@ 2026-04-23 4:21 ` Ian Rogers
2026-04-23 9:49 ` Guilherme Amadio
0 siblings, 1 reply; 4+ messages in thread
From: Ian Rogers @ 2026-04-23 4:21 UTC (permalink / raw)
To: Guilherme Amadio; +Cc: acme, linux-perf-users, linux-kernel, libunwind-devel
On Wed, Apr 22, 2026 at 6:26 AM Guilherme Amadio <amadio@gentoo.org> wrote:
>
> Dear Ian,
>
> Now that linux-7.0 is out, I've updated perf in Gentoo and moved it
> to use libdw, as libunwind has been deprecated. However, when I tried
> to use perf, I noticed a substantial performance regression and some
> other problems, which I report below.
>
> I use here an example which is my own "standard candle"¹ for checking
> that stack unwinding is working properly: the startup of ROOT², which
> is a C++ interpreter heavily used in high energy physics data analysis.
> I simply run 'root -l -q' which is the equivalent of 'python -c ""' for
> ROOT. It takes less than a second to run, but since it runs a full
> initialization of Clang/LLVM as part of the interpreter, it produces a
> rich flamegraph that I know ahead of time how it should look like, so
> I use it to check that stack unwinding and symbol resolution are working.
>
> 1. https://en.wikipedia.org/wiki/Cosmic_distance_ladder#Standard_candles
> 2. https://root.cern
>
> Below I show a comparison of the timings of perf record/report for this.
>
> First, I run it with perf-6.19.12 which is configured to use libunwind:
>
> $ perf config
> call-graph.record-mode=fp
> $ perf -vv
> perf version 6.19.12
> aio: [ on ] # HAVE_AIO_SUPPORT
> bpf: [ on ] # HAVE_LIBBPF_SUPPORT
> bpf_skeletons: [ on ] # HAVE_BPF_SKEL
> debuginfod: [ on ] # HAVE_DEBUGINFOD_SUPPORT
> dwarf: [ on ] # HAVE_LIBDW_SUPPORT
> dwarf_getlocations: [ on ] # HAVE_LIBDW_SUPPORT
> dwarf-unwind: [ on ] # HAVE_DWARF_UNWIND_SUPPORT
> libbfd: [ OFF ] # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
> libbpf-strings: [ on ] # HAVE_LIBBPF_STRINGS_SUPPORT
> libcapstone: [ on ] # HAVE_LIBCAPSTONE_SUPPORT
> libdw-dwarf-unwind: [ on ] # HAVE_LIBDW_SUPPORT
> libelf: [ on ] # HAVE_LIBELF_SUPPORT
> libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
> libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
> libopencsd: [ OFF ] # HAVE_CSTRACE_SUPPORT
> libperl: [ on ] # HAVE_LIBPERL_SUPPORT
> libpfm4: [ on ] # HAVE_LIBPFM
> libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
> libslang: [ on ] # HAVE_SLANG_SUPPORT
> libtraceevent: [ on ] # HAVE_LIBTRACEEVENT
> libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
> lzma: [ on ] # HAVE_LZMA_SUPPORT
> numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
> zlib: [ on ] # HAVE_ZLIB_SUPPORT
> zstd: [ on ] # HAVE_ZSTD_SUPPORT
> $ time perf record -g -F max -e cycles:uk -- root.exe -l -q
> info: Using a maximum frequency rate of 79800 Hz
>
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.688 MB perf.data (19693 samples) ]
> 1.25
> $ time perf report -q --stdio -g none --children --percent-limit 75
> 92.63% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.63% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
> 92.63% 0.00% root.exe root.exe [.] _start
> 92.63% 0.00% root.exe root.exe [.] main
> 91.53% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
> 89.36% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
> 88.18% 0.00% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
> 88.10% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
> 88.08% 0.00% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
> 75.62% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
> 75.62% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
>
> 1.86
> $ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-6.19-libunwind.stacks
> 4.08
> $ flamegraph.pl -w 2560 --title 'Flame Graph: ROOT Startup' --subtitle 'Created with perf-6.19.12 using libunwind' < root-perf-6.19-libunwind.stacks >| root-perf-6.19-libunwind.svg
>
> So, as you can see above, a simple perf-report took 1.86 seconds, and
> perf-script took 4.08 seconds with libunwind. Now with perf upgraded to
> perf-7.0 with libdw, this is what I see:
>
> $ perf -vv
> perf version 7.0
> aio: [ on ] # HAVE_AIO_SUPPORT
> bpf: [ on ] # HAVE_LIBBPF_SUPPORT
> bpf_skeletons: [ on ] # HAVE_BPF_SKEL
> debuginfod: [ on ] # HAVE_DEBUGINFOD_SUPPORT
> dwarf: [ on ] # HAVE_LIBDW_SUPPORT
> dwarf_getlocations: [ on ] # HAVE_LIBDW_SUPPORT
> dwarf-unwind: [ on ] # HAVE_DWARF_UNWIND_SUPPORT
> libbfd: [ OFF ] # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
> libbabeltrace: [ on ] # HAVE_LIBBABELTRACE_SUPPORT
> libbpf-strings: [ on ] # HAVE_LIBBPF_STRINGS_SUPPORT
> libcapstone: [ on ] # HAVE_LIBCAPSTONE_SUPPORT
> libdw-dwarf-unwind: [ on ] # HAVE_LIBDW_SUPPORT
> libelf: [ on ] # HAVE_LIBELF_SUPPORT
> libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
> libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
> libopencsd: [ OFF ] # HAVE_CSTRACE_SUPPORT
> libperl: [ on ] # HAVE_LIBPERL_SUPPORT
> libpfm4: [ on ] # HAVE_LIBPFM
> libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
> libslang: [ on ] # HAVE_SLANG_SUPPORT
> libtraceevent: [ on ] # HAVE_LIBTRACEEVENT
> libunwind: [ OFF ] # HAVE_LIBUNWIND_SUPPORT ( tip: Deprecated, use LIBUNWIND=1 and install libunwind-dev[el] to build with it )
> lzma: [ on ] # HAVE_LZMA_SUPPORT
> numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
> zlib: [ on ] # HAVE_ZLIB_SUPPORT
> zstd: [ on ] # HAVE_ZSTD_SUPPORT
> rust: [ on ] # HAVE_RUST_SUPPORT
> $ time perf record -g -F max -e cycles:uk -- root.exe -l -q
> info: Using a maximum frequency rate of 79800 Hz
>
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.766 MB perf.data (19922 samples) ]
> 1.28
> $ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
> 92.44% 0.00% root.exe root.exe [.] main
> 92.44% 0.00% root.exe root.exe [.] _start
> 92.44% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 87.95% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
> 75.78% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
>
> Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
> User time (seconds): 250.33
> System time (seconds): 21.18
> Percent of CPU this job got: 99%
> ** Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32.97
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> ** Maximum resident set size (kbytes): 4433000
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 7
> Minor (reclaiming a frame) page faults: 9850739
> Voluntary context switches: 226
> Involuntary context switches: 11388
> Swaps: 0
> File system inputs: 80776
> File system outputs: 232
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
>
> After seeing how much memory perf was using, I decided to record that
> too, so as you can see above, perf 7.0 with libdw took 4 minutes and 33
> seconds for the same simple perf-report that took 1.86 seconds before,
> and the names of the symbols are not as complete as with libunwind.
Hi Guilherme,
Thanks for the feedback but I'm a little confused. Your .perfconfig is
set to use frame-pointer-based unwinding, so neither libunwind nor
libdw should be used for unwinding. With framepointer unwinding, a
sample contains an array of IPs gathered by walking the linked list of
frame pointers on the stack. With --call-graph=dwarf a region of
memory is copied into a sample (the stack) along with some initial
register values, libdw or libunwind is then used to process this
memory using the dwarf information in the ELF binary.
> Stack unwinding itself also seems inconsistent with the previous run.
> Here's the equivalent with perf-6.19.12:
>
> $ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
> 92.46% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.46% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
> 92.46% 0.00% root.exe root.exe [.] _start
> 92.46% 0.00% root.exe root.exe [.] main
> 91.38% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
> 89.24% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
> 88.05% 0.01% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
> 87.96% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
> 87.95% 0.01% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
> 75.50% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
> 75.50% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
>
>
> Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
> User time (seconds): 1.79
> System time (seconds): 0.08
> Percent of CPU this job got: 99%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> Maximum resident set size (kbytes): 265108
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 0
> Minor (reclaiming a frame) page faults: 37887
> Voluntary context switches: 4
> Involuntary context switches: 77
> Swaps: 0
> File system inputs: 0
> File system outputs: 8
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
>
> Then, this is perf-script:
>
> $ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-7.0-libdw.stacks
> cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
> ... (line repeated many times)
> cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
> 273.49
>
> I see many of these cms__addr2line errors, and it takes 273.49 seconds
> compared with 4.08 seconds with perf-6.19.12, and the flamegraph has
> abbreviated function names like "operator()" instead of the full name,
> which is also somewhat problematic as there's loss of information
> relative to what libunwind used to provide. The flamegraphs for the two
> runs above are available at https://cern.ch/amadio/perf I didn't want to
> attach the files here as I don't want to send big files to the lists.
> For the record, I am using libunwind-1.8.3 and elfutils-0.195 in these tests.
>
> If you'd like to perform the same kind of test, you can install ROOT
> from EPEL on a RHEL-like distribution inside a container with a simple
> "dnf install root", or just try the same record/report commands with a
> clang++ compilation of a simple program as a decent replacement.
So something that changed in v7.0 is that with the dwarf libdw or
libunwind unwinding we always had all inline functions on the stack
but not with frame pointers. The IP in the frame pointer array can be
of an instruction within an inlined function. In v7.0 we added a patch
that includes inline information for both frame pointer and LBR-based
stack traces:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/util/machine.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf
By default we try to add inline information using libdw if that fails
we try llvm, then libbfd and finally the command line addr2line tool:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/srcline.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf#n145
I suspect the slow down is for doing all this addr2line work on a
binary that's been stripped. The good news here is that if you can add
a config option to avoid all the fallbacks, "addr2line.style=libdw".
You can also disable inline information by adding "--no-inline" to
your `perf report` command line.
Your report suggests we should tweak the defaults for showing inline
information. Could you try the options I've suggested and see if they
remedy the issue for you?
Many thanks,
Ian
> Best regards,
> -Guilherme
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: perf performance with libdw vs libunwind
2026-04-23 4:21 ` Ian Rogers
@ 2026-04-23 9:49 ` Guilherme Amadio
2026-04-23 22:28 ` Ian Rogers
0 siblings, 1 reply; 4+ messages in thread
From: Guilherme Amadio @ 2026-04-23 9:49 UTC (permalink / raw)
To: Ian Rogers; +Cc: acme, linux-perf-users, linux-kernel, libunwind-devel
Hi Ian,
On Wed, Apr 22, 2026 at 09:21:52PM -0700, Ian Rogers wrote:
> Hi Guilherme,
>
> Thanks for the feedback but I'm a little confused. Your .perfconfig is
> set to use frame-pointer-based unwinding, so neither libunwind nor
> libdw should be used for unwinding. With framepointer unwinding, a
> sample contains an array of IPs gathered by walking the linked list of
> frame pointers on the stack. With --call-graph=dwarf a region of
> memory is copied into a sample (the stack) along with some initial
> register values, libdw or libunwind is then used to process this
> memory using the dwarf information in the ELF binary.
Thanks for your reply and pardon my ignorance, I thought that the libraries
were generically used for stack unwinding, regardless of if it's fp or dwarf.
I should have looked a bit deeper before reporting this, but we are on the
right track.
> So something that changed in v7.0 is that with the dwarf libdw or
> libunwind unwinding we always had all inline functions on the stack
> but not with frame pointers. The IP in the frame pointer array can be
> of an instruction within an inlined function. In v7.0 we added a patch
> that includes inline information for both frame pointer and LBR-based
> stack traces:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/util/machine.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf
This is a nice development, I have been using --call-graph=dwarf to see
the inlined symbols, so having the ability to see inlined functions with
fp unwinding, which is much more lightweight in terms of space (i.e. the
size of the final perf.data files), is great.
> By default we try to add inline information using libdw if that fails
> we try llvm, then libbfd and finally the command line addr2line tool:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/srcline.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf#n145
> I suspect the slow down is for doing all this addr2line work on a
> binary that's been stripped. The good news here is that if you can add
> a config option to avoid all the fallbacks, "addr2line.style=libdw".
> You can also disable inline information by adding "--no-inline" to
> your `perf report` command line.
The binary and its direct dependent libraries, as well as most other dependencies
are not stripped, but it's possible that some dependency in the full chain might be.
When I run perf report using --no-inline I indeed recover the performance I
had before with perf-6.19.12. However, setting addr2line.style=libdw did not
help much. Here is what I observe:
$ perf config
call-graph.record-mode=fp
$ perf version
perf version 7.0
$ time perf record -g -F max -e cycles:uk -- root.exe -l -q
info: Using a maximum frequency rate of 63800 Hz
[ perf record: Woken up 18 times to write data ]
[ perf record: Captured and wrote 4.720 MB perf.data (16552 samples) ]
1.26
$ time perf report -q --stdio -g none --children --no-inline --percent-limit 75
92.65% 0.00% root.exe root.exe [.] main
92.64% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
92.64% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
92.64% 0.00% root.exe root.exe [.] _start
91.63% 0.00% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
89.46% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
88.31% 0.00% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
88.22% 0.00% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
88.20% 0.00% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
76.34% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
76.34% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
1.70
Flamegraph: https://amadio.web.cern.ch/perf/perf-report-noinline.svg
Now without --no-inline, and this first command is without addr2line.style=libdw in the config:
$ time perf report -q --stdio -g none --children --percent-limit 75
92.65% 0.00% root.exe root.exe [.] main
92.64% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
92.64% 0.00% root.exe root.exe [.] _start
88.22% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
76.34% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
241.60
Flamegraph: https://amadio.web.cern.ch/perf/perf-report-addr2line.svg
$ perf config addr2line.style=libdw
$ perf config
call-graph.record-mode=fp
addr2line.style=libdw
$ time perf report -q --stdio -g none --children --percent-limit 75
92.65% 0.00% root.exe root.exe [.] main
92.64% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
92.64% 0.00% root.exe root.exe [.] _start
88.22% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
137.93
Flamegraph: https://amadio.web.cern.ch/perf/perf-report-libdw.svg
The flame graphs above are for the perf-report commands themselves.
So, the performance is fine with --no-inline, and it's better with addr2line.style=libdw.
However, the function names are not the best in the last two reports, so this problem remains.
> Your report suggests we should tweak the defaults for showing inline
> information. Could you try the options I've suggested and see if they
> remedy the issue for you?
Thank you for the suggestions. Indeed --no-inline seems to bring back the
previous performance. Please let me know if you would like me to try more
things and what other information you need for the cases without --no-inline.
Best regards,
-Guilherme
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: perf performance with libdw vs libunwind
2026-04-23 9:49 ` Guilherme Amadio
@ 2026-04-23 22:28 ` Ian Rogers
0 siblings, 0 replies; 4+ messages in thread
From: Ian Rogers @ 2026-04-23 22:28 UTC (permalink / raw)
To: Guilherme Amadio; +Cc: acme, linux-perf-users, linux-kernel, libunwind-devel
On Thu, Apr 23, 2026 at 2:49 AM Guilherme Amadio <amadio@gentoo.org> wrote:
>
> Hi Ian,
>
> On Wed, Apr 22, 2026 at 09:21:52PM -0700, Ian Rogers wrote:
> > Hi Guilherme,
> >
> > Thanks for the feedback but I'm a little confused. Your .perfconfig is
> > set to use frame-pointer-based unwinding, so neither libunwind nor
> > libdw should be used for unwinding. With framepointer unwinding, a
> > sample contains an array of IPs gathered by walking the linked list of
> > frame pointers on the stack. With --call-graph=dwarf a region of
> > memory is copied into a sample (the stack) along with some initial
> > register values, libdw or libunwind is then used to process this
> > memory using the dwarf information in the ELF binary.
>
> Thanks for your reply and pardon my ignorance, I thought that the libraries
> were generically used for stack unwinding, regardless of if it's fp or dwarf.
> I should have looked a bit deeper before reporting this, but we are on the
> right track.
>
> > So something that changed in v7.0 is that with the dwarf libdw or
> > libunwind unwinding we always had all inline functions on the stack
> > but not with frame pointers. The IP in the frame pointer array can be
> > of an instruction within an inlined function. In v7.0 we added a patch
> > that includes inline information for both frame pointer and LBR-based
> > stack traces:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/util/machine.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf
>
> This is a nice development, I have been using --call-graph=dwarf to see
> the inlined symbols, so having the ability to see inlined functions with
> fp unwinding, which is much more lightweight in terms of space (i.e. the
> size of the final perf.data files), is great.
>
> > By default we try to add inline information using libdw if that fails
> > we try llvm, then libbfd and finally the command line addr2line tool:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/srcline.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf#n145
> > I suspect the slow down is for doing all this addr2line work on a
> > binary that's been stripped. The good news here is that if you can add
> > a config option to avoid all the fallbacks, "addr2line.style=libdw".
> > You can also disable inline information by adding "--no-inline" to
> > your `perf report` command line.
>
> The binary and its direct dependent libraries, as well as most other dependencies
> are not stripped, but it's possible that some dependency in the full chain might be.
>
> When I run perf report using --no-inline I indeed recover the performance I
> had before with perf-6.19.12. However, setting addr2line.style=libdw did not
> help much. Here is what I observe:
>
> $ perf config
> call-graph.record-mode=fp
> $ perf version
> perf version 7.0
> $ time perf record -g -F max -e cycles:uk -- root.exe -l -q
> info: Using a maximum frequency rate of 63800 Hz
> [ perf record: Woken up 18 times to write data ]
> [ perf record: Captured and wrote 4.720 MB perf.data (16552 samples) ]
> 1.26
> $ time perf report -q --stdio -g none --children --no-inline --percent-limit 75
> 92.65% 0.00% root.exe root.exe [.] main
> 92.64% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.64% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
> 92.64% 0.00% root.exe root.exe [.] _start
> 91.63% 0.00% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
> 89.46% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
> 88.31% 0.00% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
> 88.22% 0.00% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
> 88.20% 0.00% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
> 76.34% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
> 76.34% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
>
> 1.70
> Flamegraph: https://amadio.web.cern.ch/perf/perf-report-noinline.svg
Thanks for all the reporting!
So here the C++ demangler is about 1/3rd of execution time and there's
no dwarf decoding for the inline functions.
> Now without --no-inline, and this first command is without addr2line.style=libdw in the config:
>
> $ time perf report -q --stdio -g none --children --percent-limit 75
> 92.65% 0.00% root.exe root.exe [.] main
> 92.64% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.64% 0.00% root.exe root.exe [.] _start
> 88.22% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
> 76.34% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
>
> 241.60
> Flamegraph: https://amadio.web.cern.ch/perf/perf-report-addr2line.svg
Here perf is using libdw trying to do the addr2line and then it is
using the addr2line command to do it. Time is mainly spent gathering
addr2line inline information.
> $ perf config addr2line.style=libdw
> $ perf config
> call-graph.record-mode=fp
> addr2line.style=libdw
> $ time perf report -q --stdio -g none --children --percent-limit 75
> 92.65% 0.00% root.exe root.exe [.] main
> 92.64% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.64% 0.00% root.exe root.exe [.] _start
> 88.22% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
>
> 137.93
> Flamegraph: https://amadio.web.cern.ch/perf/perf-report-libdw.svg
Here time is just spent in libdw.
> The flame graphs above are for the perf-report commands themselves.
>
> So, the performance is fine with --no-inline, and it's better with addr2line.style=libdw.
> However, the function names are not the best in the last two reports, so this problem remains.
>
> > Your report suggests we should tweak the defaults for showing inline
> > information. Could you try the options I've suggested and see if they
> > remedy the issue for you?
>
> Thank you for the suggestions. Indeed --no-inline seems to bring back the
> previous performance. Please let me know if you would like me to try more
> things and what other information you need for the cases without --no-inline.
Performance-wise, things are working as expected. I'm confused about
why we see different symbol names, perhaps this points to a libdw bug.
With or without --no-inline libelf gets the symbol name, libdw is only
used to get the source line and inlining information. Perhaps this is
more of a bug with `-g none`, which is an option I've never used. I'm
quite busy at the moment, so it's not easy for me to dig into this.
Perhaps we can create a test and try to get an LLM to investigate it.
Thanks,
Ian
> Best regards,
> -Guilherme
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-04-23 22:28 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 13:26 perf performance with libdw vs libunwind Guilherme Amadio
2026-04-23 4:21 ` Ian Rogers
2026-04-23 9:49 ` Guilherme Amadio
2026-04-23 22:28 ` Ian Rogers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox