From: Guilherme Amadio <amadio@cern.ch>
To: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Subject: perf mem segfaults
Date: Fri, 31 May 2024 17:30:06 +0200 [thread overview]
Message-ID: <Zlns_o_IE5L28168@cern.ch> (raw)
Hi Arnaldo,
I reached some segfaults with perf mem, so I recompiled with DEBUG=1
and am sending below some stack traces. This is all with perf 6.9 for
now. The first one seems to pass a null pointer as format in frame #5,
along with a blank mem_loads_name:
gentoo ~ $ gdb --args perf mem record -t load -- root.exe -l -q
GNU gdb (Gentoo 14.2 vanilla) 14.2
Reading symbols from perf...
(gdb) run
Starting program: /usr/bin/perf mem record -t load -- root.exe -l -q
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
__strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
67 vmovdqu (%rdi), %ymm2
(gdb) bt
#0 __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
#1 0x00007ffff6c982de in __find_specmb (format=0x0) at printf-parse.h:82
#2 __printf_buffer (buf=buf@entry=0x7fffffffc760, format=format@entry=0x0, ap=ap@entry=0x7fffffffc880,
mode_flags=mode_flags@entry=0) at vfprintf-internal.c:649
#3 0x00007ffff6cb7840 in __vsnprintf_internal (string=<optimized out>, maxlen=<optimized out>, format=0x0,
args=0x7fffffffc880, mode_flags=mode_flags@entry=0) at vsnprintf.c:96
#4 0x00007ffff6cb787f in ___vsnprintf (string=<optimized out>, maxlen=<optimized out>, format=<optimized out>,
args=<optimized out>) at vsnprintf.c:103
#5 0x00005555557b9391 in scnprintf (buf=0x555555fe9320 <mem_loads_name> "", size=100, fmt=0x0)
at ../lib/vsprintf.c:21
#6 0x00005555557b74c3 in perf_pmu__mem_events_name (i=0, pmu=0x555556832180) at util/mem-events.c:106
#7 0x00005555557b7ab9 in perf_mem_events__record_args (rec_argv=0x55555684c000, argv_nr=0x7fffffffca20)
at util/mem-events.c:252
#8 0x00005555555e370d in __cmd_record (argc=3, argv=0x7fffffffd760, mem=0x7fffffffcd80) at builtin-mem.c:156
#9 0x00005555555e49c4 in cmd_mem (argc=4, argv=0x7fffffffd760) at builtin-mem.c:514
#10 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=8, argv=0x7fffffffd760) at perf.c:349
#11 0x0000555555697402 in handle_internal_command (argc=8, argv=0x7fffffffd760) at perf.c:402
#12 0x0000555555697560 in run_argv (argcp=0x7fffffffd59c, argv=0x7fffffffd590) at perf.c:446
#13 0x00005555556978a6 in main (argc=8, argv=0x7fffffffd760) at perf.c:562
(gdb)
when I record without -t load, then it works. However, when I report
with --hierarchy, I get a crash as well:
gentoo ~ $ gdb --args perf mem report --stdio --hierarchy
...
Program received signal SIGSEGV, Segmentation fault.
0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
376 if (sym->type == STT_OBJECT) {
(gdb) bt
#0 0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
#1 0x000055555574e148 in hist_entry__daddr_snprintf (he=0x5555568d66c0, bf=0x555558261200 "[.] ", size=525, width=42)
at util/sort.c:1384
#2 0x0000555555750acb in __sort__hpp_entry (fmt=0x555556844a00, hpp=0x7fffffffa550, he=0x5555568d66c0)
at util/sort.c:2505
#3 0x0000555555829cce in hist_entry__hierarchy_fprintf (he=0x5555568d66c0, hpp=0x7fffffffa550, hists=0x555556818270,
fp=0x7ffff6df25c0 <_IO_2_1_stdout_>) at ui/stdio/hist.c:517
#4 0x000055555582a039 in hist_entry__fprintf (he=0x5555568d66c0, size=525, bf=0x555558261200 "[.] ", bfsz=525,
fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:598
#5 0x000055555582adf5 in hists__fprintf (hists=0x555556818270, show_header=true, max_rows=0, max_cols=0, min_pcnt=0,
fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:868
#6 0x00005555555b2442 in evlist__tty_browse_hists (evlist=0x55555682e000, rep=0x7fffffffa930,
help=0x555556924240 "Tip: System-wide collection from all CPUs: perf record -a") at builtin-report.c:581
#7 0x00005555555b28c0 in report__browse_hists (rep=0x7fffffffa930) at builtin-report.c:684
#8 0x00005555555b3e7f in __cmd_report (rep=0x7fffffffa930) at builtin-report.c:1141
#9 0x00005555555b668a in cmd_report (argc=0, argv=0x555556806100) at builtin-report.c:1800
#10 0x00005555555e42d2 in report_events (argc=3, argv=0x7fffffffd790, mem=0x7fffffffcdb0) at builtin-mem.c:374
#11 0x00005555555e4a1c in cmd_mem (argc=3, argv=0x7fffffffd790) at builtin-mem.c:516
#12 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=4, argv=0x7fffffffd790) at perf.c:349
#13 0x0000555555697402 in handle_internal_command (argc=4, argv=0x7fffffffd790) at perf.c:402
#14 0x0000555555697560 in run_argv (argcp=0x7fffffffd5cc, argv=0x7fffffffd5c0) at perf.c:446
#15 0x00005555556978a6 in main (argc=4, argv=0x7fffffffd790) at perf.c:562
(gdb)
The other thing I noticed as of recently is that --hierarchy sometimes
fails to correctly show stacks:
$ sudo perf record -a --off-cpu -g -- sleep 10
gentoo ~ $ sudo perf report --stdio --no-children --percent-limit 5 -c swapper
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
# Overhead Shared Object Symbol
# ........ ................................... .........................................
#
39.71% [kernel.kallsyms] [k] io_idle
|
--32.55%--io_idle
acpi_idle_enter
cpuidle_enter_state
cpuidle_enter
call_cpuidle
do_idle
cpu_startup_entry
|
--31.71%--0xffffffffa18d046c
secondary_startup_64
# Samples: 525 of event 'offcpu-time'
# Event count (approx.): 294273818205
#
# Overhead Shared Object Symbol
# ........ ............. ......
#
$ sudo perf report --stdio --no-children --percent-limit 5 -c swapper --hierarchy
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
# Overhead Command / Shared Object / Symbol
# .............. ...............................................
#
47.61% swapper
47.58% [kernel.kallsyms]
39.71% [k] io_idle
|
--32.55%--0xffffffffa36504b3
0xffffffffa3650837
0xffffffffa364f9a4
0xffffffffa2e63049
0xffffffffa1a36d42
0xffffffffa1a3c963
0xffffffffa1a3cbc9
|
--31.71%--0xffffffffa18d046c
0xffffffffa189aefd
# Samples: 525 of event 'offcpu-time'
# Event count (approx.): 294273818205
#
# Overhead Command / Shared Object / Symbol
# .............. ................................
#
So the off-cpu stacks are not shown, and with --hierarchy symbol
resolution seems to not work anymore, which is somewhat strange.
This is the hardware I'm using:
gentoo ~ $ lscpu | head
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 3950X 16-Core Processor
CPU family: 23
Model: 113
It's known to be buggy with the NMI interrupts, which still causes dmesg
to show strange messages like shown below:
[286943.722814] perf[1010806]: segfault at 0 ip 00007f1dc277ab43 sp 00007ffdcd302cd8 error 4 in libc.so.6[7f1dc2666000+134000] likely on CPU 5 (core 5, socket 0)
[286943.722827] Code: f8 77 c3 0f 1f 40 00 f3 0f 1e fa c5 f9 6e c6 89 f8 25 ff 0f 00 00 c4 e2 7d 78 c0 c5 f1 ef c9 3d e0 0f 00 00 0f 87 8d 01 00 00 <c5> fe 6f 17 c5 fd 74 da c5 f5 74 d2 c5 ed eb db c5 fd d7 c3 85 c0
[288940.208010] traps: perf[1111710] general protection fault ip:55c0999fb228 sp:7ffd96023f00 error:0 in perf[55c09983d000+31d000]
[289114.949941] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79600
[289115.070561] WARNING: kernel stack frame pointer at 00000000189df501 in swapper/20:0 has bad value 000000005dc8ab50
[289115.070568] unwind stack type:0 next_sp:0000000000000000 mask:0x6 graph_idx:0
[289115.070570] 00000000b2bdf5fc: ffffab26805f4ba0 (0xffffab26805f4ba0)
[289115.070575] 000000008d353c11: ffffffffa365c2d2 (_raw_spin_lock_irqsave+0x52/0x70)
[289115.070580] 000000002fccc0bb: ffff9ca4057d5be8 (0xffff9ca4057d5be8)
[289115.070582] 00000000078840a2: 0000000000000000 ...
[289115.070583] 000000002c7f6250: ffffab26805f4bb8 (0xffffab26805f4bb8)
[289115.070585] 00000000afe09fb3: ffffffffc03eef12 (os_acquire_spinlock+0x12/0x30 [nvidia])
[289115.070844] 000000005d608f8f: ffff9ca4057d5bc8 (0xffff9ca4057d5bc8)
[289115.070846] 00000000189df501: ffff9ca4297ade60 (0xffff9ca4297ade60)
[289115.070848] 00000000b4c4e61a: ffffffffc0cfd120 (_nv044096rm+0x10/0x20 [nvidia])
[289115.071189] 00000000e2a64928: 0000000000000000 ...
and later (the usual "Dazed and confused ones"):
[289115.075161] INFO: NMI handler (perf_ibs_nmi_handler) took too long to run: 4.605 msecs
[289115.075166] perf: interrupt took too long (38058 > 3138), lowering kernel.perf_event_max_sample_rate to 5200
[289117.772464] Uhhuh. NMI received for unknown reason 3c on CPU 16.
[289117.772468] Dazed and confused, but trying to continue
I wonder also if the nvidia drivers taint have anything to do with this.
I'd appreciate your insights in helping to debug this last problem, as I
couldn't find out why it's happening. I'm running Linux 6.9.0.
I hope this is enough information to at least fix the segfaults, but if
you need any extra information, please let me know.
Best regards,
-Guilherme
next reply other threads:[~2024-05-31 15:30 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-31 15:30 Guilherme Amadio [this message]
2024-05-31 17:02 ` perf mem segfaults Ian Rogers
2024-06-04 4:26 ` Namhyung Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zlns_o_IE5L28168@cern.ch \
--to=amadio@cern.ch \
--cc=acme@kernel.org \
--cc=linux-perf-users@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).