* perf mem segfaults
@ 2024-05-31 15:30 Guilherme Amadio
2024-05-31 17:02 ` Ian Rogers
0 siblings, 1 reply; 3+ messages in thread
From: Guilherme Amadio @ 2024-05-31 15:30 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users
Hi Arnaldo,
I reached some segfaults with perf mem, so I recompiled with DEBUG=1
and am sending below some stack traces. This is all with perf 6.9 for
now. The first one seems to pass a null pointer as format in frame #5,
along with a blank mem_loads_name:
gentoo ~ $ gdb --args perf mem record -t load -- root.exe -l -q
GNU gdb (Gentoo 14.2 vanilla) 14.2
Reading symbols from perf...
(gdb) run
Starting program: /usr/bin/perf mem record -t load -- root.exe -l -q
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
__strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
67 vmovdqu (%rdi), %ymm2
(gdb) bt
#0 __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
#1 0x00007ffff6c982de in __find_specmb (format=0x0) at printf-parse.h:82
#2 __printf_buffer (buf=buf@entry=0x7fffffffc760, format=format@entry=0x0, ap=ap@entry=0x7fffffffc880,
mode_flags=mode_flags@entry=0) at vfprintf-internal.c:649
#3 0x00007ffff6cb7840 in __vsnprintf_internal (string=<optimized out>, maxlen=<optimized out>, format=0x0,
args=0x7fffffffc880, mode_flags=mode_flags@entry=0) at vsnprintf.c:96
#4 0x00007ffff6cb787f in ___vsnprintf (string=<optimized out>, maxlen=<optimized out>, format=<optimized out>,
args=<optimized out>) at vsnprintf.c:103
#5 0x00005555557b9391 in scnprintf (buf=0x555555fe9320 <mem_loads_name> "", size=100, fmt=0x0)
at ../lib/vsprintf.c:21
#6 0x00005555557b74c3 in perf_pmu__mem_events_name (i=0, pmu=0x555556832180) at util/mem-events.c:106
#7 0x00005555557b7ab9 in perf_mem_events__record_args (rec_argv=0x55555684c000, argv_nr=0x7fffffffca20)
at util/mem-events.c:252
#8 0x00005555555e370d in __cmd_record (argc=3, argv=0x7fffffffd760, mem=0x7fffffffcd80) at builtin-mem.c:156
#9 0x00005555555e49c4 in cmd_mem (argc=4, argv=0x7fffffffd760) at builtin-mem.c:514
#10 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=8, argv=0x7fffffffd760) at perf.c:349
#11 0x0000555555697402 in handle_internal_command (argc=8, argv=0x7fffffffd760) at perf.c:402
#12 0x0000555555697560 in run_argv (argcp=0x7fffffffd59c, argv=0x7fffffffd590) at perf.c:446
#13 0x00005555556978a6 in main (argc=8, argv=0x7fffffffd760) at perf.c:562
(gdb)
when I record without -t load, then it works. However, when I report
with --hierarchy, I get a crash as well:
gentoo ~ $ gdb --args perf mem report --stdio --hierarchy
...
Program received signal SIGSEGV, Segmentation fault.
0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
376 if (sym->type == STT_OBJECT) {
(gdb) bt
#0 0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
#1 0x000055555574e148 in hist_entry__daddr_snprintf (he=0x5555568d66c0, bf=0x555558261200 "[.] ", size=525, width=42)
at util/sort.c:1384
#2 0x0000555555750acb in __sort__hpp_entry (fmt=0x555556844a00, hpp=0x7fffffffa550, he=0x5555568d66c0)
at util/sort.c:2505
#3 0x0000555555829cce in hist_entry__hierarchy_fprintf (he=0x5555568d66c0, hpp=0x7fffffffa550, hists=0x555556818270,
fp=0x7ffff6df25c0 <_IO_2_1_stdout_>) at ui/stdio/hist.c:517
#4 0x000055555582a039 in hist_entry__fprintf (he=0x5555568d66c0, size=525, bf=0x555558261200 "[.] ", bfsz=525,
fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:598
#5 0x000055555582adf5 in hists__fprintf (hists=0x555556818270, show_header=true, max_rows=0, max_cols=0, min_pcnt=0,
fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:868
#6 0x00005555555b2442 in evlist__tty_browse_hists (evlist=0x55555682e000, rep=0x7fffffffa930,
help=0x555556924240 "Tip: System-wide collection from all CPUs: perf record -a") at builtin-report.c:581
#7 0x00005555555b28c0 in report__browse_hists (rep=0x7fffffffa930) at builtin-report.c:684
#8 0x00005555555b3e7f in __cmd_report (rep=0x7fffffffa930) at builtin-report.c:1141
#9 0x00005555555b668a in cmd_report (argc=0, argv=0x555556806100) at builtin-report.c:1800
#10 0x00005555555e42d2 in report_events (argc=3, argv=0x7fffffffd790, mem=0x7fffffffcdb0) at builtin-mem.c:374
#11 0x00005555555e4a1c in cmd_mem (argc=3, argv=0x7fffffffd790) at builtin-mem.c:516
#12 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=4, argv=0x7fffffffd790) at perf.c:349
#13 0x0000555555697402 in handle_internal_command (argc=4, argv=0x7fffffffd790) at perf.c:402
#14 0x0000555555697560 in run_argv (argcp=0x7fffffffd5cc, argv=0x7fffffffd5c0) at perf.c:446
#15 0x00005555556978a6 in main (argc=4, argv=0x7fffffffd790) at perf.c:562
(gdb)
The other thing I noticed as of recently is that --hierarchy sometimes
fails to correctly show stacks:
$ sudo perf record -a --off-cpu -g -- sleep 10
gentoo ~ $ sudo perf report --stdio --no-children --percent-limit 5 -c swapper
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
# Overhead Shared Object Symbol
# ........ ................................... .........................................
#
39.71% [kernel.kallsyms] [k] io_idle
|
--32.55%--io_idle
acpi_idle_enter
cpuidle_enter_state
cpuidle_enter
call_cpuidle
do_idle
cpu_startup_entry
|
--31.71%--0xffffffffa18d046c
secondary_startup_64
# Samples: 525 of event 'offcpu-time'
# Event count (approx.): 294273818205
#
# Overhead Shared Object Symbol
# ........ ............. ......
#
$ sudo perf report --stdio --no-children --percent-limit 5 -c swapper --hierarchy
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
# Overhead Command / Shared Object / Symbol
# .............. ...............................................
#
47.61% swapper
47.58% [kernel.kallsyms]
39.71% [k] io_idle
|
--32.55%--0xffffffffa36504b3
0xffffffffa3650837
0xffffffffa364f9a4
0xffffffffa2e63049
0xffffffffa1a36d42
0xffffffffa1a3c963
0xffffffffa1a3cbc9
|
--31.71%--0xffffffffa18d046c
0xffffffffa189aefd
# Samples: 525 of event 'offcpu-time'
# Event count (approx.): 294273818205
#
# Overhead Command / Shared Object / Symbol
# .............. ................................
#
So the off-cpu stacks are not shown, and with --hierarchy symbol
resolution seems to not work anymore, which is somewhat strange.
This is the hardware I'm using:
gentoo ~ $ lscpu | head
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 3950X 16-Core Processor
CPU family: 23
Model: 113
It's known to be buggy with the NMI interrupts, which still causes dmesg
to show strange messages like shown below:
[286943.722814] perf[1010806]: segfault at 0 ip 00007f1dc277ab43 sp 00007ffdcd302cd8 error 4 in libc.so.6[7f1dc2666000+134000] likely on CPU 5 (core 5, socket 0)
[286943.722827] Code: f8 77 c3 0f 1f 40 00 f3 0f 1e fa c5 f9 6e c6 89 f8 25 ff 0f 00 00 c4 e2 7d 78 c0 c5 f1 ef c9 3d e0 0f 00 00 0f 87 8d 01 00 00 <c5> fe 6f 17 c5 fd 74 da c5 f5 74 d2 c5 ed eb db c5 fd d7 c3 85 c0
[288940.208010] traps: perf[1111710] general protection fault ip:55c0999fb228 sp:7ffd96023f00 error:0 in perf[55c09983d000+31d000]
[289114.949941] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79600
[289115.070561] WARNING: kernel stack frame pointer at 00000000189df501 in swapper/20:0 has bad value 000000005dc8ab50
[289115.070568] unwind stack type:0 next_sp:0000000000000000 mask:0x6 graph_idx:0
[289115.070570] 00000000b2bdf5fc: ffffab26805f4ba0 (0xffffab26805f4ba0)
[289115.070575] 000000008d353c11: ffffffffa365c2d2 (_raw_spin_lock_irqsave+0x52/0x70)
[289115.070580] 000000002fccc0bb: ffff9ca4057d5be8 (0xffff9ca4057d5be8)
[289115.070582] 00000000078840a2: 0000000000000000 ...
[289115.070583] 000000002c7f6250: ffffab26805f4bb8 (0xffffab26805f4bb8)
[289115.070585] 00000000afe09fb3: ffffffffc03eef12 (os_acquire_spinlock+0x12/0x30 [nvidia])
[289115.070844] 000000005d608f8f: ffff9ca4057d5bc8 (0xffff9ca4057d5bc8)
[289115.070846] 00000000189df501: ffff9ca4297ade60 (0xffff9ca4297ade60)
[289115.070848] 00000000b4c4e61a: ffffffffc0cfd120 (_nv044096rm+0x10/0x20 [nvidia])
[289115.071189] 00000000e2a64928: 0000000000000000 ...
and later (the usual "Dazed and confused ones"):
[289115.075161] INFO: NMI handler (perf_ibs_nmi_handler) took too long to run: 4.605 msecs
[289115.075166] perf: interrupt took too long (38058 > 3138), lowering kernel.perf_event_max_sample_rate to 5200
[289117.772464] Uhhuh. NMI received for unknown reason 3c on CPU 16.
[289117.772468] Dazed and confused, but trying to continue
I wonder also if the nvidia drivers taint have anything to do with this.
I'd appreciate your insights in helping to debug this last problem, as I
couldn't find out why it's happening. I'm running Linux 6.9.0.
I hope this is enough information to at least fix the segfaults, but if
you need any extra information, please let me know.
Best regards,
-Guilherme
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: perf mem segfaults
2024-05-31 15:30 perf mem segfaults Guilherme Amadio
@ 2024-05-31 17:02 ` Ian Rogers
2024-06-04 4:26 ` Namhyung Kim
0 siblings, 1 reply; 3+ messages in thread
From: Ian Rogers @ 2024-05-31 17:02 UTC (permalink / raw)
To: Guilherme Amadio; +Cc: Arnaldo Carvalho de Melo, linux-perf-users
On Fri, May 31, 2024 at 8:30 AM Guilherme Amadio <amadio@cern.ch> wrote:
>
> Hi Arnaldo,
>
> I reached some segfaults with perf mem, so I recompiled with DEBUG=1
> and am sending below some stack traces. This is all with perf 6.9 for
> now. The first one seems to pass a null pointer as format in frame #5,
> along with a blank mem_loads_name:
>
> gentoo ~ $ gdb --args perf mem record -t load -- root.exe -l -q
> GNU gdb (Gentoo 14.2 vanilla) 14.2
> Reading symbols from perf...
> (gdb) run
> Starting program: /usr/bin/perf mem record -t load -- root.exe -l -q
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
>
> Program received signal SIGSEGV, Segmentation fault.
> __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
> 67 vmovdqu (%rdi), %ymm2
> (gdb) bt
> #0 __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
> #1 0x00007ffff6c982de in __find_specmb (format=0x0) at printf-parse.h:82
> #2 __printf_buffer (buf=buf@entry=0x7fffffffc760, format=format@entry=0x0, ap=ap@entry=0x7fffffffc880,
> mode_flags=mode_flags@entry=0) at vfprintf-internal.c:649
> #3 0x00007ffff6cb7840 in __vsnprintf_internal (string=<optimized out>, maxlen=<optimized out>, format=0x0,
> args=0x7fffffffc880, mode_flags=mode_flags@entry=0) at vsnprintf.c:96
> #4 0x00007ffff6cb787f in ___vsnprintf (string=<optimized out>, maxlen=<optimized out>, format=<optimized out>,
> args=<optimized out>) at vsnprintf.c:103
> #5 0x00005555557b9391 in scnprintf (buf=0x555555fe9320 <mem_loads_name> "", size=100, fmt=0x0)
> at ../lib/vsprintf.c:21
> #6 0x00005555557b74c3 in perf_pmu__mem_events_name (i=0, pmu=0x555556832180) at util/mem-events.c:106
> #7 0x00005555557b7ab9 in perf_mem_events__record_args (rec_argv=0x55555684c000, argv_nr=0x7fffffffca20)
> at util/mem-events.c:252
> #8 0x00005555555e370d in __cmd_record (argc=3, argv=0x7fffffffd760, mem=0x7fffffffcd80) at builtin-mem.c:156
> #9 0x00005555555e49c4 in cmd_mem (argc=4, argv=0x7fffffffd760) at builtin-mem.c:514
> #10 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=8, argv=0x7fffffffd760) at perf.c:349
> #11 0x0000555555697402 in handle_internal_command (argc=8, argv=0x7fffffffd760) at perf.c:402
> #12 0x0000555555697560 in run_argv (argcp=0x7fffffffd59c, argv=0x7fffffffd590) at perf.c:446
> #13 0x00005555556978a6 in main (argc=8, argv=0x7fffffffd760) at perf.c:562
> (gdb)
>
> when I record without -t load, then it works. However, when I report
> with --hierarchy, I get a crash as well:
>
> gentoo ~ $ gdb --args perf mem report --stdio --hierarchy
> ...
> Program received signal SIGSEGV, Segmentation fault.
> 0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
> bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
> 376 if (sym->type == STT_OBJECT) {
> (gdb) bt
> #0 0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
> bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
> #1 0x000055555574e148 in hist_entry__daddr_snprintf (he=0x5555568d66c0, bf=0x555558261200 "[.] ", size=525, width=42)
> at util/sort.c:1384
> #2 0x0000555555750acb in __sort__hpp_entry (fmt=0x555556844a00, hpp=0x7fffffffa550, he=0x5555568d66c0)
> at util/sort.c:2505
> #3 0x0000555555829cce in hist_entry__hierarchy_fprintf (he=0x5555568d66c0, hpp=0x7fffffffa550, hists=0x555556818270,
> fp=0x7ffff6df25c0 <_IO_2_1_stdout_>) at ui/stdio/hist.c:517
> #4 0x000055555582a039 in hist_entry__fprintf (he=0x5555568d66c0, size=525, bf=0x555558261200 "[.] ", bfsz=525,
> fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:598
> #5 0x000055555582adf5 in hists__fprintf (hists=0x555556818270, show_header=true, max_rows=0, max_cols=0, min_pcnt=0,
> fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:868
> #6 0x00005555555b2442 in evlist__tty_browse_hists (evlist=0x55555682e000, rep=0x7fffffffa930,
> help=0x555556924240 "Tip: System-wide collection from all CPUs: perf record -a") at builtin-report.c:581
> #7 0x00005555555b28c0 in report__browse_hists (rep=0x7fffffffa930) at builtin-report.c:684
> #8 0x00005555555b3e7f in __cmd_report (rep=0x7fffffffa930) at builtin-report.c:1141
> #9 0x00005555555b668a in cmd_report (argc=0, argv=0x555556806100) at builtin-report.c:1800
> #10 0x00005555555e42d2 in report_events (argc=3, argv=0x7fffffffd790, mem=0x7fffffffcdb0) at builtin-mem.c:374
> #11 0x00005555555e4a1c in cmd_mem (argc=3, argv=0x7fffffffd790) at builtin-mem.c:516
> #12 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=4, argv=0x7fffffffd790) at perf.c:349
> #13 0x0000555555697402 in handle_internal_command (argc=4, argv=0x7fffffffd790) at perf.c:402
> #14 0x0000555555697560 in run_argv (argcp=0x7fffffffd5cc, argv=0x7fffffffd5c0) at perf.c:446
> #15 0x00005555556978a6 in main (argc=4, argv=0x7fffffffd790) at perf.c:562
> (gdb)
Maybe this relates to the cleanup in:
https://lore.kernel.org/all/20240123185036.3461837-1-kan.liang@linux.intel.com/
I also did some smaller cleanup trying to make sanitizers happy:
https://lore.kernel.org/all/20240507183545.1236093-1-irogers@google.com/
> The other thing I noticed as of recently is that --hierarchy sometimes
> fails to correctly show stacks:
>
> $ sudo perf record -a --off-cpu -g -- sleep 10
> gentoo ~ $ sudo perf report --stdio --no-children --percent-limit 5 -c swapper
> # To display the perf.data header info, please use --header/--header-only options.
> #
> # comm: swapper
> #
> # Total Lost Samples: 0
> #
> # Samples: 17K of event 'cycles:P'
> # Event count (approx.): 6796556494
> #
> # Overhead Shared Object Symbol
> # ........ ................................... .........................................
> #
> 39.71% [kernel.kallsyms] [k] io_idle
> |
> --32.55%--io_idle
> acpi_idle_enter
> cpuidle_enter_state
> cpuidle_enter
> call_cpuidle
> do_idle
> cpu_startup_entry
> |
> --31.71%--0xffffffffa18d046c
> secondary_startup_64
>
> # Samples: 525 of event 'offcpu-time'
> # Event count (approx.): 294273818205
> #
> # Overhead Shared Object Symbol
> # ........ ............. ......
> #
>
> $ sudo perf report --stdio --no-children --percent-limit 5 -c swapper --hierarchy
> # To display the perf.data header info, please use --header/--header-only options.
> #
> # comm: swapper
> #
> # Total Lost Samples: 0
> #
> # Samples: 17K of event 'cycles:P'
> # Event count (approx.): 6796556494
> #
> # Overhead Command / Shared Object / Symbol
> # .............. ...............................................
> #
> 47.61% swapper
> 47.58% [kernel.kallsyms]
> 39.71% [k] io_idle
> |
> --32.55%--0xffffffffa36504b3
> 0xffffffffa3650837
> 0xffffffffa364f9a4
> 0xffffffffa2e63049
> 0xffffffffa1a36d42
> 0xffffffffa1a3c963
> 0xffffffffa1a3cbc9
> |
> --31.71%--0xffffffffa18d046c
> 0xffffffffa189aefd
>
>
>
> # Samples: 525 of event 'offcpu-time'
> # Event count (approx.): 294273818205
> #
> # Overhead Command / Shared Object / Symbol
> # .............. ................................
> #
>
> So the off-cpu stacks are not shown, and with --hierarchy symbol
> resolution seems to not work anymore, which is somewhat strange.
Perhaps this is another instance of this regression I introduced:
https://lore.kernel.org/linux-perf-users/CAM9d7cjCL2=h6jwQ_cM24JqT2Um=zfWtJYdQbiy6BAcktyudnQ@mail.gmail.com/
Apologies if it is.
Thanks,
Ian
> This is the hardware I'm using:
>
> gentoo ~ $ lscpu | head
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Address sizes: 43 bits physical, 48 bits virtual
> Byte Order: Little Endian
> CPU(s): 32
> On-line CPU(s) list: 0-31
> Vendor ID: AuthenticAMD
> Model name: AMD Ryzen 9 3950X 16-Core Processor
> CPU family: 23
> Model: 113
>
> It's known to be buggy with the NMI interrupts, which still causes dmesg
> to show strange messages like shown below:
>
> [286943.722814] perf[1010806]: segfault at 0 ip 00007f1dc277ab43 sp 00007ffdcd302cd8 error 4 in libc.so.6[7f1dc2666000+134000] likely on CPU 5 (core 5, socket 0)
> [286943.722827] Code: f8 77 c3 0f 1f 40 00 f3 0f 1e fa c5 f9 6e c6 89 f8 25 ff 0f 00 00 c4 e2 7d 78 c0 c5 f1 ef c9 3d e0 0f 00 00 0f 87 8d 01 00 00 <c5> fe 6f 17 c5 fd 74 da c5 f5 74 d2 c5 ed eb db c5 fd d7 c3 85 c0
> [288940.208010] traps: perf[1111710] general protection fault ip:55c0999fb228 sp:7ffd96023f00 error:0 in perf[55c09983d000+31d000]
> [289114.949941] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79600
> [289115.070561] WARNING: kernel stack frame pointer at 00000000189df501 in swapper/20:0 has bad value 000000005dc8ab50
> [289115.070568] unwind stack type:0 next_sp:0000000000000000 mask:0x6 graph_idx:0
> [289115.070570] 00000000b2bdf5fc: ffffab26805f4ba0 (0xffffab26805f4ba0)
> [289115.070575] 000000008d353c11: ffffffffa365c2d2 (_raw_spin_lock_irqsave+0x52/0x70)
> [289115.070580] 000000002fccc0bb: ffff9ca4057d5be8 (0xffff9ca4057d5be8)
> [289115.070582] 00000000078840a2: 0000000000000000 ...
> [289115.070583] 000000002c7f6250: ffffab26805f4bb8 (0xffffab26805f4bb8)
> [289115.070585] 00000000afe09fb3: ffffffffc03eef12 (os_acquire_spinlock+0x12/0x30 [nvidia])
> [289115.070844] 000000005d608f8f: ffff9ca4057d5bc8 (0xffff9ca4057d5bc8)
> [289115.070846] 00000000189df501: ffff9ca4297ade60 (0xffff9ca4297ade60)
> [289115.070848] 00000000b4c4e61a: ffffffffc0cfd120 (_nv044096rm+0x10/0x20 [nvidia])
> [289115.071189] 00000000e2a64928: 0000000000000000 ...
>
> and later (the usual "Dazed and confused ones"):
>
> [289115.075161] INFO: NMI handler (perf_ibs_nmi_handler) took too long to run: 4.605 msecs
> [289115.075166] perf: interrupt took too long (38058 > 3138), lowering kernel.perf_event_max_sample_rate to 5200
> [289117.772464] Uhhuh. NMI received for unknown reason 3c on CPU 16.
> [289117.772468] Dazed and confused, but trying to continue
>
> I wonder also if the nvidia drivers taint have anything to do with this.
> I'd appreciate your insights in helping to debug this last problem, as I
> couldn't find out why it's happening. I'm running Linux 6.9.0.
>
> I hope this is enough information to at least fix the segfaults, but if
> you need any extra information, please let me know.
>
> Best regards,
> -Guilherme
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: perf mem segfaults
2024-05-31 17:02 ` Ian Rogers
@ 2024-06-04 4:26 ` Namhyung Kim
0 siblings, 0 replies; 3+ messages in thread
From: Namhyung Kim @ 2024-06-04 4:26 UTC (permalink / raw)
To: Ian Rogers
Cc: namhyung, Guilherme Amadio, Arnaldo Carvalho de Melo,
linux-perf-users
From: namhyung@kernel.org
On Fri, 31 May 2024 10:02:18 -0700 Ian Rogers <irogers@google.com> wrote:
> On Fri, May 31, 2024 at 8:30 AM Guilherme Amadio <amadio@cern.ch> wrote:
> >
> > Hi Arnaldo,
> >
> > I reached some segfaults with perf mem, so I recompiled with DEBUG=1
> > and am sending below some stack traces. This is all with perf 6.9 for
> > now. The first one seems to pass a null pointer as format in frame #5,
> > along with a blank mem_loads_name:
> >
> > gentoo ~ $ gdb --args perf mem record -t load -- root.exe -l -q
> > GNU gdb (Gentoo 14.2 vanilla) 14.2
> > Reading symbols from perf...
> > (gdb) run
> > Starting program: /usr/bin/perf mem record -t load -- root.exe -l -q
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
> > 67 vmovdqu (%rdi), %ymm2
> > (gdb) bt
> > #0 __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
> > #1 0x00007ffff6c982de in __find_specmb (format=0x0) at printf-parse.h:82
> > #2 __printf_buffer (buf=buf@entry=0x7fffffffc760, format=format@entry=0x0, ap=ap@entry=0x7fffffffc880,
> > mode_flags=mode_flags@entry=0) at vfprintf-internal.c:649
> > #3 0x00007ffff6cb7840 in __vsnprintf_internal (string=<optimized out>, maxlen=<optimized out>, format=0x0,
> > args=0x7fffffffc880, mode_flags=mode_flags@entry=0) at vsnprintf.c:96
> > #4 0x00007ffff6cb787f in ___vsnprintf (string=<optimized out>, maxlen=<optimized out>, format=<optimized out>,
> > args=<optimized out>) at vsnprintf.c:103
> > #5 0x00005555557b9391 in scnprintf (buf=0x555555fe9320 <mem_loads_name> "", size=100, fmt=0x0)
> > at ../lib/vsprintf.c:21
> > #6 0x00005555557b74c3 in perf_pmu__mem_events_name (i=0, pmu=0x555556832180) at util/mem-events.c:106
> > #7 0x00005555557b7ab9 in perf_mem_events__record_args (rec_argv=0x55555684c000, argv_nr=0x7fffffffca20)
> > at util/mem-events.c:252
> > #8 0x00005555555e370d in __cmd_record (argc=3, argv=0x7fffffffd760, mem=0x7fffffffcd80) at builtin-mem.c:156
> > #9 0x00005555555e49c4 in cmd_mem (argc=4, argv=0x7fffffffd760) at builtin-mem.c:514
> > #10 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=8, argv=0x7fffffffd760) at perf.c:349
> > #11 0x0000555555697402 in handle_internal_command (argc=8, argv=0x7fffffffd760) at perf.c:402
> > #12 0x0000555555697560 in run_argv (argcp=0x7fffffffd59c, argv=0x7fffffffd590) at perf.c:446
> > #13 0x00005555556978a6 in main (argc=8, argv=0x7fffffffd760) at perf.c:562
> > (gdb)
> >
> > when I record without -t load, then it works. However, when I report
> > with --hierarchy, I get a crash as well:
> >
> > gentoo ~ $ gdb --args perf mem report --stdio --hierarchy
> > ...
> > Program received signal SIGSEGV, Segmentation fault.
> > 0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
> > bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
> > 376 if (sym->type == STT_OBJECT) {
> > (gdb) bt
> > #0 0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
> > bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
> > #1 0x000055555574e148 in hist_entry__daddr_snprintf (he=0x5555568d66c0, bf=0x555558261200 "[.] ", size=525, width=42)
> > at util/sort.c:1384
> > #2 0x0000555555750acb in __sort__hpp_entry (fmt=0x555556844a00, hpp=0x7fffffffa550, he=0x5555568d66c0)
> > at util/sort.c:2505
> > #3 0x0000555555829cce in hist_entry__hierarchy_fprintf (he=0x5555568d66c0, hpp=0x7fffffffa550, hists=0x555556818270,
> > fp=0x7ffff6df25c0 <_IO_2_1_stdout_>) at ui/stdio/hist.c:517
> > #4 0x000055555582a039 in hist_entry__fprintf (he=0x5555568d66c0, size=525, bf=0x555558261200 "[.] ", bfsz=525,
> > fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:598
> > #5 0x000055555582adf5 in hists__fprintf (hists=0x555556818270, show_header=true, max_rows=0, max_cols=0, min_pcnt=0,
> > fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:868
> > #6 0x00005555555b2442 in evlist__tty_browse_hists (evlist=0x55555682e000, rep=0x7fffffffa930,
> > help=0x555556924240 "Tip: System-wide collection from all CPUs: perf record -a") at builtin-report.c:581
> > #7 0x00005555555b28c0 in report__browse_hists (rep=0x7fffffffa930) at builtin-report.c:684
> > #8 0x00005555555b3e7f in __cmd_report (rep=0x7fffffffa930) at builtin-report.c:1141
> > #9 0x00005555555b668a in cmd_report (argc=0, argv=0x555556806100) at builtin-report.c:1800
> > #10 0x00005555555e42d2 in report_events (argc=3, argv=0x7fffffffd790, mem=0x7fffffffcdb0) at builtin-mem.c:374
> > #11 0x00005555555e4a1c in cmd_mem (argc=3, argv=0x7fffffffd790) at builtin-mem.c:516
> > #12 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=4, argv=0x7fffffffd790) at perf.c:349
> > #13 0x0000555555697402 in handle_internal_command (argc=4, argv=0x7fffffffd790) at perf.c:402
> > #14 0x0000555555697560 in run_argv (argcp=0x7fffffffd5cc, argv=0x7fffffffd5c0) at perf.c:446
> > #15 0x00005555556978a6 in main (argc=4, argv=0x7fffffffd790) at perf.c:562
> > (gdb)
>
> Maybe this relates to the cleanup in:
> https://lore.kernel.org/all/20240123185036.3461837-1-kan.liang@linux.intel.com/
> I also did some smaller cleanup trying to make sanitizers happy:
> https://lore.kernel.org/all/20240507183545.1236093-1-irogers@google.com/
How about this? I'm not sure why we think it's supported if it has no
event name.
Thanks,
Namhyung
---8<---
diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index 6dda47bb774f..e6051f0cf4db 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -86,7 +86,7 @@ static const char *perf_pmu__mem_events_name(int i, struct perf_pmu *pmu)
return NULL;
e = &pmu->mem_events[i];
- if (!e)
+ if (!e || e->name)
return NULL;
if (i == PERF_MEM_EVENTS__LOAD || i == PERF_MEM_EVENTS__LOAD_STORE) {
@@ -184,7 +184,7 @@ static bool perf_pmu__mem_events_supported(const char *mnt, struct perf_pmu *pmu
struct stat st;
if (!e->event_name)
- return true;
+ return false;
scnprintf(path, PATH_MAX, "%s/devices/%s/events/%s", mnt, pmu->name, e->event_name);
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-06-04 4:26 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-31 15:30 perf mem segfaults Guilherme Amadio
2024-05-31 17:02 ` Ian Rogers
2024-06-04 4:26 ` Namhyung Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).