linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* perf mem segfaults
@ 2024-05-31 15:30 Guilherme Amadio
  2024-05-31 17:02 ` Ian Rogers
  0 siblings, 1 reply; 3+ messages in thread
From: Guilherme Amadio @ 2024-05-31 15:30 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: linux-perf-users

Hi Arnaldo,

I reached some segfaults with perf mem, so I recompiled with DEBUG=1
and am sending below some stack traces. This is all with perf 6.9 for
now. The first one seems to pass a null pointer as format in frame #5,
along with a blank mem_loads_name:

gentoo ~ $ gdb --args perf mem record -t load -- root.exe -l -q
GNU gdb (Gentoo 14.2 vanilla) 14.2
Reading symbols from perf...
(gdb) run
Starting program: /usr/bin/perf mem record -t load -- root.exe -l -q
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
__strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
67		vmovdqu	(%rdi), %ymm2
(gdb) bt
#0  __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
#1  0x00007ffff6c982de in __find_specmb (format=0x0) at printf-parse.h:82
#2  __printf_buffer (buf=buf@entry=0x7fffffffc760, format=format@entry=0x0, ap=ap@entry=0x7fffffffc880,
    mode_flags=mode_flags@entry=0) at vfprintf-internal.c:649
#3  0x00007ffff6cb7840 in __vsnprintf_internal (string=<optimized out>, maxlen=<optimized out>, format=0x0,
    args=0x7fffffffc880, mode_flags=mode_flags@entry=0) at vsnprintf.c:96
#4  0x00007ffff6cb787f in ___vsnprintf (string=<optimized out>, maxlen=<optimized out>, format=<optimized out>,
    args=<optimized out>) at vsnprintf.c:103
#5  0x00005555557b9391 in scnprintf (buf=0x555555fe9320 <mem_loads_name> "", size=100, fmt=0x0)
    at ../lib/vsprintf.c:21
#6  0x00005555557b74c3 in perf_pmu__mem_events_name (i=0, pmu=0x555556832180) at util/mem-events.c:106
#7  0x00005555557b7ab9 in perf_mem_events__record_args (rec_argv=0x55555684c000, argv_nr=0x7fffffffca20)
    at util/mem-events.c:252
#8  0x00005555555e370d in __cmd_record (argc=3, argv=0x7fffffffd760, mem=0x7fffffffcd80) at builtin-mem.c:156
#9  0x00005555555e49c4 in cmd_mem (argc=4, argv=0x7fffffffd760) at builtin-mem.c:514
#10 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=8, argv=0x7fffffffd760) at perf.c:349
#11 0x0000555555697402 in handle_internal_command (argc=8, argv=0x7fffffffd760) at perf.c:402
#12 0x0000555555697560 in run_argv (argcp=0x7fffffffd59c, argv=0x7fffffffd590) at perf.c:446
#13 0x00005555556978a6 in main (argc=8, argv=0x7fffffffd760) at perf.c:562
(gdb)

when I record without -t load, then it works. However, when I report
with --hierarchy, I get a crash as well:

gentoo ~ $ gdb --args perf mem report --stdio --hierarchy
...
Program received signal SIGSEGV, Segmentation fault.
0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
    bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
376			if (sym->type == STT_OBJECT) {
(gdb) bt
#0  0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
    bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
#1  0x000055555574e148 in hist_entry__daddr_snprintf (he=0x5555568d66c0, bf=0x555558261200 "[.] ", size=525, width=42)
    at util/sort.c:1384
#2  0x0000555555750acb in __sort__hpp_entry (fmt=0x555556844a00, hpp=0x7fffffffa550, he=0x5555568d66c0)
    at util/sort.c:2505
#3  0x0000555555829cce in hist_entry__hierarchy_fprintf (he=0x5555568d66c0, hpp=0x7fffffffa550, hists=0x555556818270,
    fp=0x7ffff6df25c0 <_IO_2_1_stdout_>) at ui/stdio/hist.c:517
#4  0x000055555582a039 in hist_entry__fprintf (he=0x5555568d66c0, size=525, bf=0x555558261200 "[.] ", bfsz=525,
    fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:598
#5  0x000055555582adf5 in hists__fprintf (hists=0x555556818270, show_header=true, max_rows=0, max_cols=0, min_pcnt=0,
    fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:868
#6  0x00005555555b2442 in evlist__tty_browse_hists (evlist=0x55555682e000, rep=0x7fffffffa930,
    help=0x555556924240 "Tip: System-wide collection from all CPUs: perf record -a") at builtin-report.c:581
#7  0x00005555555b28c0 in report__browse_hists (rep=0x7fffffffa930) at builtin-report.c:684
#8  0x00005555555b3e7f in __cmd_report (rep=0x7fffffffa930) at builtin-report.c:1141
#9  0x00005555555b668a in cmd_report (argc=0, argv=0x555556806100) at builtin-report.c:1800
#10 0x00005555555e42d2 in report_events (argc=3, argv=0x7fffffffd790, mem=0x7fffffffcdb0) at builtin-mem.c:374
#11 0x00005555555e4a1c in cmd_mem (argc=3, argv=0x7fffffffd790) at builtin-mem.c:516
#12 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=4, argv=0x7fffffffd790) at perf.c:349
#13 0x0000555555697402 in handle_internal_command (argc=4, argv=0x7fffffffd790) at perf.c:402
#14 0x0000555555697560 in run_argv (argcp=0x7fffffffd5cc, argv=0x7fffffffd5c0) at perf.c:446
#15 0x00005555556978a6 in main (argc=4, argv=0x7fffffffd790) at perf.c:562
(gdb)

The other thing I noticed as of recently is that --hierarchy sometimes
fails to correctly show stacks:

$ sudo perf record -a --off-cpu -g -- sleep 10
gentoo ~ $ sudo perf report --stdio --no-children --percent-limit 5 -c swapper
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
# Overhead  Shared Object                        Symbol
# ........  ...................................  .........................................
#
    39.71%  [kernel.kallsyms]                    [k] io_idle
            |
             --32.55%--io_idle
                       acpi_idle_enter
                       cpuidle_enter_state
                       cpuidle_enter
                       call_cpuidle
                       do_idle
                       cpu_startup_entry
                       |
                        --31.71%--0xffffffffa18d046c
                                  secondary_startup_64

# Samples: 525  of event 'offcpu-time'
# Event count (approx.): 294273818205
#
# Overhead  Shared Object  Symbol
# ........  .............  ......
#

$ sudo perf report --stdio --no-children --percent-limit 5 -c swapper --hierarchy
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
#       Overhead  Command / Shared Object / Symbol
# ..............  ...............................................
#
    47.61%        swapper
       47.58%        [kernel.kallsyms]
          39.71%        [k] io_idle
            |
             --32.55%--0xffffffffa36504b3
                       0xffffffffa3650837
                       0xffffffffa364f9a4
                       0xffffffffa2e63049
                       0xffffffffa1a36d42
                       0xffffffffa1a3c963
                       0xffffffffa1a3cbc9
                       |
                        --31.71%--0xffffffffa18d046c
                                  0xffffffffa189aefd



# Samples: 525  of event 'offcpu-time'
# Event count (approx.): 294273818205
#
#       Overhead  Command / Shared Object / Symbol
# ..............  ................................
#

So the off-cpu stacks are not shown, and with --hierarchy symbol
resolution seems to not work anymore, which is somewhat strange.

This is the hardware I'm using:

gentoo ~ $ lscpu | head
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 9 3950X 16-Core Processor
CPU family:                           23
Model:                                113

It's known to be buggy with the NMI interrupts, which still causes dmesg
to show strange messages like shown below:

[286943.722814] perf[1010806]: segfault at 0 ip 00007f1dc277ab43 sp 00007ffdcd302cd8 error 4 in libc.so.6[7f1dc2666000+134000] likely on CPU 5 (core 5, socket 0)
[286943.722827] Code: f8 77 c3 0f 1f 40 00 f3 0f 1e fa c5 f9 6e c6 89 f8 25 ff 0f 00 00 c4 e2 7d 78 c0 c5 f1 ef c9 3d e0 0f 00 00 0f 87 8d 01 00 00 <c5> fe 6f 17 c5 fd 74 da c5 f5 74 d2 c5 ed eb db c5 fd d7 c3 85 c0
[288940.208010] traps: perf[1111710] general protection fault ip:55c0999fb228 sp:7ffd96023f00 error:0 in perf[55c09983d000+31d000]
[289114.949941] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79600
[289115.070561] WARNING: kernel stack frame pointer at 00000000189df501 in swapper/20:0 has bad value 000000005dc8ab50
[289115.070568] unwind stack type:0 next_sp:0000000000000000 mask:0x6 graph_idx:0
[289115.070570] 00000000b2bdf5fc: ffffab26805f4ba0 (0xffffab26805f4ba0)
[289115.070575] 000000008d353c11: ffffffffa365c2d2 (_raw_spin_lock_irqsave+0x52/0x70)
[289115.070580] 000000002fccc0bb: ffff9ca4057d5be8 (0xffff9ca4057d5be8)
[289115.070582] 00000000078840a2: 0000000000000000 ...
[289115.070583] 000000002c7f6250: ffffab26805f4bb8 (0xffffab26805f4bb8)
[289115.070585] 00000000afe09fb3: ffffffffc03eef12 (os_acquire_spinlock+0x12/0x30 [nvidia])
[289115.070844] 000000005d608f8f: ffff9ca4057d5bc8 (0xffff9ca4057d5bc8)
[289115.070846] 00000000189df501: ffff9ca4297ade60 (0xffff9ca4297ade60)
[289115.070848] 00000000b4c4e61a: ffffffffc0cfd120 (_nv044096rm+0x10/0x20 [nvidia])
[289115.071189] 00000000e2a64928: 0000000000000000 ...

and later (the usual "Dazed and confused ones"):

[289115.075161] INFO: NMI handler (perf_ibs_nmi_handler) took too long to run: 4.605 msecs
[289115.075166] perf: interrupt took too long (38058 > 3138), lowering kernel.perf_event_max_sample_rate to 5200
[289117.772464] Uhhuh. NMI received for unknown reason 3c on CPU 16.
[289117.772468] Dazed and confused, but trying to continue

I wonder also if the nvidia drivers taint have anything to do with this.
I'd appreciate your insights in helping to debug this last problem, as I
couldn't find out why it's happening. I'm running Linux 6.9.0.

I hope this is enough information to at least fix the segfaults, but if
you need any extra information, please let me know.

Best regards,
-Guilherme


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: perf mem segfaults
  2024-05-31 15:30 perf mem segfaults Guilherme Amadio
@ 2024-05-31 17:02 ` Ian Rogers
  2024-06-04  4:26   ` Namhyung Kim
  0 siblings, 1 reply; 3+ messages in thread
From: Ian Rogers @ 2024-05-31 17:02 UTC (permalink / raw)
  To: Guilherme Amadio; +Cc: Arnaldo Carvalho de Melo, linux-perf-users

On Fri, May 31, 2024 at 8:30 AM Guilherme Amadio <amadio@cern.ch> wrote:
>
> Hi Arnaldo,
>
> I reached some segfaults with perf mem, so I recompiled with DEBUG=1
> and am sending below some stack traces. This is all with perf 6.9 for
> now. The first one seems to pass a null pointer as format in frame #5,
> along with a blank mem_loads_name:
>
> gentoo ~ $ gdb --args perf mem record -t load -- root.exe -l -q
> GNU gdb (Gentoo 14.2 vanilla) 14.2
> Reading symbols from perf...
> (gdb) run
> Starting program: /usr/bin/perf mem record -t load -- root.exe -l -q
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
>
> Program received signal SIGSEGV, Segmentation fault.
> __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
> 67              vmovdqu (%rdi), %ymm2
> (gdb) bt
> #0  __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
> #1  0x00007ffff6c982de in __find_specmb (format=0x0) at printf-parse.h:82
> #2  __printf_buffer (buf=buf@entry=0x7fffffffc760, format=format@entry=0x0, ap=ap@entry=0x7fffffffc880,
>     mode_flags=mode_flags@entry=0) at vfprintf-internal.c:649
> #3  0x00007ffff6cb7840 in __vsnprintf_internal (string=<optimized out>, maxlen=<optimized out>, format=0x0,
>     args=0x7fffffffc880, mode_flags=mode_flags@entry=0) at vsnprintf.c:96
> #4  0x00007ffff6cb787f in ___vsnprintf (string=<optimized out>, maxlen=<optimized out>, format=<optimized out>,
>     args=<optimized out>) at vsnprintf.c:103
> #5  0x00005555557b9391 in scnprintf (buf=0x555555fe9320 <mem_loads_name> "", size=100, fmt=0x0)
>     at ../lib/vsprintf.c:21
> #6  0x00005555557b74c3 in perf_pmu__mem_events_name (i=0, pmu=0x555556832180) at util/mem-events.c:106
> #7  0x00005555557b7ab9 in perf_mem_events__record_args (rec_argv=0x55555684c000, argv_nr=0x7fffffffca20)
>     at util/mem-events.c:252
> #8  0x00005555555e370d in __cmd_record (argc=3, argv=0x7fffffffd760, mem=0x7fffffffcd80) at builtin-mem.c:156
> #9  0x00005555555e49c4 in cmd_mem (argc=4, argv=0x7fffffffd760) at builtin-mem.c:514
> #10 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=8, argv=0x7fffffffd760) at perf.c:349
> #11 0x0000555555697402 in handle_internal_command (argc=8, argv=0x7fffffffd760) at perf.c:402
> #12 0x0000555555697560 in run_argv (argcp=0x7fffffffd59c, argv=0x7fffffffd590) at perf.c:446
> #13 0x00005555556978a6 in main (argc=8, argv=0x7fffffffd760) at perf.c:562
> (gdb)
>
> when I record without -t load, then it works. However, when I report
> with --hierarchy, I get a crash as well:
>
> gentoo ~ $ gdb --args perf mem report --stdio --hierarchy
> ...
> Program received signal SIGSEGV, Segmentation fault.
> 0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
>     bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
> 376                     if (sym->type == STT_OBJECT) {
> (gdb) bt
> #0  0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
>     bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
> #1  0x000055555574e148 in hist_entry__daddr_snprintf (he=0x5555568d66c0, bf=0x555558261200 "[.] ", size=525, width=42)
>     at util/sort.c:1384
> #2  0x0000555555750acb in __sort__hpp_entry (fmt=0x555556844a00, hpp=0x7fffffffa550, he=0x5555568d66c0)
>     at util/sort.c:2505
> #3  0x0000555555829cce in hist_entry__hierarchy_fprintf (he=0x5555568d66c0, hpp=0x7fffffffa550, hists=0x555556818270,
>     fp=0x7ffff6df25c0 <_IO_2_1_stdout_>) at ui/stdio/hist.c:517
> #4  0x000055555582a039 in hist_entry__fprintf (he=0x5555568d66c0, size=525, bf=0x555558261200 "[.] ", bfsz=525,
>     fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:598
> #5  0x000055555582adf5 in hists__fprintf (hists=0x555556818270, show_header=true, max_rows=0, max_cols=0, min_pcnt=0,
>     fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:868
> #6  0x00005555555b2442 in evlist__tty_browse_hists (evlist=0x55555682e000, rep=0x7fffffffa930,
>     help=0x555556924240 "Tip: System-wide collection from all CPUs: perf record -a") at builtin-report.c:581
> #7  0x00005555555b28c0 in report__browse_hists (rep=0x7fffffffa930) at builtin-report.c:684
> #8  0x00005555555b3e7f in __cmd_report (rep=0x7fffffffa930) at builtin-report.c:1141
> #9  0x00005555555b668a in cmd_report (argc=0, argv=0x555556806100) at builtin-report.c:1800
> #10 0x00005555555e42d2 in report_events (argc=3, argv=0x7fffffffd790, mem=0x7fffffffcdb0) at builtin-mem.c:374
> #11 0x00005555555e4a1c in cmd_mem (argc=3, argv=0x7fffffffd790) at builtin-mem.c:516
> #12 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=4, argv=0x7fffffffd790) at perf.c:349
> #13 0x0000555555697402 in handle_internal_command (argc=4, argv=0x7fffffffd790) at perf.c:402
> #14 0x0000555555697560 in run_argv (argcp=0x7fffffffd5cc, argv=0x7fffffffd5c0) at perf.c:446
> #15 0x00005555556978a6 in main (argc=4, argv=0x7fffffffd790) at perf.c:562
> (gdb)

Maybe this relates to the cleanup in:
https://lore.kernel.org/all/20240123185036.3461837-1-kan.liang@linux.intel.com/
I also did some smaller cleanup trying to make sanitizers happy:
https://lore.kernel.org/all/20240507183545.1236093-1-irogers@google.com/

> The other thing I noticed as of recently is that --hierarchy sometimes
> fails to correctly show stacks:
>
> $ sudo perf record -a --off-cpu -g -- sleep 10
> gentoo ~ $ sudo perf report --stdio --no-children --percent-limit 5 -c swapper
> # To display the perf.data header info, please use --header/--header-only options.
> #
> # comm: swapper
> #
> # Total Lost Samples: 0
> #
> # Samples: 17K of event 'cycles:P'
> # Event count (approx.): 6796556494
> #
> # Overhead  Shared Object                        Symbol
> # ........  ...................................  .........................................
> #
>     39.71%  [kernel.kallsyms]                    [k] io_idle
>             |
>              --32.55%--io_idle
>                        acpi_idle_enter
>                        cpuidle_enter_state
>                        cpuidle_enter
>                        call_cpuidle
>                        do_idle
>                        cpu_startup_entry
>                        |
>                         --31.71%--0xffffffffa18d046c
>                                   secondary_startup_64
>
> # Samples: 525  of event 'offcpu-time'
> # Event count (approx.): 294273818205
> #
> # Overhead  Shared Object  Symbol
> # ........  .............  ......
> #
>
> $ sudo perf report --stdio --no-children --percent-limit 5 -c swapper --hierarchy
> # To display the perf.data header info, please use --header/--header-only options.
> #
> # comm: swapper
> #
> # Total Lost Samples: 0
> #
> # Samples: 17K of event 'cycles:P'
> # Event count (approx.): 6796556494
> #
> #       Overhead  Command / Shared Object / Symbol
> # ..............  ...............................................
> #
>     47.61%        swapper
>        47.58%        [kernel.kallsyms]
>           39.71%        [k] io_idle
>             |
>              --32.55%--0xffffffffa36504b3
>                        0xffffffffa3650837
>                        0xffffffffa364f9a4
>                        0xffffffffa2e63049
>                        0xffffffffa1a36d42
>                        0xffffffffa1a3c963
>                        0xffffffffa1a3cbc9
>                        |
>                         --31.71%--0xffffffffa18d046c
>                                   0xffffffffa189aefd
>
>
>
> # Samples: 525  of event 'offcpu-time'
> # Event count (approx.): 294273818205
> #
> #       Overhead  Command / Shared Object / Symbol
> # ..............  ................................
> #
>
> So the off-cpu stacks are not shown, and with --hierarchy symbol
> resolution seems to not work anymore, which is somewhat strange.

Perhaps this is another instance of this regression I introduced:
https://lore.kernel.org/linux-perf-users/CAM9d7cjCL2=h6jwQ_cM24JqT2Um=zfWtJYdQbiy6BAcktyudnQ@mail.gmail.com/
Apologies if it is.

Thanks,
Ian

> This is the hardware I'm using:
>
> gentoo ~ $ lscpu | head
> Architecture:                         x86_64
> CPU op-mode(s):                       32-bit, 64-bit
> Address sizes:                        43 bits physical, 48 bits virtual
> Byte Order:                           Little Endian
> CPU(s):                               32
> On-line CPU(s) list:                  0-31
> Vendor ID:                            AuthenticAMD
> Model name:                           AMD Ryzen 9 3950X 16-Core Processor
> CPU family:                           23
> Model:                                113
>
> It's known to be buggy with the NMI interrupts, which still causes dmesg
> to show strange messages like shown below:
>
> [286943.722814] perf[1010806]: segfault at 0 ip 00007f1dc277ab43 sp 00007ffdcd302cd8 error 4 in libc.so.6[7f1dc2666000+134000] likely on CPU 5 (core 5, socket 0)
> [286943.722827] Code: f8 77 c3 0f 1f 40 00 f3 0f 1e fa c5 f9 6e c6 89 f8 25 ff 0f 00 00 c4 e2 7d 78 c0 c5 f1 ef c9 3d e0 0f 00 00 0f 87 8d 01 00 00 <c5> fe 6f 17 c5 fd 74 da c5 f5 74 d2 c5 ed eb db c5 fd d7 c3 85 c0
> [288940.208010] traps: perf[1111710] general protection fault ip:55c0999fb228 sp:7ffd96023f00 error:0 in perf[55c09983d000+31d000]
> [289114.949941] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79600
> [289115.070561] WARNING: kernel stack frame pointer at 00000000189df501 in swapper/20:0 has bad value 000000005dc8ab50
> [289115.070568] unwind stack type:0 next_sp:0000000000000000 mask:0x6 graph_idx:0
> [289115.070570] 00000000b2bdf5fc: ffffab26805f4ba0 (0xffffab26805f4ba0)
> [289115.070575] 000000008d353c11: ffffffffa365c2d2 (_raw_spin_lock_irqsave+0x52/0x70)
> [289115.070580] 000000002fccc0bb: ffff9ca4057d5be8 (0xffff9ca4057d5be8)
> [289115.070582] 00000000078840a2: 0000000000000000 ...
> [289115.070583] 000000002c7f6250: ffffab26805f4bb8 (0xffffab26805f4bb8)
> [289115.070585] 00000000afe09fb3: ffffffffc03eef12 (os_acquire_spinlock+0x12/0x30 [nvidia])
> [289115.070844] 000000005d608f8f: ffff9ca4057d5bc8 (0xffff9ca4057d5bc8)
> [289115.070846] 00000000189df501: ffff9ca4297ade60 (0xffff9ca4297ade60)
> [289115.070848] 00000000b4c4e61a: ffffffffc0cfd120 (_nv044096rm+0x10/0x20 [nvidia])
> [289115.071189] 00000000e2a64928: 0000000000000000 ...
>
> and later (the usual "Dazed and confused ones"):
>
> [289115.075161] INFO: NMI handler (perf_ibs_nmi_handler) took too long to run: 4.605 msecs
> [289115.075166] perf: interrupt took too long (38058 > 3138), lowering kernel.perf_event_max_sample_rate to 5200
> [289117.772464] Uhhuh. NMI received for unknown reason 3c on CPU 16.
> [289117.772468] Dazed and confused, but trying to continue
>
> I wonder also if the nvidia drivers taint have anything to do with this.
> I'd appreciate your insights in helping to debug this last problem, as I
> couldn't find out why it's happening. I'm running Linux 6.9.0.
>
> I hope this is enough information to at least fix the segfaults, but if
> you need any extra information, please let me know.
>
> Best regards,
> -Guilherme
>
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: perf mem segfaults
  2024-05-31 17:02 ` Ian Rogers
@ 2024-06-04  4:26   ` Namhyung Kim
  0 siblings, 0 replies; 3+ messages in thread
From: Namhyung Kim @ 2024-06-04  4:26 UTC (permalink / raw)
  To: Ian Rogers
  Cc: namhyung, Guilherme Amadio, Arnaldo Carvalho de Melo,
	linux-perf-users

From: namhyung@kernel.org

On Fri, 31 May 2024 10:02:18 -0700 Ian Rogers <irogers@google.com> wrote:

> On Fri, May 31, 2024 at 8:30 AM Guilherme Amadio <amadio@cern.ch> wrote:
> >
> > Hi Arnaldo,
> >
> > I reached some segfaults with perf mem, so I recompiled with DEBUG=1
> > and am sending below some stack traces. This is all with perf 6.9 for
> > now. The first one seems to pass a null pointer as format in frame #5,
> > along with a blank mem_loads_name:
> >
> > gentoo ~ $ gdb --args perf mem record -t load -- root.exe -l -q
> > GNU gdb (Gentoo 14.2 vanilla) 14.2
> > Reading symbols from perf...
> > (gdb) run
> > Starting program: /usr/bin/perf mem record -t load -- root.exe -l -q
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
> > 67              vmovdqu (%rdi), %ymm2
> > (gdb) bt
> > #0  __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
> > #1  0x00007ffff6c982de in __find_specmb (format=0x0) at printf-parse.h:82
> > #2  __printf_buffer (buf=buf@entry=0x7fffffffc760, format=format@entry=0x0, ap=ap@entry=0x7fffffffc880,
> >     mode_flags=mode_flags@entry=0) at vfprintf-internal.c:649
> > #3  0x00007ffff6cb7840 in __vsnprintf_internal (string=<optimized out>, maxlen=<optimized out>, format=0x0,
> >     args=0x7fffffffc880, mode_flags=mode_flags@entry=0) at vsnprintf.c:96
> > #4  0x00007ffff6cb787f in ___vsnprintf (string=<optimized out>, maxlen=<optimized out>, format=<optimized out>,
> >     args=<optimized out>) at vsnprintf.c:103
> > #5  0x00005555557b9391 in scnprintf (buf=0x555555fe9320 <mem_loads_name> "", size=100, fmt=0x0)
> >     at ../lib/vsprintf.c:21
> > #6  0x00005555557b74c3 in perf_pmu__mem_events_name (i=0, pmu=0x555556832180) at util/mem-events.c:106
> > #7  0x00005555557b7ab9 in perf_mem_events__record_args (rec_argv=0x55555684c000, argv_nr=0x7fffffffca20)
> >     at util/mem-events.c:252
> > #8  0x00005555555e370d in __cmd_record (argc=3, argv=0x7fffffffd760, mem=0x7fffffffcd80) at builtin-mem.c:156
> > #9  0x00005555555e49c4 in cmd_mem (argc=4, argv=0x7fffffffd760) at builtin-mem.c:514
> > #10 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=8, argv=0x7fffffffd760) at perf.c:349
> > #11 0x0000555555697402 in handle_internal_command (argc=8, argv=0x7fffffffd760) at perf.c:402
> > #12 0x0000555555697560 in run_argv (argcp=0x7fffffffd59c, argv=0x7fffffffd590) at perf.c:446
> > #13 0x00005555556978a6 in main (argc=8, argv=0x7fffffffd760) at perf.c:562
> > (gdb)
> >
> > when I record without -t load, then it works. However, when I report
> > with --hierarchy, I get a crash as well:
> >
> > gentoo ~ $ gdb --args perf mem report --stdio --hierarchy
> > ...
> > Program received signal SIGSEGV, Segmentation fault.
> > 0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
> >     bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
> > 376                     if (sym->type == STT_OBJECT) {
> > (gdb) bt
> > #0  0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
> >     bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
> > #1  0x000055555574e148 in hist_entry__daddr_snprintf (he=0x5555568d66c0, bf=0x555558261200 "[.] ", size=525, width=42)
> >     at util/sort.c:1384
> > #2  0x0000555555750acb in __sort__hpp_entry (fmt=0x555556844a00, hpp=0x7fffffffa550, he=0x5555568d66c0)
> >     at util/sort.c:2505
> > #3  0x0000555555829cce in hist_entry__hierarchy_fprintf (he=0x5555568d66c0, hpp=0x7fffffffa550, hists=0x555556818270,
> >     fp=0x7ffff6df25c0 <_IO_2_1_stdout_>) at ui/stdio/hist.c:517
> > #4  0x000055555582a039 in hist_entry__fprintf (he=0x5555568d66c0, size=525, bf=0x555558261200 "[.] ", bfsz=525,
> >     fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:598
> > #5  0x000055555582adf5 in hists__fprintf (hists=0x555556818270, show_header=true, max_rows=0, max_cols=0, min_pcnt=0,
> >     fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:868
> > #6  0x00005555555b2442 in evlist__tty_browse_hists (evlist=0x55555682e000, rep=0x7fffffffa930,
> >     help=0x555556924240 "Tip: System-wide collection from all CPUs: perf record -a") at builtin-report.c:581
> > #7  0x00005555555b28c0 in report__browse_hists (rep=0x7fffffffa930) at builtin-report.c:684
> > #8  0x00005555555b3e7f in __cmd_report (rep=0x7fffffffa930) at builtin-report.c:1141
> > #9  0x00005555555b668a in cmd_report (argc=0, argv=0x555556806100) at builtin-report.c:1800
> > #10 0x00005555555e42d2 in report_events (argc=3, argv=0x7fffffffd790, mem=0x7fffffffcdb0) at builtin-mem.c:374
> > #11 0x00005555555e4a1c in cmd_mem (argc=3, argv=0x7fffffffd790) at builtin-mem.c:516
> > #12 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=4, argv=0x7fffffffd790) at perf.c:349
> > #13 0x0000555555697402 in handle_internal_command (argc=4, argv=0x7fffffffd790) at perf.c:402
> > #14 0x0000555555697560 in run_argv (argcp=0x7fffffffd5cc, argv=0x7fffffffd5c0) at perf.c:446
> > #15 0x00005555556978a6 in main (argc=4, argv=0x7fffffffd790) at perf.c:562
> > (gdb)
> 
> Maybe this relates to the cleanup in:
> https://lore.kernel.org/all/20240123185036.3461837-1-kan.liang@linux.intel.com/
> I also did some smaller cleanup trying to make sanitizers happy:
> https://lore.kernel.org/all/20240507183545.1236093-1-irogers@google.com/

How about this?  I'm not sure why we think it's supported if it has no
event name.

Thanks,
Namhyung

---8<---

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index 6dda47bb774f..e6051f0cf4db 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -86,7 +86,7 @@ static const char *perf_pmu__mem_events_name(int i, struct perf_pmu *pmu)
                return NULL;
 
        e = &pmu->mem_events[i];
-       if (!e)
+       if (!e || e->name)
                return NULL;
 
        if (i == PERF_MEM_EVENTS__LOAD || i == PERF_MEM_EVENTS__LOAD_STORE) {
@@ -184,7 +184,7 @@ static bool perf_pmu__mem_events_supported(const char *mnt, struct perf_pmu *pmu
        struct stat st;
 
        if (!e->event_name)
-               return true;
+               return false;
 
        scnprintf(path, PATH_MAX, "%s/devices/%s/events/%s", mnt, pmu->name, e->event_name);
 


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-06-04  4:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-31 15:30 perf mem segfaults Guilherme Amadio
2024-05-31 17:02 ` Ian Rogers
2024-06-04  4:26   ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).