perf mem segfaults - Guilherme Amadio

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Guilherme Amadio <amadio@cern.ch>
To: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Subject: perf mem segfaults
Date: Fri, 31 May 2024 17:30:06 +0200	[thread overview]
Message-ID: <Zlns_o_IE5L28168@cern.ch> (raw)

Hi Arnaldo,

I reached some segfaults with perf mem, so I recompiled with DEBUG=1
and am sending below some stack traces. This is all with perf 6.9 for
now. The first one seems to pass a null pointer as format in frame #5,
along with a blank mem_loads_name:

gentoo ~ $ gdb --args perf mem record -t load -- root.exe -l -q
GNU gdb (Gentoo 14.2 vanilla) 14.2
Reading symbols from perf...
(gdb) run
Starting program: /usr/bin/perf mem record -t load -- root.exe -l -q
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
__strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
67		vmovdqu	(%rdi), %ymm2
(gdb) bt
#0  __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
#1  0x00007ffff6c982de in __find_specmb (format=0x0) at printf-parse.h:82
#2  __printf_buffer (buf=buf@entry=0x7fffffffc760, format=format@entry=0x0, ap=ap@entry=0x7fffffffc880,
    mode_flags=mode_flags@entry=0) at vfprintf-internal.c:649
#3  0x00007ffff6cb7840 in __vsnprintf_internal (string=<optimized out>, maxlen=<optimized out>, format=0x0,
    args=0x7fffffffc880, mode_flags=mode_flags@entry=0) at vsnprintf.c:96
#4  0x00007ffff6cb787f in ___vsnprintf (string=<optimized out>, maxlen=<optimized out>, format=<optimized out>,
    args=<optimized out>) at vsnprintf.c:103
#5  0x00005555557b9391 in scnprintf (buf=0x555555fe9320 <mem_loads_name> "", size=100, fmt=0x0)
    at ../lib/vsprintf.c:21
#6  0x00005555557b74c3 in perf_pmu__mem_events_name (i=0, pmu=0x555556832180) at util/mem-events.c:106
#7  0x00005555557b7ab9 in perf_mem_events__record_args (rec_argv=0x55555684c000, argv_nr=0x7fffffffca20)
    at util/mem-events.c:252
#8  0x00005555555e370d in __cmd_record (argc=3, argv=0x7fffffffd760, mem=0x7fffffffcd80) at builtin-mem.c:156
#9  0x00005555555e49c4 in cmd_mem (argc=4, argv=0x7fffffffd760) at builtin-mem.c:514
#10 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=8, argv=0x7fffffffd760) at perf.c:349
#11 0x0000555555697402 in handle_internal_command (argc=8, argv=0x7fffffffd760) at perf.c:402
#12 0x0000555555697560 in run_argv (argcp=0x7fffffffd59c, argv=0x7fffffffd590) at perf.c:446
#13 0x00005555556978a6 in main (argc=8, argv=0x7fffffffd760) at perf.c:562
(gdb)

when I record without -t load, then it works. However, when I report
with --hierarchy, I get a crash as well:

gentoo ~ $ gdb --args perf mem report --stdio --hierarchy
...
Program received signal SIGSEGV, Segmentation fault.
0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
    bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
376			if (sym->type == STT_OBJECT) {
(gdb) bt
#0  0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
    bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
#1  0x000055555574e148 in hist_entry__daddr_snprintf (he=0x5555568d66c0, bf=0x555558261200 "[.] ", size=525, width=42)
    at util/sort.c:1384
#2  0x0000555555750acb in __sort__hpp_entry (fmt=0x555556844a00, hpp=0x7fffffffa550, he=0x5555568d66c0)
    at util/sort.c:2505
#3  0x0000555555829cce in hist_entry__hierarchy_fprintf (he=0x5555568d66c0, hpp=0x7fffffffa550, hists=0x555556818270,
    fp=0x7ffff6df25c0 <_IO_2_1_stdout_>) at ui/stdio/hist.c:517
#4  0x000055555582a039 in hist_entry__fprintf (he=0x5555568d66c0, size=525, bf=0x555558261200 "[.] ", bfsz=525,
    fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:598
#5  0x000055555582adf5 in hists__fprintf (hists=0x555556818270, show_header=true, max_rows=0, max_cols=0, min_pcnt=0,
    fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:868
#6  0x00005555555b2442 in evlist__tty_browse_hists (evlist=0x55555682e000, rep=0x7fffffffa930,
    help=0x555556924240 "Tip: System-wide collection from all CPUs: perf record -a") at builtin-report.c:581
#7  0x00005555555b28c0 in report__browse_hists (rep=0x7fffffffa930) at builtin-report.c:684
#8  0x00005555555b3e7f in __cmd_report (rep=0x7fffffffa930) at builtin-report.c:1141
#9  0x00005555555b668a in cmd_report (argc=0, argv=0x555556806100) at builtin-report.c:1800
#10 0x00005555555e42d2 in report_events (argc=3, argv=0x7fffffffd790, mem=0x7fffffffcdb0) at builtin-mem.c:374
#11 0x00005555555e4a1c in cmd_mem (argc=3, argv=0x7fffffffd790) at builtin-mem.c:516
#12 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=4, argv=0x7fffffffd790) at perf.c:349
#13 0x0000555555697402 in handle_internal_command (argc=4, argv=0x7fffffffd790) at perf.c:402
#14 0x0000555555697560 in run_argv (argcp=0x7fffffffd5cc, argv=0x7fffffffd5c0) at perf.c:446
#15 0x00005555556978a6 in main (argc=4, argv=0x7fffffffd790) at perf.c:562
(gdb)

The other thing I noticed as of recently is that --hierarchy sometimes
fails to correctly show stacks:

$ sudo perf record -a --off-cpu -g -- sleep 10
gentoo ~ $ sudo perf report --stdio --no-children --percent-limit 5 -c swapper
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
# Overhead  Shared Object                        Symbol
# ........  ...................................  .........................................
#
    39.71%  [kernel.kallsyms]                    [k] io_idle
            |
             --32.55%--io_idle
                       acpi_idle_enter
                       cpuidle_enter_state
                       cpuidle_enter
                       call_cpuidle
                       do_idle
                       cpu_startup_entry
                       |
                        --31.71%--0xffffffffa18d046c
                                  secondary_startup_64

# Samples: 525  of event 'offcpu-time'
# Event count (approx.): 294273818205
#
# Overhead  Shared Object  Symbol
# ........  .............  ......
#

$ sudo perf report --stdio --no-children --percent-limit 5 -c swapper --hierarchy
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
#       Overhead  Command / Shared Object / Symbol
# ..............  ...............................................
#
    47.61%        swapper
       47.58%        [kernel.kallsyms]
          39.71%        [k] io_idle
            |
             --32.55%--0xffffffffa36504b3
                       0xffffffffa3650837
                       0xffffffffa364f9a4
                       0xffffffffa2e63049
                       0xffffffffa1a36d42
                       0xffffffffa1a3c963
                       0xffffffffa1a3cbc9
                       |
                        --31.71%--0xffffffffa18d046c
                                  0xffffffffa189aefd



# Samples: 525  of event 'offcpu-time'
# Event count (approx.): 294273818205
#
#       Overhead  Command / Shared Object / Symbol
# ..............  ................................
#

So the off-cpu stacks are not shown, and with --hierarchy symbol
resolution seems to not work anymore, which is somewhat strange.

This is the hardware I'm using:

gentoo ~ $ lscpu | head
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 9 3950X 16-Core Processor
CPU family:                           23
Model:                                113

It's known to be buggy with the NMI interrupts, which still causes dmesg
to show strange messages like shown below:

[286943.722814] perf[1010806]: segfault at 0 ip 00007f1dc277ab43 sp 00007ffdcd302cd8 error 4 in libc.so.6[7f1dc2666000+134000] likely on CPU 5 (core 5, socket 0)
[286943.722827] Code: f8 77 c3 0f 1f 40 00 f3 0f 1e fa c5 f9 6e c6 89 f8 25 ff 0f 00 00 c4 e2 7d 78 c0 c5 f1 ef c9 3d e0 0f 00 00 0f 87 8d 01 00 00 <c5> fe 6f 17 c5 fd 74 da c5 f5 74 d2 c5 ed eb db c5 fd d7 c3 85 c0
[288940.208010] traps: perf[1111710] general protection fault ip:55c0999fb228 sp:7ffd96023f00 error:0 in perf[55c09983d000+31d000]
[289114.949941] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79600
[289115.070561] WARNING: kernel stack frame pointer at 00000000189df501 in swapper/20:0 has bad value 000000005dc8ab50
[289115.070568] unwind stack type:0 next_sp:0000000000000000 mask:0x6 graph_idx:0
[289115.070570] 00000000b2bdf5fc: ffffab26805f4ba0 (0xffffab26805f4ba0)
[289115.070575] 000000008d353c11: ffffffffa365c2d2 (_raw_spin_lock_irqsave+0x52/0x70)
[289115.070580] 000000002fccc0bb: ffff9ca4057d5be8 (0xffff9ca4057d5be8)
[289115.070582] 00000000078840a2: 0000000000000000 ...
[289115.070583] 000000002c7f6250: ffffab26805f4bb8 (0xffffab26805f4bb8)
[289115.070585] 00000000afe09fb3: ffffffffc03eef12 (os_acquire_spinlock+0x12/0x30 [nvidia])
[289115.070844] 000000005d608f8f: ffff9ca4057d5bc8 (0xffff9ca4057d5bc8)
[289115.070846] 00000000189df501: ffff9ca4297ade60 (0xffff9ca4297ade60)
[289115.070848] 00000000b4c4e61a: ffffffffc0cfd120 (_nv044096rm+0x10/0x20 [nvidia])
[289115.071189] 00000000e2a64928: 0000000000000000 ...

and later (the usual "Dazed and confused ones"):

[289115.075161] INFO: NMI handler (perf_ibs_nmi_handler) took too long to run: 4.605 msecs
[289115.075166] perf: interrupt took too long (38058 > 3138), lowering kernel.perf_event_max_sample_rate to 5200
[289117.772464] Uhhuh. NMI received for unknown reason 3c on CPU 16.
[289117.772468] Dazed and confused, but trying to continue

I wonder also if the nvidia drivers taint have anything to do with this.
I'd appreciate your insights in helping to debug this last problem, as I
couldn't find out why it's happening. I'm running Linux 6.9.0.

I hope this is enough information to at least fix the segfaults, but if
you need any extra information, please let me know.

Best regards,
-Guilherme

next             reply	other threads:[~2024-05-31 15:30 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-31 15:30 Guilherme Amadio [this message]
2024-05-31 17:02 ` perf mem segfaults Ian Rogers
2024-06-04  4:26   ` Namhyung Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zlns_o_IE5L28168@cern.ch \
    --to=amadio@cern.ch \
    --cc=acme@kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.