perf mem segfaults - Guilherme Amadio

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Guilherme Amadio <amadio@cern.ch>
To: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Subject: perf mem segfaults
Date: Fri, 31 May 2024 17:30:06 +0200	[thread overview]
Message-ID: <Zlns_o_IE5L28168@cern.ch> (raw)

Hi Arnaldo,

I reached some segfaults with perf mem, so I recompiled with DEBUG=1
and am sending below some stack traces. This is all with perf 6.9 for
now. The first one seems to pass a null pointer as format in frame #5,
along with a blank mem_loads_name:

gentoo ~ $ gdb --args perf mem record -t load -- root.exe -l -q
GNU gdb (Gentoo 14.2 vanilla) 14.2
Reading symbols from perf...
(gdb) run
Starting program: /usr/bin/perf mem record -t load -- root.exe -l -q
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
__strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
67		vmovdqu	(%rdi), %ymm2
(gdb) bt
#0  __strchrnul_avx2 () at ../sysdeps/x86_64/multiarch/strchr-avx2.S:67
#1  0x00007ffff6c982de in __find_specmb (format=0x0) at printf-parse.h:82
#2  __printf_buffer (buf=buf@entry=0x7fffffffc760, format=format@entry=0x0, ap=ap@entry=0x7fffffffc880,
    mode_flags=mode_flags@entry=0) at vfprintf-internal.c:649
#3  0x00007ffff6cb7840 in __vsnprintf_internal (string=<optimized out>, maxlen=<optimized out>, format=0x0,
    args=0x7fffffffc880, mode_flags=mode_flags@entry=0) at vsnprintf.c:96
#4  0x00007ffff6cb787f in ___vsnprintf (string=<optimized out>, maxlen=<optimized out>, format=<optimized out>,
    args=<optimized out>) at vsnprintf.c:103
#5  0x00005555557b9391 in scnprintf (buf=0x555555fe9320 <mem_loads_name> "", size=100, fmt=0x0)
    at ../lib/vsprintf.c:21
#6  0x00005555557b74c3 in perf_pmu__mem_events_name (i=0, pmu=0x555556832180) at util/mem-events.c:106
#7  0x00005555557b7ab9 in perf_mem_events__record_args (rec_argv=0x55555684c000, argv_nr=0x7fffffffca20)
    at util/mem-events.c:252
#8  0x00005555555e370d in __cmd_record (argc=3, argv=0x7fffffffd760, mem=0x7fffffffcd80) at builtin-mem.c:156
#9  0x00005555555e49c4 in cmd_mem (argc=4, argv=0x7fffffffd760) at builtin-mem.c:514
#10 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=8, argv=0x7fffffffd760) at perf.c:349
#11 0x0000555555697402 in handle_internal_command (argc=8, argv=0x7fffffffd760) at perf.c:402
#12 0x0000555555697560 in run_argv (argcp=0x7fffffffd59c, argv=0x7fffffffd590) at perf.c:446
#13 0x00005555556978a6 in main (argc=8, argv=0x7fffffffd760) at perf.c:562
(gdb)

when I record without -t load, then it works. However, when I report
with --hierarchy, I get a crash as well:

gentoo ~ $ gdb --args perf mem report --stdio --hierarchy
...
Program received signal SIGSEGV, Segmentation fault.
0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
    bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
376			if (sym->type == STT_OBJECT) {
(gdb) bt
#0  0x000055555574b228 in _hist_entry__sym_snprintf (ms=0x555558256940, ip=7237970023832232038, level=46 '.',
    bf=0x555558261200 "[.] ", size=525, width=42) at util/sort.c:376
#1  0x000055555574e148 in hist_entry__daddr_snprintf (he=0x5555568d66c0, bf=0x555558261200 "[.] ", size=525, width=42)
    at util/sort.c:1384
#2  0x0000555555750acb in __sort__hpp_entry (fmt=0x555556844a00, hpp=0x7fffffffa550, he=0x5555568d66c0)
    at util/sort.c:2505
#3  0x0000555555829cce in hist_entry__hierarchy_fprintf (he=0x5555568d66c0, hpp=0x7fffffffa550, hists=0x555556818270,
    fp=0x7ffff6df25c0 <_IO_2_1_stdout_>) at ui/stdio/hist.c:517
#4  0x000055555582a039 in hist_entry__fprintf (he=0x5555568d66c0, size=525, bf=0x555558261200 "[.] ", bfsz=525,
    fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:598
#5  0x000055555582adf5 in hists__fprintf (hists=0x555556818270, show_header=true, max_rows=0, max_cols=0, min_pcnt=0,
    fp=0x7ffff6df25c0 <_IO_2_1_stdout_>, ignore_callchains=true) at ui/stdio/hist.c:868
#6  0x00005555555b2442 in evlist__tty_browse_hists (evlist=0x55555682e000, rep=0x7fffffffa930,
    help=0x555556924240 "Tip: System-wide collection from all CPUs: perf record -a") at builtin-report.c:581
#7  0x00005555555b28c0 in report__browse_hists (rep=0x7fffffffa930) at builtin-report.c:684
#8  0x00005555555b3e7f in __cmd_report (rep=0x7fffffffa930) at builtin-report.c:1141
#9  0x00005555555b668a in cmd_report (argc=0, argv=0x555556806100) at builtin-report.c:1800
#10 0x00005555555e42d2 in report_events (argc=3, argv=0x7fffffffd790, mem=0x7fffffffcdb0) at builtin-mem.c:374
#11 0x00005555555e4a1c in cmd_mem (argc=3, argv=0x7fffffffd790) at builtin-mem.c:516
#12 0x000055555569716c in run_builtin (p=0x555555fcde80 <commands+672>, argc=4, argv=0x7fffffffd790) at perf.c:349
#13 0x0000555555697402 in handle_internal_command (argc=4, argv=0x7fffffffd790) at perf.c:402
#14 0x0000555555697560 in run_argv (argcp=0x7fffffffd5cc, argv=0x7fffffffd5c0) at perf.c:446
#15 0x00005555556978a6 in main (argc=4, argv=0x7fffffffd790) at perf.c:562
(gdb)

The other thing I noticed as of recently is that --hierarchy sometimes
fails to correctly show stacks:

$ sudo perf record -a --off-cpu -g -- sleep 10
gentoo ~ $ sudo perf report --stdio --no-children --percent-limit 5 -c swapper
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
# Overhead  Shared Object                        Symbol
# ........  ...................................  .........................................
#
    39.71%  [kernel.kallsyms]                    [k] io_idle
            |
             --32.55%--io_idle
                       acpi_idle_enter
                       cpuidle_enter_state
                       cpuidle_enter
                       call_cpuidle
                       do_idle
                       cpu_startup_entry
                       |
                        --31.71%--0xffffffffa18d046c
                                  secondary_startup_64

# Samples: 525  of event 'offcpu-time'
# Event count (approx.): 294273818205
#
# Overhead  Shared Object  Symbol
# ........  .............  ......
#

$ sudo perf report --stdio --no-children --percent-limit 5 -c swapper --hierarchy
# To display the perf.data header info, please use --header/--header-only options.
#
# comm: swapper
#
# Total Lost Samples: 0
#
# Samples: 17K of event 'cycles:P'
# Event count (approx.): 6796556494
#
#       Overhead  Command / Shared Object / Symbol
# ..............  ...............................................
#
    47.61%        swapper
       47.58%        [kernel.kallsyms]
          39.71%        [k] io_idle
            |
             --32.55%--0xffffffffa36504b3
                       0xffffffffa3650837
                       0xffffffffa364f9a4
                       0xffffffffa2e63049
                       0xffffffffa1a36d42
                       0xffffffffa1a3c963
                       0xffffffffa1a3cbc9
                       |
                        --31.71%--0xffffffffa18d046c
                                  0xffffffffa189aefd



# Samples: 525  of event 'offcpu-time'
# Event count (approx.): 294273818205
#
#       Overhead  Command / Shared Object / Symbol
# ..............  ................................
#

So the off-cpu stacks are not shown, and with --hierarchy symbol
resolution seems to not work anymore, which is somewhat strange.

This is the hardware I'm using:

gentoo ~ $ lscpu | head
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 9 3950X 16-Core Processor
CPU family:                           23
Model:                                113

It's known to be buggy with the NMI interrupts, which still causes dmesg
to show strange messages like shown below:

[286943.722814] perf[1010806]: segfault at 0 ip 00007f1dc277ab43 sp 00007ffdcd302cd8 error 4 in libc.so.6[7f1dc2666000+134000] likely on CPU 5 (core 5, socket 0)
[286943.722827] Code: f8 77 c3 0f 1f 40 00 f3 0f 1e fa c5 f9 6e c6 89 f8 25 ff 0f 00 00 c4 e2 7d 78 c0 c5 f1 ef c9 3d e0 0f 00 00 0f 87 8d 01 00 00 <c5> fe 6f 17 c5 fd 74 da c5 f5 74 d2 c5 ed eb db c5 fd d7 c3 85 c0
[288940.208010] traps: perf[1111710] general protection fault ip:55c0999fb228 sp:7ffd96023f00 error:0 in perf[55c09983d000+31d000]
[289114.949941] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79600
[289115.070561] WARNING: kernel stack frame pointer at 00000000189df501 in swapper/20:0 has bad value 000000005dc8ab50
[289115.070568] unwind stack type:0 next_sp:0000000000000000 mask:0x6 graph_idx:0
[289115.070570] 00000000b2bdf5fc: ffffab26805f4ba0 (0xffffab26805f4ba0)
[289115.070575] 000000008d353c11: ffffffffa365c2d2 (_raw_spin_lock_irqsave+0x52/0x70)
[289115.070580] 000000002fccc0bb: ffff9ca4057d5be8 (0xffff9ca4057d5be8)
[289115.070582] 00000000078840a2: 0000000000000000 ...
[289115.070583] 000000002c7f6250: ffffab26805f4bb8 (0xffffab26805f4bb8)
[289115.070585] 00000000afe09fb3: ffffffffc03eef12 (os_acquire_spinlock+0x12/0x30 [nvidia])
[289115.070844] 000000005d608f8f: ffff9ca4057d5bc8 (0xffff9ca4057d5bc8)
[289115.070846] 00000000189df501: ffff9ca4297ade60 (0xffff9ca4297ade60)
[289115.070848] 00000000b4c4e61a: ffffffffc0cfd120 (_nv044096rm+0x10/0x20 [nvidia])
[289115.071189] 00000000e2a64928: 0000000000000000 ...

and later (the usual "Dazed and confused ones"):

[289115.075161] INFO: NMI handler (perf_ibs_nmi_handler) took too long to run: 4.605 msecs
[289115.075166] perf: interrupt took too long (38058 > 3138), lowering kernel.perf_event_max_sample_rate to 5200
[289117.772464] Uhhuh. NMI received for unknown reason 3c on CPU 16.
[289117.772468] Dazed and confused, but trying to continue

I wonder also if the nvidia drivers taint have anything to do with this.
I'd appreciate your insights in helping to debug this last problem, as I
couldn't find out why it's happening. I'm running Linux 6.9.0.

I hope this is enough information to at least fix the segfaults, but if
you need any extra information, please let me know.

Best regards,
-Guilherme

next             reply	other threads:[~2024-05-31 15:30 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-31 15:30 Guilherme Amadio [this message]
2024-05-31 17:02 ` perf mem segfaults Ian Rogers
2024-06-04  4:26   ` Namhyung Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zlns_o_IE5L28168@cern.ch \
    --to=amadio@cern.ch \
    --cc=acme@kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).