Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults?

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults?
@ 2023-11-08 10:46 Maksymilian Graczyk
  2023-11-10 10:45 ` Maksymilian Graczyk
  2023-11-10 15:59 ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 7+ messages in thread
From: Maksymilian Graczyk @ 2023-11-08 10:46 UTC (permalink / raw)
  To: linux-perf-users; +Cc: syclops-project, Guilherme Amadio, Stephan Hageboeck

Hello all,

I have a problem with broken stack traces in perf when I profile a small 
multi-threaded program in C producing deep (~1000 entries) stacks with 
"perf record --call-graph=fp -e task-clock -F <any number> --off-cpu" 
attached to the program's PID. The callchains seem to stop at "random" 
points throughout my application, occasionally managing to reach the 
bottom (i.e. either "start" or "clone3").

This is the machine configuration I use:

* Intel Xeon Silver 4216 @ 2.10 GHz, with two 16-core CPUs and 
hyper-threading disabled
* 180 GB RAM, with no errors detected by Memtest86+ and no swap
* Gentoo with Linux 6.5.8 (installed from the Gentoo repo with 
gentoo-sources and compiled using this config: 
https://gist.github.com/maksgraczyk/1bce96841a5b2cb2a92f725635c04bf2)
* perf 6.6 with this quick patch of mine: 
https://gist.github.com/maksgraczyk/ee1dd98dda79129a35f7fd3acffb35fd
* Everything is compiled with "-fno-omit-frame-pointer" and 
"-mno-omit-leaf-frame-pointer" gcc flags
* kernel.perf_event_max_stack is set to 1024 and 
kernel.perf_event_paranoid is set to -1

The code of the profiled program is at 
https://gist.github.com/maksgraczyk/da2bc6d0be9d4e7d88f8bea45221a542 
(the higher the value of NUM_THREADS, the more likely the issue is to 
occur; you may need to compile the code without compiler optimisations).

Alongside sampling-based profiling, I run syscall profiling with a 
separate "perf record" instance attached to the same PID.

When I debug the kernel using kgdb, I see more-or-less the following 
behaviour happening in the stack traversal loop in perf_callchain_user() 
in arch/x86/events/core.c for the same thread being profiled:

1. The first sample goes fine, the entire stack is traversed.
2. The second sample breaks at some point inside my program, with a page 
fault due to page not present.
3. The third sample breaks at another *earlier* point inside my program, 
with a page fault due to page not present.
4. The fourth sample breaks at another *later* point inside my program, 
with a page fault due to page not present.

The stack frame addresses do not change throughout profiling and all 
page faults happen at __get_user(frame.next_frame, &fp->next_frame). The 
behaviour above also occurs occasionally in a single-threaded variant of 
the code (without pthread at all) with a very high sampling frequency 
(tens of thousands Hz).

This issue makes profiling results unreliable for my use case, as I 
usually profile multi-threaded applications with deep stacks with 
hundreds of entries (hence why my test program also produces a deep 
stack) and use flame graphs for later analysis.

Could you help me diagnose the problem? For example, what may be the 
cause of my page faults? I also did tests (without debugging though) 
without syscall profiling and the "--off-cpu" flag, broken stacks still 
appeared.

(I cannot use DWARF because it makes profiling too slow and perf.data 
size too large in my tests. I also want to avoid using 
non-portable/vendor-specific stack unwinding solutions like LBR, as we 
may need to run profiling on non-Intel CPUs.)

Best regards,
Maks Graczyk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults?
  2023-11-08 10:46 Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults? Maksymilian Graczyk
@ 2023-11-10 10:45 ` Maksymilian Graczyk
  2023-11-10 10:51   ` Maksymilian Graczyk
  2023-11-10 15:59 ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 7+ messages in thread
From: Maksymilian Graczyk @ 2023-11-10 10:45 UTC (permalink / raw)
  To: linux-perf-users; +Cc: syclops-project, Guilherme Amadio, Stephan Hageboeck

Hello again,

I would like to add an update:

My machine turns out to have NUMA due to having two separate CPUs.

Broken stacks virtually disappear when I disable NUMA awareness in the 
kernel by adding "numa=off" to its boot options.

Therefore, I am quite certain that the problem is caused by something 
related to the NUMA memory management in Linux.

The questions now are:
1. How does turning off NUMA awareness affect performance?
2. Can the problem be fixed without resorting to turning off NUMA 
awareness? (Making kernel configuration changes? Patching the kernel? 
Changing something in perf itself? ...)

Best regards,
Maks Graczyk

On 08/11/2023 11:46, Maksymilian Graczyk wrote:
> Hello all,
>
> I have a problem with broken stack traces in perf when I profile a 
> small multi-threaded program in C producing deep (~1000 entries) 
> stacks with "perf record --call-graph=fp -e task-clock -F <any number> 
> --off-cpu" attached to the program's PID. The callchains seem to stop 
> at "random" points throughout my application, occasionally managing to 
> reach the bottom (i.e. either "start" or "clone3").
>
> This is the machine configuration I use:
>
> * Intel Xeon Silver 4216 @ 2.10 GHz, with two 16-core CPUs and 
> hyper-threading disabled
> * 180 GB RAM, with no errors detected by Memtest86+ and no swap
> * Gentoo with Linux 6.5.8 (installed from the Gentoo repo with 
> gentoo-sources and compiled using this config: 
> https://gist.github.com/maksgraczyk/1bce96841a5b2cb2a92f725635c04bf2)
> * perf 6.6 with this quick patch of mine: 
> https://gist.github.com/maksgraczyk/ee1dd98dda79129a35f7fd3acffb35fd
> * Everything is compiled with "-fno-omit-frame-pointer" and 
> "-mno-omit-leaf-frame-pointer" gcc flags
> * kernel.perf_event_max_stack is set to 1024 and 
> kernel.perf_event_paranoid is set to -1
>
> The code of the profiled program is at 
> https://gist.github.com/maksgraczyk/da2bc6d0be9d4e7d88f8bea45221a542 
> (the higher the value of NUM_THREADS, the more likely the issue is to 
> occur; you may need to compile the code without compiler optimisations).
>
> Alongside sampling-based profiling, I run syscall profiling with a 
> separate "perf record" instance attached to the same PID.
>
> When I debug the kernel using kgdb, I see more-or-less the following 
> behaviour happening in the stack traversal loop in 
> perf_callchain_user() in arch/x86/events/core.c for the same thread 
> being profiled:
>
> 1. The first sample goes fine, the entire stack is traversed.
> 2. The second sample breaks at some point inside my program, with a 
> page fault due to page not present.
> 3. The third sample breaks at another *earlier* point inside my 
> program, with a page fault due to page not present.
> 4. The fourth sample breaks at another *later* point inside my 
> program, with a page fault due to page not present.
>
> The stack frame addresses do not change throughout profiling and all 
> page faults happen at __get_user(frame.next_frame, &fp->next_frame). 
> The behaviour above also occurs occasionally in a single-threaded 
> variant of the code (without pthread at all) with a very high sampling 
> frequency (tens of thousands Hz).
>
> This issue makes profiling results unreliable for my use case, as I 
> usually profile multi-threaded applications with deep stacks with 
> hundreds of entries (hence why my test program also produces a deep 
> stack) and use flame graphs for later analysis.
>
> Could you help me diagnose the problem? For example, what may be the 
> cause of my page faults? I also did tests (without debugging though) 
> without syscall profiling and the "--off-cpu" flag, broken stacks 
> still appeared.
>
> (I cannot use DWARF because it makes profiling too slow and perf.data 
> size too large in my tests. I also want to avoid using 
> non-portable/vendor-specific stack unwinding solutions like LBR, as we 
> may need to run profiling on non-Intel CPUs.)
>
> Best regards,
> Maks Graczyk


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults?
  2023-11-10 10:45 ` Maksymilian Graczyk
@ 2023-11-10 10:51   ` Maksymilian Graczyk
  0 siblings, 0 replies; 7+ messages in thread
From: Maksymilian Graczyk @ 2023-11-10 10:51 UTC (permalink / raw)
  To: linux-perf-users; +Cc: syclops-project, Guilherme Amadio, Stephan Hageboeck

On 10/11/2023 11:45, Maksymilian Graczyk wrote:
> Hello again,
>
> I would like to add an update:
>
> My machine turns out to have NUMA due to having two separate CPUs.
>
> Broken stacks virtually disappear when I disable NUMA awareness in the 
> kernel by adding "numa=off" to its boot options.
>
> Therefore, I am quite certain that the problem is caused by something 
> related to the NUMA memory management in Linux.
>
> The questions now are:
> 1. How does turning off NUMA awareness affect performance?
> 2. Can the problem be fixed without resorting to turning off NUMA 
> awareness? (Making kernel configuration changes? Patching the kernel? 
> Changing something in perf itself? ...)
>
> Best regards,
> Maks Graczyk

Apologies, my e-mail program quotes an original message at the end 
rather than at the beginning by default and being used to standard 
e-mail conversations, I have forgotten to make changes to my previous 
message to adhere to the netiquette. Sorry again!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults?
  2023-11-08 10:46 Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults? Maksymilian Graczyk
  2023-11-10 10:45 ` Maksymilian Graczyk
@ 2023-11-10 15:59 ` Arnaldo Carvalho de Melo
  2023-11-10 17:40   ` long BPF stack traces " Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 7+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-11-10 15:59 UTC (permalink / raw)
  To: Maksymilian Graczyk
  Cc: Namhyung Kim, Jiri Olsa, linux-perf-users, syclops-project,
	Guilherme Amadio, Stephan Hageboeck

Em Wed, Nov 08, 2023 at 11:46:03AM +0100, Maksymilian Graczyk escreveu:
> Hello all,
 
> I have a problem with broken stack traces in perf when I profile a small
> multi-threaded program in C producing deep (~1000 entries) stacks with "perf
> record --call-graph=fp -e task-clock -F <any number> --off-cpu" attached to
> the program's PID. The callchains seem to stop at "random" points throughout
> my application, occasionally managing to reach the bottom (i.e. either
> "start" or "clone3").

> This is the machine configuration I use:
> 
> * Intel Xeon Silver 4216 @ 2.10 GHz, with two 16-core CPUs and
> hyper-threading disabled
> * 180 GB RAM, with no errors detected by Memtest86+ and no swap
> * Gentoo with Linux 6.5.8 (installed from the Gentoo repo with
> gentoo-sources and compiled using this config:
> https://gist.github.com/maksgraczyk/1bce96841a5b2cb2a92f725635c04bf2)
> * perf 6.6 with this quick patch of mine:
> https://gist.github.com/maksgraczyk/ee1dd98dda79129a35f7fd3acffb35fd


Ok, so the backtrace is being collected in the BPF skel (copying the
patch here for convenience):

diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 01f70b8e705a8..cb2e4f335b0ad 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -16,7 +16,7 @@

 #include "bpf_skel/off_cpu.skel.h"

-#define MAX_STACKS  32
+#define MAX_STACKS  1024
 #define MAX_PROC  4096
 /* we don't need actual timestamp, just want to put the samples at last */
 #define OFF_CPU_TIMESTAMP  (~0ull << 32)
@@ -33,7 +33,7 @@ struct off_cpu_key {

 union off_cpu_data {
 	struct perf_event_header hdr;
-	u64 array[1024 / sizeof(u64)];
+	u64 array[1024 / sizeof(u64) + (MAX_STACKS - 32)];
 };

 static int off_cpu_config(struct evlist *evlist)
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index d877a0a9731f9..802c2389c400a 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -15,7 +15,7 @@
 /* create a new thread */
 #define CLONE_THREAD  0x10000

-#define MAX_STACKS   32
+#define MAX_STACKS   1024
 #define MAX_ENTRIES  102400

 struct tstamp_data {
--

That changes the size of the 'stacks' BPF map that is then used in:

        stack_id = bpf_get_stackid(ctx, &stacks,
                                   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);


So, in the bpf_get_stackid implementation:

BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
           u64, flags)
{
        u32 max_depth = map->value_size / stack_map_data_size(map);
        u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
        bool user = flags & BPF_F_USER_STACK;
        struct perf_callchain_entry *trace;
        bool kernel = !user;

        if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
                               BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
                return -EINVAL;

        max_depth += skip;
        if (max_depth > sysctl_perf_event_max_stack)
                max_depth = sysctl_perf_event_max_stack;

        trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
                                   false, false);

        if (unlikely(!trace))
                /* couldn't fetch the stack trace */
                return -EFAULT;

        return __bpf_get_stackid(map, trace, flags);
}

So it is capped by sysctl_perf_event_max_stack, that is:

[root@quaco ~]# cat /proc/sys/kernel/perf_event_max_stack
127
[root@quaco ~]#

Can you try after doing:

[root@quaco ~]# echo 1024 > /proc/sys/kernel/perf_event_max_stack 
[root@quaco ~]# cat /proc/sys/kernel/perf_event_max_stack 
1024
[root@quaco ~]#

more comments below

> * Everything is compiled with "-fno-omit-frame-pointer" and
> "-mno-omit-leaf-frame-pointer" gcc flags
> * kernel.perf_event_max_stack is set to 1024 and kernel.perf_event_paranoid
> is set to -1
> 
> The code of the profiled program is at
> https://gist.github.com/maksgraczyk/da2bc6d0be9d4e7d88f8bea45221a542 (the
> higher the value of NUM_THREADS, the more likely the issue is to occur; you
> may need to compile the code without compiler optimisations).
> 
> Alongside sampling-based profiling, I run syscall profiling with a separate
> "perf record" instance attached to the same PID.
> 
> When I debug the kernel using kgdb, I see more-or-less the following
> behaviour happening in the stack traversal loop in perf_callchain_user() in
> arch/x86/events/core.c for the same thread being profiled:
> 
> 1. The first sample goes fine, the entire stack is traversed.
> 2. The second sample breaks at some point inside my program, with a page
> fault due to page not present.
> 3. The third sample breaks at another *earlier* point inside my program,
> with a page fault due to page not present.
> 4. The fourth sample breaks at another *later* point inside my program, with
> a page fault due to page not present.

Namhyung, Jiri: ideas? I have to stop this analysis now, will continue later.
 
> The stack frame addresses do not change throughout profiling and all page
> faults happen at __get_user(frame.next_frame, &fp->next_frame). The
> behaviour above also occurs occasionally in a single-threaded variant of the
> code (without pthread at all) with a very high sampling frequency (tens of
> thousands Hz).
> 
> This issue makes profiling results unreliable for my use case, as I usually
> profile multi-threaded applications with deep stacks with hundreds of
> entries (hence why my test program also produces a deep stack) and use flame
> graphs for later analysis.
> 
> Could you help me diagnose the problem? For example, what may be the cause
> of my page faults? I also did tests (without debugging though) without
> syscall profiling and the "--off-cpu" flag, broken stacks still appeared.
> 
> (I cannot use DWARF because it makes profiling too slow and perf.data size
> too large in my tests. I also want to avoid using
> non-portable/vendor-specific stack unwinding solutions like LBR, as we may
> need to run profiling on non-Intel CPUs.)
> 
> Best regards,
> Maks Graczyk
> 

-- 

- Arnaldo

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* long BPF stack traces Re: Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults?
  2023-11-10 15:59 ` Arnaldo Carvalho de Melo
@ 2023-11-10 17:40   ` Arnaldo Carvalho de Melo
  2023-11-10 23:01     ` Namhyung Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-11-10 17:40 UTC (permalink / raw)
  To: Maksymilian Graczyk
  Cc: Hao Luo, Namhyung Kim, Jiri Olsa, Andrii Nakryiko,
	linux-perf-users, syclops-project, Guilherme Amadio,
	Stephan Hageboeck

Em Fri, Nov 10, 2023 at 12:59:34PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Nov 08, 2023 at 11:46:03AM +0100, Maksymilian Graczyk escreveu:
> > Alongside sampling-based profiling, I run syscall profiling with a separate
> > "perf record" instance attached to the same PID.

> > When I debug the kernel using kgdb, I see more-or-less the following
> > behaviour happening in the stack traversal loop in perf_callchain_user() in
> > arch/x86/events/core.c for the same thread being profiled:

> > 1. The first sample goes fine, the entire stack is traversed.
> > 2. The second sample breaks at some point inside my program, with a page
> > fault due to page not present.
> > 3. The third sample breaks at another *earlier* point inside my program,
> > with a page fault due to page not present.
> > 4. The fourth sample breaks at another *later* point inside my program, with
> > a page fault due to page not present.
 
> Namhyung, Jiri: ideas? I have to stop this analysis now, will continue later.

Would https://lore.kernel.org/all/20220225234339.2386398-7-haoluo@google.com/
this come into play, i.e. we would need to use a sleepable tp_btf (see
below about __get_user(frame.next_frame...) page faulting) in
tools/perf/util/bpf_skel/off_cpu.bpf.c, here:

[acme@quaco perf-tools-next]$ grep tp_btf tools/perf/util/bpf_skel/off_cpu.bpf.c
SEC("tp_btf/task_newtask")
SEC("tp_btf/sched_switch")
[acme@quaco perf-tools-next]$

But Hao's patch (CCed here) doesn't seem to have made its way to
tools/lib/bpf/, Hao, why this hasn't made into libbpf?

+	SEC_DEF("tp_btf.s/",            TRACING, BPF_TRACE_RAW_TP, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),

- Arnaldo
  
> > The stack frame addresses do not change throughout profiling and all page
> > faults happen at __get_user(frame.next_frame, &fp->next_frame). The
> > behaviour above also occurs occasionally in a single-threaded variant of the
> > code (without pthread at all) with a very high sampling frequency (tens of
> > thousands Hz).

> > This issue makes profiling results unreliable for my use case, as I usually
> > profile multi-threaded applications with deep stacks with hundreds of
> > entries (hence why my test program also produces a deep stack) and use flame
> > graphs for later analysis.

> > Could you help me diagnose the problem? For example, what may be the cause
> > of my page faults? I also did tests (without debugging though) without
> > syscall profiling and the "--off-cpu" flag, broken stacks still appeared.

> > (I cannot use DWARF because it makes profiling too slow and perf.data size
> > too large in my tests. I also want to avoid using
> > non-portable/vendor-specific stack unwinding solutions like LBR, as we may
> > need to run profiling on non-Intel CPUs.)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: long BPF stack traces Re: Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults?
  2023-11-10 17:40   ` long BPF stack traces " Arnaldo Carvalho de Melo
@ 2023-11-10 23:01     ` Namhyung Kim
  2023-11-11 13:37       ` Maksymilian Graczyk
  0 siblings, 1 reply; 7+ messages in thread
From: Namhyung Kim @ 2023-11-10 23:01 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Maksymilian Graczyk, Hao Luo, Jiri Olsa, Andrii Nakryiko,
	linux-perf-users, syclops-project, Guilherme Amadio,
	Stephan Hageboeck

Hello,

On Fri, Nov 10, 2023 at 9:40 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> Em Fri, Nov 10, 2023 at 12:59:34PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Wed, Nov 08, 2023 at 11:46:03AM +0100, Maksymilian Graczyk escreveu:
> > > Alongside sampling-based profiling, I run syscall profiling with a separate
> > > "perf record" instance attached to the same PID.
>
> > > When I debug the kernel using kgdb, I see more-or-less the following
> > > behaviour happening in the stack traversal loop in perf_callchain_user() in
> > > arch/x86/events/core.c for the same thread being profiled:
>
> > > 1. The first sample goes fine, the entire stack is traversed.
> > > 2. The second sample breaks at some point inside my program, with a page
> > > fault due to page not present.
> > > 3. The third sample breaks at another *earlier* point inside my program,
> > > with a page fault due to page not present.
> > > 4. The fourth sample breaks at another *later* point inside my program, with
> > > a page fault due to page not present.
>
> > Namhyung, Jiri: ideas? I have to stop this analysis now, will continue later.
>
> Would https://lore.kernel.org/all/20220225234339.2386398-7-haoluo@google.com/
> this come into play, i.e. we would need to use a sleepable tp_btf (see
> below about __get_user(frame.next_frame...) page faulting) in
> tools/perf/util/bpf_skel/off_cpu.bpf.c, here:
>
> [acme@quaco perf-tools-next]$ grep tp_btf tools/perf/util/bpf_skel/off_cpu.bpf.c
> SEC("tp_btf/task_newtask")
> SEC("tp_btf/sched_switch")
> [acme@quaco perf-tools-next]$
>
> But Hao's patch (CCed here) doesn't seem to have made its way to
> tools/lib/bpf/, Hao, why this hasn't made into libbpf?
>
> +       SEC_DEF("tp_btf.s/",            TRACING, BPF_TRACE_RAW_TP, SEC_ATTACH_BTF | SEC_SLEEPABLE, attach_trace),
>

IIUC it's not just a BPF problem, Maksymilian said it still failed to get
the stack trace without --off-cpu.  As it's in the NMI context, it cannot
handle faults in the perf overflow handler.  So things like the proposed
sframe unwinder (I really need to go review it) would be needed.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: long BPF stack traces Re: Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults?
  2023-11-10 23:01     ` Namhyung Kim
@ 2023-11-11 13:37       ` Maksymilian Graczyk
  0 siblings, 0 replies; 7+ messages in thread
From: Maksymilian Graczyk @ 2023-11-11 13:37 UTC (permalink / raw)
  To: Namhyung Kim, Arnaldo Carvalho de Melo
  Cc: Hao Luo, Jiri Olsa, Andrii Nakryiko, linux-perf-users,
	syclops-project, Guilherme Amadio, Stephan Hageboeck

Hi all,

Thank you for your responses.

As I said in my first e-mail, perf_event_max_stack is already set to 
1024 and it doesn't help unfortunately.

Also, as Namhyung has spotted, my problem isn't limited only to BPF, I 
get broken stacks for both on-CPU and off-CPU stuff. The page faults I 
caught in my debugging sessions actually come from on-CPU stack 
traversal and they cannot be currently (if at all) handled by perf 
because a sample is taken inside an interrupt handler.

In the meantime, I have managed to find the exact cause of my page 
faults, which is narrower than what I posted in my last update: it's 
NUMA automatic balancing, which unmaps program's pages to migrate them 
closer to a CPU running that program on the next page fault. Disabling 
it with "sysctl kernel.numa_balancing=0" seems to fix the broken stack 
issue and is a smaller overkill than turning off NUMA awareness 
completely in the kernel, but it's still only a workaround with a 
potential performance penalty (unless there is a manual way of setting 
up NUMA properly so that the performance penalty is negligible and 
callchains are not broken).

The ideal solution for me would be fixing perf so that stack unwinding 
can always be fully done without disabling or tinkering with any 
performance-enhancing mechanisms enabled by default. By "the proposed 
sframe unwinder", do you mean the set of patches suggested in this 
thread? 
https://lore.kernel.org/all/cover.1699487758.git.jpoimboe@kernel.org/

Best regards,
Maks Graczyk

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-11-11 13:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-08 10:46 Broken stack traces with --call-graph=fp and a multi-threaded app due to page faults? Maksymilian Graczyk
2023-11-10 10:45 ` Maksymilian Graczyk
2023-11-10 10:51   ` Maksymilian Graczyk
2023-11-10 15:59 ` Arnaldo Carvalho de Melo
2023-11-10 17:40   ` long BPF stack traces " Arnaldo Carvalho de Melo
2023-11-10 23:01     ` Namhyung Kim
2023-11-11 13:37       ` Maksymilian Graczyk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).