bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC 0/2] bpf: Disable preemption in bpf_perf_event_output
@ 2023-07-17 11:17 Jiri Olsa
  2023-07-17 11:17 ` [PATCH 1/2] " Jiri Olsa
  2023-07-17 11:17 ` [PATCH/RFC 2/2] bpf: Disable preemption in bpf_event_output Jiri Olsa
  0 siblings, 2 replies; 8+ messages in thread
From: Jiri Olsa @ 2023-07-17 11:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: bpf, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo

hi,
we got report of kernel crash [1][3] within bpf_event_output helper.

The reason seems to be the nesting protection code in bpf_event_output
that expects disabled preemption, which is not guaranteed for programs
executed by bpf_prog_run_array_cg.

I managed to reproduce on tracing side where we have the same problem
in bpf_perf_event_output. The reproducer [2] just creates busy uprobe
and kprobe which call bpf_perf_event_output helper a lot.

Eventually the code will be preempted within nesting protection and
cause crash like:

  kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
  BUG: unable to handle page fault for address: ffffffff82be3eea
  ...
  Call Trace:
   <TASK>
   ? __die+0x1f/0x70
   ? page_fault_oops+0x176/0x4d0
   ? exc_page_fault+0x132/0x230
   ? asm_exc_page_fault+0x22/0x30
   ? perf_output_sample+0x12b/0x910
   ? perf_event_output+0xd0/0x1d0
   ? bpf_perf_event_output+0x162/0x1d0
   ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87
   ? __uprobe_perf_func+0x12b/0x540
   ? uprobe_dispatcher+0x2c4/0x430
   ? uprobe_notify_resume+0x2da/0xce0
   ? atomic_notifier_call_chain+0x7b/0x110
   ? exit_to_user_mode_prepare+0x13e/0x290
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_exc_int3+0x35/0x40

I did not come up with reproducer for bpf_event_output case and have no
way to reproduce or test it so far, hence I'm sending patch 2 as RFC for
discussion. However it seems to me it suffers the same issue as
bpf_perf_event_output on tracing side.

thanks,
jirka


[1] https://github.com/cilium/cilium/issues/26756
[2] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=bpf_output_fix_reproducer&id=8054dcc634121b884c7c331329d61d93351d03b5
[3] slack:
    [66194.378161] BUG: kernel NULL pointer dereference, address: 0000000000000001
    [66194.378324] #PF: supervisor instruction fetch in kernel mode
    [66194.378447] #PF: error_code(0x0010) - not-present page
    ...
    [66194.378692] Oops: 0010 [#1] PREEMPT SMP NOPTI
    ...
    [66194.380666]  <TASK>
    [66194.380775]  ? perf_output_sample+0x12a/0x9a0
    [66194.380902]  ? finish_task_switch.isra.0+0x81/0x280
    [66194.381024]  ? perf_event_output+0x66/0xa0
    [66194.381148]  ? bpf_event_output+0x13a/0x190
    [66194.381270]  ? bpf_event_output_data+0x22/0x40
    [66194.381391]  ? bpf_prog_dfc84bbde731b257_cil_sock4_connect+0x40a/0xacb
    [66194.381519]  ? xa_load+0x87/0xe0
    [66194.381635]  ? __cgroup_bpf_run_filter_sock_addr+0xc1/0x1a0
    [66194.381759]  ? release_sock+0x3e/0x90
    [66194.381876]  ? sk_setsockopt+0x1a1/0x12f0
    [66194.381996]  ? udp_pre_connect+0x36/0x50
    [66194.382114]  ? inet_dgram_connect+0x93/0xa0
    [66194.382233]  ? __sys_connect+0xb4/0xe0
    [66194.382353]  ? udp_setsockopt+0x27/0x40
    [66194.382470]  ? __pfx_udp_push_pending_frames+0x10/0x10
    [66194.382593]  ? __sys_setsockopt+0xdf/0x1a0
    [66194.382713]  ? __x64_sys_connect+0xf/0x20
    [66194.382832]  ? do_syscall_64+0x3a/0x90
    [66194.382949]  ? entry_SYSCALL_64_after_hwframe+0x72/0xdc
    [66194.383077]  </TASK>


---
Jiri Olsa (2):
      bpf: Disable preemption in bpf_perf_event_output
      bpf: Disable preemption in bpf_event_output

 kernel/trace/bpf_trace.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] bpf: Disable preemption in bpf_perf_event_output
  2023-07-17 11:17 [PATCH/RFC 0/2] bpf: Disable preemption in bpf_perf_event_output Jiri Olsa
@ 2023-07-17 11:17 ` Jiri Olsa
  2023-07-19  0:59   ` Alexei Starovoitov
  2023-07-17 11:17 ` [PATCH/RFC 2/2] bpf: Disable preemption in bpf_event_output Jiri Olsa
  1 sibling, 1 reply; 8+ messages in thread
From: Jiri Olsa @ 2023-07-17 11:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: bpf, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo

The nesting protection in bpf_perf_event_output relies on disabled
preemption, which is guaranteed for kprobes and tracepoints.

However bpf_perf_event_output can be also called from uprobes context
through bpf_prog_run_array_sleepable function which disables migration,
but keeps preemption enabled.

This can cause task to be preempted by another one inside the nesting
protection and lead eventually to two tasks using same perf_sample_data
buffer and cause crashes like:

  kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
  BUG: unable to handle page fault for address: ffffffff82be3eea
  ...
  Call Trace:
   ? __die+0x1f/0x70
   ? page_fault_oops+0x176/0x4d0
   ? exc_page_fault+0x132/0x230
   ? asm_exc_page_fault+0x22/0x30
   ? perf_output_sample+0x12b/0x910
   ? perf_event_output+0xd0/0x1d0
   ? bpf_perf_event_output+0x162/0x1d0
   ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87
   ? __uprobe_perf_func+0x12b/0x540
   ? uprobe_dispatcher+0x2c4/0x430
   ? uprobe_notify_resume+0x2da/0xce0
   ? atomic_notifier_call_chain+0x7b/0x110
   ? exit_to_user_mode_prepare+0x13e/0x290
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_exc_int3+0x35/0x40

Fixing this by disabling preemption in bpf_perf_event_output.

Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/trace/bpf_trace.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index c92eb8c6ff08..2a6ba05d8aee 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -661,8 +661,7 @@ static DEFINE_PER_CPU(int, bpf_trace_nest_level);
 BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
 	   u64, flags, void *, data, u64, size)
 {
-	struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds);
-	int nest_level = this_cpu_inc_return(bpf_trace_nest_level);
+	struct bpf_trace_sample_data *sds;
 	struct perf_raw_record raw = {
 		.frag = {
 			.size = size,
@@ -670,7 +669,12 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
 		},
 	};
 	struct perf_sample_data *sd;
-	int err;
+	int nest_level, err;
+
+	preempt_disable();
+
+	sds = this_cpu_ptr(&bpf_trace_sds);
+	nest_level = this_cpu_inc_return(bpf_trace_nest_level);
 
 	if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) {
 		err = -EBUSY;
@@ -691,6 +695,7 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
 
 out:
 	this_cpu_dec(bpf_trace_nest_level);
+	preempt_enable();
 	return err;
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH/RFC 2/2] bpf: Disable preemption in bpf_event_output
  2023-07-17 11:17 [PATCH/RFC 0/2] bpf: Disable preemption in bpf_perf_event_output Jiri Olsa
  2023-07-17 11:17 ` [PATCH 1/2] " Jiri Olsa
@ 2023-07-17 11:17 ` Jiri Olsa
  2023-07-19  1:01   ` Alexei Starovoitov
  1 sibling, 1 reply; 8+ messages in thread
From: Jiri Olsa @ 2023-07-17 11:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: bpf, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo

We received report [1] of kernel crash, which is caused by
using nesting protection without disabled preemption.

The bpf_event_output can be called by programs executed by
bpf_prog_run_array_cg function that disabled migration but
keeps preemption enabled.

This can cause task to be preempted by another one inside the
nesting protection and lead eventually to two tasks using same
perf_sample_data buffer and cause crashes like:

  BUG: kernel NULL pointer dereference, address: 0000000000000001
  #PF: supervisor instruction fetch in kernel mode
  #PF: error_code(0x0010) - not-present page
  ...
  ? perf_output_sample+0x12a/0x9a0
  ? finish_task_switch.isra.0+0x81/0x280
  ? perf_event_output+0x66/0xa0
  ? bpf_event_output+0x13a/0x190
  ? bpf_event_output_data+0x22/0x40
  ? bpf_prog_dfc84bbde731b257_cil_sock4_connect+0x40a/0xacb
  ? xa_load+0x87/0xe0
  ? __cgroup_bpf_run_filter_sock_addr+0xc1/0x1a0
  ? release_sock+0x3e/0x90
  ? sk_setsockopt+0x1a1/0x12f0
  ? udp_pre_connect+0x36/0x50
  ? inet_dgram_connect+0x93/0xa0
  ? __sys_connect+0xb4/0xe0
  ? udp_setsockopt+0x27/0x40
  ? __pfx_udp_push_pending_frames+0x10/0x10
  ? __sys_setsockopt+0xdf/0x1a0
  ? __x64_sys_connect+0xf/0x20
  ? do_syscall_64+0x3a/0x90
  ? entry_SYSCALL_64_after_hwframe+0x72/0xdc

Fixing this by disabling preemption in bpf_event_output.

[1] https://github.com/cilium/cilium/issues/26756
Reported-by:  Oleg "livelace" Popov <o.popov@livelace.ru>
Fixes: 768fb61fcc13 ("bpf: Fix bpf_event_output re-entry issue")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/trace/bpf_trace.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 2a6ba05d8aee..36fb6e483952 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -720,7 +720,6 @@ static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_misc_sds);
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 		     void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy)
 {
-	int nest_level = this_cpu_inc_return(bpf_event_output_nest_level);
 	struct perf_raw_frag frag = {
 		.copy		= ctx_copy,
 		.size		= ctx_size,
@@ -737,8 +736,13 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 	};
 	struct perf_sample_data *sd;
 	struct pt_regs *regs;
+	int nest_level;
 	u64 ret;
 
+	preempt_disable();
+
+	nest_level = this_cpu_inc_return(bpf_event_output_nest_level);
+
 	if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(bpf_misc_sds.sds))) {
 		ret = -EBUSY;
 		goto out;
@@ -753,6 +757,7 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 	ret = __bpf_perf_event_output(regs, map, flags, sd);
 out:
 	this_cpu_dec(bpf_event_output_nest_level);
+	preempt_enable();
 	return ret;
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] bpf: Disable preemption in bpf_perf_event_output
  2023-07-17 11:17 ` [PATCH 1/2] " Jiri Olsa
@ 2023-07-19  0:59   ` Alexei Starovoitov
  2023-07-19 14:08     ` Jiri Olsa
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2023-07-19  0:59 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo

On Mon, Jul 17, 2023 at 4:17 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> The nesting protection in bpf_perf_event_output relies on disabled
> preemption, which is guaranteed for kprobes and tracepoints.

I don't understand why you came up with such a conclusion.
bpf_perf_event_output needs migration disabled and doesn't mind
being preempted.
That's what the nested counter is for.

Stack trace also doesn't look like it's related to that.
More like stack corruption in perf_output_sample.

Do you have
commit eb81a2ed4f52 ("perf/core: Fix perf_output_begin parameter is
incorrectly invoked in perf_event_bpf_output")
in your kernel?

> However bpf_perf_event_output can be also called from uprobes context
> through bpf_prog_run_array_sleepable function which disables migration,
> but keeps preemption enabled.
>
> This can cause task to be preempted by another one inside the nesting
> protection and lead eventually to two tasks using same perf_sample_data
> buffer and cause crashes like:
>
>   kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
>   BUG: unable to handle page fault for address: ffffffff82be3eea
>   ...
>   Call Trace:
>    ? __die+0x1f/0x70
>    ? page_fault_oops+0x176/0x4d0
>    ? exc_page_fault+0x132/0x230
>    ? asm_exc_page_fault+0x22/0x30
>    ? perf_output_sample+0x12b/0x910
>    ? perf_event_output+0xd0/0x1d0
>    ? bpf_perf_event_output+0x162/0x1d0
>    ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87
>    ? __uprobe_perf_func+0x12b/0x540
>    ? uprobe_dispatcher+0x2c4/0x430
>    ? uprobe_notify_resume+0x2da/0xce0
>    ? atomic_notifier_call_chain+0x7b/0x110
>    ? exit_to_user_mode_prepare+0x13e/0x290
>    ? irqentry_exit_to_user_mode+0x5/0x30
>    ? asm_exc_int3+0x35/0x40
>
> Fixing this by disabling preemption in bpf_perf_event_output.
>
> Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data")
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  kernel/trace/bpf_trace.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index c92eb8c6ff08..2a6ba05d8aee 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -661,8 +661,7 @@ static DEFINE_PER_CPU(int, bpf_trace_nest_level);
>  BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
>            u64, flags, void *, data, u64, size)
>  {
> -       struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds);
> -       int nest_level = this_cpu_inc_return(bpf_trace_nest_level);
> +       struct bpf_trace_sample_data *sds;
>         struct perf_raw_record raw = {
>                 .frag = {
>                         .size = size,
> @@ -670,7 +669,12 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
>                 },
>         };
>         struct perf_sample_data *sd;
> -       int err;
> +       int nest_level, err;
> +
> +       preempt_disable();
> +
> +       sds = this_cpu_ptr(&bpf_trace_sds);
> +       nest_level = this_cpu_inc_return(bpf_trace_nest_level);
>
>         if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) {
>                 err = -EBUSY;
> @@ -691,6 +695,7 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
>
>  out:
>         this_cpu_dec(bpf_trace_nest_level);
> +       preempt_enable();
>         return err;
>  }
>
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC 2/2] bpf: Disable preemption in bpf_event_output
  2023-07-17 11:17 ` [PATCH/RFC 2/2] bpf: Disable preemption in bpf_event_output Jiri Olsa
@ 2023-07-19  1:01   ` Alexei Starovoitov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2023-07-19  1:01 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo

On Mon, Jul 17, 2023 at 4:18 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> We received report [1] of kernel crash, which is caused by
> using nesting protection without disabled preemption.

Same question. I don't see why it would be preemption related.

> The bpf_event_output can be called by programs executed by
> bpf_prog_run_array_cg function that disabled migration but
> keeps preemption enabled.
>
> This can cause task to be preempted by another one inside the
> nesting protection and lead eventually to two tasks using same
> perf_sample_data buffer and cause crashes like:
>
>   BUG: kernel NULL pointer dereference, address: 0000000000000001
>   #PF: supervisor instruction fetch in kernel mode
>   #PF: error_code(0x0010) - not-present page
>   ...
>   ? perf_output_sample+0x12a/0x9a0
>   ? finish_task_switch.isra.0+0x81/0x280
>   ? perf_event_output+0x66/0xa0
>   ? bpf_event_output+0x13a/0x190
>   ? bpf_event_output_data+0x22/0x40
>   ? bpf_prog_dfc84bbde731b257_cil_sock4_connect+0x40a/0xacb
>   ? xa_load+0x87/0xe0
>   ? __cgroup_bpf_run_filter_sock_addr+0xc1/0x1a0
>   ? release_sock+0x3e/0x90
>   ? sk_setsockopt+0x1a1/0x12f0
>   ? udp_pre_connect+0x36/0x50
>   ? inet_dgram_connect+0x93/0xa0
>   ? __sys_connect+0xb4/0xe0
>   ? udp_setsockopt+0x27/0x40
>   ? __pfx_udp_push_pending_frames+0x10/0x10
>   ? __sys_setsockopt+0xdf/0x1a0
>   ? __x64_sys_connect+0xf/0x20
>   ? do_syscall_64+0x3a/0x90
>   ? entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> Fixing this by disabling preemption in bpf_event_output.
>
> [1] https://github.com/cilium/cilium/issues/26756
> Reported-by:  Oleg "livelace" Popov <o.popov@livelace.ru>
> Fixes: 768fb61fcc13 ("bpf: Fix bpf_event_output re-entry issue")
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  kernel/trace/bpf_trace.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 2a6ba05d8aee..36fb6e483952 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -720,7 +720,6 @@ static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_misc_sds);
>  u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
>                      void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy)
>  {
> -       int nest_level = this_cpu_inc_return(bpf_event_output_nest_level);
>         struct perf_raw_frag frag = {
>                 .copy           = ctx_copy,
>                 .size           = ctx_size,
> @@ -737,8 +736,13 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
>         };
>         struct perf_sample_data *sd;
>         struct pt_regs *regs;
> +       int nest_level;
>         u64 ret;
>
> +       preempt_disable();
> +
> +       nest_level = this_cpu_inc_return(bpf_event_output_nest_level);
> +
>         if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(bpf_misc_sds.sds))) {
>                 ret = -EBUSY;
>                 goto out;
> @@ -753,6 +757,7 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
>         ret = __bpf_perf_event_output(regs, map, flags, sd);
>  out:
>         this_cpu_dec(bpf_event_output_nest_level);
> +       preempt_enable();
>         return ret;
>  }
>
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] bpf: Disable preemption in bpf_perf_event_output
  2023-07-19  0:59   ` Alexei Starovoitov
@ 2023-07-19 14:08     ` Jiri Olsa
  2023-07-19 18:45       ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Jiri Olsa @ 2023-07-19 14:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo

On Tue, Jul 18, 2023 at 05:59:53PM -0700, Alexei Starovoitov wrote:
> On Mon, Jul 17, 2023 at 4:17 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > The nesting protection in bpf_perf_event_output relies on disabled
> > preemption, which is guaranteed for kprobes and tracepoints.
> 
> I don't understand why you came up with such a conclusion.
> bpf_perf_event_output needs migration disabled and doesn't mind
> being preempted.
> That's what the nested counter is for.
> 
> Stack trace also doesn't look like it's related to that.
> More like stack corruption in perf_output_sample.

hum I think that with the preemption being enabled following scenario
is possible where at the end 2 tasks on same cpu can endup sharing same
pointer to struct perf_sample_data:


        task-1
        --------------------------------------------------------
        uprobe hit

        uprobe_dispatcher
        {
          __uprobe_perf_func
            bpf_prog_run_array_sleepable
            {
              might_fault
              rcu_read_lock_trace
              migrate_disable
              rcu_read_lock

              bpf_prog
                ...
                bpf_perf_event_output
                {
                  nest_level = bpf_trace_nest_level = 1
                  sd = &sds->sds[0];

        -> preempted by task-2


                                                                        task-2
                                                                        --------------------------------------------------------
                                                                        uprobe hit

                                                                        uprobe_dispatcher
                                                                          __uprobe_perf_func
                                                                            bpf_prog_run_array_sleepable

                                                                              might_fault
                                                                              rcu_read_lock_trace
                                                                              migrate_disable
                                                                              rcu_read_lock

                                                                              bpf_prog
                                                                                ...
                                                                                bpf_perf_event_output
                                                                                {
                                                                                  nest_level = bpf_trace_nest_level = 2
                                                                                  sd = &sds->sds[1];

                                                                        -> preempted by task-1



                  __bpf_perf_event_output(regs, map, flags, sd);
                    perf_output_sample(data)

                  bpf_trace_nest_level = 1

                } /* bpf_perf_event_output */

              rcu_read_unlock
              migrate_enable
              rcu_read_unlock_trace

            } /* bpf_prog_run_array_sleepable */
        } /* uprobe_dispatcher */

        uprobe hit

        uprobe_dispatcher
        {
          __uprobe_perf_func
            bpf_prog_run_array_sleepable
            {
              might_fault
              rcu_read_lock_trace
              migrate_disable
              rcu_read_lock

              bpf_prog
                ...
                bpf_perf_event_output {
                  nest_level = bpf_trace_nest_level = 2
                  sd = &sds->sds[1];


now task-1 and task-2 share same bpf_trace_nest_level value and same
'struct perf_sample_data' buffer on top of &sds->sds[1]

I did not figure out yet the actual exact scenario/cause of the crash yet,
I suspect one of the tasks copies data over some boundary, but all the
ideas I had so far did not match the instructions from the crash

anyway I thought that having 2 tasks sharing the same perf_sample_data
is bad enough to send the patch

> 
> Do you have
> commit eb81a2ed4f52 ("perf/core: Fix perf_output_begin parameter is
> incorrectly invoked in perf_event_bpf_output")
> in your kernel?

yes, I just retested and see that on latest bpf-next/master

jirka

> 
> > However bpf_perf_event_output can be also called from uprobes context
> > through bpf_prog_run_array_sleepable function which disables migration,
> > but keeps preemption enabled.
> >
> > This can cause task to be preempted by another one inside the nesting
> > protection and lead eventually to two tasks using same perf_sample_data
> > buffer and cause crashes like:
> >
> >   kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
> >   BUG: unable to handle page fault for address: ffffffff82be3eea
> >   ...
> >   Call Trace:
> >    ? __die+0x1f/0x70
> >    ? page_fault_oops+0x176/0x4d0
> >    ? exc_page_fault+0x132/0x230
> >    ? asm_exc_page_fault+0x22/0x30
> >    ? perf_output_sample+0x12b/0x910
> >    ? perf_event_output+0xd0/0x1d0
> >    ? bpf_perf_event_output+0x162/0x1d0
> >    ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87
> >    ? __uprobe_perf_func+0x12b/0x540
> >    ? uprobe_dispatcher+0x2c4/0x430
> >    ? uprobe_notify_resume+0x2da/0xce0
> >    ? atomic_notifier_call_chain+0x7b/0x110
> >    ? exit_to_user_mode_prepare+0x13e/0x290
> >    ? irqentry_exit_to_user_mode+0x5/0x30
> >    ? asm_exc_int3+0x35/0x40
> >
> > Fixing this by disabling preemption in bpf_perf_event_output.
> >
> > Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data")
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  kernel/trace/bpf_trace.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index c92eb8c6ff08..2a6ba05d8aee 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -661,8 +661,7 @@ static DEFINE_PER_CPU(int, bpf_trace_nest_level);
> >  BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
> >            u64, flags, void *, data, u64, size)
> >  {
> > -       struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds);
> > -       int nest_level = this_cpu_inc_return(bpf_trace_nest_level);
> > +       struct bpf_trace_sample_data *sds;
> >         struct perf_raw_record raw = {
> >                 .frag = {
> >                         .size = size,
> > @@ -670,7 +669,12 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
> >                 },
> >         };
> >         struct perf_sample_data *sd;
> > -       int err;
> > +       int nest_level, err;
> > +
> > +       preempt_disable();
> > +
> > +       sds = this_cpu_ptr(&bpf_trace_sds);
> > +       nest_level = this_cpu_inc_return(bpf_trace_nest_level);
> >
> >         if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) {
> >                 err = -EBUSY;
> > @@ -691,6 +695,7 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
> >
> >  out:
> >         this_cpu_dec(bpf_trace_nest_level);
> > +       preempt_enable();
> >         return err;
> >  }
> >
> > --
> > 2.41.0
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] bpf: Disable preemption in bpf_perf_event_output
  2023-07-19 14:08     ` Jiri Olsa
@ 2023-07-19 18:45       ` Alexei Starovoitov
  2023-07-19 21:21         ` Jiri Olsa
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2023-07-19 18:45 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo

On Wed, Jul 19, 2023 at 7:08 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> now task-1 and task-2 share same bpf_trace_nest_level value and same
> 'struct perf_sample_data' buffer on top of &sds->sds[1]
>
> I did not figure out yet the actual exact scenario/cause of the crash yet,
> I suspect one of the tasks copies data over some boundary, but all the
> ideas I had so far did not match the instructions from the crash
>
> anyway I thought that having 2 tasks sharing the same perf_sample_data
> is bad enough to send the patch

It makes sense now. We forgot to update this part during
transition from preempt_disable to migrate_disable.

But do you have PREEMPT_RCU in your kernel?
If not then the above race shouldn't be possible.
Worth fixing anyway, of course.
Can you repro with a crafted test?
Multiple uprobes doing bpf_perf_event_output should be enough, right?
For kprobes we're "lucky" due to bpf_prog_active.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] bpf: Disable preemption in bpf_perf_event_output
  2023-07-19 18:45       ` Alexei Starovoitov
@ 2023-07-19 21:21         ` Jiri Olsa
  0 siblings, 0 replies; 8+ messages in thread
From: Jiri Olsa @ 2023-07-19 21:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	bpf, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo

On Wed, Jul 19, 2023 at 11:45:26AM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 19, 2023 at 7:08 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > now task-1 and task-2 share same bpf_trace_nest_level value and same
> > 'struct perf_sample_data' buffer on top of &sds->sds[1]
> >
> > I did not figure out yet the actual exact scenario/cause of the crash yet,
> > I suspect one of the tasks copies data over some boundary, but all the
> > ideas I had so far did not match the instructions from the crash
> >
> > anyway I thought that having 2 tasks sharing the same perf_sample_data
> > is bad enough to send the patch
> 
> It makes sense now. We forgot to update this part during
> transition from preempt_disable to migrate_disable.
> 
> But do you have PREEMPT_RCU in your kernel?

yes, I have that enabled and it's also enabled in the kernel
that originally hit this

> If not then the above race shouldn't be possible.
> Worth fixing anyway, of course.
> Can you repro with a crafted test?
> Multiple uprobes doing bpf_perf_event_output should be enough, right?
> For kprobes we're "lucky" due to bpf_prog_active.

right, I can reproduce it just with uprobe

I realized the changes are on top of bpf-next/master.. I'll rebase it
on top of bpf/master and send without RFC tag

thanks,
jirka

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-07-19 21:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-17 11:17 [PATCH/RFC 0/2] bpf: Disable preemption in bpf_perf_event_output Jiri Olsa
2023-07-17 11:17 ` [PATCH 1/2] " Jiri Olsa
2023-07-19  0:59   ` Alexei Starovoitov
2023-07-19 14:08     ` Jiri Olsa
2023-07-19 18:45       ` Alexei Starovoitov
2023-07-19 21:21         ` Jiri Olsa
2023-07-17 11:17 ` [PATCH/RFC 2/2] bpf: Disable preemption in bpf_event_output Jiri Olsa
2023-07-19  1:01   ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).