[bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ?
@ 2024-07-12 16:53 Joe Damato
  2024-07-12 22:18 ` Jiri Olsa
  0 siblings, 1 reply; 8+ messages in thread
From: Joe Damato @ 2024-07-12 16:53 UTC (permalink / raw)
  To: me, linux-kernel, bpf, netdev
  Cc: acme, andrii.nakryiko, elver, jolsa, khuey, mingo, namhyung,
	peterz, robert, yonghong.song, mkarsten, kuba

Greetings:

(I am reposting this question after 2 days and to a wider audience
as I didn't hear back [1]; my apologies it just seemed like a
possible bug slipped into 6.10-rc1 and I wanted to bring attention
to it before 6.10 is released.)

While testing some unrelated networking code with Martin Karsten (cc'd on
this email) we discovered what appears to be some sort of overflow bug in
bpf.

git bisect suggests that commit f11f10bfa1ca ("perf/bpf: Call BPF handler
directly, not through overflow machinery") is the first commit where the
(I assume) buggy behavior appears.

Running the following on my machine as of the commit mentioned above:

  bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'

while simultaneously transferring data to the target machine (in my case, I
scp'd a 100MiB file of zeros in a loop) results in very strange output
(snipped):

  @[11]: 5
  @[18]: 5
  @[-30590]: 6
  @[10]: 7
  @[14]: 9

It does not seem that the driver I am using on my test system (mlx5) would
ever return a negative value from its napi poll function and likewise for
the driver Martin is using (mlx4).

As such, I don't think it is possible for args->work to ever be a large
negative number, but perhaps I am misunderstanding something?

I would like to note that commit 14e40a9578b7 ("perf/bpf: Remove #ifdef
CONFIG_BPF_SYSCALL from struct perf_event members") does not exhibit this
behavior and the output seems reasonable on my test system. Martin confirms
the same for both commits on his test system, which uses different hardware
than mine.

Is this an expected side effect of this change? I would expect it is not
and that the output is a bug of some sort. My apologies in that I am not
particularly familiar with the bpf code and cannot suggest what the root
cause might be.

If it is not a bug:
  1. Sorry for the noise :(
  2. Can anyone suggest what this output might mean or how the
     script run above should be modified? AFAIK this is a fairly
     common bpftrace that many folks run for profiling/debugging
     purposes.

Thanks,
Joe

[1]: https://lore.kernel.org/bpf/Zo64cpho2cFQiOeE@LQ3V64L9R2/T/#u

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ?
  2024-07-12 16:53 [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ? Joe Damato
@ 2024-07-12 22:18 ` Jiri Olsa
  2024-07-12 22:39   ` Jiri Olsa
  2024-07-12 22:49   ` Kyle Huey
  0 siblings, 2 replies; 8+ messages in thread
From: Jiri Olsa @ 2024-07-12 22:18 UTC (permalink / raw)
  To: Joe Damato, Kyle Huey
  Cc: linux-kernel, bpf, netdev, acme, andrii.nakryiko, elver, khuey,
	mingo, namhyung, peterz, robert, yonghong.song, mkarsten, kuba

On Fri, Jul 12, 2024 at 09:53:53AM -0700, Joe Damato wrote:
> Greetings:
> 
> (I am reposting this question after 2 days and to a wider audience
> as I didn't hear back [1]; my apologies it just seemed like a
> possible bug slipped into 6.10-rc1 and I wanted to bring attention
> to it before 6.10 is released.)
> 
> While testing some unrelated networking code with Martin Karsten (cc'd on
> this email) we discovered what appears to be some sort of overflow bug in
> bpf.
> 
> git bisect suggests that commit f11f10bfa1ca ("perf/bpf: Call BPF handler
> directly, not through overflow machinery") is the first commit where the
> (I assume) buggy behavior appears.

heya, nice catch!

I can reproduce.. it seems that after f11f10bfa1ca we allow to run tracepoint
program as perf event overflow program 

bpftrace's bpf program returns 1 which means that perf_trace_run_bpf_submit
will continue to execute perf_tp_event and then:

  perf_tp_event
    perf_swevent_event
      __perf_event_overflow
        bpf_overflow_handler

bpf_overflow_handler then executes event->prog on wrong arguments, which
results in wrong 'work' data in bpftrace output

I can 'fix' that by checking the event type before running the program like
in the change below, but I wonder there's probably better fix

Kyle, any idea?

> 
> Running the following on my machine as of the commit mentioned above:
> 
>   bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'
> 
> while simultaneously transferring data to the target machine (in my case, I
> scp'd a 100MiB file of zeros in a loop) results in very strange output
> (snipped):
> 
>   @[11]: 5
>   @[18]: 5
>   @[-30590]: 6
>   @[10]: 7
>   @[14]: 9
> 
> It does not seem that the driver I am using on my test system (mlx5) would
> ever return a negative value from its napi poll function and likewise for
> the driver Martin is using (mlx4).
> 
> As such, I don't think it is possible for args->work to ever be a large
> negative number, but perhaps I am misunderstanding something?
> 
> I would like to note that commit 14e40a9578b7 ("perf/bpf: Remove #ifdef
> CONFIG_BPF_SYSCALL from struct perf_event members") does not exhibit this
> behavior and the output seems reasonable on my test system. Martin confirms
> the same for both commits on his test system, which uses different hardware
> than mine.
> 
> Is this an expected side effect of this change? I would expect it is not
> and that the output is a bug of some sort. My apologies in that I am not
> particularly familiar with the bpf code and cannot suggest what the root
> cause might be.
> 
> If it is not a bug:
>   1. Sorry for the noise :(

your report is great, thanks a lot!

jirka


>   2. Can anyone suggest what this output might mean or how the
>      script run above should be modified? AFAIK this is a fairly
>      common bpftrace that many folks run for profiling/debugging
>      purposes.
> 
> Thanks,
> Joe
> 
> [1]: https://lore.kernel.org/bpf/Zo64cpho2cFQiOeE@LQ3V64L9R2/T/#u

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c6a6936183d5..0045dc754ef7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9580,7 +9580,7 @@ static int bpf_overflow_handler(struct perf_event *event,
 		goto out;
 	rcu_read_lock();
 	prog = READ_ONCE(event->prog);
-	if (prog) {
+	if (prog && prog->type == BPF_PROG_TYPE_PERF_EVENT) {
 		perf_prepare_sample(data, event, regs);
 		ret = bpf_prog_run(prog, &ctx);
 	}

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ?
  2024-07-12 22:18 ` Jiri Olsa
@ 2024-07-12 22:39   ` Jiri Olsa
  2024-07-12 22:49   ` Kyle Huey
  1 sibling, 0 replies; 8+ messages in thread
From: Jiri Olsa @ 2024-07-12 22:39 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Joe Damato, Kyle Huey, linux-kernel, bpf, netdev, acme,
	andrii.nakryiko, elver, khuey, mingo, namhyung, peterz, robert,
	yonghong.song, mkarsten, kuba, Viktor Malik

On Sat, Jul 13, 2024 at 12:18:26AM +0200, Jiri Olsa wrote:
> On Fri, Jul 12, 2024 at 09:53:53AM -0700, Joe Damato wrote:
> > Greetings:
> > 
> > (I am reposting this question after 2 days and to a wider audience
> > as I didn't hear back [1]; my apologies it just seemed like a
> > possible bug slipped into 6.10-rc1 and I wanted to bring attention
> > to it before 6.10 is released.)
> > 
> > While testing some unrelated networking code with Martin Karsten (cc'd on
> > this email) we discovered what appears to be some sort of overflow bug in
> > bpf.
> > 
> > git bisect suggests that commit f11f10bfa1ca ("perf/bpf: Call BPF handler
> > directly, not through overflow machinery") is the first commit where the
> > (I assume) buggy behavior appears.
> 
> heya, nice catch!
> 
> I can reproduce.. it seems that after f11f10bfa1ca we allow to run tracepoint
> program as perf event overflow program 
> 
> bpftrace's bpf program returns 1 which means that perf_trace_run_bpf_submit
> will continue to execute perf_tp_event and then:

also bpftrace should perhaps return 0 in tracepoint programs
and cut the extra processing in any case

cc-ing Viktor

jirka


> 
>   perf_tp_event
>     perf_swevent_event
>       __perf_event_overflow
>         bpf_overflow_handler
> 
> bpf_overflow_handler then executes event->prog on wrong arguments, which
> results in wrong 'work' data in bpftrace output
> 
> I can 'fix' that by checking the event type before running the program like
> in the change below, but I wonder there's probably better fix
> 
> Kyle, any idea?
> 
> > 
> > Running the following on my machine as of the commit mentioned above:
> > 
> >   bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'
> > 
> > while simultaneously transferring data to the target machine (in my case, I
> > scp'd a 100MiB file of zeros in a loop) results in very strange output
> > (snipped):
> > 
> >   @[11]: 5
> >   @[18]: 5
> >   @[-30590]: 6
> >   @[10]: 7
> >   @[14]: 9
> > 
> > It does not seem that the driver I am using on my test system (mlx5) would
> > ever return a negative value from its napi poll function and likewise for
> > the driver Martin is using (mlx4).
> > 
> > As such, I don't think it is possible for args->work to ever be a large
> > negative number, but perhaps I am misunderstanding something?
> > 
> > I would like to note that commit 14e40a9578b7 ("perf/bpf: Remove #ifdef
> > CONFIG_BPF_SYSCALL from struct perf_event members") does not exhibit this
> > behavior and the output seems reasonable on my test system. Martin confirms
> > the same for both commits on his test system, which uses different hardware
> > than mine.
> > 
> > Is this an expected side effect of this change? I would expect it is not
> > and that the output is a bug of some sort. My apologies in that I am not
> > particularly familiar with the bpf code and cannot suggest what the root
> > cause might be.
> > 
> > If it is not a bug:
> >   1. Sorry for the noise :(
> 
> your report is great, thanks a lot!
> 
> jirka
> 
> 
> >   2. Can anyone suggest what this output might mean or how the
> >      script run above should be modified? AFAIK this is a fairly
> >      common bpftrace that many folks run for profiling/debugging
> >      purposes.
> > 
> > Thanks,
> > Joe
> > 
> > [1]: https://lore.kernel.org/bpf/Zo64cpho2cFQiOeE@LQ3V64L9R2/T/#u
> 
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c6a6936183d5..0045dc754ef7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -9580,7 +9580,7 @@ static int bpf_overflow_handler(struct perf_event *event,
>  		goto out;
>  	rcu_read_lock();
>  	prog = READ_ONCE(event->prog);
> -	if (prog) {
> +	if (prog && prog->type == BPF_PROG_TYPE_PERF_EVENT) {
>  		perf_prepare_sample(data, event, regs);
>  		ret = bpf_prog_run(prog, &ctx);
>  	}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ?
  2024-07-12 22:18 ` Jiri Olsa
  2024-07-12 22:39   ` Jiri Olsa
@ 2024-07-12 22:49   ` Kyle Huey
  2024-07-12 23:05     ` Kyle Huey
  1 sibling, 1 reply; 8+ messages in thread
From: Kyle Huey @ 2024-07-12 22:49 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Joe Damato, linux-kernel, bpf, netdev, acme, andrii.nakryiko,
	elver, khuey, mingo, namhyung, peterz, robert, yonghong.song,
	mkarsten, kuba

On Fri, Jul 12, 2024 at 3:18 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Jul 12, 2024 at 09:53:53AM -0700, Joe Damato wrote:
> > Greetings:
> >
> > (I am reposting this question after 2 days and to a wider audience
> > as I didn't hear back [1]; my apologies it just seemed like a
> > possible bug slipped into 6.10-rc1 and I wanted to bring attention
> > to it before 6.10 is released.)
> >
> > While testing some unrelated networking code with Martin Karsten (cc'd on
> > this email) we discovered what appears to be some sort of overflow bug in
> > bpf.
> >
> > git bisect suggests that commit f11f10bfa1ca ("perf/bpf: Call BPF handler
> > directly, not through overflow machinery") is the first commit where the
> > (I assume) buggy behavior appears.
>
> heya, nice catch!
>
> I can reproduce.. it seems that after f11f10bfa1ca we allow to run tracepoint
> program as perf event overflow program
>
> bpftrace's bpf program returns 1 which means that perf_trace_run_bpf_submit
> will continue to execute perf_tp_event and then:
>
>   perf_tp_event
>     perf_swevent_event
>       __perf_event_overflow
>         bpf_overflow_handler
>
> bpf_overflow_handler then executes event->prog on wrong arguments, which
> results in wrong 'work' data in bpftrace output
>
> I can 'fix' that by checking the event type before running the program like
> in the change below, but I wonder there's probably better fix
>
> Kyle, any idea?

Thanks for doing the hard work here Jiri. I did see the original email
a couple days ago but the cause was far from obvious to me so I was
waiting until I had more time to dig in.

The issue here is that kernel/trace/bpf_trace.c pokes at event->prog
directly, so the assumption made in my patch series (based on the
suggested patch at
https://lore.kernel.org/lkml/ZXJJa5re536_e7c1@google.com/) that having
a BPF program in event->prog means we also use the BPF overflow
handler is wrong.

I'll think about how to fix it.

- Kyle


> >
> > Running the following on my machine as of the commit mentioned above:
> >
> >   bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'
> >
> > while simultaneously transferring data to the target machine (in my case, I
> > scp'd a 100MiB file of zeros in a loop) results in very strange output
> > (snipped):
> >
> >   @[11]: 5
> >   @[18]: 5
> >   @[-30590]: 6
> >   @[10]: 7
> >   @[14]: 9
> >
> > It does not seem that the driver I am using on my test system (mlx5) would
> > ever return a negative value from its napi poll function and likewise for
> > the driver Martin is using (mlx4).
> >
> > As such, I don't think it is possible for args->work to ever be a large
> > negative number, but perhaps I am misunderstanding something?
> >
> > I would like to note that commit 14e40a9578b7 ("perf/bpf: Remove #ifdef
> > CONFIG_BPF_SYSCALL from struct perf_event members") does not exhibit this
> > behavior and the output seems reasonable on my test system. Martin confirms
> > the same for both commits on his test system, which uses different hardware
> > than mine.
> >
> > Is this an expected side effect of this change? I would expect it is not
> > and that the output is a bug of some sort. My apologies in that I am not
> > particularly familiar with the bpf code and cannot suggest what the root
> > cause might be.
> >
> > If it is not a bug:
> >   1. Sorry for the noise :(
>
> your report is great, thanks a lot!
>
> jirka
>
>
> >   2. Can anyone suggest what this output might mean or how the
> >      script run above should be modified? AFAIK this is a fairly
> >      common bpftrace that many folks run for profiling/debugging
> >      purposes.
> >
> > Thanks,
> > Joe
> >
> > [1]: https://lore.kernel.org/bpf/Zo64cpho2cFQiOeE@LQ3V64L9R2/T/#u
>
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c6a6936183d5..0045dc754ef7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -9580,7 +9580,7 @@ static int bpf_overflow_handler(struct perf_event *event,
>                 goto out;
>         rcu_read_lock();
>         prog = READ_ONCE(event->prog);
> -       if (prog) {
> +       if (prog && prog->type == BPF_PROG_TYPE_PERF_EVENT) {
>                 perf_prepare_sample(data, event, regs);
>                 ret = bpf_prog_run(prog, &ctx);
>         }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ?
  2024-07-12 22:49   ` Kyle Huey
@ 2024-07-12 23:05     ` Kyle Huey
  2024-07-12 23:30       ` Kyle Huey
  0 siblings, 1 reply; 8+ messages in thread
From: Kyle Huey @ 2024-07-12 23:05 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Joe Damato, linux-kernel, bpf, netdev, acme, andrii.nakryiko,
	elver, khuey, mingo, namhyung, peterz, robert, yonghong.song,
	mkarsten, kuba

On Fri, Jul 12, 2024 at 3:49 PM Kyle Huey <me@kylehuey.com> wrote:
>
> On Fri, Jul 12, 2024 at 3:18 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, Jul 12, 2024 at 09:53:53AM -0700, Joe Damato wrote:
> > > Greetings:
> > >
> > > (I am reposting this question after 2 days and to a wider audience
> > > as I didn't hear back [1]; my apologies it just seemed like a
> > > possible bug slipped into 6.10-rc1 and I wanted to bring attention
> > > to it before 6.10 is released.)
> > >
> > > While testing some unrelated networking code with Martin Karsten (cc'd on
> > > this email) we discovered what appears to be some sort of overflow bug in
> > > bpf.
> > >
> > > git bisect suggests that commit f11f10bfa1ca ("perf/bpf: Call BPF handler
> > > directly, not through overflow machinery") is the first commit where the
> > > (I assume) buggy behavior appears.
> >
> > heya, nice catch!
> >
> > I can reproduce.. it seems that after f11f10bfa1ca we allow to run tracepoint
> > program as perf event overflow program
> >
> > bpftrace's bpf program returns 1 which means that perf_trace_run_bpf_submit
> > will continue to execute perf_tp_event and then:
> >
> >   perf_tp_event
> >     perf_swevent_event
> >       __perf_event_overflow
> >         bpf_overflow_handler
> >
> > bpf_overflow_handler then executes event->prog on wrong arguments, which
> > results in wrong 'work' data in bpftrace output
> >
> > I can 'fix' that by checking the event type before running the program like
> > in the change below, but I wonder there's probably better fix
> >
> > Kyle, any idea?
>
> Thanks for doing the hard work here Jiri. I did see the original email
> a couple days ago but the cause was far from obvious to me so I was
> waiting until I had more time to dig in.
>
> The issue here is that kernel/trace/bpf_trace.c pokes at event->prog
> directly, so the assumption made in my patch series (based on the
> suggested patch at
> https://lore.kernel.org/lkml/ZXJJa5re536_e7c1@google.com/) that having
> a BPF program in event->prog means we also use the BPF overflow
> handler is wrong.
>
> I'll think about how to fix it.
>
> - Kyle

The good news is that perf_event_attach_bpf_prog() (where we have a
program but no overflow handler) and perf_event_set_bpf_handler()
(where we have a program and an overflow handler) appear to be
mutually exclusive, gated on perf_event_is_tracing(). So I believe we
can fix this with a more generic version of your patch.

- Kyle

>
> > >
> > > Running the following on my machine as of the commit mentioned above:
> > >
> > >   bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'
> > >
> > > while simultaneously transferring data to the target machine (in my case, I
> > > scp'd a 100MiB file of zeros in a loop) results in very strange output
> > > (snipped):
> > >
> > >   @[11]: 5
> > >   @[18]: 5
> > >   @[-30590]: 6
> > >   @[10]: 7
> > >   @[14]: 9
> > >
> > > It does not seem that the driver I am using on my test system (mlx5) would
> > > ever return a negative value from its napi poll function and likewise for
> > > the driver Martin is using (mlx4).
> > >
> > > As such, I don't think it is possible for args->work to ever be a large
> > > negative number, but perhaps I am misunderstanding something?
> > >
> > > I would like to note that commit 14e40a9578b7 ("perf/bpf: Remove #ifdef
> > > CONFIG_BPF_SYSCALL from struct perf_event members") does not exhibit this
> > > behavior and the output seems reasonable on my test system. Martin confirms
> > > the same for both commits on his test system, which uses different hardware
> > > than mine.
> > >
> > > Is this an expected side effect of this change? I would expect it is not
> > > and that the output is a bug of some sort. My apologies in that I am not
> > > particularly familiar with the bpf code and cannot suggest what the root
> > > cause might be.
> > >
> > > If it is not a bug:
> > >   1. Sorry for the noise :(
> >
> > your report is great, thanks a lot!
> >
> > jirka
> >
> >
> > >   2. Can anyone suggest what this output might mean or how the
> > >      script run above should be modified? AFAIK this is a fairly
> > >      common bpftrace that many folks run for profiling/debugging
> > >      purposes.
> > >
> > > Thanks,
> > > Joe
> > >
> > > [1]: https://lore.kernel.org/bpf/Zo64cpho2cFQiOeE@LQ3V64L9R2/T/#u
> >
> > ---
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index c6a6936183d5..0045dc754ef7 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -9580,7 +9580,7 @@ static int bpf_overflow_handler(struct perf_event *event,
> >                 goto out;
> >         rcu_read_lock();
> >         prog = READ_ONCE(event->prog);
> > -       if (prog) {
> > +       if (prog && prog->type == BPF_PROG_TYPE_PERF_EVENT) {
> >                 perf_prepare_sample(data, event, regs);
> >                 ret = bpf_prog_run(prog, &ctx);
> >         }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ?
  2024-07-12 23:05     ` Kyle Huey
@ 2024-07-12 23:30       ` Kyle Huey
  2024-07-13  0:45         ` Joe Damato
  0 siblings, 1 reply; 8+ messages in thread
From: Kyle Huey @ 2024-07-12 23:30 UTC (permalink / raw)
  To: Joe Damato
  Cc: Jiri Olsa, linux-kernel, bpf, netdev, acme, andrii.nakryiko,
	elver, khuey, mingo, namhyung, peterz, robert, yonghong.song,
	mkarsten, kuba

Joe, can you test this?

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8f908f077935..f0d7119585dc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9666,6 +9666,8 @@ static inline void
perf_event_free_bpf_handler(struct perf_event *event)
  * Generic event overflow handling, sampling.
  */

+static bool perf_event_is_tracing(struct perf_event *event);
+
 static int __perf_event_overflow(struct perf_event *event,
                  int throttle, struct perf_sample_data *data,
                  struct pt_regs *regs)
@@ -9682,7 +9684,9 @@ static int __perf_event_overflow(struct perf_event *event,

     ret = __perf_event_account_interrupt(event, throttle);

-    if (event->prog && !bpf_overflow_handler(event, data, regs))
+    if (event->prog &&
+        !perf_event_is_tracing(event) &&
+        !bpf_overflow_handler(event, data, regs))
         return ret;

     /*
@@ -10612,6 +10616,11 @@ void perf_event_free_bpf_prog(struct perf_event *event)

 #else

+static inline bool perf_event_is_tracing(struct perf_event *event)
+{
+    return false;
+}
+
 static inline void perf_tp_register(void)
 {
 }

- Kyle

On Fri, Jul 12, 2024 at 4:05 PM Kyle Huey <me@kylehuey.com> wrote:
>
> On Fri, Jul 12, 2024 at 3:49 PM Kyle Huey <me@kylehuey.com> wrote:
> >
> > On Fri, Jul 12, 2024 at 3:18 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Fri, Jul 12, 2024 at 09:53:53AM -0700, Joe Damato wrote:
> > > > Greetings:
> > > >
> > > > (I am reposting this question after 2 days and to a wider audience
> > > > as I didn't hear back [1]; my apologies it just seemed like a
> > > > possible bug slipped into 6.10-rc1 and I wanted to bring attention
> > > > to it before 6.10 is released.)
> > > >
> > > > While testing some unrelated networking code with Martin Karsten (cc'd on
> > > > this email) we discovered what appears to be some sort of overflow bug in
> > > > bpf.
> > > >
> > > > git bisect suggests that commit f11f10bfa1ca ("perf/bpf: Call BPF handler
> > > > directly, not through overflow machinery") is the first commit where the
> > > > (I assume) buggy behavior appears.
> > >
> > > heya, nice catch!
> > >
> > > I can reproduce.. it seems that after f11f10bfa1ca we allow to run tracepoint
> > > program as perf event overflow program
> > >
> > > bpftrace's bpf program returns 1 which means that perf_trace_run_bpf_submit
> > > will continue to execute perf_tp_event and then:
> > >
> > >   perf_tp_event
> > >     perf_swevent_event
> > >       __perf_event_overflow
> > >         bpf_overflow_handler
> > >
> > > bpf_overflow_handler then executes event->prog on wrong arguments, which
> > > results in wrong 'work' data in bpftrace output
> > >
> > > I can 'fix' that by checking the event type before running the program like
> > > in the change below, but I wonder there's probably better fix
> > >
> > > Kyle, any idea?
> >
> > Thanks for doing the hard work here Jiri. I did see the original email
> > a couple days ago but the cause was far from obvious to me so I was
> > waiting until I had more time to dig in.
> >
> > The issue here is that kernel/trace/bpf_trace.c pokes at event->prog
> > directly, so the assumption made in my patch series (based on the
> > suggested patch at
> > https://lore.kernel.org/lkml/ZXJJa5re536_e7c1@google.com/) that having
> > a BPF program in event->prog means we also use the BPF overflow
> > handler is wrong.
> >
> > I'll think about how to fix it.
> >
> > - Kyle
>
> The good news is that perf_event_attach_bpf_prog() (where we have a
> program but no overflow handler) and perf_event_set_bpf_handler()
> (where we have a program and an overflow handler) appear to be
> mutually exclusive, gated on perf_event_is_tracing(). So I believe we
> can fix this with a more generic version of your patch.
>
> - Kyle
>
> >
> > > >
> > > > Running the following on my machine as of the commit mentioned above:
> > > >
> > > >   bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'
> > > >
> > > > while simultaneously transferring data to the target machine (in my case, I
> > > > scp'd a 100MiB file of zeros in a loop) results in very strange output
> > > > (snipped):
> > > >
> > > >   @[11]: 5
> > > >   @[18]: 5
> > > >   @[-30590]: 6
> > > >   @[10]: 7
> > > >   @[14]: 9
> > > >
> > > > It does not seem that the driver I am using on my test system (mlx5) would
> > > > ever return a negative value from its napi poll function and likewise for
> > > > the driver Martin is using (mlx4).
> > > >
> > > > As such, I don't think it is possible for args->work to ever be a large
> > > > negative number, but perhaps I am misunderstanding something?
> > > >
> > > > I would like to note that commit 14e40a9578b7 ("perf/bpf: Remove #ifdef
> > > > CONFIG_BPF_SYSCALL from struct perf_event members") does not exhibit this
> > > > behavior and the output seems reasonable on my test system. Martin confirms
> > > > the same for both commits on his test system, which uses different hardware
> > > > than mine.
> > > >
> > > > Is this an expected side effect of this change? I would expect it is not
> > > > and that the output is a bug of some sort. My apologies in that I am not
> > > > particularly familiar with the bpf code and cannot suggest what the root
> > > > cause might be.
> > > >
> > > > If it is not a bug:
> > > >   1. Sorry for the noise :(
> > >
> > > your report is great, thanks a lot!
> > >
> > > jirka
> > >
> > >
> > > >   2. Can anyone suggest what this output might mean or how the
> > > >      script run above should be modified? AFAIK this is a fairly
> > > >      common bpftrace that many folks run for profiling/debugging
> > > >      purposes.
> > > >
> > > > Thanks,
> > > > Joe
> > > >
> > > > [1]: https://lore.kernel.org/bpf/Zo64cpho2cFQiOeE@LQ3V64L9R2/T/#u
> > >
> > > ---
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index c6a6936183d5..0045dc754ef7 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -9580,7 +9580,7 @@ static int bpf_overflow_handler(struct perf_event *event,
> > >                 goto out;
> > >         rcu_read_lock();
> > >         prog = READ_ONCE(event->prog);
> > > -       if (prog) {
> > > +       if (prog && prog->type == BPF_PROG_TYPE_PERF_EVENT) {
> > >                 perf_prepare_sample(data, event, regs);
> > >                 ret = bpf_prog_run(prog, &ctx);
> > >         }

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ?
  2024-07-12 23:30       ` Kyle Huey
@ 2024-07-13  0:45         ` Joe Damato
  2024-07-13  4:47           ` Kyle Huey
  0 siblings, 1 reply; 8+ messages in thread
From: Joe Damato @ 2024-07-13  0:45 UTC (permalink / raw)
  To: Kyle Huey
  Cc: Jiri Olsa, linux-kernel, bpf, netdev, acme, andrii.nakryiko,
	elver, khuey, mingo, namhyung, peterz, robert, yonghong.song,
	mkarsten, kuba

On Fri, Jul 12, 2024 at 04:30:31PM -0700, Kyle Huey wrote:
> Joe, can you test this?
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8f908f077935..f0d7119585dc 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -9666,6 +9666,8 @@ static inline void
> perf_event_free_bpf_handler(struct perf_event *event)
>   * Generic event overflow handling, sampling.
>   */
> 
> +static bool perf_event_is_tracing(struct perf_event *event);
> +
>  static int __perf_event_overflow(struct perf_event *event,
>                   int throttle, struct perf_sample_data *data,
>                   struct pt_regs *regs)
> @@ -9682,7 +9684,9 @@ static int __perf_event_overflow(struct perf_event *event,
> 
>      ret = __perf_event_account_interrupt(event, throttle);
> 
> -    if (event->prog && !bpf_overflow_handler(event, data, regs))
> +    if (event->prog &&
> +        !perf_event_is_tracing(event) &&
> +        !bpf_overflow_handler(event, data, regs))
>          return ret;
> 
>      /*
> @@ -10612,6 +10616,11 @@ void perf_event_free_bpf_prog(struct perf_event *event)
> 
>  #else
> 
> +static inline bool perf_event_is_tracing(struct perf_event *event)
> +{
> +    return false;
> +}
> +
>  static inline void perf_tp_register(void)
>  {
>  }
>

Thank you!

I've applied the above patch on top of commit 338a93cf4a18 ("net:
mctp-i2c: invalidate flows immediately on TX errors"), which seems
to be latest on net-next/main.

I built and booted that kernel on my mlx5 test machine and re-ran
the same bpftrace invocation:

  bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'

I then scp-ed a 100MiB zero filled file to the target 48 times back
to back (e.g. scp zeroes target:~/ && scp zeroes target:~/ && ... )
and the bpftrace output seems reasonable; there are no negative
numbers and the values output *look* reasonable to me.

The patch seems reasonable, as well, with the major caveat that I've
only hacked on drivers and networking stuff and know absolutely
nothing about bpf internals.

All that said:

Tested-by: Joe Damato <jdamato@fastly.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ?
  2024-07-13  0:45         ` Joe Damato
@ 2024-07-13  4:47           ` Kyle Huey
  0 siblings, 0 replies; 8+ messages in thread
From: Kyle Huey @ 2024-07-13  4:47 UTC (permalink / raw)
  To: Joe Damato, Kyle Huey, Jiri Olsa, linux-kernel, bpf, netdev, acme,
	andrii.nakryiko, elver, khuey, mingo, namhyung, peterz, robert,
	yonghong.song, mkarsten, kuba

On Fri, Jul 12, 2024 at 5:45 PM Joe Damato <jdamato@fastly.com> wrote:
>
> On Fri, Jul 12, 2024 at 04:30:31PM -0700, Kyle Huey wrote:
> > Joe, can you test this?
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 8f908f077935..f0d7119585dc 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -9666,6 +9666,8 @@ static inline void
> > perf_event_free_bpf_handler(struct perf_event *event)
> >   * Generic event overflow handling, sampling.
> >   */
> >
> > +static bool perf_event_is_tracing(struct perf_event *event);
> > +
> >  static int __perf_event_overflow(struct perf_event *event,
> >                   int throttle, struct perf_sample_data *data,
> >                   struct pt_regs *regs)
> > @@ -9682,7 +9684,9 @@ static int __perf_event_overflow(struct perf_event *event,
> >
> >      ret = __perf_event_account_interrupt(event, throttle);
> >
> > -    if (event->prog && !bpf_overflow_handler(event, data, regs))
> > +    if (event->prog &&
> > +        !perf_event_is_tracing(event) &&
> > +        !bpf_overflow_handler(event, data, regs))
> >          return ret;
> >
> >      /*
> > @@ -10612,6 +10616,11 @@ void perf_event_free_bpf_prog(struct perf_event *event)
> >
> >  #else
> >
> > +static inline bool perf_event_is_tracing(struct perf_event *event)
> > +{
> > +    return false;
> > +}
> > +
> >  static inline void perf_tp_register(void)
> >  {
> >  }
> >
>
> Thank you!
>
> I've applied the above patch on top of commit 338a93cf4a18 ("net:
> mctp-i2c: invalidate flows immediately on TX errors"), which seems
> to be latest on net-next/main.
>
> I built and booted that kernel on my mlx5 test machine and re-ran
> the same bpftrace invocation:
>
>   bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'
>
> I then scp-ed a 100MiB zero filled file to the target 48 times back
> to back (e.g. scp zeroes target:~/ && scp zeroes target:~/ && ... )
> and the bpftrace output seems reasonable; there are no negative
> numbers and the values output *look* reasonable to me.
>
> The patch seems reasonable, as well, with the major caveat that I've
> only hacked on drivers and networking stuff and know absolutely
> nothing about bpf internals.
>
> All that said:
>
> Tested-by: Joe Damato <jdamato@fastly.com>

Thanks, I've sent a patch formally.

Hopefully this can slip into 6.10 before it ships.

- Kyle

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-07-13  4:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-12 16:53 [bpf?] [net-next ?] [RESEND] possible bpf overflow/output bug introduced in 6.10rc1 ? Joe Damato
2024-07-12 22:18 ` Jiri Olsa
2024-07-12 22:39   ` Jiri Olsa
2024-07-12 22:49   ` Kyle Huey
2024-07-12 23:05     ` Kyle Huey
2024-07-12 23:30       ` Kyle Huey
2024-07-13  0:45         ` Joe Damato
2024-07-13  4:47           ` Kyle Huey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).