linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf] libbpf: eBPF fails on events with auxiliary data
@ 2025-07-25  9:34 Thomas Richter
  2025-07-25 14:06 ` Jiri Olsa
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Richter @ 2025-07-25  9:34 UTC (permalink / raw)
  To: linux-kernel, linux-s390, linux-perf-users, acme, namhyung, bpf
  Cc: agordeev, gor, sumanthk, hca, japo, Thomas Richter,
	Ilya Leoshkevich

On linux-next
commit b4c658d4d63d61 ("perf target: Remove uid from target")
introduces a regression on s390. In fact the regression exists
on all platforms when the event supports auxiliary data gathering.

Command
   # ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
   [ perf record: Woken up 1 times to write data ]
   [ perf record: Captured and wrote 0.011 MB perf.data ]
   # ./perf report --stats | grep SAMPLE
   #

does not generate samples in the perf.data file.
On x86 command 
  # sudo perf record -e intel_pt// -u 0 ls
is broken too.

Looking at the sequence of calls in 'perf record' reveals this
behavior:
1. The event 'cycles' is created and enabled:
   record__open()
   +-> evlist__apply_filters()
       +-> perf_bpf_filter__prepare()
	   +-> bpf_program.attach_perf_event()
	       +-> bpf_program.attach_perf_event_opts()
	           +-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...)
   The event 'cycles' is enabled and active now. However the event's
   ring-buffer to store the samples generated by hardware is not
   allocated yet. This happens now after enabling the event:

2. The event's fd is mmap() to create the ring buffer:
   record__open()
   +-> record__mmap()
       +-> record__mmap_evlist()
	   +-> evlist__mmap_ex()
	       +-> perf_evlist__mmap_ops()
	           +-> mmap_per_cpu()
	               +-> mmap_per_evsel()
	                   +-> mmap__mmap()
	                       +-> perf_mmap__mmap()
	                           +-> mmap()

   This allocates the ring-buffer for the event 'cycles'.  With mmap()
   the kernel creates the ring buffer:

   perf_mmap(): kernel function to create the event's ring
   |            buffer to save the sampled data.
   |
   +-> ring_buffer_attach(): Allocates memory for ring buffer.
       |        The PMU has auxiliary data setup function. The
       |        has_aux(event) condition is true and the PMU's
       |        stop() is called to stop sampling. It is not
       |        restarted:
       |        if (has_aux(event))
       |                perf_event_stop(event, 0);
       |
       +-> cpumsf_pmu_stop():

   Hardware sampling is stopped. No samples are generated and saved
   anymore.

3. After the event 'cycles' has been mapped, the event is enabled a
   second time in:
   __cmd_record()
   +-> evlist__enable()
       +-> __evlist__enable()
	   +-> evsel__enable_cpu()
	       +-> perf_evsel__enable_cpu()
	           +-> perf_evsel__run_ioctl()
	               +-> perf_evsel__ioctl()
	                   +-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .)
   The second
      ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
   is just a NOP in this case. The first invocation in (1.) sets the
   event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions
   perf_ioctl()
   +-> _perf_ioctl()
       +-> _perf_event_enable()
           +-> __perf_event_enable() returns immediately because
	              event::state is already set to
		      PERF_EVENT_STATE_ACTIVE.

This happens on s390, because the event 'cycles' offers the possibility
to save auxilary data. The PMU call backs setup_aux() and
free_aux() are defined. Without both call back functions,
cpumsf_pmu_stop() is not invoked and sampling continues.

To remedy this, remove the first invocation of
   ioctl(..., PERF_EVENT_IOC_ENABLE, ...).
in step (1.) Create the event in step (1.) and enable it in step (3.)
after the ring buffer has been mapped.

Output after:
 # ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2
 [ perf record: Woken up 3 times to write data ]
 [ perf record: Captured and wrote 0.876 MB perf.data ]
 # ./perf  report --stats | grep SAMPLE
              SAMPLE events:      16200  (99.5%)
              SAMPLE events:      16200
 #

The software event succeeded before and after the patch:
 # ./perf record -e cpu-clock -aB --synth=no -u 0 -- ./perf test -w thloop 2
 [ perf record: Woken up 7 times to write data ]
 [ perf record: Captured and wrote 2.870 MB perf.data ]
 # ./perf  report --stats | grep SAMPLE
              SAMPLE events:      53506  (99.8%)
              SAMPLE events:      53506
 #

Fixes: 63f2f5ee856ba ("libbpf: add ability to attach/detach BPF program to perf event")
To: Andrii Nakryiko <andriin@fb.com>
To: Ian Rogers <irogers@google.com>
To: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
---
 tools/lib/bpf/libbpf.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 162ebd16a59f..5973412a1031 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -10960,12 +10960,6 @@ struct bpf_link *bpf_program__attach_perf_event_opts(const struct bpf_program *p
 		}
 		link->link.fd = pfd;
 	}
-	if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
-		err = -errno;
-		pr_warn("prog '%s': failed to enable perf_event FD %d: %s\n",
-			prog->name, pfd, errstr(err));
-		goto err_out;
-	}
 
 	return &link->link;
 err_out:
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH bpf] libbpf: eBPF fails on events with auxiliary data
  2025-07-25  9:34 [PATCH bpf] libbpf: eBPF fails on events with auxiliary data Thomas Richter
@ 2025-07-25 14:06 ` Jiri Olsa
  2025-07-31  7:41   ` Thomas Richter
  0 siblings, 1 reply; 3+ messages in thread
From: Jiri Olsa @ 2025-07-25 14:06 UTC (permalink / raw)
  To: Thomas Richter
  Cc: linux-kernel, linux-s390, linux-perf-users, acme, namhyung, bpf,
	agordeev, gor, sumanthk, hca, japo, Ilya Leoshkevich

On Fri, Jul 25, 2025 at 11:34:05AM +0200, Thomas Richter wrote:
> On linux-next
> commit b4c658d4d63d61 ("perf target: Remove uid from target")
> introduces a regression on s390. In fact the regression exists
> on all platforms when the event supports auxiliary data gathering.
> 
> Command
>    # ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
>    [ perf record: Woken up 1 times to write data ]
>    [ perf record: Captured and wrote 0.011 MB perf.data ]
>    # ./perf report --stats | grep SAMPLE
>    #
> 
> does not generate samples in the perf.data file.
> On x86 command 
>   # sudo perf record -e intel_pt// -u 0 ls
> is broken too.
> 
> Looking at the sequence of calls in 'perf record' reveals this
> behavior:
> 1. The event 'cycles' is created and enabled:
>    record__open()
>    +-> evlist__apply_filters()
>        +-> perf_bpf_filter__prepare()
> 	   +-> bpf_program.attach_perf_event()
> 	       +-> bpf_program.attach_perf_event_opts()
> 	           +-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...)
>    The event 'cycles' is enabled and active now. However the event's
>    ring-buffer to store the samples generated by hardware is not
>    allocated yet. This happens now after enabling the event:
> 
> 2. The event's fd is mmap() to create the ring buffer:
>    record__open()
>    +-> record__mmap()
>        +-> record__mmap_evlist()
> 	   +-> evlist__mmap_ex()
> 	       +-> perf_evlist__mmap_ops()
> 	           +-> mmap_per_cpu()
> 	               +-> mmap_per_evsel()
> 	                   +-> mmap__mmap()
> 	                       +-> perf_mmap__mmap()
> 	                           +-> mmap()
> 
>    This allocates the ring-buffer for the event 'cycles'.  With mmap()
>    the kernel creates the ring buffer:
> 
>    perf_mmap(): kernel function to create the event's ring
>    |            buffer to save the sampled data.
>    |
>    +-> ring_buffer_attach(): Allocates memory for ring buffer.
>        |        The PMU has auxiliary data setup function. The
>        |        has_aux(event) condition is true and the PMU's
>        |        stop() is called to stop sampling. It is not
>        |        restarted:
>        |        if (has_aux(event))
>        |                perf_event_stop(event, 0);
>        |
>        +-> cpumsf_pmu_stop():
> 
>    Hardware sampling is stopped. No samples are generated and saved
>    anymore.
> 
> 3. After the event 'cycles' has been mapped, the event is enabled a
>    second time in:
>    __cmd_record()
>    +-> evlist__enable()
>        +-> __evlist__enable()
> 	   +-> evsel__enable_cpu()
> 	       +-> perf_evsel__enable_cpu()
> 	           +-> perf_evsel__run_ioctl()
> 	               +-> perf_evsel__ioctl()
> 	                   +-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .)
>    The second
>       ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
>    is just a NOP in this case. The first invocation in (1.) sets the
>    event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions
>    perf_ioctl()
>    +-> _perf_ioctl()
>        +-> _perf_event_enable()
>            +-> __perf_event_enable() returns immediately because
> 	              event::state is already set to
> 		      PERF_EVENT_STATE_ACTIVE.
> 
> This happens on s390, because the event 'cycles' offers the possibility
> to save auxilary data. The PMU call backs setup_aux() and
> free_aux() are defined. Without both call back functions,
> cpumsf_pmu_stop() is not invoked and sampling continues.
> 
> To remedy this, remove the first invocation of
>    ioctl(..., PERF_EVENT_IOC_ENABLE, ...).
> in step (1.) Create the event in step (1.) and enable it in step (3.)
> after the ring buffer has been mapped.
> 
> Output after:
>  # ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2
>  [ perf record: Woken up 3 times to write data ]
>  [ perf record: Captured and wrote 0.876 MB perf.data ]
>  # ./perf  report --stats | grep SAMPLE
>               SAMPLE events:      16200  (99.5%)
>               SAMPLE events:      16200
>  #
> 
> The software event succeeded before and after the patch:
>  # ./perf record -e cpu-clock -aB --synth=no -u 0 -- ./perf test -w thloop 2
>  [ perf record: Woken up 7 times to write data ]
>  [ perf record: Captured and wrote 2.870 MB perf.data ]
>  # ./perf  report --stats | grep SAMPLE
>               SAMPLE events:      53506  (99.8%)
>               SAMPLE events:      53506
>  #
> 
> Fixes: 63f2f5ee856ba ("libbpf: add ability to attach/detach BPF program to perf event")
> To: Andrii Nakryiko <andriin@fb.com>
> To: Ian Rogers <irogers@google.com>
> To: Ilya Leoshkevich <iii@linux.ibm.com>
> Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
> ---
>  tools/lib/bpf/libbpf.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 162ebd16a59f..5973412a1031 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -10960,12 +10960,6 @@ struct bpf_link *bpf_program__attach_perf_event_opts(const struct bpf_program *p
>  		}
>  		link->link.fd = pfd;
>  	}
> -	if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
> -		err = -errno;
> -		pr_warn("prog '%s': failed to enable perf_event FD %d: %s\n",
> -			prog->name, pfd, errstr(err));
> -		goto err_out;
> -	}

I think this might break existing users depending on this

could we instead add some 'enable' flag to bpf_perf_event_opts and perf
would use bpf_program__attach_perf_event_opts function instead?

jirka

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH bpf] libbpf: eBPF fails on events with auxiliary data
  2025-07-25 14:06 ` Jiri Olsa
@ 2025-07-31  7:41   ` Thomas Richter
  0 siblings, 0 replies; 3+ messages in thread
From: Thomas Richter @ 2025-07-31  7:41 UTC (permalink / raw)
  To: Jiri Olsa, Ian Rogers, Ilya Leoshkevich
  Cc: linux-kernel, linux-s390, linux-perf-users, acme, namhyung, bpf,
	agordeev, gor, sumanthk, hca, japo

On 7/25/25 16:06, Jiri Olsa wrote:
> On Fri, Jul 25, 2025 at 11:34:05AM +0200, Thomas Richter wrote:
>> On linux-next
>> commit b4c658d4d63d61 ("perf target: Remove uid from target")
>> introduces a regression on s390. In fact the regression exists
>> on all platforms when the event supports auxiliary data gathering.
>>
>> Command
>>    # ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
>>    [ perf record: Woken up 1 times to write data ]
>>    [ perf record: Captured and wrote 0.011 MB perf.data ]
>>    # ./perf report --stats | grep SAMPLE
>>    #

....

>> -		goto err_out;
>> -	}
> 
> I think this might break existing users depending on this
> 
> could we instead add some 'enable' flag to bpf_perf_event_opts and perf
> would use bpf_program__attach_perf_event_opts function instead?
> 
> jirka
> 

Hi Jiri, Ilya and Ian,

Jiri recommended a more flexible approach and I submitted version 2
https://lore.kernel.org/all/20250728144340.711196-1-tmricht@linux.ibm.com

This version now also passed the eBPF ci-test suite.

I am blind on the eBPF topic and do not have enough knowledge to dig deeper into eBPF.
Please note that this is not an s390 specific issue. It happens on other platforms
when the event supports auxiliary data gathering, i. e. has_aux() returns true.

I would like to ask someone with more eBPF knowledge for some help to drive this further.
Thanks a lot.
-- 
Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
--
IBM Deutschland Research & Development GmbH

Vorsitzender des Aufsichtsrats: Wolfgang Wendt

Geschäftsführung: David Faller

Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-07-31  7:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-25  9:34 [PATCH bpf] libbpf: eBPF fails on events with auxiliary data Thomas Richter
2025-07-25 14:06 ` Jiri Olsa
2025-07-31  7:41   ` Thomas Richter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).