* [PATCH v4 0/2] perf/s390: Regression: Move uid filtering to BPF filters
@ 2025-08-06 11:40 Ilya Leoshkevich
2025-08-06 11:40 ` [PATCH v4 1/2] libbpf: Add the ability to suppress perf event enablement Ilya Leoshkevich
2025-08-06 11:40 ` [PATCH v4 2/2] perf bpf-filter: Enable events manually Ilya Leoshkevich
0 siblings, 2 replies; 7+ messages in thread
From: Ilya Leoshkevich @ 2025-08-06 11:40 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Ian Rogers,
Arnaldo Carvalho de Melo
Cc: bpf, linux-perf-users, linux-kernel, linux-s390, Thomas Richter,
Jiri Olsa, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
Ilya Leoshkevich
v3: https://lore.kernel.org/bpf/20250805130346.1225535-1-iii@linux.ibm.com/
v3 -> v4: Rename the new field to dont_enable (Alexei, Eduard).
Switch the Fixes: tag in patch 2 (Alexander, Thomas).
Fix typos in the cover letter (Thomas).
v2: https://lore.kernel.org/bpf/20250728144340.711196-1-tmricht@linux.ibm.com/
v2 -> v3: Use no_ioctl_enable in perf.
v1: https://lore.kernel.org/bpf/20250725093405.3629253-1-tmricht@linux.ibm.com/
v1 -> v2: Introduce no_ioctl_enable (Jiri).
Hi,
This series fixes a regression caused by moving UID filtering to BPF.
The regression affects all events that support auxiliary data, most
notably, "cycles" events on s390, but also PT events on Intel. The
symptom is missing events when UID filtering is enabled.
Patch 1 introduces a new option for the
bpf_program__attach_perf_event_opts() function.
Patch 2 makes use of it in perf, and also contains a lot of technical
details of why exactly the problem is occurring.
Thanks to Thomas Richter for the investigation and the initial version
of this fix, and to Jiri Olsa for suggestions.
Best regards,
Ilya
Ilya Leoshkevich (2):
libbpf: Add the ability to suppress perf event enablement
perf bpf-filter: Enable events manually
tools/lib/bpf/libbpf.c | 13 ++++++++-----
tools/lib/bpf/libbpf.h | 4 +++-
tools/perf/util/bpf-filter.c | 5 ++++-
3 files changed, 15 insertions(+), 7 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v4 1/2] libbpf: Add the ability to suppress perf event enablement
2025-08-06 11:40 [PATCH v4 0/2] perf/s390: Regression: Move uid filtering to BPF filters Ilya Leoshkevich
@ 2025-08-06 11:40 ` Ilya Leoshkevich
2025-08-06 15:25 ` Yonghong Song
2025-08-06 11:40 ` [PATCH v4 2/2] perf bpf-filter: Enable events manually Ilya Leoshkevich
1 sibling, 1 reply; 7+ messages in thread
From: Ilya Leoshkevich @ 2025-08-06 11:40 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Ian Rogers,
Arnaldo Carvalho de Melo
Cc: bpf, linux-perf-users, linux-kernel, linux-s390, Thomas Richter,
Jiri Olsa, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
Ilya Leoshkevich, Eduard Zingerman
Automatically enabling a perf event after attaching a BPF prog to it is
not always desirable.
Add a new no_ioctl_enable field to struct bpf_perf_event_opts. While
introducing ioctl_enable instead would be nicer in that it would avoid
a double negation in the implementation, it would make
DECLARE_LIBBPF_OPTS() less efficient.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---
tools/lib/bpf/libbpf.c | 13 ++++++++-----
tools/lib/bpf/libbpf.h | 4 +++-
2 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index fb4d92c5c339..8f5a81b672e1 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -10965,11 +10965,14 @@ struct bpf_link *bpf_program__attach_perf_event_opts(const struct bpf_program *p
}
link->link.fd = pfd;
}
- if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
- err = -errno;
- pr_warn("prog '%s': failed to enable perf_event FD %d: %s\n",
- prog->name, pfd, errstr(err));
- goto err_out;
+
+ if (!OPTS_GET(opts, dont_enable, false)) {
+ if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
+ err = -errno;
+ pr_warn("prog '%s': failed to enable perf_event FD %d: %s\n",
+ prog->name, pfd, errstr(err));
+ goto err_out;
+ }
}
return &link->link;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index d1cf813a057b..455a957cb702 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -499,9 +499,11 @@ struct bpf_perf_event_opts {
__u64 bpf_cookie;
/* don't use BPF link when attach BPF program */
bool force_ioctl_attach;
+ /* don't automatically enable the event */
+ bool dont_enable;
size_t :0;
};
-#define bpf_perf_event_opts__last_field force_ioctl_attach
+#define bpf_perf_event_opts__last_field dont_enable
LIBBPF_API struct bpf_link *
bpf_program__attach_perf_event(const struct bpf_program *prog, int pfd);
--
2.50.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v4 2/2] perf bpf-filter: Enable events manually
2025-08-06 11:40 [PATCH v4 0/2] perf/s390: Regression: Move uid filtering to BPF filters Ilya Leoshkevich
2025-08-06 11:40 ` [PATCH v4 1/2] libbpf: Add the ability to suppress perf event enablement Ilya Leoshkevich
@ 2025-08-06 11:40 ` Ilya Leoshkevich
2025-08-06 22:53 ` Namhyung Kim
1 sibling, 1 reply; 7+ messages in thread
From: Ilya Leoshkevich @ 2025-08-06 11:40 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Ian Rogers,
Arnaldo Carvalho de Melo
Cc: bpf, linux-perf-users, linux-kernel, linux-s390, Thomas Richter,
Jiri Olsa, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
Ilya Leoshkevich
On s390, and, in general, on all platforms where the respective event
supports auxiliary data gathering, the command:
# ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.011 MB perf.data ]
# ./perf report --stats | grep SAMPLE
#
does not generate samples in the perf.data file. On x86 the command:
# sudo perf record -e intel_pt// -u 0 ls
is broken too.
Looking at the sequence of calls in 'perf record' reveals this
behavior:
1. The event 'cycles' is created and enabled:
record__open()
+-> evlist__apply_filters()
+-> perf_bpf_filter__prepare()
+-> bpf_program.attach_perf_event()
+-> bpf_program.attach_perf_event_opts()
+-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...)
The event 'cycles' is enabled and active now. However the event's
ring-buffer to store the samples generated by hardware is not
allocated yet.
2. The event's fd is mmap()ed to create the ring buffer:
record__open()
+-> record__mmap()
+-> record__mmap_evlist()
+-> evlist__mmap_ex()
+-> perf_evlist__mmap_ops()
+-> mmap_per_cpu()
+-> mmap_per_evsel()
+-> mmap__mmap()
+-> perf_mmap__mmap()
+-> mmap()
This allocates the ring buffer for the event 'cycles'. With mmap()
the kernel creates the ring buffer:
perf_mmap(): kernel function to create the event's ring
| buffer to save the sampled data.
|
+-> ring_buffer_attach(): Allocates memory for ring buffer.
| The PMU has auxiliary data setup function. The
| has_aux(event) condition is true and the PMU's
| stop() is called to stop sampling. It is not
| restarted:
|
| if (has_aux(event))
| perf_event_stop(event, 0);
|
+-> cpumsf_pmu_stop():
Hardware sampling is stopped. No samples are generated and saved
anymore.
3. After the event 'cycles' has been mapped, the event is enabled a
second time in:
__cmd_record()
+-> evlist__enable()
+-> __evlist__enable()
+-> evsel__enable_cpu()
+-> perf_evsel__enable_cpu()
+-> perf_evsel__run_ioctl()
+-> perf_evsel__ioctl()
+-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .)
The second
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
is just a NOP in this case. The first invocation in (1.) sets the
event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions
perf_ioctl()
+-> _perf_ioctl()
+-> _perf_event_enable()
+-> __perf_event_enable()
return immediately because event::state is already set to
PERF_EVENT_STATE_ACTIVE.
This happens on s390, because the event 'cycles' offers the possibility
to save auxilary data. The PMU callbacks setup_aux() and free_aux() are
defined. Without both callback functions, cpumsf_pmu_stop() is not
invoked and sampling continues.
To remedy this, remove the first invocation of
ioctl(..., PERF_EVENT_IOC_ENABLE, ...).
in step (1.) Create the event in step (1.) and enable it in step (3.)
after the ring buffer has been mapped.
Output after:
# ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.876 MB perf.data ]
# ./perf report --stats | grep SAMPLE
SAMPLE events: 16200 (99.5%)
SAMPLE events: 16200
#
The software event succeeded both before and after the patch:
# ./perf record -e cpu-clock -aB --synth=no -u 0 -- \
./perf test -w thloop 2
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 2.870 MB perf.data ]
# ./perf report --stats | grep SAMPLE
SAMPLE events: 53506 (99.8%)
SAMPLE events: 53506
#
Fixes: b4c658d4d63d61 ("perf target: Remove uid from target")
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---
tools/perf/util/bpf-filter.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
index d0e013eeb0f7..a0b11f35395f 100644
--- a/tools/perf/util/bpf-filter.c
+++ b/tools/perf/util/bpf-filter.c
@@ -451,6 +451,8 @@ int perf_bpf_filter__prepare(struct evsel *evsel, struct target *target)
struct bpf_link *link;
struct perf_bpf_filter_entry *entry;
bool needs_idx_hash = !target__has_cpu(target);
+ DECLARE_LIBBPF_OPTS(bpf_perf_event_opts, pe_opts,
+ .dont_enable = true);
entry = calloc(MAX_FILTERS, sizeof(*entry));
if (entry == NULL)
@@ -522,7 +524,8 @@ int perf_bpf_filter__prepare(struct evsel *evsel, struct target *target)
prog = skel->progs.perf_sample_filter;
for (x = 0; x < xyarray__max_x(evsel->core.fd); x++) {
for (y = 0; y < xyarray__max_y(evsel->core.fd); y++) {
- link = bpf_program__attach_perf_event(prog, FD(evsel, x, y));
+ link = bpf_program__attach_perf_event_opts(prog, FD(evsel, x, y),
+ &pe_opts);
if (IS_ERR(link)) {
pr_err("Failed to attach perf sample-filter program\n");
ret = PTR_ERR(link);
--
2.50.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v4 1/2] libbpf: Add the ability to suppress perf event enablement
2025-08-06 11:40 ` [PATCH v4 1/2] libbpf: Add the ability to suppress perf event enablement Ilya Leoshkevich
@ 2025-08-06 15:25 ` Yonghong Song
0 siblings, 0 replies; 7+ messages in thread
From: Yonghong Song @ 2025-08-06 15:25 UTC (permalink / raw)
To: Ilya Leoshkevich, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Ian Rogers, Arnaldo Carvalho de Melo
Cc: bpf, linux-perf-users, linux-kernel, linux-s390, Thomas Richter,
Jiri Olsa, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
Eduard Zingerman
On 8/6/25 4:40 AM, Ilya Leoshkevich wrote:
> Automatically enabling a perf event after attaching a BPF prog to it is
> not always desirable.
>
> Add a new no_ioctl_enable field to struct bpf_perf_event_opts. While
no_ioctl_enable => dont_enable
> introducing ioctl_enable instead would be nicer in that it would avoid
> a double negation in the implementation, it would make
> DECLARE_LIBBPF_OPTS() less efficient.
>
> Acked-by: Eduard Zingerman <eddyz87@gmail.com>
> Suggested-by: Jiri Olsa <jolsa@kernel.org>
> Tested-by: Thomas Richter <tmricht@linux.ibm.com>
> Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
> Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> ---
> tools/lib/bpf/libbpf.c | 13 ++++++++-----
> tools/lib/bpf/libbpf.h | 4 +++-
> 2 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index fb4d92c5c339..8f5a81b672e1 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -10965,11 +10965,14 @@ struct bpf_link *bpf_program__attach_perf_event_opts(const struct bpf_program *p
> }
> link->link.fd = pfd;
> }
> - if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
> - err = -errno;
> - pr_warn("prog '%s': failed to enable perf_event FD %d: %s\n",
> - prog->name, pfd, errstr(err));
> - goto err_out;
> +
> + if (!OPTS_GET(opts, dont_enable, false)) {
> + if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
> + err = -errno;
> + pr_warn("prog '%s': failed to enable perf_event FD %d: %s\n",
> + prog->name, pfd, errstr(err));
> + goto err_out;
> + }
> }
>
> return &link->link;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index d1cf813a057b..455a957cb702 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -499,9 +499,11 @@ struct bpf_perf_event_opts {
> __u64 bpf_cookie;
> /* don't use BPF link when attach BPF program */
> bool force_ioctl_attach;
> + /* don't automatically enable the event */
> + bool dont_enable;
> size_t :0;
> };
> -#define bpf_perf_event_opts__last_field force_ioctl_attach
> +#define bpf_perf_event_opts__last_field dont_enable
>
> LIBBPF_API struct bpf_link *
> bpf_program__attach_perf_event(const struct bpf_program *prog, int pfd);
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 2/2] perf bpf-filter: Enable events manually
2025-08-06 11:40 ` [PATCH v4 2/2] perf bpf-filter: Enable events manually Ilya Leoshkevich
@ 2025-08-06 22:53 ` Namhyung Kim
2025-08-06 23:38 ` Alexei Starovoitov
0 siblings, 1 reply; 7+ messages in thread
From: Namhyung Kim @ 2025-08-06 22:53 UTC (permalink / raw)
To: Ilya Leoshkevich
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Ian Rogers,
Arnaldo Carvalho de Melo, bpf, linux-perf-users, linux-kernel,
linux-s390, Thomas Richter, Jiri Olsa, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev
Hello,
On Wed, Aug 06, 2025 at 01:40:35PM +0200, Ilya Leoshkevich wrote:
> On s390, and, in general, on all platforms where the respective event
> supports auxiliary data gathering, the command:
>
> # ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.011 MB perf.data ]
> # ./perf report --stats | grep SAMPLE
> #
>
> does not generate samples in the perf.data file. On x86 the command:
>
> # sudo perf record -e intel_pt// -u 0 ls
>
> is broken too.
>
> Looking at the sequence of calls in 'perf record' reveals this
> behavior:
>
> 1. The event 'cycles' is created and enabled:
>
> record__open()
> +-> evlist__apply_filters()
> +-> perf_bpf_filter__prepare()
> +-> bpf_program.attach_perf_event()
> +-> bpf_program.attach_perf_event_opts()
> +-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...)
>
> The event 'cycles' is enabled and active now. However the event's
> ring-buffer to store the samples generated by hardware is not
> allocated yet.
>
> 2. The event's fd is mmap()ed to create the ring buffer:
>
> record__open()
> +-> record__mmap()
> +-> record__mmap_evlist()
> +-> evlist__mmap_ex()
> +-> perf_evlist__mmap_ops()
> +-> mmap_per_cpu()
> +-> mmap_per_evsel()
> +-> mmap__mmap()
> +-> perf_mmap__mmap()
> +-> mmap()
>
> This allocates the ring buffer for the event 'cycles'. With mmap()
> the kernel creates the ring buffer:
>
> perf_mmap(): kernel function to create the event's ring
> | buffer to save the sampled data.
> |
> +-> ring_buffer_attach(): Allocates memory for ring buffer.
> | The PMU has auxiliary data setup function. The
> | has_aux(event) condition is true and the PMU's
> | stop() is called to stop sampling. It is not
> | restarted:
> |
> | if (has_aux(event))
> | perf_event_stop(event, 0);
> |
> +-> cpumsf_pmu_stop():
>
> Hardware sampling is stopped. No samples are generated and saved
> anymore.
>
> 3. After the event 'cycles' has been mapped, the event is enabled a
> second time in:
>
> __cmd_record()
> +-> evlist__enable()
> +-> __evlist__enable()
> +-> evsel__enable_cpu()
> +-> perf_evsel__enable_cpu()
> +-> perf_evsel__run_ioctl()
> +-> perf_evsel__ioctl()
> +-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .)
>
> The second
>
> ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
>
> is just a NOP in this case. The first invocation in (1.) sets the
> event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions
>
> perf_ioctl()
> +-> _perf_ioctl()
> +-> _perf_event_enable()
> +-> __perf_event_enable()
>
> return immediately because event::state is already set to
> PERF_EVENT_STATE_ACTIVE.
>
> This happens on s390, because the event 'cycles' offers the possibility
> to save auxilary data. The PMU callbacks setup_aux() and free_aux() are
> defined. Without both callback functions, cpumsf_pmu_stop() is not
> invoked and sampling continues.
>
> To remedy this, remove the first invocation of
>
> ioctl(..., PERF_EVENT_IOC_ENABLE, ...).
>
> in step (1.) Create the event in step (1.) and enable it in step (3.)
> after the ring buffer has been mapped.
>
> Output after:
>
> # ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2
> [ perf record: Woken up 3 times to write data ]
> [ perf record: Captured and wrote 0.876 MB perf.data ]
> # ./perf report --stats | grep SAMPLE
> SAMPLE events: 16200 (99.5%)
> SAMPLE events: 16200
> #
>
> The software event succeeded both before and after the patch:
>
> # ./perf record -e cpu-clock -aB --synth=no -u 0 -- \
> ./perf test -w thloop 2
> [ perf record: Woken up 7 times to write data ]
> [ perf record: Captured and wrote 2.870 MB perf.data ]
> # ./perf report --stats | grep SAMPLE
> SAMPLE events: 53506 (99.8%)
> SAMPLE events: 53506
> #
>
> Fixes: b4c658d4d63d61 ("perf target: Remove uid from target")
> Suggested-by: Jiri Olsa <jolsa@kernel.org>
> Tested-by: Thomas Richter <tmricht@linux.ibm.com>
> Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
> Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Thanks,
Namhyung
> ---
> tools/perf/util/bpf-filter.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
> index d0e013eeb0f7..a0b11f35395f 100644
> --- a/tools/perf/util/bpf-filter.c
> +++ b/tools/perf/util/bpf-filter.c
> @@ -451,6 +451,8 @@ int perf_bpf_filter__prepare(struct evsel *evsel, struct target *target)
> struct bpf_link *link;
> struct perf_bpf_filter_entry *entry;
> bool needs_idx_hash = !target__has_cpu(target);
> + DECLARE_LIBBPF_OPTS(bpf_perf_event_opts, pe_opts,
> + .dont_enable = true);
>
> entry = calloc(MAX_FILTERS, sizeof(*entry));
> if (entry == NULL)
> @@ -522,7 +524,8 @@ int perf_bpf_filter__prepare(struct evsel *evsel, struct target *target)
> prog = skel->progs.perf_sample_filter;
> for (x = 0; x < xyarray__max_x(evsel->core.fd); x++) {
> for (y = 0; y < xyarray__max_y(evsel->core.fd); y++) {
> - link = bpf_program__attach_perf_event(prog, FD(evsel, x, y));
> + link = bpf_program__attach_perf_event_opts(prog, FD(evsel, x, y),
> + &pe_opts);
> if (IS_ERR(link)) {
> pr_err("Failed to attach perf sample-filter program\n");
> ret = PTR_ERR(link);
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 2/2] perf bpf-filter: Enable events manually
2025-08-06 22:53 ` Namhyung Kim
@ 2025-08-06 23:38 ` Alexei Starovoitov
2025-08-07 5:02 ` Namhyung Kim
0 siblings, 1 reply; 7+ messages in thread
From: Alexei Starovoitov @ 2025-08-06 23:38 UTC (permalink / raw)
To: Namhyung Kim
Cc: Ilya Leoshkevich, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Ian Rogers, Arnaldo Carvalho de Melo, bpf,
linux-perf-use., LKML, linux-s390, Thomas Richter, Jiri Olsa,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
On Wed, Aug 6, 2025 at 3:53 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> On Wed, Aug 06, 2025 at 01:40:35PM +0200, Ilya Leoshkevich wrote:
> > On s390, and, in general, on all platforms where the respective event
> > supports auxiliary data gathering, the command:
> >
> > # ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.011 MB perf.data ]
> > # ./perf report --stats | grep SAMPLE
> > #
> >
> > does not generate samples in the perf.data file. On x86 the command:
> >
> > # sudo perf record -e intel_pt// -u 0 ls
> >
> > is broken too.
> >
> > Looking at the sequence of calls in 'perf record' reveals this
> > behavior:
> >
> > 1. The event 'cycles' is created and enabled:
> >
> > record__open()
> > +-> evlist__apply_filters()
> > +-> perf_bpf_filter__prepare()
> > +-> bpf_program.attach_perf_event()
> > +-> bpf_program.attach_perf_event_opts()
> > +-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...)
> >
> > The event 'cycles' is enabled and active now. However the event's
> > ring-buffer to store the samples generated by hardware is not
> > allocated yet.
> >
> > 2. The event's fd is mmap()ed to create the ring buffer:
> >
> > record__open()
> > +-> record__mmap()
> > +-> record__mmap_evlist()
> > +-> evlist__mmap_ex()
> > +-> perf_evlist__mmap_ops()
> > +-> mmap_per_cpu()
> > +-> mmap_per_evsel()
> > +-> mmap__mmap()
> > +-> perf_mmap__mmap()
> > +-> mmap()
> >
> > This allocates the ring buffer for the event 'cycles'. With mmap()
> > the kernel creates the ring buffer:
> >
> > perf_mmap(): kernel function to create the event's ring
> > | buffer to save the sampled data.
> > |
> > +-> ring_buffer_attach(): Allocates memory for ring buffer.
> > | The PMU has auxiliary data setup function. The
> > | has_aux(event) condition is true and the PMU's
> > | stop() is called to stop sampling. It is not
> > | restarted:
> > |
> > | if (has_aux(event))
> > | perf_event_stop(event, 0);
> > |
> > +-> cpumsf_pmu_stop():
> >
> > Hardware sampling is stopped. No samples are generated and saved
> > anymore.
> >
> > 3. After the event 'cycles' has been mapped, the event is enabled a
> > second time in:
> >
> > __cmd_record()
> > +-> evlist__enable()
> > +-> __evlist__enable()
> > +-> evsel__enable_cpu()
> > +-> perf_evsel__enable_cpu()
> > +-> perf_evsel__run_ioctl()
> > +-> perf_evsel__ioctl()
> > +-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .)
> >
> > The second
> >
> > ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
> >
> > is just a NOP in this case. The first invocation in (1.) sets the
> > event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions
> >
> > perf_ioctl()
> > +-> _perf_ioctl()
> > +-> _perf_event_enable()
> > +-> __perf_event_enable()
> >
> > return immediately because event::state is already set to
> > PERF_EVENT_STATE_ACTIVE.
> >
> > This happens on s390, because the event 'cycles' offers the possibility
> > to save auxilary data. The PMU callbacks setup_aux() and free_aux() are
> > defined. Without both callback functions, cpumsf_pmu_stop() is not
> > invoked and sampling continues.
> >
> > To remedy this, remove the first invocation of
> >
> > ioctl(..., PERF_EVENT_IOC_ENABLE, ...).
> >
> > in step (1.) Create the event in step (1.) and enable it in step (3.)
> > after the ring buffer has been mapped.
> >
> > Output after:
> >
> > # ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2
> > [ perf record: Woken up 3 times to write data ]
> > [ perf record: Captured and wrote 0.876 MB perf.data ]
> > # ./perf report --stats | grep SAMPLE
> > SAMPLE events: 16200 (99.5%)
> > SAMPLE events: 16200
> > #
> >
> > The software event succeeded both before and after the patch:
> >
> > # ./perf record -e cpu-clock -aB --synth=no -u 0 -- \
> > ./perf test -w thloop 2
> > [ perf record: Woken up 7 times to write data ]
> > [ perf record: Captured and wrote 2.870 MB perf.data ]
> > # ./perf report --stats | grep SAMPLE
> > SAMPLE events: 53506 (99.8%)
> > SAMPLE events: 53506
> > #
> >
> > Fixes: b4c658d4d63d61 ("perf target: Remove uid from target")
> > Suggested-by: Jiri Olsa <jolsa@kernel.org>
> > Tested-by: Thomas Richter <tmricht@linux.ibm.com>
> > Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
> > Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
> > Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
>
> Acked-by: Namhyung Kim <namhyung@kernel.org>
Do you mind if I take the whole set through the bpf tree ?
I'm planning to send bpf PR in a couple days, so by -rc1
all trees will see the fix.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 2/2] perf bpf-filter: Enable events manually
2025-08-06 23:38 ` Alexei Starovoitov
@ 2025-08-07 5:02 ` Namhyung Kim
0 siblings, 0 replies; 7+ messages in thread
From: Namhyung Kim @ 2025-08-07 5:02 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Ilya Leoshkevich, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Ian Rogers, Arnaldo Carvalho de Melo, bpf,
linux-perf-use., LKML, linux-s390, Thomas Richter, Jiri Olsa,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
Hi Alexei,
On Wed, Aug 06, 2025 at 04:38:09PM -0700, Alexei Starovoitov wrote:
> On Wed, Aug 6, 2025 at 3:53 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Aug 06, 2025 at 01:40:35PM +0200, Ilya Leoshkevich wrote:
> > > On s390, and, in general, on all platforms where the respective event
> > > supports auxiliary data gathering, the command:
> > >
> > > # ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 0.011 MB perf.data ]
> > > # ./perf report --stats | grep SAMPLE
> > > #
> > >
> > > does not generate samples in the perf.data file. On x86 the command:
> > >
> > > # sudo perf record -e intel_pt// -u 0 ls
> > >
> > > is broken too.
> > >
> > > Looking at the sequence of calls in 'perf record' reveals this
> > > behavior:
> > >
> > > 1. The event 'cycles' is created and enabled:
> > >
> > > record__open()
> > > +-> evlist__apply_filters()
> > > +-> perf_bpf_filter__prepare()
> > > +-> bpf_program.attach_perf_event()
> > > +-> bpf_program.attach_perf_event_opts()
> > > +-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...)
> > >
> > > The event 'cycles' is enabled and active now. However the event's
> > > ring-buffer to store the samples generated by hardware is not
> > > allocated yet.
> > >
> > > 2. The event's fd is mmap()ed to create the ring buffer:
> > >
> > > record__open()
> > > +-> record__mmap()
> > > +-> record__mmap_evlist()
> > > +-> evlist__mmap_ex()
> > > +-> perf_evlist__mmap_ops()
> > > +-> mmap_per_cpu()
> > > +-> mmap_per_evsel()
> > > +-> mmap__mmap()
> > > +-> perf_mmap__mmap()
> > > +-> mmap()
> > >
> > > This allocates the ring buffer for the event 'cycles'. With mmap()
> > > the kernel creates the ring buffer:
> > >
> > > perf_mmap(): kernel function to create the event's ring
> > > | buffer to save the sampled data.
> > > |
> > > +-> ring_buffer_attach(): Allocates memory for ring buffer.
> > > | The PMU has auxiliary data setup function. The
> > > | has_aux(event) condition is true and the PMU's
> > > | stop() is called to stop sampling. It is not
> > > | restarted:
> > > |
> > > | if (has_aux(event))
> > > | perf_event_stop(event, 0);
> > > |
> > > +-> cpumsf_pmu_stop():
> > >
> > > Hardware sampling is stopped. No samples are generated and saved
> > > anymore.
> > >
> > > 3. After the event 'cycles' has been mapped, the event is enabled a
> > > second time in:
> > >
> > > __cmd_record()
> > > +-> evlist__enable()
> > > +-> __evlist__enable()
> > > +-> evsel__enable_cpu()
> > > +-> perf_evsel__enable_cpu()
> > > +-> perf_evsel__run_ioctl()
> > > +-> perf_evsel__ioctl()
> > > +-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .)
> > >
> > > The second
> > >
> > > ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
> > >
> > > is just a NOP in this case. The first invocation in (1.) sets the
> > > event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions
> > >
> > > perf_ioctl()
> > > +-> _perf_ioctl()
> > > +-> _perf_event_enable()
> > > +-> __perf_event_enable()
> > >
> > > return immediately because event::state is already set to
> > > PERF_EVENT_STATE_ACTIVE.
> > >
> > > This happens on s390, because the event 'cycles' offers the possibility
> > > to save auxilary data. The PMU callbacks setup_aux() and free_aux() are
> > > defined. Without both callback functions, cpumsf_pmu_stop() is not
> > > invoked and sampling continues.
> > >
> > > To remedy this, remove the first invocation of
> > >
> > > ioctl(..., PERF_EVENT_IOC_ENABLE, ...).
> > >
> > > in step (1.) Create the event in step (1.) and enable it in step (3.)
> > > after the ring buffer has been mapped.
> > >
> > > Output after:
> > >
> > > # ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2
> > > [ perf record: Woken up 3 times to write data ]
> > > [ perf record: Captured and wrote 0.876 MB perf.data ]
> > > # ./perf report --stats | grep SAMPLE
> > > SAMPLE events: 16200 (99.5%)
> > > SAMPLE events: 16200
> > > #
> > >
> > > The software event succeeded both before and after the patch:
> > >
> > > # ./perf record -e cpu-clock -aB --synth=no -u 0 -- \
> > > ./perf test -w thloop 2
> > > [ perf record: Woken up 7 times to write data ]
> > > [ perf record: Captured and wrote 2.870 MB perf.data ]
> > > # ./perf report --stats | grep SAMPLE
> > > SAMPLE events: 53506 (99.8%)
> > > SAMPLE events: 53506
> > > #
> > >
> > > Fixes: b4c658d4d63d61 ("perf target: Remove uid from target")
> > > Suggested-by: Jiri Olsa <jolsa@kernel.org>
> > > Tested-by: Thomas Richter <tmricht@linux.ibm.com>
> > > Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
> > > Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
> > > Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> >
> > Acked-by: Namhyung Kim <namhyung@kernel.org>
>
> Do you mind if I take the whole set through the bpf tree ?
>
> I'm planning to send bpf PR in a couple days, so by -rc1
> all trees will see the fix.
Sure, I don't think we have conflicting changes and we'll sync
perf-tools-next once -rc1 is released.
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-08-07 5:02 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-06 11:40 [PATCH v4 0/2] perf/s390: Regression: Move uid filtering to BPF filters Ilya Leoshkevich
2025-08-06 11:40 ` [PATCH v4 1/2] libbpf: Add the ability to suppress perf event enablement Ilya Leoshkevich
2025-08-06 15:25 ` Yonghong Song
2025-08-06 11:40 ` [PATCH v4 2/2] perf bpf-filter: Enable events manually Ilya Leoshkevich
2025-08-06 22:53 ` Namhyung Kim
2025-08-06 23:38 ` Alexei Starovoitov
2025-08-07 5:02 ` Namhyung Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).