* [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable
@ 2025-03-07 19:33 Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 1/7] perf: Ensure bpf_perf_link path is properly serialized Peter Zijlstra
` (10 more replies)
0 siblings, 11 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-07 19:33 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
Hi!
Much smaller this time because Ingo merged a whole lot of the preperatory
patches, thanks!
This appears to survive a few hours of perf_fuzzer combined with tinypmu testcase.
Patches also at:
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/pmu-unregister
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v3 1/7] perf: Ensure bpf_perf_link path is properly serialized
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
@ 2025-03-07 19:33 ` Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 2/7] perf: Simplify child event tear-down Peter Zijlstra
` (9 subsequent siblings)
10 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-07 19:33 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
Ravi reported that the bpf_perf_link_attach() usage of
perf_event_set_bpf_prog() is not serialized by ctx->mutex, unlike the
PERF_EVENT_IOC_SET_BPF case.
Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 34 ++++++++++++++++++++++++++++++----
1 file changed, 30 insertions(+), 4 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6087,6 +6087,9 @@ static int perf_event_set_output(struct
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
static int perf_copy_attr(struct perf_event_attr __user *uattr,
struct perf_event_attr *attr);
+static int __perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
+ u64 bpf_cookie);
static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg)
{
@@ -6149,7 +6152,7 @@ static long _perf_ioctl(struct perf_even
if (IS_ERR(prog))
return PTR_ERR(prog);
- err = perf_event_set_bpf_prog(event, prog, 0);
+ err = __perf_event_set_bpf_prog(event, prog, 0);
if (err) {
bpf_prog_put(prog);
return err;
@@ -10875,8 +10878,9 @@ static inline bool perf_event_is_tracing
return false;
}
-int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog,
- u64 bpf_cookie)
+static int __perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
+ u64 bpf_cookie)
{
bool is_kprobe, is_uprobe, is_tracepoint, is_syscall_tp;
@@ -10914,6 +10918,20 @@ int perf_event_set_bpf_prog(struct perf_
return perf_event_attach_bpf_prog(event, prog, bpf_cookie);
}
+int perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
+ u64 bpf_cookie)
+{
+ struct perf_event_context *ctx;
+ int ret;
+
+ ctx = perf_event_ctx_lock(event);
+ ret = __perf_event_set_bpf_prog(event, prog, bpf_cookie);
+ perf_event_ctx_unlock(event, ctx);
+
+ return ret;
+}
+
void perf_event_free_bpf_prog(struct perf_event *event)
{
if (!event->prog)
@@ -10936,7 +10954,15 @@ static void perf_event_free_filter(struc
{
}
-int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog,
+static int __perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
+ u64 bpf_cookie)
+{
+ return -ENOENT;
+}
+
+int perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
u64 bpf_cookie)
{
return -ENOENT;
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v3 2/7] perf: Simplify child event tear-down
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 1/7] perf: Ensure bpf_perf_link path is properly serialized Peter Zijlstra
@ 2025-03-07 19:33 ` Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 3/7] perf: Simplify perf_event_free_task() wait Peter Zijlstra
` (8 subsequent siblings)
10 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-07 19:33 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
Currently perf_event_release_kernel() will iterate the child events and attempt
tear-down. However, it removes them from the child_list using list_move(),
notably skipping the state management done by perf_child_detach().
Crucially, it fails to clear PERF_ATTACH_CHILD, which opens the door for a
concurrent perf_remove_from_context() to race.
This way child_list management stays fully serialized using child_mutex.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2338,7 +2338,11 @@ static void perf_child_detach(struct per
if (WARN_ON_ONCE(!parent_event))
return;
+ /*
+ * Can't check this from an IPI, the holder is likey another CPU.
+ *
lockdep_assert_held(&parent_event->child_mutex);
+ */
sync_child_event(event);
list_del_init(&event->child_list);
@@ -5611,8 +5615,8 @@ int perf_event_release_kernel(struct per
tmp = list_first_entry_or_null(&event->child_list,
struct perf_event, child_list);
if (tmp == child) {
- perf_remove_from_context(child, DETACH_GROUP);
- list_move(&child->child_list, &free_list);
+ perf_remove_from_context(child, DETACH_GROUP | DETACH_CHILD);
+ list_add(&child->child_list, &free_list);
/*
* This matches the refcount bump in inherit_event();
* this can't be the last reference.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v3 3/7] perf: Simplify perf_event_free_task() wait
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 1/7] perf: Ensure bpf_perf_link path is properly serialized Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 2/7] perf: Simplify child event tear-down Peter Zijlstra
@ 2025-03-07 19:33 ` Peter Zijlstra
2025-03-17 6:49 ` Ravi Bangoria
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 4/7] perf: Simplify perf_event_release_kernel() Peter Zijlstra
` (7 subsequent siblings)
10 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-07 19:33 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
Simplify the code by moving the duplicated wakeup condition into
put_ctx().
Notably, wait_var_event() is in perf_event_free_task() and will have
set ctx->task = TASK_TOMBSTONE.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 32 ++++++++------------------------
1 file changed, 8 insertions(+), 24 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1223,8 +1223,14 @@ static void put_ctx(struct perf_event_co
if (refcount_dec_and_test(&ctx->refcount)) {
if (ctx->parent_ctx)
put_ctx(ctx->parent_ctx);
- if (ctx->task && ctx->task != TASK_TOMBSTONE)
- put_task_struct(ctx->task);
+ if (ctx->task) {
+ if (ctx->task == TASK_TOMBSTONE) {
+ smp_mb(); /* pairs with wait_var_event() */
+ wake_up_var(&ctx->refcount);
+ } else {
+ put_task_struct(ctx->task);
+ }
+ }
call_rcu(&ctx->rcu_head, free_ctx);
}
}
@@ -5492,8 +5498,6 @@ int perf_event_release_kernel(struct per
again:
mutex_lock(&event->child_mutex);
list_for_each_entry(child, &event->child_list, child_list) {
- void *var = NULL;
-
/*
* Cannot change, child events are not migrated, see the
* comment with perf_event_ctx_lock_nested().
@@ -5533,39 +5537,19 @@ int perf_event_release_kernel(struct per
* this can't be the last reference.
*/
put_event(event);
- } else {
- var = &ctx->refcount;
}
mutex_unlock(&event->child_mutex);
mutex_unlock(&ctx->mutex);
put_ctx(ctx);
- if (var) {
- /*
- * If perf_event_free_task() has deleted all events from the
- * ctx while the child_mutex got released above, make sure to
- * notify about the preceding put_ctx().
- */
- smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(var);
- }
goto again;
}
mutex_unlock(&event->child_mutex);
list_for_each_entry_safe(child, tmp, &free_list, child_list) {
- void *var = &child->ctx->refcount;
-
list_del(&child->child_list);
free_event(child);
-
- /*
- * Wake any perf_event_free_task() waiting for this event to be
- * freed.
- */
- smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(var);
}
no_ctx:
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v3 4/7] perf: Simplify perf_event_release_kernel()
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
` (2 preceding siblings ...)
2025-03-07 19:33 ` [PATCH v3 3/7] perf: Simplify perf_event_free_task() wait Peter Zijlstra
@ 2025-03-07 19:33 ` Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 5/7] perf: Unify perf_event_free_task() / perf_event_exit_task_context() Peter Zijlstra
` (6 subsequent siblings)
10 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-07 19:33 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
There is no good reason to have the free list anymore. It is possible
to call free_event() after the locks have been dropped in the main
loop.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5462,7 +5462,6 @@ int perf_event_release_kernel(struct per
{
struct perf_event_context *ctx = event->ctx;
struct perf_event *child, *tmp;
- LIST_HEAD(free_list);
/*
* If we got here through err_alloc: free_event(event); we will not
@@ -5531,27 +5530,26 @@ int perf_event_release_kernel(struct per
struct perf_event, child_list);
if (tmp == child) {
perf_remove_from_context(child, DETACH_GROUP | DETACH_CHILD);
- list_add(&child->child_list, &free_list);
/*
* This matches the refcount bump in inherit_event();
* this can't be the last reference.
*/
put_event(event);
+ } else {
+ child = NULL;
}
mutex_unlock(&event->child_mutex);
mutex_unlock(&ctx->mutex);
+
+ if (child)
+ free_event(child);
put_ctx(ctx);
goto again;
}
mutex_unlock(&event->child_mutex);
- list_for_each_entry_safe(child, tmp, &free_list, child_list) {
- list_del(&child->child_list);
- free_event(child);
- }
-
no_ctx:
put_event(event); /* Must be the 'last' reference */
return 0;
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v3 5/7] perf: Unify perf_event_free_task() / perf_event_exit_task_context()
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
` (3 preceding siblings ...)
2025-03-07 19:33 ` [PATCH v3 4/7] perf: Simplify perf_event_release_kernel() Peter Zijlstra
@ 2025-03-07 19:33 ` Peter Zijlstra
2025-03-10 15:35 ` [PATCH v3a " Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child) Peter Zijlstra
` (5 subsequent siblings)
10 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-07 19:33 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
Both perf_event_free_task() and perf_event_exit_task_context() are
very similar, except perf_event_exit_task_context() is a little more
generic / makes less assumptions.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 88 ++++++++++++---------------------------------------
1 file changed, 22 insertions(+), 66 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13488,7 +13488,7 @@ perf_event_exit_event(struct perf_event
perf_event_wakeup(event);
}
-static void perf_event_exit_task_context(struct task_struct *child)
+static void perf_event_exit_task_context(struct task_struct *child, bool exit)
{
struct perf_event_context *child_ctx, *clone_ctx = NULL;
struct perf_event *child_event, *next;
@@ -13539,13 +13539,31 @@ static void perf_event_exit_task_context
* won't get any samples after PERF_RECORD_EXIT. We can however still
* get a few PERF_RECORD_READ events.
*/
- perf_event_task(child, child_ctx, 0);
+ if (exit)
+ perf_event_task(child, child_ctx, 0);
list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry)
perf_event_exit_event(child_event, child_ctx);
mutex_unlock(&child_ctx->mutex);
+ if (!exit) {
+ /*
+ * perf_event_release_kernel() could still have a reference on
+ * this context. In that case we must wait for these events to
+ * have been freed (in particular all their references to this
+ * task must've been dropped).
+ *
+ * Without this copy_process() will unconditionally free this
+ * task (irrespective of its reference count) and
+ * _free_event()'s put_task_struct(event->hw.target) will be a
+ * use-after-free.
+ *
+ * Wait for all events to drop their context reference.
+ */
+ wait_var_event(&child_ctx->refcount,
+ refcount_read(&child_ctx->refcount) == 1);
+ }
put_ctx(child_ctx);
}
@@ -13573,7 +13591,7 @@ void perf_event_exit_task(struct task_st
}
mutex_unlock(&child->perf_event_mutex);
- perf_event_exit_task_context(child);
+ perf_event_exit_task_context(child, true);
/*
* The perf_event_exit_task_context calls perf_event_task
@@ -13584,27 +13602,6 @@ void perf_event_exit_task(struct task_st
perf_event_task(child, NULL, 0);
}
-static void perf_free_event(struct perf_event *event,
- struct perf_event_context *ctx)
-{
- struct perf_event *parent = event->parent;
-
- if (WARN_ON_ONCE(!parent))
- return;
-
- mutex_lock(&parent->child_mutex);
- list_del_init(&event->child_list);
- mutex_unlock(&parent->child_mutex);
-
- put_event(parent);
-
- raw_spin_lock_irq(&ctx->lock);
- perf_group_detach(event);
- list_del_event(event, ctx);
- raw_spin_unlock_irq(&ctx->lock);
- free_event(event);
-}
-
/*
* Free a context as created by inheritance by perf_event_init_task() below,
* used by fork() in case of fail.
@@ -13614,48 +13611,7 @@ static void perf_free_event(struct perf_
*/
void perf_event_free_task(struct task_struct *task)
{
- struct perf_event_context *ctx;
- struct perf_event *event, *tmp;
-
- ctx = rcu_access_pointer(task->perf_event_ctxp);
- if (!ctx)
- return;
-
- mutex_lock(&ctx->mutex);
- raw_spin_lock_irq(&ctx->lock);
- /*
- * Destroy the task <-> ctx relation and mark the context dead.
- *
- * This is important because even though the task hasn't been
- * exposed yet the context has been (through child_list).
- */
- RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
- WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
- put_task_struct(task); /* cannot be last */
- raw_spin_unlock_irq(&ctx->lock);
-
-
- list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
- perf_free_event(event, ctx);
-
- mutex_unlock(&ctx->mutex);
-
- /*
- * perf_event_release_kernel() could've stolen some of our
- * child events and still have them on its free_list. In that
- * case we must wait for these events to have been freed (in
- * particular all their references to this task must've been
- * dropped).
- *
- * Without this copy_process() will unconditionally free this
- * task (irrespective of its reference count) and
- * _free_event()'s put_task_struct(event->hw.target) will be a
- * use-after-free.
- *
- * Wait for all events to drop their context reference.
- */
- wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
- put_ctx(ctx); /* must be last */
+ perf_event_exit_task_context(task, false);
}
void perf_event_delayed_put(struct task_struct *task)
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child)
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
` (4 preceding siblings ...)
2025-03-07 19:33 ` [PATCH v3 5/7] perf: Unify perf_event_free_task() / perf_event_exit_task_context() Peter Zijlstra
@ 2025-03-07 19:33 ` Peter Zijlstra
2025-03-10 11:08 ` Ravi Bangoria
2025-03-10 15:37 ` [PATCH v3a " Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Peter Zijlstra
` (4 subsequent siblings)
10 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-07 19:33 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
The task passed to perf_event_exit_task() is not a child, it is
current. Fix this confusing naming, since much of the rest of the code
also relies on it being current.
Specifically, both exec() and exit() callers use it with current as
the argument.
Notably, task_ctx_sched_out() doesn't make much sense outside of
current.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 60 ++++++++++++++++++++++++++-------------------------
1 file changed, 31 insertions(+), 29 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13488,15 +13488,15 @@ perf_event_exit_event(struct perf_event
perf_event_wakeup(event);
}
-static void perf_event_exit_task_context(struct task_struct *child, bool exit)
+static void perf_event_exit_task_context(struct task_struct *task, bool exit)
{
- struct perf_event_context *child_ctx, *clone_ctx = NULL;
+ struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_event *child_event, *next;
- WARN_ON_ONCE(child != current);
+ WARN_ON_ONCE(task != current);
- child_ctx = perf_pin_task_context(child);
- if (!child_ctx)
+ ctx = perf_pin_task_context(task);
+ if (!ctx)
return;
/*
@@ -13509,27 +13509,27 @@ static void perf_event_exit_task_context
* without ctx::mutex (it cannot because of the move_group double mutex
* lock thing). See the comments in perf_install_in_context().
*/
- mutex_lock(&child_ctx->mutex);
+ mutex_lock(&ctx->mutex);
/*
* In a single ctx::lock section, de-schedule the events and detach the
* context from the task such that we cannot ever get it scheduled back
* in.
*/
- raw_spin_lock_irq(&child_ctx->lock);
- task_ctx_sched_out(child_ctx, NULL, EVENT_ALL);
+ raw_spin_lock_irq(&ctx->lock);
+ task_ctx_sched_out(ctx, NULL, EVENT_ALL);
/*
* Now that the context is inactive, destroy the task <-> ctx relation
* and mark the context dead.
*/
- RCU_INIT_POINTER(child->perf_event_ctxp, NULL);
- put_ctx(child_ctx); /* cannot be last */
- WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
+ RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
+ put_ctx(ctx); /* cannot be last */
+ WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
put_task_struct(current); /* cannot be last */
- clone_ctx = unclone_ctx(child_ctx);
- raw_spin_unlock_irq(&child_ctx->lock);
+ clone_ctx = unclone_ctx(ctx);
+ raw_spin_unlock_irq(&ctx->lock);
if (clone_ctx)
put_ctx(clone_ctx);
@@ -13540,12 +13540,12 @@ static void perf_event_exit_task_context
* get a few PERF_RECORD_READ events.
*/
if (exit)
- perf_event_task(child, child_ctx, 0);
+ perf_event_task(task, ctx, 0);
- list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry)
- perf_event_exit_event(child_event, child_ctx);
+ list_for_each_entry_safe(child_event, next, &ctx->event_list, event_entry)
+ perf_event_exit_event(child_event, ctx);
- mutex_unlock(&child_ctx->mutex);
+ mutex_unlock(&ctx->mutex);
if (!exit) {
/*
@@ -13561,24 +13561,26 @@ static void perf_event_exit_task_context
*
* Wait for all events to drop their context reference.
*/
- wait_var_event(&child_ctx->refcount,
- refcount_read(&child_ctx->refcount) == 1);
+ wait_var_event(&ctx->refcount,
+ refcount_read(&ctx->refcount) == 1);
}
- put_ctx(child_ctx);
+ put_ctx(ctx);
}
/*
- * When a child task exits, feed back event values to parent events.
+ * When a task exits, feed back event values to parent events.
*
* Can be called with exec_update_lock held when called from
* setup_new_exec().
*/
-void perf_event_exit_task(struct task_struct *child)
+void perf_event_exit_task(struct task_struct *task)
{
struct perf_event *event, *tmp;
- mutex_lock(&child->perf_event_mutex);
- list_for_each_entry_safe(event, tmp, &child->perf_event_list,
+ WARN_ON_ONCE(task != current);
+
+ mutex_lock(&task->perf_event_mutex);
+ list_for_each_entry_safe(event, tmp, &task->perf_event_list,
owner_entry) {
list_del_init(&event->owner_entry);
@@ -13589,17 +13591,17 @@ void perf_event_exit_task(struct task_st
*/
smp_store_release(&event->owner, NULL);
}
- mutex_unlock(&child->perf_event_mutex);
+ mutex_unlock(&task->perf_event_mutex);
- perf_event_exit_task_context(child, true);
+ perf_event_exit_task_context(task, true);
/*
* The perf_event_exit_task_context calls perf_event_task
- * with child's task_ctx, which generates EXIT events for
- * child contexts and sets child->perf_event_ctxp[] to NULL.
+ * with task's task_ctx, which generates EXIT events for
+ * task contexts and sets task->perf_event_ctxp[] to NULL.
* At this point we need to send EXIT events to cpu contexts.
*/
- perf_event_task(child, NULL, 0);
+ perf_event_task(task, NULL, 0);
}
/*
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
` (5 preceding siblings ...)
2025-03-07 19:33 ` [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child) Peter Zijlstra
@ 2025-03-07 19:33 ` Peter Zijlstra
2025-03-10 15:35 ` Ravi Bangoria
` (3 more replies)
2025-03-17 6:54 ` [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Ravi Bangoria
` (3 subsequent siblings)
10 siblings, 4 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-07 19:33 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
Previously it was only safe to call perf_pmu_unregister() if there
were no active events of that pmu around -- which was impossible to
guarantee since it races all sorts against perf_init_event().
Rework the whole thing by:
- keeping track of all events for a given pmu
- 'hiding' the pmu from perf_init_event()
- waiting for the appropriate (s)rcu grace periods such that all
prior references to the PMU will be completed
- detaching all still existing events of that pmu (see first point)
and moving them to a new REVOKED state.
- actually freeing the pmu data.
Where notably the new REVOKED state must inhibit all event actions
from reaching code that wants to use event->pmu.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/perf_event.h | 15 +-
kernel/events/core.c | 294 ++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 274 insertions(+), 35 deletions(-)
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -325,6 +325,9 @@ struct perf_output_handle;
struct pmu {
struct list_head entry;
+ spinlock_t events_lock;
+ struct list_head events;
+
struct module *module;
struct device *dev;
struct device *parent;
@@ -632,9 +635,10 @@ struct perf_addr_filter_range {
* enum perf_event_state - the states of an event:
*/
enum perf_event_state {
- PERF_EVENT_STATE_DEAD = -4,
- PERF_EVENT_STATE_EXIT = -3,
- PERF_EVENT_STATE_ERROR = -2,
+ PERF_EVENT_STATE_DEAD = -5,
+ PERF_EVENT_STATE_REVOKED = -4, /* pmu gone, must not touch */
+ PERF_EVENT_STATE_EXIT = -3, /* task died, still inherit */
+ PERF_EVENT_STATE_ERROR = -2, /* scheduling error, can enable */
PERF_EVENT_STATE_OFF = -1,
PERF_EVENT_STATE_INACTIVE = 0,
PERF_EVENT_STATE_ACTIVE = 1,
@@ -875,6 +879,7 @@ struct perf_event {
void *security;
#endif
struct list_head sb_list;
+ struct list_head pmu_list;
/*
* Certain events gets forwarded to another pmu internally by over-
@@ -1132,7 +1137,7 @@ extern void perf_aux_output_flag(struct
extern void perf_event_itrace_started(struct perf_event *event);
extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
-extern void perf_pmu_unregister(struct pmu *pmu);
+extern int perf_pmu_unregister(struct pmu *pmu);
extern void __perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task);
@@ -1734,7 +1739,7 @@ static inline bool needs_branch_stack(st
static inline bool has_aux(struct perf_event *event)
{
- return event->pmu->setup_aux;
+ return event->pmu && event->pmu->setup_aux;
}
static inline bool has_aux_action(struct perf_event *event)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -207,6 +207,7 @@ static void perf_ctx_unlock(struct perf_
}
#define TASK_TOMBSTONE ((void *)-1L)
+#define EVENT_TOMBSTONE ((void *)-1L)
static bool is_kernel_event(struct perf_event *event)
{
@@ -2348,6 +2349,11 @@ static void perf_child_detach(struct per
sync_child_event(event);
list_del_init(&event->child_list);
+ /*
+ * Cannot set to NULL, as that would confuse the situation vs
+ * not being a child event. See for example unaccount_event().
+ */
+ event->parent = EVENT_TOMBSTONE;
}
static bool is_orphaned_event(struct perf_event *event)
@@ -2469,7 +2475,9 @@ ctx_time_update_event(struct perf_event_
#define DETACH_GROUP 0x01UL
#define DETACH_CHILD 0x02UL
-#define DETACH_DEAD 0x04UL
+#define DETACH_EXIT 0x04UL
+#define DETACH_REVOKE 0x08UL
+#define DETACH_DEAD 0x10UL
/*
* Cross CPU call to remove a performance event
@@ -2484,6 +2492,7 @@ __perf_remove_from_context(struct perf_e
void *info)
{
struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
+ enum perf_event_state state = PERF_EVENT_STATE_OFF;
unsigned long flags = (unsigned long)info;
ctx_time_update(cpuctx, ctx);
@@ -2492,16 +2501,22 @@ __perf_remove_from_context(struct perf_e
* Ensure event_sched_out() switches to OFF, at the very least
* this avoids raising perf_pending_task() at this time.
*/
- if (flags & DETACH_DEAD)
+ if (flags & DETACH_EXIT)
+ state = PERF_EVENT_STATE_EXIT;
+ if (flags & DETACH_REVOKE)
+ state = PERF_EVENT_STATE_REVOKED;
+ if (flags & DETACH_DEAD) {
event->pending_disable = 1;
+ state = PERF_EVENT_STATE_DEAD;
+ }
event_sched_out(event, ctx);
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
- if (flags & DETACH_DEAD)
- event->state = PERF_EVENT_STATE_DEAD;
+
+ event->state = min(event->state, state);
if (!pmu_ctx->nr_events) {
pmu_ctx->rotate_necessary = 0;
@@ -4560,7 +4575,8 @@ static void perf_event_enable_on_exec(st
static void perf_remove_from_owner(struct perf_event *event);
static void perf_event_exit_event(struct perf_event *event,
- struct perf_event_context *ctx);
+ struct perf_event_context *ctx,
+ bool revoke);
/*
* Removes all events from the current task that have been marked
@@ -4587,7 +4603,7 @@ static void perf_event_remove_on_exec(st
modified = true;
- perf_event_exit_event(event, ctx);
+ perf_event_exit_event(event, ctx, false);
}
raw_spin_lock_irqsave(&ctx->lock, flags);
@@ -5187,6 +5203,7 @@ static bool is_sb_event(struct perf_even
attr->context_switch || attr->text_poke ||
attr->bpf_event)
return true;
+
return false;
}
@@ -5388,6 +5405,8 @@ static void perf_pending_task_sync(struc
/* vs perf_event_alloc() error */
static void __free_event(struct perf_event *event)
{
+ struct pmu *pmu = event->pmu;
+
if (event->attach_state & PERF_ATTACH_CALLCHAIN)
put_callchain_buffers();
@@ -5414,6 +5433,7 @@ static void __free_event(struct perf_eve
* put_pmu_ctx() needs an event->ctx reference, because of
* epc->ctx.
*/
+ WARN_ON_ONCE(!pmu);
WARN_ON_ONCE(!event->ctx);
WARN_ON_ONCE(event->pmu_ctx->ctx != event->ctx);
put_pmu_ctx(event->pmu_ctx);
@@ -5426,8 +5446,13 @@ static void __free_event(struct perf_eve
if (event->ctx)
put_ctx(event->ctx);
- if (event->pmu)
- module_put(event->pmu->module);
+ if (pmu) {
+ module_put(pmu->module);
+ scoped_guard (spinlock, &pmu->events_lock) {
+ list_del(&event->pmu_list);
+ wake_up_var(pmu);
+ }
+ }
call_rcu(&event->rcu_head, free_event_rcu);
}
@@ -5575,7 +5600,11 @@ int perf_event_release_kernel(struct per
* Thus this guarantees that we will in fact observe and kill _ALL_
* child events.
*/
- perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
+ if (event->state > PERF_EVENT_STATE_REVOKED) {
+ perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
+ } else {
+ event->state = PERF_EVENT_STATE_DEAD;
+ }
perf_event_ctx_unlock(event, ctx);
@@ -5628,7 +5657,7 @@ int perf_event_release_kernel(struct per
mutex_unlock(&ctx->mutex);
if (child)
- free_event(child);
+ put_event(child);
put_ctx(ctx);
goto again;
@@ -5863,7 +5892,7 @@ __perf_read(struct perf_event *event, ch
* error state (i.e. because it was pinned but it couldn't be
* scheduled on to the CPU at some point).
*/
- if (event->state == PERF_EVENT_STATE_ERROR)
+ if (event->state <= PERF_EVENT_STATE_ERROR)
return 0;
if (count < event->read_size)
@@ -5902,8 +5931,14 @@ static __poll_t perf_poll(struct file *f
struct perf_buffer *rb;
__poll_t events = EPOLLHUP;
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return EPOLLERR;
+
poll_wait(file, &event->waitq, wait);
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return EPOLLERR;
+
if (is_event_hup(event))
return events;
@@ -6078,6 +6113,9 @@ static long _perf_ioctl(struct perf_even
void (*func)(struct perf_event *);
u32 flags = arg;
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return -ENODEV;
+
switch (cmd) {
case PERF_EVENT_IOC_ENABLE:
func = _perf_event_enable;
@@ -6453,9 +6491,22 @@ void ring_buffer_put(struct perf_buffer
call_rcu(&rb->rcu_head, rb_free_rcu);
}
+typedef void (*mapped_f)(struct perf_event *event, struct mm_struct *mm);
+
+#define get_mapped(event, func) \
+({ struct pmu *pmu; \
+ mapped_f f = NULL; \
+ guard(rcu)(); \
+ pmu = READ_ONCE(event->pmu); \
+ if (pmu) \
+ f = pmu->func; \
+ f; \
+})
+
static void perf_mmap_open(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;
+ mapped_f mapped = get_mapped(event, event_mapped);
atomic_inc(&event->mmap_count);
atomic_inc(&event->rb->mmap_count);
@@ -6463,8 +6514,8 @@ static void perf_mmap_open(struct vm_are
if (vma->vm_pgoff)
atomic_inc(&event->rb->aux_mmap_count);
- if (event->pmu->event_mapped)
- event->pmu->event_mapped(event, vma->vm_mm);
+ if (mapped)
+ mapped(event, vma->vm_mm);
}
static void perf_pmu_output_stop(struct perf_event *event);
@@ -6480,14 +6531,16 @@ static void perf_pmu_output_stop(struct
static void perf_mmap_close(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;
+ mapped_f unmapped = get_mapped(event, event_unmapped);
struct perf_buffer *rb = ring_buffer_get(event);
struct user_struct *mmap_user = rb->mmap_user;
int mmap_locked = rb->mmap_locked;
unsigned long size = perf_data_size(rb);
bool detach_rest = false;
- if (event->pmu->event_unmapped)
- event->pmu->event_unmapped(event, vma->vm_mm);
+ /* FIXIES vs perf_pmu_unregister() */
+ if (unmapped)
+ unmapped(event, vma->vm_mm);
/*
* The AUX buffer is strictly a sub-buffer, serialize using aux_mutex
@@ -6680,6 +6733,7 @@ static int perf_mmap(struct file *file,
unsigned long nr_pages;
long user_extra = 0, extra = 0;
int ret, flags = 0;
+ mapped_f mapped;
/*
* Don't allow mmap() of inherited per-task counters. This would
@@ -6710,6 +6764,16 @@ static int perf_mmap(struct file *file,
mutex_lock(&event->mmap_mutex);
ret = -EINVAL;
+ /*
+ * This relies on __pmu_detach_event() taking mmap_mutex after marking
+ * the event REVOKED. Either we observe the state, or __pmu_detach_event()
+ * will detach the rb created here.
+ */
+ if (event->state <= PERF_EVENT_STATE_REVOKED) {
+ ret = -ENODEV;
+ goto unlock;
+ }
+
if (vma->vm_pgoff == 0) {
nr_pages -= 1;
@@ -6888,8 +6952,9 @@ static int perf_mmap(struct file *file,
if (!ret)
ret = map_range(rb, vma);
- if (!ret && event->pmu->event_mapped)
- event->pmu->event_mapped(event, vma->vm_mm);
+ mapped = get_mapped(event, event_mapped);
+ if (mapped)
+ mapped(event, vma->vm_mm);
return ret;
}
@@ -6900,6 +6965,9 @@ static int perf_fasync(int fd, struct fi
struct perf_event *event = filp->private_data;
int retval;
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return -ENODEV;
+
inode_lock(inode);
retval = fasync_helper(fd, filp, on, &event->fasync);
inode_unlock(inode);
@@ -10866,6 +10934,9 @@ static int __perf_event_set_bpf_prog(str
{
bool is_kprobe, is_uprobe, is_tracepoint, is_syscall_tp;
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return -ENODEV;
+
if (!perf_event_is_tracing(event))
return perf_event_set_bpf_handler(event, prog, bpf_cookie);
@@ -12049,6 +12120,9 @@ int perf_pmu_register(struct pmu *_pmu,
if (!pmu->event_idx)
pmu->event_idx = perf_event_idx_default;
+ INIT_LIST_HEAD(&pmu->events);
+ spin_lock_init(&pmu->events_lock);
+
/*
* Now that the PMU is complete, make it visible to perf_try_init_event().
*/
@@ -12062,21 +12136,143 @@ int perf_pmu_register(struct pmu *_pmu,
}
EXPORT_SYMBOL_GPL(perf_pmu_register);
-void perf_pmu_unregister(struct pmu *pmu)
+static void __pmu_detach_event(struct pmu *pmu, struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ /*
+ * De-schedule the event and mark it REVOKED.
+ */
+ perf_event_exit_event(event, ctx, true);
+
+ /*
+ * All _free_event() bits that rely on event->pmu:
+ *
+ * Notably, perf_mmap() relies on the ordering here.
+ */
+ scoped_guard (mutex, &event->mmap_mutex) {
+ WARN_ON_ONCE(pmu->event_unmapped);
+ /*
+ * Mostly an empty lock sequence, such that perf_mmap(), which
+ * relies on mmap_mutex, is sure to observe the state change.
+ */
+ }
+
+ perf_event_free_bpf_prog(event);
+ perf_free_addr_filters(event);
+
+ if (event->destroy) {
+ event->destroy(event);
+ event->destroy = NULL;
+ }
+
+ if (event->pmu_ctx) {
+ put_pmu_ctx(event->pmu_ctx);
+ event->pmu_ctx = NULL;
+ }
+
+ exclusive_event_destroy(event);
+ module_put(pmu->module);
+
+ event->pmu = NULL; /* force fault instead of UAF */
+}
+
+static void pmu_detach_event(struct pmu *pmu, struct perf_event *event)
+{
+ struct perf_event_context *ctx;
+
+ ctx = perf_event_ctx_lock(event);
+ __pmu_detach_event(pmu, event, ctx);
+ perf_event_ctx_unlock(event, ctx);
+
+ scoped_guard (spinlock, &pmu->events_lock)
+ list_del(&event->pmu_list);
+}
+
+static struct perf_event *pmu_get_event(struct pmu *pmu)
+{
+ struct perf_event *event;
+
+ guard(spinlock)(&pmu->events_lock);
+ list_for_each_entry(event, &pmu->events, pmu_list) {
+ if (atomic_long_inc_not_zero(&event->refcount))
+ return event;
+ }
+
+ return NULL;
+}
+
+static bool pmu_empty(struct pmu *pmu)
+{
+ guard(spinlock)(&pmu->events_lock);
+ return list_empty(&pmu->events);
+}
+
+static void pmu_detach_events(struct pmu *pmu)
+{
+ struct perf_event *event;
+
+ for (;;) {
+ event = pmu_get_event(pmu);
+ if (!event)
+ break;
+
+ pmu_detach_event(pmu, event);
+ put_event(event);
+ }
+
+ /*
+ * wait for pending _free_event()s
+ */
+ wait_var_event(pmu, pmu_empty(pmu));
+}
+
+int perf_pmu_unregister(struct pmu *pmu)
{
scoped_guard (mutex, &pmus_lock) {
+ if (!idr_cmpxchg(&pmu_idr, pmu->type, pmu, NULL))
+ return -EINVAL;
+
list_del_rcu(&pmu->entry);
- idr_remove(&pmu_idr, pmu->type);
}
/*
* We dereference the pmu list under both SRCU and regular RCU, so
* synchronize against both of those.
+ *
+ * Notably, the entirety of event creation, from perf_init_event()
+ * (which will now fail, because of the above) until
+ * perf_install_in_context() should be under SRCU such that
+ * this synchronizes against event creation. This avoids trying to
+ * detach events that are not fully formed.
*/
synchronize_srcu(&pmus_srcu);
synchronize_rcu();
+ if (pmu->event_unmapped && !pmu_empty(pmu)) {
+ /*
+ * Can't force remove events when pmu::event_unmapped()
+ * is used in perf_mmap_close().
+ */
+ guard(mutex)(&pmus_lock);
+ idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu);
+ list_add_rcu(&pmu->entry, &pmus);
+ return -EBUSY;
+ }
+
+ scoped_guard (mutex, &pmus_lock)
+ idr_remove(&pmu_idr, pmu->type);
+
+ /*
+ * PMU is removed from the pmus list, so no new events will
+ * be created, now take care of the existing ones.
+ */
+ pmu_detach_events(pmu);
+
+ /*
+ * PMU is unused, make it go away.
+ */
perf_pmu_free(pmu);
+ return 0;
}
EXPORT_SYMBOL_GPL(perf_pmu_unregister);
@@ -12170,7 +12366,7 @@ static struct pmu *perf_init_event(struc
struct pmu *pmu;
int type, ret;
- guard(srcu)(&pmus_srcu);
+ guard(srcu)(&pmus_srcu); /* pmu idr/list access */
/*
* Save original type before calling pmu->event_init() since certain
@@ -12394,6 +12590,7 @@ perf_event_alloc(struct perf_event_attr
INIT_LIST_HEAD(&event->active_entry);
INIT_LIST_HEAD(&event->addr_filters.list);
INIT_HLIST_NODE(&event->hlist_entry);
+ INIT_LIST_HEAD(&event->pmu_list);
init_waitqueue_head(&event->waitq);
@@ -12561,6 +12758,13 @@ perf_event_alloc(struct perf_event_attr
/* symmetric to unaccount_event() in _free_event() */
account_event(event);
+ /*
+ * Event creation should be under SRCU, see perf_pmu_unregister().
+ */
+ lockdep_assert_held(&pmus_srcu);
+ scoped_guard (spinlock, &pmu->events_lock)
+ list_add(&event->pmu_list, &pmu->events);
+
return_ptr(event);
}
@@ -12760,6 +12964,9 @@ perf_event_set_output(struct perf_event
goto unlock;
if (output_event) {
+ if (output_event->state <= PERF_EVENT_STATE_REVOKED)
+ goto unlock;
+
/* get the rb we want to redirect to */
rb = ring_buffer_get(output_event);
if (!rb)
@@ -12941,6 +13148,11 @@ SYSCALL_DEFINE5(perf_event_open,
if (event_fd < 0)
return event_fd;
+ /*
+ * Event creation should be under SRCU, see perf_pmu_unregister().
+ */
+ guard(srcu)(&pmus_srcu);
+
CLASS(fd, group)(group_fd); // group_fd == -1 => empty
if (group_fd != -1) {
if (!is_perf_file(group)) {
@@ -12948,6 +13160,10 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_fd;
}
group_leader = fd_file(group)->private_data;
+ if (group_leader->state <= PERF_EVENT_STATE_REVOKED) {
+ err = -ENODEV;
+ goto err_fd;
+ }
if (flags & PERF_FLAG_FD_OUTPUT)
output_event = group_leader;
if (flags & PERF_FLAG_FD_NO_GROUP)
@@ -13281,6 +13497,11 @@ perf_event_create_kernel_counter(struct
if (attr->aux_output || attr->aux_action)
return ERR_PTR(-EINVAL);
+ /*
+ * Event creation should be under SRCU, see perf_pmu_unregister().
+ */
+ guard(srcu)(&pmus_srcu);
+
event = perf_event_alloc(attr, cpu, task, NULL, NULL,
overflow_handler, context, -1);
if (IS_ERR(event)) {
@@ -13492,10 +13713,14 @@ static void sync_child_event(struct perf
}
static void
-perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
+perf_event_exit_event(struct perf_event *event,
+ struct perf_event_context *ctx, bool revoke)
{
struct perf_event *parent_event = event->parent;
- unsigned long detach_flags = 0;
+ unsigned long detach_flags = DETACH_EXIT;
+
+ if (parent_event == EVENT_TOMBSTONE)
+ parent_event = NULL;
if (parent_event) {
/*
@@ -13510,16 +13735,14 @@ perf_event_exit_event(struct perf_event
* Do destroy all inherited groups, we don't care about those
* and being thorough is better.
*/
- detach_flags = DETACH_GROUP | DETACH_CHILD;
+ detach_flags |= DETACH_GROUP | DETACH_CHILD;
mutex_lock(&parent_event->child_mutex);
}
- perf_remove_from_context(event, detach_flags);
+ if (revoke)
+ detach_flags |= DETACH_GROUP | DETACH_REVOKE;
- raw_spin_lock_irq(&ctx->lock);
- if (event->state > PERF_EVENT_STATE_EXIT)
- perf_event_set_state(event, PERF_EVENT_STATE_EXIT);
- raw_spin_unlock_irq(&ctx->lock);
+ perf_remove_from_context(event, detach_flags);
/*
* Child events can be freed.
@@ -13530,7 +13753,10 @@ perf_event_exit_event(struct perf_event
* Kick perf_poll() for is_event_hup();
*/
perf_event_wakeup(parent_event);
- free_event(event);
+ /*
+ * pmu_detach_event() will have an extra refcount.
+ */
+ put_event(event);
put_event(parent_event);
return;
}
@@ -13596,7 +13822,7 @@ static void perf_event_exit_task_context
perf_event_task(task, ctx, 0);
list_for_each_entry_safe(child_event, next, &ctx->event_list, event_entry)
- perf_event_exit_event(child_event, ctx);
+ perf_event_exit_event(child_event, ctx, false);
mutex_unlock(&ctx->mutex);
@@ -13743,6 +13969,14 @@ inherit_event(struct perf_event *parent_
if (parent_event->parent)
parent_event = parent_event->parent;
+ if (parent_event->state <= PERF_EVENT_STATE_REVOKED)
+ return NULL;
+
+ /*
+ * Event creation should be under SRCU, see perf_pmu_unregister().
+ */
+ guard(srcu)(&pmus_srcu);
+
child_event = perf_event_alloc(&parent_event->attr,
parent_event->cpu,
child,
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child)
2025-03-07 19:33 ` [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child) Peter Zijlstra
@ 2025-03-10 11:08 ` Ravi Bangoria
2025-03-10 14:47 ` Peter Zijlstra
2025-03-10 15:37 ` [PATCH v3a " Peter Zijlstra
1 sibling, 1 reply; 43+ messages in thread
From: Ravi Bangoria @ 2025-03-10 11:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
Ravi Bangoria
Hi Peter,
On 08-Mar-25 1:03 AM, Peter Zijlstra wrote:
> The task passed to perf_event_exit_task() is not a child, it is
> current. Fix this confusing naming, since much of the rest of the code
> also relies on it being current.
>
> Specifically, both exec() and exit() callers use it with current as
> the argument.
...
> -static void perf_event_exit_task_context(struct task_struct *child, bool exit)
> +static void perf_event_exit_task_context(struct task_struct *task, bool exit)
> {
> - struct perf_event_context *child_ctx, *clone_ctx = NULL;
> + struct perf_event_context *ctx, *clone_ctx = NULL;
> struct perf_event *child_event, *next;
>
> - WARN_ON_ONCE(child != current);
> + WARN_ON_ONCE(task != current);
exec() codepath (i.e. copy_process()) passes child pointer, not 'current'.
Thanks,
Ravi
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child)
2025-03-10 11:08 ` Ravi Bangoria
@ 2025-03-10 14:47 ` Peter Zijlstra
2025-03-10 15:20 ` Ravi Bangoria
0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-10 14:47 UTC (permalink / raw)
To: Ravi Bangoria
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
On Mon, Mar 10, 2025 at 04:38:36PM +0530, Ravi Bangoria wrote:
> Hi Peter,
>
> On 08-Mar-25 1:03 AM, Peter Zijlstra wrote:
> > The task passed to perf_event_exit_task() is not a child, it is
> > current. Fix this confusing naming, since much of the rest of the code
> > also relies on it being current.
> >
> > Specifically, both exec() and exit() callers use it with current as
> > the argument.
>
> ...
>
> > -static void perf_event_exit_task_context(struct task_struct *child, bool exit)
> > +static void perf_event_exit_task_context(struct task_struct *task, bool exit)
> > {
> > - struct perf_event_context *child_ctx, *clone_ctx = NULL;
> > + struct perf_event_context *ctx, *clone_ctx = NULL;
> > struct perf_event *child_event, *next;
> >
> > - WARN_ON_ONCE(child != current);
> > + WARN_ON_ONCE(task != current);
>
> exec() codepath (i.e. copy_process()) passes child pointer, not 'current'.
I am confused, this not a new warning. Also, copy_process() is clone(),
exec() is another code path.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child)
2025-03-10 14:47 ` Peter Zijlstra
@ 2025-03-10 15:20 ` Ravi Bangoria
2025-03-10 15:27 ` Peter Zijlstra
0 siblings, 1 reply; 43+ messages in thread
From: Ravi Bangoria @ 2025-03-10 15:20 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
Ravi Bangoria
On 10-Mar-25 8:17 PM, Peter Zijlstra wrote:
> On Mon, Mar 10, 2025 at 04:38:36PM +0530, Ravi Bangoria wrote:
>> Hi Peter,
>>
>> On 08-Mar-25 1:03 AM, Peter Zijlstra wrote:
>>> The task passed to perf_event_exit_task() is not a child, it is
>>> current. Fix this confusing naming, since much of the rest of the code
>>> also relies on it being current.
>>>
>>> Specifically, both exec() and exit() callers use it with current as
>>> the argument.
>>
>> ...
>>
>>> -static void perf_event_exit_task_context(struct task_struct *child, bool exit)
>>> +static void perf_event_exit_task_context(struct task_struct *task, bool exit)
>>> {
>>> - struct perf_event_context *child_ctx, *clone_ctx = NULL;
>>> + struct perf_event_context *ctx, *clone_ctx = NULL;
>>> struct perf_event *child_event, *next;
>>>
>>> - WARN_ON_ONCE(child != current);
>>> + WARN_ON_ONCE(task != current);
>>
>> exec() codepath (i.e. copy_process()) passes child pointer, not 'current'.
>
> I am confused, this not a new warning.
Right, However the WARN was present only in perf_event_exit_task_context()
before merging it with perf_event_free_task() (patch #5). And
perf_event_free_task() is getting called for child task.
> Also, copy_process() is clone(), exec() is another code path.
My bad. I meant clone() code path:
copy_process()
p = dup_task_struct(current);
perf_event_init_task(p);
perf_event_free_task(p);
perf_event_exit_task_context(p);
WARN_ON_ONCE(task != current);
Another one:
copy_process()
p = dup_task_struct(current);
...
bad_fork_cleanup_perf:
perf_event_free_task(p);
perf_event_exit_task_context(p);
WARN_ON_ONCE(task != current);
Or am I missing something?
Thanks,
Ravi
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child)
2025-03-10 15:20 ` Ravi Bangoria
@ 2025-03-10 15:27 ` Peter Zijlstra
0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-10 15:27 UTC (permalink / raw)
To: Ravi Bangoria
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
On Mon, Mar 10, 2025 at 08:50:55PM +0530, Ravi Bangoria wrote:
> On 10-Mar-25 8:17 PM, Peter Zijlstra wrote:
> > On Mon, Mar 10, 2025 at 04:38:36PM +0530, Ravi Bangoria wrote:
> >> Hi Peter,
> >>
> >> On 08-Mar-25 1:03 AM, Peter Zijlstra wrote:
> >>> The task passed to perf_event_exit_task() is not a child, it is
> >>> current. Fix this confusing naming, since much of the rest of the code
> >>> also relies on it being current.
> >>>
> >>> Specifically, both exec() and exit() callers use it with current as
> >>> the argument.
> >>
> >> ...
> >>
> >>> -static void perf_event_exit_task_context(struct task_struct *child, bool exit)
> >>> +static void perf_event_exit_task_context(struct task_struct *task, bool exit)
> >>> {
> >>> - struct perf_event_context *child_ctx, *clone_ctx = NULL;
> >>> + struct perf_event_context *ctx, *clone_ctx = NULL;
> >>> struct perf_event *child_event, *next;
> >>>
> >>> - WARN_ON_ONCE(child != current);
> >>> + WARN_ON_ONCE(task != current);
> >>
> >> exec() codepath (i.e. copy_process()) passes child pointer, not 'current'.
> >
> > I am confused, this not a new warning.
>
> Right, However the WARN was present only in perf_event_exit_task_context()
> before merging it with perf_event_free_task() (patch #5). And
> perf_event_free_task() is getting called for child task.
Argh, yes.
> > Also, copy_process() is clone(), exec() is another code path.
>
> My bad. I meant clone() code path:
>
> copy_process()
> p = dup_task_struct(current);
> perf_event_init_task(p);
> perf_event_free_task(p);
> perf_event_exit_task_context(p);
> WARN_ON_ONCE(task != current);
>
> Another one:
>
> copy_process()
> p = dup_task_struct(current);
> ...
> bad_fork_cleanup_perf:
> perf_event_free_task(p);
> perf_event_exit_task_context(p);
> WARN_ON_ONCE(task != current);
>
> Or am I missing something?
No, the perf_event_free_task() callchain has a problem.
I'll remove that WARN_ON_ONCE() since perf_event_exit_task() has the
same check. I'll do that in the merge patch, not this rename patch.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable
2025-03-07 19:33 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Peter Zijlstra
@ 2025-03-10 15:35 ` Ravi Bangoria
2025-03-10 16:14 ` Peter Zijlstra
2025-03-10 16:46 ` Ravi Bangoria
` (2 subsequent siblings)
3 siblings, 1 reply; 43+ messages in thread
From: Ravi Bangoria @ 2025-03-10 15:35 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
Ravi Bangoria
> @@ -207,6 +207,7 @@ static void perf_ctx_unlock(struct perf_
> }
>
> #define TASK_TOMBSTONE ((void *)-1L)
> +#define EVENT_TOMBSTONE ((void *)-1L)
>
> static bool is_kernel_event(struct perf_event *event)
> {
> @@ -2348,6 +2349,11 @@ static void perf_child_detach(struct per
>
> sync_child_event(event);
> list_del_init(&event->child_list);
> + /*
> + * Cannot set to NULL, as that would confuse the situation vs
> + * not being a child event. See for example unaccount_event().
> + */
> + event->parent = EVENT_TOMBSTONE;
This will cause issues where we do `event = event->parent`. No? For ex:
perf_pmu_unregister()
...
perf_event_exit_event()
perf_event_wakeup()
ring_buffer_wakeup()
if (event->parent)
event = event->parent;
Thanks,
Ravi
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v3a 5/7] perf: Unify perf_event_free_task() / perf_event_exit_task_context()
2025-03-07 19:33 ` [PATCH v3 5/7] perf: Unify perf_event_free_task() / perf_event_exit_task_context() Peter Zijlstra
@ 2025-03-10 15:35 ` Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
1 sibling, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-10 15:35 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, acme, namhyung, mark.rutland, alexander.shishkin,
jolsa, irogers, adrian.hunter, kan.liang
Both perf_event_free_task() and perf_event_exit_task_context() are
very similar, except perf_event_exit_task_context() is a little more
generic / makes less assumptions.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 90 ++++++++++++---------------------------------------
1 file changed, 22 insertions(+), 68 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13546,13 +13546,11 @@ perf_event_exit_event(struct perf_event
perf_event_wakeup(event);
}
-static void perf_event_exit_task_context(struct task_struct *child)
+static void perf_event_exit_task_context(struct task_struct *child, bool exit)
{
struct perf_event_context *child_ctx, *clone_ctx = NULL;
struct perf_event *child_event, *next;
- WARN_ON_ONCE(child != current);
-
child_ctx = perf_pin_task_context(child);
if (!child_ctx)
return;
@@ -13597,13 +13595,31 @@ static void perf_event_exit_task_context
* won't get any samples after PERF_RECORD_EXIT. We can however still
* get a few PERF_RECORD_READ events.
*/
- perf_event_task(child, child_ctx, 0);
+ if (exit)
+ perf_event_task(child, child_ctx, 0);
list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry)
perf_event_exit_event(child_event, child_ctx);
mutex_unlock(&child_ctx->mutex);
+ if (!exit) {
+ /*
+ * perf_event_release_kernel() could still have a reference on
+ * this context. In that case we must wait for these events to
+ * have been freed (in particular all their references to this
+ * task must've been dropped).
+ *
+ * Without this copy_process() will unconditionally free this
+ * task (irrespective of its reference count) and
+ * _free_event()'s put_task_struct(event->hw.target) will be a
+ * use-after-free.
+ *
+ * Wait for all events to drop their context reference.
+ */
+ wait_var_event(&child_ctx->refcount,
+ refcount_read(&child_ctx->refcount) == 1);
+ }
put_ctx(child_ctx);
}
@@ -13631,7 +13647,7 @@ void perf_event_exit_task(struct task_st
}
mutex_unlock(&child->perf_event_mutex);
- perf_event_exit_task_context(child);
+ perf_event_exit_task_context(child, true);
/*
* The perf_event_exit_task_context calls perf_event_task
@@ -13642,27 +13658,6 @@ void perf_event_exit_task(struct task_st
perf_event_task(child, NULL, 0);
}
-static void perf_free_event(struct perf_event *event,
- struct perf_event_context *ctx)
-{
- struct perf_event *parent = event->parent;
-
- if (WARN_ON_ONCE(!parent))
- return;
-
- mutex_lock(&parent->child_mutex);
- list_del_init(&event->child_list);
- mutex_unlock(&parent->child_mutex);
-
- put_event(parent);
-
- raw_spin_lock_irq(&ctx->lock);
- perf_group_detach(event);
- list_del_event(event, ctx);
- raw_spin_unlock_irq(&ctx->lock);
- free_event(event);
-}
-
/*
* Free a context as created by inheritance by perf_event_init_task() below,
* used by fork() in case of fail.
@@ -13672,48 +13667,7 @@ static void perf_free_event(struct perf_
*/
void perf_event_free_task(struct task_struct *task)
{
- struct perf_event_context *ctx;
- struct perf_event *event, *tmp;
-
- ctx = rcu_access_pointer(task->perf_event_ctxp);
- if (!ctx)
- return;
-
- mutex_lock(&ctx->mutex);
- raw_spin_lock_irq(&ctx->lock);
- /*
- * Destroy the task <-> ctx relation and mark the context dead.
- *
- * This is important because even though the task hasn't been
- * exposed yet the context has been (through child_list).
- */
- RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
- WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
- put_task_struct(task); /* cannot be last */
- raw_spin_unlock_irq(&ctx->lock);
-
-
- list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
- perf_free_event(event, ctx);
-
- mutex_unlock(&ctx->mutex);
-
- /*
- * perf_event_release_kernel() could've stolen some of our
- * child events and still have them on its free_list. In that
- * case we must wait for these events to have been freed (in
- * particular all their references to this task must've been
- * dropped).
- *
- * Without this copy_process() will unconditionally free this
- * task (irrespective of its reference count) and
- * _free_event()'s put_task_struct(event->hw.target) will be a
- * use-after-free.
- *
- * Wait for all events to drop their context reference.
- */
- wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
- put_ctx(ctx); /* must be last */
+ perf_event_exit_task_context(task, false);
}
void perf_event_delayed_put(struct task_struct *task)
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v3a 6/7] perf: Rename perf_event_exit_task(.child)
2025-03-07 19:33 ` [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child) Peter Zijlstra
2025-03-10 11:08 ` Ravi Bangoria
@ 2025-03-10 15:37 ` Peter Zijlstra
2025-03-12 6:31 ` Ravi Bangoria
1 sibling, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-10 15:37 UTC (permalink / raw)
To: mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, acme, namhyung, mark.rutland, alexander.shishkin,
jolsa, irogers, adrian.hunter, kan.liang
The task passed to perf_event_exit_task() is not a child, it is
current. Fix this confusing naming, since much of the rest of the code
also relies on it being current.
Specifically, both exec() and exit() callers use it with current as
the argument.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 54 ++++++++++++++++++++++++++-------------------------
1 file changed, 28 insertions(+), 26 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13551,8 +13551,8 @@ static void perf_event_exit_task_context
struct perf_event_context *child_ctx, *clone_ctx = NULL;
struct perf_event *child_event, *next;
- child_ctx = perf_pin_task_context(child);
- if (!child_ctx)
+ ctx = perf_pin_task_context(task);
+ if (!ctx)
return;
/*
@@ -13565,27 +13565,27 @@ static void perf_event_exit_task_context
* without ctx::mutex (it cannot because of the move_group double mutex
* lock thing). See the comments in perf_install_in_context().
*/
- mutex_lock(&child_ctx->mutex);
+ mutex_lock(&ctx->mutex);
/*
* In a single ctx::lock section, de-schedule the events and detach the
* context from the task such that we cannot ever get it scheduled back
* in.
*/
- raw_spin_lock_irq(&child_ctx->lock);
- task_ctx_sched_out(child_ctx, NULL, EVENT_ALL);
+ raw_spin_lock_irq(&ctx->lock);
+ task_ctx_sched_out(ctx, NULL, EVENT_ALL);
/*
* Now that the context is inactive, destroy the task <-> ctx relation
* and mark the context dead.
*/
- RCU_INIT_POINTER(child->perf_event_ctxp, NULL);
- put_ctx(child_ctx); /* cannot be last */
- WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
+ RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
+ put_ctx(ctx); /* cannot be last */
+ WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
put_task_struct(current); /* cannot be last */
- clone_ctx = unclone_ctx(child_ctx);
- raw_spin_unlock_irq(&child_ctx->lock);
+ clone_ctx = unclone_ctx(ctx);
+ raw_spin_unlock_irq(&ctx->lock);
if (clone_ctx)
put_ctx(clone_ctx);
@@ -13596,12 +13596,12 @@ static void perf_event_exit_task_context
* get a few PERF_RECORD_READ events.
*/
if (exit)
- perf_event_task(child, child_ctx, 0);
+ perf_event_task(task, ctx, 0);
- list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry)
- perf_event_exit_event(child_event, child_ctx);
+ list_for_each_entry_safe(child_event, next, &ctx->event_list, event_entry)
+ perf_event_exit_event(child_event, ctx);
- mutex_unlock(&child_ctx->mutex);
+ mutex_unlock(&ctx->mutex);
if (!exit) {
/*
@@ -13617,24 +13617,26 @@ static void perf_event_exit_task_context
*
* Wait for all events to drop their context reference.
*/
- wait_var_event(&child_ctx->refcount,
- refcount_read(&child_ctx->refcount) == 1);
+ wait_var_event(&ctx->refcount,
+ refcount_read(&ctx->refcount) == 1);
}
- put_ctx(child_ctx);
+ put_ctx(ctx);
}
/*
- * When a child task exits, feed back event values to parent events.
+ * When a task exits, feed back event values to parent events.
*
* Can be called with exec_update_lock held when called from
* setup_new_exec().
*/
-void perf_event_exit_task(struct task_struct *child)
+void perf_event_exit_task(struct task_struct *task)
{
struct perf_event *event, *tmp;
- mutex_lock(&child->perf_event_mutex);
- list_for_each_entry_safe(event, tmp, &child->perf_event_list,
+ WARN_ON_ONCE(task != current);
+
+ mutex_lock(&task->perf_event_mutex);
+ list_for_each_entry_safe(event, tmp, &task->perf_event_list,
owner_entry) {
list_del_init(&event->owner_entry);
@@ -13645,17 +13647,17 @@ void perf_event_exit_task(struct task_st
*/
smp_store_release(&event->owner, NULL);
}
- mutex_unlock(&child->perf_event_mutex);
+ mutex_unlock(&task->perf_event_mutex);
- perf_event_exit_task_context(child, true);
+ perf_event_exit_task_context(task, true);
/*
* The perf_event_exit_task_context calls perf_event_task
- * with child's task_ctx, which generates EXIT events for
- * child contexts and sets child->perf_event_ctxp[] to NULL.
+ * with task's task_ctx, which generates EXIT events for
+ * task contexts and sets task->perf_event_ctxp[] to NULL.
* At this point we need to send EXIT events to cpu contexts.
*/
- perf_event_task(child, NULL, 0);
+ perf_event_task(task, NULL, 0);
}
/*
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable
2025-03-10 15:35 ` Ravi Bangoria
@ 2025-03-10 16:14 ` Peter Zijlstra
0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-10 16:14 UTC (permalink / raw)
To: Ravi Bangoria
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
On Mon, Mar 10, 2025 at 09:05:45PM +0530, Ravi Bangoria wrote:
> > @@ -207,6 +207,7 @@ static void perf_ctx_unlock(struct perf_
> > }
> >
> > #define TASK_TOMBSTONE ((void *)-1L)
> > +#define EVENT_TOMBSTONE ((void *)-1L)
> >
> > static bool is_kernel_event(struct perf_event *event)
> > {
> > @@ -2348,6 +2349,11 @@ static void perf_child_detach(struct per
> >
> > sync_child_event(event);
> > list_del_init(&event->child_list);
> > + /*
> > + * Cannot set to NULL, as that would confuse the situation vs
> > + * not being a child event. See for example unaccount_event().
> > + */
> > + event->parent = EVENT_TOMBSTONE;
>
> This will cause issues where we do `event = event->parent`. No? For ex:
>
> perf_pmu_unregister()
> ...
> perf_event_exit_event()
> perf_event_wakeup()
> ring_buffer_wakeup()
> if (event->parent)
> event = event->parent;
It will. However, if we do not do this, we have a potential
use-after-free, and people seem to take a dim view of those too :-)
I had convinced myself all paths that were doing this 'event =
event->parent' were unreachable by the time we set the TOMBSTONE, but
clearly I missed one.
I suppose we can do something like so, not really pretty though.
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13722,9 +13722,12 @@ perf_event_exit_event(struct perf_event
{
struct perf_event *parent_event = event->parent;
unsigned long detach_flags = DETACH_EXIT;
+ bool parent_dead = false;
- if (parent_event == EVENT_TOMBSTONE)
+ if (parent_event == EVENT_TOMBSTONE) {
parent_event = NULL;
+ parent_dead = true;
+ }
if (parent_event) {
/*
@@ -13748,6 +13751,9 @@ perf_event_exit_event(struct perf_event
perf_remove_from_context(event, detach_flags);
+ if (parent_dead)
+ return;
+
/*
* Child events can be freed.
*/
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable
2025-03-07 19:33 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Peter Zijlstra
2025-03-10 15:35 ` Ravi Bangoria
@ 2025-03-10 16:46 ` Ravi Bangoria
2025-03-12 12:57 ` Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2025-04-14 0:37 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Mi, Dapeng
3 siblings, 1 reply; 43+ messages in thread
From: Ravi Bangoria @ 2025-03-10 16:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
Ravi Bangoria
On 08-Mar-25 1:03 AM, Peter Zijlstra wrote:
> Previously it was only safe to call perf_pmu_unregister() if there
> were no active events of that pmu around -- which was impossible to
> guarantee since it races all sorts against perf_init_event().
>
> Rework the whole thing by:
>
> - keeping track of all events for a given pmu
>
> - 'hiding' the pmu from perf_init_event()
>
> - waiting for the appropriate (s)rcu grace periods such that all
> prior references to the PMU will be completed
>
> - detaching all still existing events of that pmu (see first point)
> and moving them to a new REVOKED state.
>
> - actually freeing the pmu data.
>
> Where notably the new REVOKED state must inhibit all event actions
> from reaching code that wants to use event->pmu.
I think perf_event_init_task() failure path can still race with
perf_pmu_unregister() and trigger a WARN():
CPU 1 CPU 2
perf_event_init_task()
perf_event_free_task()
perf_free_event(event1)
/* event1->refcount is 1 */
perf_pmu_unregister()
pmu_detach_events()
pmu_get_event(pmu) /* Picks event1 */
atomic_long_inc_not_zero(&event1->refcount) /* event1 */
/* event1->refcount became 2 (by CPU 2) */
free_event(event1)
WARN()
Thanks,
Ravi
I'll schedule a perf_fuzzer + tinypmu test overnight and see if it
reports any other issues.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3a 6/7] perf: Rename perf_event_exit_task(.child)
2025-03-10 15:37 ` [PATCH v3a " Peter Zijlstra
@ 2025-03-12 6:31 ` Ravi Bangoria
2025-03-12 10:16 ` Peter Zijlstra
0 siblings, 1 reply; 43+ messages in thread
From: Ravi Bangoria @ 2025-03-12 6:31 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
Ravi Bangoria
Hi Peter,
> The task passed to perf_event_exit_task() is not a child, it is
> current. Fix this confusing naming, since much of the rest of the code
> also relies on it being current.
>
> Specifically, both exec() and exit() callers use it with current as
> the argument.
When perf_event_exit_task_context() gets called by perf_event_free_task():
1) task_ctx_sched_out(ctx) function should be avoided because the 'ctx'
argument is of the (half baked)child task whereas task_ctx_sched_out()
expects 'ctx' to be the context of 'current'.
2) Similarly, 'task' argument != 'current'.
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13573,7 +13573,8 @@ static void perf_event_exit_task_context(struct task_struct *task, bool exit)
* in.
*/
raw_spin_lock_irq(&ctx->lock);
- task_ctx_sched_out(ctx, NULL, EVENT_ALL);
+ if (exit)
+ task_ctx_sched_out(ctx, NULL, EVENT_ALL);
/*
* Now that the context is inactive, destroy the task <-> ctx relation
@@ -13582,7 +13583,7 @@ static void perf_event_exit_task_context(struct task_struct *task, bool exit)
RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
put_ctx(ctx); /* cannot be last */
WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
- put_task_struct(current); /* cannot be last */
+ put_task_struct(task); /* cannot be last */
clone_ctx = unclone_ctx(ctx);
raw_spin_unlock_irq(&ctx->lock);
Thanks,
Ravi
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3a 6/7] perf: Rename perf_event_exit_task(.child)
2025-03-12 6:31 ` Ravi Bangoria
@ 2025-03-12 10:16 ` Peter Zijlstra
0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-12 10:16 UTC (permalink / raw)
To: Ravi Bangoria
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
On Wed, Mar 12, 2025 at 12:01:00PM +0530, Ravi Bangoria wrote:
> Hi Peter,
>
> > The task passed to perf_event_exit_task() is not a child, it is
> > current. Fix this confusing naming, since much of the rest of the code
> > also relies on it being current.
> >
> > Specifically, both exec() and exit() callers use it with current as
> > the argument.
>
> When perf_event_exit_task_context() gets called by perf_event_free_task():
>
> 1) task_ctx_sched_out(ctx) function should be avoided because the 'ctx'
> argument is of the (half baked)child task whereas task_ctx_sched_out()
> expects 'ctx' to be the context of 'current'.
> 2) Similarly, 'task' argument != 'current'.
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -13573,7 +13573,8 @@ static void perf_event_exit_task_context(struct task_struct *task, bool exit)
> * in.
> */
> raw_spin_lock_irq(&ctx->lock);
> - task_ctx_sched_out(ctx, NULL, EVENT_ALL);
> + if (exit)
> + task_ctx_sched_out(ctx, NULL, EVENT_ALL);
>
> /*
> * Now that the context is inactive, destroy the task <-> ctx relation
> @@ -13582,7 +13583,7 @@ static void perf_event_exit_task_context(struct task_struct *task, bool exit)
> RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
> put_ctx(ctx); /* cannot be last */
> WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
> - put_task_struct(current); /* cannot be last */
> + put_task_struct(task); /* cannot be last */
>
> clone_ctx = unclone_ctx(ctx);
> raw_spin_unlock_irq(&ctx->lock);
Right you are. Thanks!
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable
2025-03-10 16:46 ` Ravi Bangoria
@ 2025-03-12 12:57 ` Peter Zijlstra
2025-03-12 13:57 ` Ravi Bangoria
0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2025-03-12 12:57 UTC (permalink / raw)
To: Ravi Bangoria
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
On Mon, Mar 10, 2025 at 10:16:09PM +0530, Ravi Bangoria wrote:
> On 08-Mar-25 1:03 AM, Peter Zijlstra wrote:
> > Previously it was only safe to call perf_pmu_unregister() if there
> > were no active events of that pmu around -- which was impossible to
> > guarantee since it races all sorts against perf_init_event().
> >
> > Rework the whole thing by:
> >
> > - keeping track of all events for a given pmu
> >
> > - 'hiding' the pmu from perf_init_event()
> >
> > - waiting for the appropriate (s)rcu grace periods such that all
> > prior references to the PMU will be completed
> >
> > - detaching all still existing events of that pmu (see first point)
> > and moving them to a new REVOKED state.
> >
> > - actually freeing the pmu data.
> >
> > Where notably the new REVOKED state must inhibit all event actions
> > from reaching code that wants to use event->pmu.
>
> I think perf_event_init_task() failure path can still race with
> perf_pmu_unregister() and trigger a WARN():
>
> CPU 1 CPU 2
>
> perf_event_init_task()
> perf_event_free_task()
> perf_free_event(event1)
> /* event1->refcount is 1 */
> perf_pmu_unregister()
> pmu_detach_events()
> pmu_get_event(pmu) /* Picks event1 */
> atomic_long_inc_not_zero(&event1->refcount) /* event1 */
> /* event1->refcount became 2 (by CPU 2) */
> free_event(event1)
> WARN()
>
Yeah, I think most free_event() users are broken at this point. Let me
rip all that out. Not really worth the trouble anymore.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable
2025-03-12 12:57 ` Peter Zijlstra
@ 2025-03-12 13:57 ` Ravi Bangoria
0 siblings, 0 replies; 43+ messages in thread
From: Ravi Bangoria @ 2025-03-12 13:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
Ravi Bangoria
Peter,
On 12-Mar-25 6:27 PM, Peter Zijlstra wrote:
> On Mon, Mar 10, 2025 at 10:16:09PM +0530, Ravi Bangoria wrote:
>> On 08-Mar-25 1:03 AM, Peter Zijlstra wrote:
>>> Previously it was only safe to call perf_pmu_unregister() if there
>>> were no active events of that pmu around -- which was impossible to
>>> guarantee since it races all sorts against perf_init_event().
>>>
>>> Rework the whole thing by:
>>>
>>> - keeping track of all events for a given pmu
>>>
>>> - 'hiding' the pmu from perf_init_event()
>>>
>>> - waiting for the appropriate (s)rcu grace periods such that all
>>> prior references to the PMU will be completed
>>>
>>> - detaching all still existing events of that pmu (see first point)
>>> and moving them to a new REVOKED state.
>>>
>>> - actually freeing the pmu data.
>>>
>>> Where notably the new REVOKED state must inhibit all event actions
>>> from reaching code that wants to use event->pmu.
>>
>> I think perf_event_init_task() failure path can still race with
>> perf_pmu_unregister() and trigger a WARN():
>>
>> CPU 1 CPU 2
>>
>> perf_event_init_task()
>> perf_event_free_task()
>> perf_free_event(event1)
>> /* event1->refcount is 1 */
>> perf_pmu_unregister()
>> pmu_detach_events()
>> pmu_get_event(pmu) /* Picks event1 */
>> atomic_long_inc_not_zero(&event1->refcount) /* event1 */
>> /* event1->refcount became 2 (by CPU 2) */
>> free_event(event1)
>> WARN()
>>
Please ignore this.
I missed to realize that you have already replaced perf_free_event() with
perf_event_exit_event() and also replaced free_event() with put_event()
in perf_event_exit_event().
Thanks,
Ravi
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 3/7] perf: Simplify perf_event_free_task() wait
2025-03-07 19:33 ` [PATCH v3 3/7] perf: Simplify perf_event_free_task() wait Peter Zijlstra
@ 2025-03-17 6:49 ` Ravi Bangoria
2025-04-02 9:15 ` Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
1 sibling, 1 reply; 43+ messages in thread
From: Ravi Bangoria @ 2025-03-17 6:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
Ravi Bangoria
Hi Peter,
> @@ -1223,8 +1223,14 @@ static void put_ctx(struct perf_event_co
> if (refcount_dec_and_test(&ctx->refcount)) {
> if (ctx->parent_ctx)
> put_ctx(ctx->parent_ctx);
> - if (ctx->task && ctx->task != TASK_TOMBSTONE)
> - put_task_struct(ctx->task);
> + if (ctx->task) {
> + if (ctx->task == TASK_TOMBSTONE) {
> + smp_mb(); /* pairs with wait_var_event() */
> + wake_up_var(&ctx->refcount);
perf_event_free_task() waits on "ctx->refcount == 1". But moving
wake_up_var() under refcount_dec_and_test() will cause
perf_event_free_task() to wait indefinitely. Right? So, shouldn't
wake_up_var() be outside? something like:
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1281,15 +1281,14 @@ static void put_ctx(struct perf_event_context *ctx)
if (refcount_dec_and_test(&ctx->refcount)) {
if (ctx->parent_ctx)
put_ctx(ctx->parent_ctx);
- if (ctx->task) {
- if (ctx->task == TASK_TOMBSTONE) {
- smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(&ctx->refcount);
- } else {
- put_task_struct(ctx->task);
- }
- }
+ if (ctx->task && ctx->task != TASK_TOMBSTONE)
+ put_task_struct(ctx->task);
call_rcu(&ctx->rcu_head, free_ctx);
+ } else {
+ if (ctx->task == TASK_TOMBSTONE) {
+ smp_mb(); /* pairs with wait_var_event() */
+ wake_up_var(&ctx->refcount);
+ }
}
}
Thanks,
Ravi
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
` (6 preceding siblings ...)
2025-03-07 19:33 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Peter Zijlstra
@ 2025-03-17 6:54 ` Ravi Bangoria
2025-04-08 19:05 ` [tip: perf/core] perf: Rename perf_event_exit_task(.child) tip-bot2 for Peter Zijlstra
` (2 subsequent siblings)
10 siblings, 0 replies; 43+ messages in thread
From: Ravi Bangoria @ 2025-03-17 6:54 UTC (permalink / raw)
To: Peter Zijlstra, mingo
Cc: lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
Ravi Bangoria
Hi Peter,
On 08-Mar-25 1:03 AM, Peter Zijlstra wrote:
> Hi!
>
> Much smaller this time because Ingo merged a whole lot of the preperatory
> patches, thanks!
>
> This appears to survive a few hours of perf_fuzzer combined with tinypmu testcase.
>
> Patches also at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/pmu-unregister
>
With all the reported issues addressed, for the series:
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Thanks,
Ravi
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 3/7] perf: Simplify perf_event_free_task() wait
2025-03-17 6:49 ` Ravi Bangoria
@ 2025-04-02 9:15 ` Peter Zijlstra
0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-04-02 9:15 UTC (permalink / raw)
To: Ravi Bangoria
Cc: mingo, lucas.demarchi, linux-kernel, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
On Mon, Mar 17, 2025 at 12:19:07PM +0530, Ravi Bangoria wrote:
> Hi Peter,
>
> > @@ -1223,8 +1223,14 @@ static void put_ctx(struct perf_event_co
> > if (refcount_dec_and_test(&ctx->refcount)) {
> > if (ctx->parent_ctx)
> > put_ctx(ctx->parent_ctx);
> > - if (ctx->task && ctx->task != TASK_TOMBSTONE)
> > - put_task_struct(ctx->task);
> > + if (ctx->task) {
> > + if (ctx->task == TASK_TOMBSTONE) {
> > + smp_mb(); /* pairs with wait_var_event() */
> > + wake_up_var(&ctx->refcount);
>
> perf_event_free_task() waits on "ctx->refcount == 1". But moving
> wake_up_var() under refcount_dec_and_test() will cause
> perf_event_free_task() to wait indefinitely. Right? So, shouldn't
> wake_up_var() be outside? something like:
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1281,15 +1281,14 @@ static void put_ctx(struct perf_event_context *ctx)
> if (refcount_dec_and_test(&ctx->refcount)) {
> if (ctx->parent_ctx)
> put_ctx(ctx->parent_ctx);
> - if (ctx->task) {
> - if (ctx->task == TASK_TOMBSTONE) {
> - smp_mb(); /* pairs with wait_var_event() */
> - wake_up_var(&ctx->refcount);
> - } else {
> - put_task_struct(ctx->task);
> - }
> - }
> + if (ctx->task && ctx->task != TASK_TOMBSTONE)
> + put_task_struct(ctx->task);
> call_rcu(&ctx->rcu_head, free_ctx);
> + } else {
> + if (ctx->task == TASK_TOMBSTONE) {
> + smp_mb(); /* pairs with wait_var_event() */
> + wake_up_var(&ctx->refcount);
> + }
> }
> }
Yes, you're quite right indeed. Thanks!
^ permalink raw reply [flat|nested] 43+ messages in thread
* [tip: perf/core] perf: Make perf_pmu_unregister() useable
2025-03-07 19:33 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Peter Zijlstra
2025-03-10 15:35 ` Ravi Bangoria
2025-03-10 16:46 ` Ravi Bangoria
@ 2025-04-08 19:05 ` tip-bot2 for Peter Zijlstra
2025-04-17 8:08 ` Peter Zijlstra
2025-04-14 0:37 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Mi, Dapeng
3 siblings, 1 reply; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-08 19:05 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Ravi Bangoria, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: da916e96e2dedcb2d40de77a7def833d315b81a6
Gitweb: https://git.kernel.org/tip/da916e96e2dedcb2d40de77a7def833d315b81a6
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 25 Oct 2024 10:21:41 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:48 +02:00
perf: Make perf_pmu_unregister() useable
Previously it was only safe to call perf_pmu_unregister() if there
were no active events of that pmu around -- which was impossible to
guarantee since it races all sorts against perf_init_event().
Rework the whole thing by:
- keeping track of all events for a given pmu
- 'hiding' the pmu from perf_init_event()
- waiting for the appropriate (s)rcu grace periods such that all
prior references to the PMU will be completed
- detaching all still existing events of that pmu (see first point)
and moving them to a new REVOKED state.
- actually freeing the pmu data.
Where notably the new REVOKED state must inhibit all event actions
from reaching code that wants to use event->pmu.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193723.525402029@infradead.org
---
include/linux/perf_event.h | 15 +-
kernel/events/core.c | 320 ++++++++++++++++++++++++++++++------
2 files changed, 280 insertions(+), 55 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0069ba6..7f49a58 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -325,6 +325,9 @@ struct perf_output_handle;
struct pmu {
struct list_head entry;
+ spinlock_t events_lock;
+ struct list_head events;
+
struct module *module;
struct device *dev;
struct device *parent;
@@ -622,9 +625,10 @@ struct perf_addr_filter_range {
* enum perf_event_state - the states of an event:
*/
enum perf_event_state {
- PERF_EVENT_STATE_DEAD = -4,
- PERF_EVENT_STATE_EXIT = -3,
- PERF_EVENT_STATE_ERROR = -2,
+ PERF_EVENT_STATE_DEAD = -5,
+ PERF_EVENT_STATE_REVOKED = -4, /* pmu gone, must not touch */
+ PERF_EVENT_STATE_EXIT = -3, /* task died, still inherit */
+ PERF_EVENT_STATE_ERROR = -2, /* scheduling error, can enable */
PERF_EVENT_STATE_OFF = -1,
PERF_EVENT_STATE_INACTIVE = 0,
PERF_EVENT_STATE_ACTIVE = 1,
@@ -865,6 +869,7 @@ struct perf_event {
void *security;
#endif
struct list_head sb_list;
+ struct list_head pmu_list;
/*
* Certain events gets forwarded to another pmu internally by over-
@@ -1155,7 +1160,7 @@ extern void perf_aux_output_flag(struct perf_output_handle *handle, u64 flags);
extern void perf_event_itrace_started(struct perf_event *event);
extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
-extern void perf_pmu_unregister(struct pmu *pmu);
+extern int perf_pmu_unregister(struct pmu *pmu);
extern void __perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task);
@@ -1760,7 +1765,7 @@ static inline bool needs_branch_stack(struct perf_event *event)
static inline bool has_aux(struct perf_event *event)
{
- return event->pmu->setup_aux;
+ return event->pmu && event->pmu->setup_aux;
}
static inline bool has_aux_action(struct perf_event *event)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 985b5c7..2eb9cd5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -208,6 +208,7 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
}
#define TASK_TOMBSTONE ((void *)-1L)
+#define EVENT_TOMBSTONE ((void *)-1L)
static bool is_kernel_event(struct perf_event *event)
{
@@ -2336,6 +2337,11 @@ static void perf_child_detach(struct perf_event *event)
sync_child_event(event);
list_del_init(&event->child_list);
+ /*
+ * Cannot set to NULL, as that would confuse the situation vs
+ * not being a child event. See for example unaccount_event().
+ */
+ event->parent = EVENT_TOMBSTONE;
}
static bool is_orphaned_event(struct perf_event *event)
@@ -2457,8 +2463,9 @@ ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event)
#define DETACH_GROUP 0x01UL
#define DETACH_CHILD 0x02UL
-#define DETACH_DEAD 0x04UL
-#define DETACH_EXIT 0x08UL
+#define DETACH_EXIT 0x04UL
+#define DETACH_REVOKE 0x08UL
+#define DETACH_DEAD 0x10UL
/*
* Cross CPU call to remove a performance event
@@ -2484,18 +2491,21 @@ __perf_remove_from_context(struct perf_event *event,
*/
if (flags & DETACH_EXIT)
state = PERF_EVENT_STATE_EXIT;
+ if (flags & DETACH_REVOKE)
+ state = PERF_EVENT_STATE_REVOKED;
if (flags & DETACH_DEAD) {
event->pending_disable = 1;
state = PERF_EVENT_STATE_DEAD;
}
event_sched_out(event, ctx);
- perf_event_set_state(event, min(event->state, state));
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
+ event->state = min(event->state, state);
+
if (!pmu_ctx->nr_events) {
pmu_ctx->rotate_necessary = 0;
@@ -4523,7 +4533,8 @@ out:
static void perf_remove_from_owner(struct perf_event *event);
static void perf_event_exit_event(struct perf_event *event,
- struct perf_event_context *ctx);
+ struct perf_event_context *ctx,
+ bool revoke);
/*
* Removes all events from the current task that have been marked
@@ -4550,7 +4561,7 @@ static void perf_event_remove_on_exec(struct perf_event_context *ctx)
modified = true;
- perf_event_exit_event(event, ctx);
+ perf_event_exit_event(event, ctx, false);
}
raw_spin_lock_irqsave(&ctx->lock, flags);
@@ -5132,6 +5143,7 @@ static bool is_sb_event(struct perf_event *event)
attr->context_switch || attr->text_poke ||
attr->bpf_event)
return true;
+
return false;
}
@@ -5528,6 +5540,8 @@ static void perf_free_addr_filters(struct perf_event *event);
/* vs perf_event_alloc() error */
static void __free_event(struct perf_event *event)
{
+ struct pmu *pmu = event->pmu;
+
if (event->attach_state & PERF_ATTACH_CALLCHAIN)
put_callchain_buffers();
@@ -5557,6 +5571,7 @@ static void __free_event(struct perf_event *event)
* put_pmu_ctx() needs an event->ctx reference, because of
* epc->ctx.
*/
+ WARN_ON_ONCE(!pmu);
WARN_ON_ONCE(!event->ctx);
WARN_ON_ONCE(event->pmu_ctx->ctx != event->ctx);
put_pmu_ctx(event->pmu_ctx);
@@ -5569,8 +5584,13 @@ static void __free_event(struct perf_event *event)
if (event->ctx)
put_ctx(event->ctx);
- if (event->pmu)
- module_put(event->pmu->module);
+ if (pmu) {
+ module_put(pmu->module);
+ scoped_guard (spinlock, &pmu->events_lock) {
+ list_del(&event->pmu_list);
+ wake_up_var(pmu);
+ }
+ }
call_rcu(&event->rcu_head, free_event_rcu);
}
@@ -5606,22 +5626,6 @@ static void _free_event(struct perf_event *event)
}
/*
- * Used to free events which have a known refcount of 1, such as in error paths
- * where the event isn't exposed yet and inherited events.
- */
-static void free_event(struct perf_event *event)
-{
- if (WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1,
- "unexpected event refcount: %ld; ptr=%p\n",
- atomic_long_read(&event->refcount), event)) {
- /* leak to avoid use-after-free */
- return;
- }
-
- _free_event(event);
-}
-
-/*
* Remove user event from the owner task.
*/
static void perf_remove_from_owner(struct perf_event *event)
@@ -5724,7 +5728,11 @@ int perf_event_release_kernel(struct perf_event *event)
* Thus this guarantees that we will in fact observe and kill _ALL_
* child events.
*/
- perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
+ if (event->state > PERF_EVENT_STATE_REVOKED) {
+ perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
+ } else {
+ event->state = PERF_EVENT_STATE_DEAD;
+ }
perf_event_ctx_unlock(event, ctx);
@@ -6013,7 +6021,7 @@ __perf_read(struct perf_event *event, char __user *buf, size_t count)
* error state (i.e. because it was pinned but it couldn't be
* scheduled on to the CPU at some point).
*/
- if (event->state == PERF_EVENT_STATE_ERROR)
+ if (event->state <= PERF_EVENT_STATE_ERROR)
return 0;
if (count < event->read_size)
@@ -6052,8 +6060,14 @@ static __poll_t perf_poll(struct file *file, poll_table *wait)
struct perf_buffer *rb;
__poll_t events = EPOLLHUP;
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return EPOLLERR;
+
poll_wait(file, &event->waitq, wait);
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return EPOLLERR;
+
if (is_event_hup(event))
return events;
@@ -6232,6 +6246,9 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
void (*func)(struct perf_event *);
u32 flags = arg;
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return -ENODEV;
+
switch (cmd) {
case PERF_EVENT_IOC_ENABLE:
func = _perf_event_enable;
@@ -6607,9 +6624,22 @@ void ring_buffer_put(struct perf_buffer *rb)
call_rcu(&rb->rcu_head, rb_free_rcu);
}
+typedef void (*mapped_f)(struct perf_event *event, struct mm_struct *mm);
+
+#define get_mapped(event, func) \
+({ struct pmu *pmu; \
+ mapped_f f = NULL; \
+ guard(rcu)(); \
+ pmu = READ_ONCE(event->pmu); \
+ if (pmu) \
+ f = pmu->func; \
+ f; \
+})
+
static void perf_mmap_open(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;
+ mapped_f mapped = get_mapped(event, event_mapped);
atomic_inc(&event->mmap_count);
atomic_inc(&event->rb->mmap_count);
@@ -6617,8 +6647,8 @@ static void perf_mmap_open(struct vm_area_struct *vma)
if (vma->vm_pgoff)
atomic_inc(&event->rb->aux_mmap_count);
- if (event->pmu->event_mapped)
- event->pmu->event_mapped(event, vma->vm_mm);
+ if (mapped)
+ mapped(event, vma->vm_mm);
}
static void perf_pmu_output_stop(struct perf_event *event);
@@ -6634,14 +6664,16 @@ static void perf_pmu_output_stop(struct perf_event *event);
static void perf_mmap_close(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;
+ mapped_f unmapped = get_mapped(event, event_unmapped);
struct perf_buffer *rb = ring_buffer_get(event);
struct user_struct *mmap_user = rb->mmap_user;
int mmap_locked = rb->mmap_locked;
unsigned long size = perf_data_size(rb);
bool detach_rest = false;
- if (event->pmu->event_unmapped)
- event->pmu->event_unmapped(event, vma->vm_mm);
+ /* FIXIES vs perf_pmu_unregister() */
+ if (unmapped)
+ unmapped(event, vma->vm_mm);
/*
* The AUX buffer is strictly a sub-buffer, serialize using aux_mutex
@@ -6834,6 +6866,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
unsigned long nr_pages;
long user_extra = 0, extra = 0;
int ret, flags = 0;
+ mapped_f mapped;
/*
* Don't allow mmap() of inherited per-task counters. This would
@@ -6864,6 +6897,16 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
mutex_lock(&event->mmap_mutex);
ret = -EINVAL;
+ /*
+ * This relies on __pmu_detach_event() taking mmap_mutex after marking
+ * the event REVOKED. Either we observe the state, or __pmu_detach_event()
+ * will detach the rb created here.
+ */
+ if (event->state <= PERF_EVENT_STATE_REVOKED) {
+ ret = -ENODEV;
+ goto unlock;
+ }
+
if (vma->vm_pgoff == 0) {
nr_pages -= 1;
@@ -7042,8 +7085,9 @@ aux_unlock:
if (!ret)
ret = map_range(rb, vma);
- if (!ret && event->pmu->event_mapped)
- event->pmu->event_mapped(event, vma->vm_mm);
+ mapped = get_mapped(event, event_mapped);
+ if (mapped)
+ mapped(event, vma->vm_mm);
return ret;
}
@@ -7054,6 +7098,9 @@ static int perf_fasync(int fd, struct file *filp, int on)
struct perf_event *event = filp->private_data;
int retval;
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return -ENODEV;
+
inode_lock(inode);
retval = fasync_helper(fd, filp, on, &event->fasync);
inode_unlock(inode);
@@ -11062,6 +11109,9 @@ static int __perf_event_set_bpf_prog(struct perf_event *event,
{
bool is_kprobe, is_uprobe, is_tracepoint, is_syscall_tp;
+ if (event->state <= PERF_EVENT_STATE_REVOKED)
+ return -ENODEV;
+
if (!perf_event_is_tracing(event))
return perf_event_set_bpf_handler(event, prog, bpf_cookie);
@@ -12245,6 +12295,9 @@ int perf_pmu_register(struct pmu *_pmu, const char *name, int type)
if (!pmu->event_idx)
pmu->event_idx = perf_event_idx_default;
+ INIT_LIST_HEAD(&pmu->events);
+ spin_lock_init(&pmu->events_lock);
+
/*
* Now that the PMU is complete, make it visible to perf_try_init_event().
*/
@@ -12258,21 +12311,143 @@ int perf_pmu_register(struct pmu *_pmu, const char *name, int type)
}
EXPORT_SYMBOL_GPL(perf_pmu_register);
-void perf_pmu_unregister(struct pmu *pmu)
+static void __pmu_detach_event(struct pmu *pmu, struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ /*
+ * De-schedule the event and mark it REVOKED.
+ */
+ perf_event_exit_event(event, ctx, true);
+
+ /*
+ * All _free_event() bits that rely on event->pmu:
+ *
+ * Notably, perf_mmap() relies on the ordering here.
+ */
+ scoped_guard (mutex, &event->mmap_mutex) {
+ WARN_ON_ONCE(pmu->event_unmapped);
+ /*
+ * Mostly an empty lock sequence, such that perf_mmap(), which
+ * relies on mmap_mutex, is sure to observe the state change.
+ */
+ }
+
+ perf_event_free_bpf_prog(event);
+ perf_free_addr_filters(event);
+
+ if (event->destroy) {
+ event->destroy(event);
+ event->destroy = NULL;
+ }
+
+ if (event->pmu_ctx) {
+ put_pmu_ctx(event->pmu_ctx);
+ event->pmu_ctx = NULL;
+ }
+
+ exclusive_event_destroy(event);
+ module_put(pmu->module);
+
+ event->pmu = NULL; /* force fault instead of UAF */
+}
+
+static void pmu_detach_event(struct pmu *pmu, struct perf_event *event)
+{
+ struct perf_event_context *ctx;
+
+ ctx = perf_event_ctx_lock(event);
+ __pmu_detach_event(pmu, event, ctx);
+ perf_event_ctx_unlock(event, ctx);
+
+ scoped_guard (spinlock, &pmu->events_lock)
+ list_del(&event->pmu_list);
+}
+
+static struct perf_event *pmu_get_event(struct pmu *pmu)
+{
+ struct perf_event *event;
+
+ guard(spinlock)(&pmu->events_lock);
+ list_for_each_entry(event, &pmu->events, pmu_list) {
+ if (atomic_long_inc_not_zero(&event->refcount))
+ return event;
+ }
+
+ return NULL;
+}
+
+static bool pmu_empty(struct pmu *pmu)
+{
+ guard(spinlock)(&pmu->events_lock);
+ return list_empty(&pmu->events);
+}
+
+static void pmu_detach_events(struct pmu *pmu)
+{
+ struct perf_event *event;
+
+ for (;;) {
+ event = pmu_get_event(pmu);
+ if (!event)
+ break;
+
+ pmu_detach_event(pmu, event);
+ put_event(event);
+ }
+
+ /*
+ * wait for pending _free_event()s
+ */
+ wait_var_event(pmu, pmu_empty(pmu));
+}
+
+int perf_pmu_unregister(struct pmu *pmu)
{
scoped_guard (mutex, &pmus_lock) {
+ if (!idr_cmpxchg(&pmu_idr, pmu->type, pmu, NULL))
+ return -EINVAL;
+
list_del_rcu(&pmu->entry);
- idr_remove(&pmu_idr, pmu->type);
}
/*
* We dereference the pmu list under both SRCU and regular RCU, so
* synchronize against both of those.
+ *
+ * Notably, the entirety of event creation, from perf_init_event()
+ * (which will now fail, because of the above) until
+ * perf_install_in_context() should be under SRCU such that
+ * this synchronizes against event creation. This avoids trying to
+ * detach events that are not fully formed.
*/
synchronize_srcu(&pmus_srcu);
synchronize_rcu();
+ if (pmu->event_unmapped && !pmu_empty(pmu)) {
+ /*
+ * Can't force remove events when pmu::event_unmapped()
+ * is used in perf_mmap_close().
+ */
+ guard(mutex)(&pmus_lock);
+ idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu);
+ list_add_rcu(&pmu->entry, &pmus);
+ return -EBUSY;
+ }
+
+ scoped_guard (mutex, &pmus_lock)
+ idr_remove(&pmu_idr, pmu->type);
+
+ /*
+ * PMU is removed from the pmus list, so no new events will
+ * be created, now take care of the existing ones.
+ */
+ pmu_detach_events(pmu);
+
+ /*
+ * PMU is unused, make it go away.
+ */
perf_pmu_free(pmu);
+ return 0;
}
EXPORT_SYMBOL_GPL(perf_pmu_unregister);
@@ -12366,7 +12541,7 @@ static struct pmu *perf_init_event(struct perf_event *event)
struct pmu *pmu;
int type, ret;
- guard(srcu)(&pmus_srcu);
+ guard(srcu)(&pmus_srcu); /* pmu idr/list access */
/*
* Save original type before calling pmu->event_init() since certain
@@ -12590,6 +12765,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
INIT_LIST_HEAD(&event->active_entry);
INIT_LIST_HEAD(&event->addr_filters.list);
INIT_HLIST_NODE(&event->hlist_entry);
+ INIT_LIST_HEAD(&event->pmu_list);
init_waitqueue_head(&event->waitq);
@@ -12768,6 +12944,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
/* symmetric to unaccount_event() in _free_event() */
account_event(event);
+ /*
+ * Event creation should be under SRCU, see perf_pmu_unregister().
+ */
+ lockdep_assert_held(&pmus_srcu);
+ scoped_guard (spinlock, &pmu->events_lock)
+ list_add(&event->pmu_list, &pmu->events);
+
return_ptr(event);
}
@@ -12967,6 +13150,9 @@ set:
goto unlock;
if (output_event) {
+ if (output_event->state <= PERF_EVENT_STATE_REVOKED)
+ goto unlock;
+
/* get the rb we want to redirect to */
rb = ring_buffer_get(output_event);
if (!rb)
@@ -13148,6 +13334,11 @@ SYSCALL_DEFINE5(perf_event_open,
if (event_fd < 0)
return event_fd;
+ /*
+ * Event creation should be under SRCU, see perf_pmu_unregister().
+ */
+ guard(srcu)(&pmus_srcu);
+
CLASS(fd, group)(group_fd); // group_fd == -1 => empty
if (group_fd != -1) {
if (!is_perf_file(group)) {
@@ -13155,6 +13346,10 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_fd;
}
group_leader = fd_file(group)->private_data;
+ if (group_leader->state <= PERF_EVENT_STATE_REVOKED) {
+ err = -ENODEV;
+ goto err_fd;
+ }
if (flags & PERF_FLAG_FD_OUTPUT)
output_event = group_leader;
if (flags & PERF_FLAG_FD_NO_GROUP)
@@ -13451,7 +13646,7 @@ err_cred:
if (task)
up_read(&task->signal->exec_update_lock);
err_alloc:
- free_event(event);
+ put_event(event);
err_task:
if (task)
put_task_struct(task);
@@ -13488,6 +13683,11 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
if (attr->aux_output || attr->aux_action)
return ERR_PTR(-EINVAL);
+ /*
+ * Event creation should be under SRCU, see perf_pmu_unregister().
+ */
+ guard(srcu)(&pmus_srcu);
+
event = perf_event_alloc(attr, cpu, task, NULL, NULL,
overflow_handler, context, -1);
if (IS_ERR(event)) {
@@ -13559,7 +13759,7 @@ err_unlock:
perf_unpin_context(ctx);
put_ctx(ctx);
err_alloc:
- free_event(event);
+ put_event(event);
err:
return ERR_PTR(err);
}
@@ -13699,10 +13899,15 @@ static void sync_child_event(struct perf_event *child_event)
}
static void
-perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
+perf_event_exit_event(struct perf_event *event,
+ struct perf_event_context *ctx, bool revoke)
{
struct perf_event *parent_event = event->parent;
- unsigned long detach_flags = 0;
+ unsigned long detach_flags = DETACH_EXIT;
+ bool is_child = !!parent_event;
+
+ if (parent_event == EVENT_TOMBSTONE)
+ parent_event = NULL;
if (parent_event) {
/*
@@ -13717,22 +13922,29 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
* Do destroy all inherited groups, we don't care about those
* and being thorough is better.
*/
- detach_flags = DETACH_GROUP | DETACH_CHILD;
+ detach_flags |= DETACH_GROUP | DETACH_CHILD;
mutex_lock(&parent_event->child_mutex);
}
- perf_remove_from_context(event, detach_flags | DETACH_EXIT);
+ if (revoke)
+ detach_flags |= DETACH_GROUP | DETACH_REVOKE;
+ perf_remove_from_context(event, detach_flags);
/*
* Child events can be freed.
*/
- if (parent_event) {
- mutex_unlock(&parent_event->child_mutex);
- /*
- * Kick perf_poll() for is_event_hup();
- */
- perf_event_wakeup(parent_event);
- put_event(event);
+ if (is_child) {
+ if (parent_event) {
+ mutex_unlock(&parent_event->child_mutex);
+ /*
+ * Kick perf_poll() for is_event_hup();
+ */
+ perf_event_wakeup(parent_event);
+ /*
+ * pmu_detach_event() will have an extra refcount.
+ */
+ put_event(event);
+ }
return;
}
@@ -13796,7 +14008,7 @@ static void perf_event_exit_task_context(struct task_struct *task, bool exit)
perf_event_task(task, ctx, 0);
list_for_each_entry_safe(child_event, next, &ctx->event_list, event_entry)
- perf_event_exit_event(child_event, ctx);
+ perf_event_exit_event(child_event, ctx, false);
mutex_unlock(&ctx->mutex);
@@ -13949,6 +14161,14 @@ inherit_event(struct perf_event *parent_event,
if (parent_event->parent)
parent_event = parent_event->parent;
+ if (parent_event->state <= PERF_EVENT_STATE_REVOKED)
+ return NULL;
+
+ /*
+ * Event creation should be under SRCU, see perf_pmu_unregister().
+ */
+ guard(srcu)(&pmus_srcu);
+
child_event = perf_event_alloc(&parent_event->attr,
parent_event->cpu,
child,
@@ -13962,7 +14182,7 @@ inherit_event(struct perf_event *parent_event,
pmu_ctx = find_get_pmu_context(child_event->pmu, child_ctx, child_event);
if (IS_ERR(pmu_ctx)) {
- free_event(child_event);
+ put_event(child_event);
return ERR_CAST(pmu_ctx);
}
child_event->pmu_ctx = pmu_ctx;
@@ -13977,7 +14197,7 @@ inherit_event(struct perf_event *parent_event,
if (is_orphaned_event(parent_event) ||
!atomic_long_inc_not_zero(&parent_event->refcount)) {
mutex_unlock(&parent_event->child_mutex);
- free_event(child_event);
+ put_event(child_event);
return NULL;
}
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [tip: perf/core] perf: Rename perf_event_exit_task(.child)
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
` (7 preceding siblings ...)
2025-03-17 6:54 ` [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Ravi Bangoria
@ 2025-04-08 19:05 ` tip-bot2 for Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] perf: Simplify child event tear-down tip-bot2 for Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] perf: Ensure bpf_perf_link path is properly serialized tip-bot2 for Peter Zijlstra
10 siblings, 0 replies; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-08 19:05 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Ravi Bangoria, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 4da0600edae1cf15d12bebacc66d7237e2c33fc6
Gitweb: https://git.kernel.org/tip/4da0600edae1cf15d12bebacc66d7237e2c33fc6
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 14 Feb 2025 13:23:45 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:48 +02:00
perf: Rename perf_event_exit_task(.child)
The task passed to perf_event_exit_task() is not a child, it is
current. Fix this confusing naming, since much of the rest of the code
also relies on it being current.
Specifically, both exec() and exit() callers use it with current as
the argument.
Notably, task_ctx_sched_out() doesn't make much sense outside of
current.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org
---
kernel/events/core.c | 62 ++++++++++++++++++++++---------------------
1 file changed, 32 insertions(+), 30 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 85c8b79..985b5c7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13742,13 +13742,13 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
perf_event_wakeup(event);
}
-static void perf_event_exit_task_context(struct task_struct *child, bool exit)
+static void perf_event_exit_task_context(struct task_struct *task, bool exit)
{
- struct perf_event_context *child_ctx, *clone_ctx = NULL;
+ struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_event *child_event, *next;
- child_ctx = perf_pin_task_context(child);
- if (!child_ctx)
+ ctx = perf_pin_task_context(task);
+ if (!ctx)
return;
/*
@@ -13761,28 +13761,28 @@ static void perf_event_exit_task_context(struct task_struct *child, bool exit)
* without ctx::mutex (it cannot because of the move_group double mutex
* lock thing). See the comments in perf_install_in_context().
*/
- mutex_lock(&child_ctx->mutex);
+ mutex_lock(&ctx->mutex);
/*
* In a single ctx::lock section, de-schedule the events and detach the
* context from the task such that we cannot ever get it scheduled back
* in.
*/
- raw_spin_lock_irq(&child_ctx->lock);
+ raw_spin_lock_irq(&ctx->lock);
if (exit)
- task_ctx_sched_out(child_ctx, NULL, EVENT_ALL);
+ task_ctx_sched_out(ctx, NULL, EVENT_ALL);
/*
* Now that the context is inactive, destroy the task <-> ctx relation
* and mark the context dead.
*/
- RCU_INIT_POINTER(child->perf_event_ctxp, NULL);
- put_ctx(child_ctx); /* cannot be last */
- WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
- put_task_struct(child); /* cannot be last */
+ RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
+ put_ctx(ctx); /* cannot be last */
+ WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
+ put_task_struct(task); /* cannot be last */
- clone_ctx = unclone_ctx(child_ctx);
- raw_spin_unlock_irq(&child_ctx->lock);
+ clone_ctx = unclone_ctx(ctx);
+ raw_spin_unlock_irq(&ctx->lock);
if (clone_ctx)
put_ctx(clone_ctx);
@@ -13793,12 +13793,12 @@ static void perf_event_exit_task_context(struct task_struct *child, bool exit)
* get a few PERF_RECORD_READ events.
*/
if (exit)
- perf_event_task(child, child_ctx, 0);
+ perf_event_task(task, ctx, 0);
- list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry)
- perf_event_exit_event(child_event, child_ctx);
+ list_for_each_entry_safe(child_event, next, &ctx->event_list, event_entry)
+ perf_event_exit_event(child_event, ctx);
- mutex_unlock(&child_ctx->mutex);
+ mutex_unlock(&ctx->mutex);
if (!exit) {
/*
@@ -13814,24 +13814,26 @@ static void perf_event_exit_task_context(struct task_struct *child, bool exit)
*
* Wait for all events to drop their context reference.
*/
- wait_var_event(&child_ctx->refcount,
- refcount_read(&child_ctx->refcount) == 1);
+ wait_var_event(&ctx->refcount,
+ refcount_read(&ctx->refcount) == 1);
}
- put_ctx(child_ctx);
+ put_ctx(ctx);
}
/*
- * When a child task exits, feed back event values to parent events.
+ * When a task exits, feed back event values to parent events.
*
* Can be called with exec_update_lock held when called from
* setup_new_exec().
*/
-void perf_event_exit_task(struct task_struct *child)
+void perf_event_exit_task(struct task_struct *task)
{
struct perf_event *event, *tmp;
- mutex_lock(&child->perf_event_mutex);
- list_for_each_entry_safe(event, tmp, &child->perf_event_list,
+ WARN_ON_ONCE(task != current);
+
+ mutex_lock(&task->perf_event_mutex);
+ list_for_each_entry_safe(event, tmp, &task->perf_event_list,
owner_entry) {
list_del_init(&event->owner_entry);
@@ -13842,23 +13844,23 @@ void perf_event_exit_task(struct task_struct *child)
*/
smp_store_release(&event->owner, NULL);
}
- mutex_unlock(&child->perf_event_mutex);
+ mutex_unlock(&task->perf_event_mutex);
- perf_event_exit_task_context(child, true);
+ perf_event_exit_task_context(task, true);
/*
* The perf_event_exit_task_context calls perf_event_task
- * with child's task_ctx, which generates EXIT events for
- * child contexts and sets child->perf_event_ctxp[] to NULL.
+ * with task's task_ctx, which generates EXIT events for
+ * task contexts and sets task->perf_event_ctxp[] to NULL.
* At this point we need to send EXIT events to cpu contexts.
*/
- perf_event_task(child, NULL, 0);
+ perf_event_task(task, NULL, 0);
/*
* Detach the perf_ctx_data for the system-wide event.
*/
guard(percpu_read)(&global_ctx_data_rwsem);
- detach_task_ctx_data(child);
+ detach_task_ctx_data(task);
}
/*
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [tip: perf/core] perf: Unify perf_event_free_task() / perf_event_exit_task_context()
2025-03-07 19:33 ` [PATCH v3 5/7] perf: Unify perf_event_free_task() / perf_event_exit_task_context() Peter Zijlstra
2025-03-10 15:35 ` [PATCH v3a " Peter Zijlstra
@ 2025-04-08 19:05 ` tip-bot2 for Peter Zijlstra
1 sibling, 0 replies; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-08 19:05 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Ravi Bangoria, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 90661365021a6d0d7f3a2c5046ebe33e4df53b92
Gitweb: https://git.kernel.org/tip/90661365021a6d0d7f3a2c5046ebe33e4df53b92
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 13 Feb 2025 14:04:07 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:47 +02:00
perf: Unify perf_event_free_task() / perf_event_exit_task_context()
Both perf_event_free_task() and perf_event_exit_task_context() are
very similar, except perf_event_exit_task_context() is a little more
generic / makes less assumptions.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193723.274039710@infradead.org
---
kernel/events/core.c | 93 +++++++++++--------------------------------
1 file changed, 25 insertions(+), 68 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f75b0d3..85c8b79 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13742,13 +13742,11 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
perf_event_wakeup(event);
}
-static void perf_event_exit_task_context(struct task_struct *child)
+static void perf_event_exit_task_context(struct task_struct *child, bool exit)
{
struct perf_event_context *child_ctx, *clone_ctx = NULL;
struct perf_event *child_event, *next;
- WARN_ON_ONCE(child != current);
-
child_ctx = perf_pin_task_context(child);
if (!child_ctx)
return;
@@ -13771,7 +13769,8 @@ static void perf_event_exit_task_context(struct task_struct *child)
* in.
*/
raw_spin_lock_irq(&child_ctx->lock);
- task_ctx_sched_out(child_ctx, NULL, EVENT_ALL);
+ if (exit)
+ task_ctx_sched_out(child_ctx, NULL, EVENT_ALL);
/*
* Now that the context is inactive, destroy the task <-> ctx relation
@@ -13780,7 +13779,7 @@ static void perf_event_exit_task_context(struct task_struct *child)
RCU_INIT_POINTER(child->perf_event_ctxp, NULL);
put_ctx(child_ctx); /* cannot be last */
WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
- put_task_struct(current); /* cannot be last */
+ put_task_struct(child); /* cannot be last */
clone_ctx = unclone_ctx(child_ctx);
raw_spin_unlock_irq(&child_ctx->lock);
@@ -13793,13 +13792,31 @@ static void perf_event_exit_task_context(struct task_struct *child)
* won't get any samples after PERF_RECORD_EXIT. We can however still
* get a few PERF_RECORD_READ events.
*/
- perf_event_task(child, child_ctx, 0);
+ if (exit)
+ perf_event_task(child, child_ctx, 0);
list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry)
perf_event_exit_event(child_event, child_ctx);
mutex_unlock(&child_ctx->mutex);
+ if (!exit) {
+ /*
+ * perf_event_release_kernel() could still have a reference on
+ * this context. In that case we must wait for these events to
+ * have been freed (in particular all their references to this
+ * task must've been dropped).
+ *
+ * Without this copy_process() will unconditionally free this
+ * task (irrespective of its reference count) and
+ * _free_event()'s put_task_struct(event->hw.target) will be a
+ * use-after-free.
+ *
+ * Wait for all events to drop their context reference.
+ */
+ wait_var_event(&child_ctx->refcount,
+ refcount_read(&child_ctx->refcount) == 1);
+ }
put_ctx(child_ctx);
}
@@ -13827,7 +13844,7 @@ void perf_event_exit_task(struct task_struct *child)
}
mutex_unlock(&child->perf_event_mutex);
- perf_event_exit_task_context(child);
+ perf_event_exit_task_context(child, true);
/*
* The perf_event_exit_task_context calls perf_event_task
@@ -13844,25 +13861,6 @@ void perf_event_exit_task(struct task_struct *child)
detach_task_ctx_data(child);
}
-static void perf_free_event(struct perf_event *event,
- struct perf_event_context *ctx)
-{
- struct perf_event *parent = event->parent;
-
- if (WARN_ON_ONCE(!parent))
- return;
-
- mutex_lock(&parent->child_mutex);
- list_del_init(&event->child_list);
- mutex_unlock(&parent->child_mutex);
-
- raw_spin_lock_irq(&ctx->lock);
- perf_group_detach(event);
- list_del_event(event, ctx);
- raw_spin_unlock_irq(&ctx->lock);
- put_event(event);
-}
-
/*
* Free a context as created by inheritance by perf_event_init_task() below,
* used by fork() in case of fail.
@@ -13872,48 +13870,7 @@ static void perf_free_event(struct perf_event *event,
*/
void perf_event_free_task(struct task_struct *task)
{
- struct perf_event_context *ctx;
- struct perf_event *event, *tmp;
-
- ctx = rcu_access_pointer(task->perf_event_ctxp);
- if (!ctx)
- return;
-
- mutex_lock(&ctx->mutex);
- raw_spin_lock_irq(&ctx->lock);
- /*
- * Destroy the task <-> ctx relation and mark the context dead.
- *
- * This is important because even though the task hasn't been
- * exposed yet the context has been (through child_list).
- */
- RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
- WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
- put_task_struct(task); /* cannot be last */
- raw_spin_unlock_irq(&ctx->lock);
-
-
- list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
- perf_free_event(event, ctx);
-
- mutex_unlock(&ctx->mutex);
-
- /*
- * perf_event_release_kernel() could've stolen some of our
- * child events and still have them on its free_list. In that
- * case we must wait for these events to have been freed (in
- * particular all their references to this task must've been
- * dropped).
- *
- * Without this copy_process() will unconditionally free this
- * task (irrespective of its reference count) and
- * _free_event()'s put_task_struct(event->hw.target) will be a
- * use-after-free.
- *
- * Wait for all events to drop their context reference.
- */
- wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
- put_ctx(ctx); /* must be last */
+ perf_event_exit_task_context(task, false);
}
void perf_event_delayed_put(struct task_struct *task)
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [tip: perf/core] perf: Simplify perf_event_release_kernel()
2025-03-07 19:33 ` [PATCH v3 4/7] perf: Simplify perf_event_release_kernel() Peter Zijlstra
@ 2025-04-08 19:05 ` tip-bot2 for Peter Zijlstra
0 siblings, 0 replies; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-08 19:05 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Ravi Bangoria, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 3e8671e00e57b3d006ed8ae5ef055807506e44b2
Gitweb: https://git.kernel.org/tip/3e8671e00e57b3d006ed8ae5ef055807506e44b2
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 17 Jan 2025 15:31:49 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:47 +02:00
perf: Simplify perf_event_release_kernel()
There is no good reason to have the free list anymore. It is possible
to call free_event() after the locks have been dropped in the main
loop.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193723.151721102@infradead.org
---
kernel/events/core.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fa6dab0..f75b0d3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5696,7 +5696,6 @@ int perf_event_release_kernel(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;
struct perf_event *child, *tmp;
- LIST_HEAD(free_list);
/*
* If we got here through err_alloc: free_event(event); we will not
@@ -5765,23 +5764,23 @@ again:
struct perf_event, child_list);
if (tmp == child) {
perf_remove_from_context(child, DETACH_GROUP | DETACH_CHILD);
- list_add(&child->child_list, &free_list);
+ } else {
+ child = NULL;
}
mutex_unlock(&event->child_mutex);
mutex_unlock(&ctx->mutex);
+
+ if (child) {
+ /* Last reference unless ->pending_task work is pending */
+ put_event(child);
+ }
put_ctx(ctx);
goto again;
}
mutex_unlock(&event->child_mutex);
- list_for_each_entry_safe(child, tmp, &free_list, child_list) {
- list_del(&child->child_list);
- /* Last reference unless ->pending_task work is pending */
- put_event(child);
- }
-
no_ctx:
/*
* Last reference unless ->pending_task work is pending on this event
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [tip: perf/core] perf: Simplify perf_event_free_task() wait
2025-03-07 19:33 ` [PATCH v3 3/7] perf: Simplify perf_event_free_task() wait Peter Zijlstra
2025-03-17 6:49 ` Ravi Bangoria
@ 2025-04-08 19:05 ` tip-bot2 for Peter Zijlstra
2025-04-09 13:01 ` Frederic Weisbecker
1 sibling, 1 reply; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-08 19:05 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Ravi Bangoria, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 59f3aa4a3ee27e96132e16d2d2bdc3acadb4bf79
Gitweb: https://git.kernel.org/tip/59f3aa4a3ee27e96132e16d2d2bdc3acadb4bf79
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 17 Jan 2025 15:27:07 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:46 +02:00
perf: Simplify perf_event_free_task() wait
Simplify the code by moving the duplicated wakeup condition into
put_ctx().
Notably, wait_var_event() is in perf_event_free_task() and will have
set ctx->task = TASK_TOMBSTONE.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193723.044499344@infradead.org
---
kernel/events/core.c | 25 +++----------------------
1 file changed, 3 insertions(+), 22 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3c92b75..fa6dab0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1270,6 +1270,9 @@ static void put_ctx(struct perf_event_context *ctx)
if (ctx->task && ctx->task != TASK_TOMBSTONE)
put_task_struct(ctx->task);
call_rcu(&ctx->rcu_head, free_ctx);
+ } else if (ctx->task == TASK_TOMBSTONE) {
+ smp_mb(); /* pairs with wait_var_event() */
+ wake_up_var(&ctx->refcount);
}
}
@@ -5729,8 +5732,6 @@ int perf_event_release_kernel(struct perf_event *event)
again:
mutex_lock(&event->child_mutex);
list_for_each_entry(child, &event->child_list, child_list) {
- void *var = NULL;
-
/*
* Cannot change, child events are not migrated, see the
* comment with perf_event_ctx_lock_nested().
@@ -5765,40 +5766,20 @@ again:
if (tmp == child) {
perf_remove_from_context(child, DETACH_GROUP | DETACH_CHILD);
list_add(&child->child_list, &free_list);
- } else {
- var = &ctx->refcount;
}
mutex_unlock(&event->child_mutex);
mutex_unlock(&ctx->mutex);
put_ctx(ctx);
- if (var) {
- /*
- * If perf_event_free_task() has deleted all events from the
- * ctx while the child_mutex got released above, make sure to
- * notify about the preceding put_ctx().
- */
- smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(var);
- }
goto again;
}
mutex_unlock(&event->child_mutex);
list_for_each_entry_safe(child, tmp, &free_list, child_list) {
- void *var = &child->ctx->refcount;
-
list_del(&child->child_list);
/* Last reference unless ->pending_task work is pending */
put_event(child);
-
- /*
- * Wake any perf_event_free_task() waiting for this event to be
- * freed.
- */
- smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(var);
}
no_ctx:
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [tip: perf/core] perf: Simplify child event tear-down
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
` (8 preceding siblings ...)
2025-04-08 19:05 ` [tip: perf/core] perf: Rename perf_event_exit_task(.child) tip-bot2 for Peter Zijlstra
@ 2025-04-08 19:05 ` tip-bot2 for Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] perf: Ensure bpf_perf_link path is properly serialized tip-bot2 for Peter Zijlstra
10 siblings, 0 replies; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-08 19:05 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Ravi Bangoria, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 0a00a43b8c200df5b9ca2b3e1726479b5916264b
Gitweb: https://git.kernel.org/tip/0a00a43b8c200df5b9ca2b3e1726479b5916264b
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 17 Jan 2025 15:25:23 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:46 +02:00
perf: Simplify child event tear-down
Currently perf_event_release_kernel() will iterate the child events and attempt
tear-down. However, it removes them from the child_list using list_move(),
notably skipping the state management done by perf_child_detach().
Crucially, it fails to clear PERF_ATTACH_CHILD, which opens the door for a
concurrent perf_remove_from_context() to race.
This way child_list management stays fully serialized using child_mutex.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org
---
kernel/events/core.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a85d63b..3c92b75 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2325,7 +2325,11 @@ static void perf_child_detach(struct perf_event *event)
if (WARN_ON_ONCE(!parent_event))
return;
+ /*
+ * Can't check this from an IPI, the holder is likey another CPU.
+ *
lockdep_assert_held(&parent_event->child_mutex);
+ */
sync_child_event(event);
list_del_init(&event->child_list);
@@ -5759,8 +5763,8 @@ again:
tmp = list_first_entry_or_null(&event->child_list,
struct perf_event, child_list);
if (tmp == child) {
- perf_remove_from_context(child, DETACH_GROUP);
- list_move(&child->child_list, &free_list);
+ perf_remove_from_context(child, DETACH_GROUP | DETACH_CHILD);
+ list_add(&child->child_list, &free_list);
} else {
var = &ctx->refcount;
}
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [tip: perf/core] perf: Ensure bpf_perf_link path is properly serialized
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
` (9 preceding siblings ...)
2025-04-08 19:05 ` [tip: perf/core] perf: Simplify child event tear-down tip-bot2 for Peter Zijlstra
@ 2025-04-08 19:05 ` tip-bot2 for Peter Zijlstra
10 siblings, 0 replies; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-08 19:05 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ravi Bangoria, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 7ed9138a72829d2035ecbd8dbd35b1bc3c137c40
Gitweb: https://git.kernel.org/tip/7ed9138a72829d2035ecbd8dbd35b1bc3c137c40
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 17 Jan 2025 10:54:50 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:46 +02:00
perf: Ensure bpf_perf_link path is properly serialized
Ravi reported that the bpf_perf_link_attach() usage of
perf_event_set_bpf_prog() is not serialized by ctx->mutex, unlike the
PERF_EVENT_IOC_SET_BPF case.
Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org
---
kernel/events/core.c | 34 ++++++++++++++++++++++++++++++----
1 file changed, 30 insertions(+), 4 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e93c195..a85d63b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6239,6 +6239,9 @@ static int perf_event_set_output(struct perf_event *event,
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
static int perf_copy_attr(struct perf_event_attr __user *uattr,
struct perf_event_attr *attr);
+static int __perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
+ u64 bpf_cookie);
static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg)
{
@@ -6301,7 +6304,7 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
if (IS_ERR(prog))
return PTR_ERR(prog);
- err = perf_event_set_bpf_prog(event, prog, 0);
+ err = __perf_event_set_bpf_prog(event, prog, 0);
if (err) {
bpf_prog_put(prog);
return err;
@@ -11069,8 +11072,9 @@ static inline bool perf_event_is_tracing(struct perf_event *event)
return false;
}
-int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog,
- u64 bpf_cookie)
+static int __perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
+ u64 bpf_cookie)
{
bool is_kprobe, is_uprobe, is_tracepoint, is_syscall_tp;
@@ -11108,6 +11112,20 @@ int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog,
return perf_event_attach_bpf_prog(event, prog, bpf_cookie);
}
+int perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
+ u64 bpf_cookie)
+{
+ struct perf_event_context *ctx;
+ int ret;
+
+ ctx = perf_event_ctx_lock(event);
+ ret = __perf_event_set_bpf_prog(event, prog, bpf_cookie);
+ perf_event_ctx_unlock(event, ctx);
+
+ return ret;
+}
+
void perf_event_free_bpf_prog(struct perf_event *event)
{
if (!event->prog)
@@ -11130,7 +11148,15 @@ static void perf_event_free_filter(struct perf_event *event)
{
}
-int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog,
+static int __perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
+ u64 bpf_cookie)
+{
+ return -ENOENT;
+}
+
+int perf_event_set_bpf_prog(struct perf_event *event,
+ struct bpf_prog *prog,
u64 bpf_cookie)
{
return -ENOENT;
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [tip: perf/core] perf: Simplify perf_event_free_task() wait
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
@ 2025-04-09 13:01 ` Frederic Weisbecker
2025-04-10 9:34 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 43+ messages in thread
From: Frederic Weisbecker @ 2025-04-09 13:01 UTC (permalink / raw)
To: linux-kernel
Cc: linux-tip-commits, Peter Zijlstra (Intel), Ravi Bangoria, x86
Le Tue, Apr 08, 2025 at 07:05:04PM -0000, tip-bot2 for Peter Zijlstra a écrit :
> The following commit has been merged into the perf/core branch of tip:
>
> Commit-ID: 59f3aa4a3ee27e96132e16d2d2bdc3acadb4bf79
> Gitweb: https://git.kernel.org/tip/59f3aa4a3ee27e96132e16d2d2bdc3acadb4bf79
> Author: Peter Zijlstra <peterz@infradead.org>
> AuthorDate: Fri, 17 Jan 2025 15:27:07 +01:00
> Committer: Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Tue, 08 Apr 2025 20:55:46 +02:00
>
> perf: Simplify perf_event_free_task() wait
>
> Simplify the code by moving the duplicated wakeup condition into
> put_ctx().
>
> Notably, wait_var_event() is in perf_event_free_task() and will have
> set ctx->task = TASK_TOMBSTONE.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Link: https://lkml.kernel.org/r/20250307193723.044499344@infradead.org
> ---
> kernel/events/core.c | 25 +++----------------------
> 1 file changed, 3 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3c92b75..fa6dab0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1270,6 +1270,9 @@ static void put_ctx(struct perf_event_context *ctx)
> if (ctx->task && ctx->task != TASK_TOMBSTONE)
> put_task_struct(ctx->task);
> call_rcu(&ctx->rcu_head, free_ctx);
> + } else if (ctx->task == TASK_TOMBSTONE) {
> + smp_mb(); /* pairs with wait_var_event() */
> + wake_up_var(&ctx->refcount);
So there are three situations:
* If perf_event_free_task() has removed all the children from the parent list
before perf_event_release_kernel() got a chance to even iterate them, then
it's all good as there is no get_ctx() pending.
* If perf_event_release_kernel() iterates a child event, but it gets freed
meanwhile by perf_event_free_task() while the mutexes are temporarily
unlocked, it's all good because while locking again the ctx mutex,
perf_event_release_kernel() observes TASK_TOMBSTONE.
* But if perf_event_release_kernel() frees the child event before
perf_event_free_task() got a chance, we may face this scenario:
perf_event_release_kernel() perf_event_free_task()
-------------------------- ------------------------
mutex_lock(&event->child_mutex)
get_ctx(child->ctx)
mutex_unlock(&event->child_mutex)
mutex_lock(ctx->mutex)
mutex_lock(&event->child_mutex)
perf_remove_from_context(child)
mutex_unlock(&event->child_mutex)
mutex_unlock(ctx->mutex)
// This lock acquires ctx->refcount == 2
// visibility
mutex_lock(ctx->mutex)
ctx->task = TASK_TOMBSTONE
mutex_unlock(ctx->mutex)
wait_var_event()
// enters prepare_to_wait() since
// ctx->refcount == 2
// is guaranteed to be seen
set_current_state(TASK_INTERRUPTIBLE)
smp_mb()
if (ctx->refcount != 1)
schedule()
put_ctx()
// NOT fully ordered! Only RELEASE semantics
refcount_dec_and_test()
atomic_fetch_sub_release()
// So TASK_TOMBSTONE is not guaranteed to be seen
if (ctx->task == TASK_TOMBSTONE)
wake_up_var()
Basically it's a broken store buffer:
perf_event_release_kernel() perf_event_free_task()
-------------------------- ------------------------
ctx->task = TASK_TOMBSTONE smp_store_release(&ctx->refcount, ctx->refcount - 1)
smp_mb()
READ_ONCE(ctx->refcount) READ_ONCE(ctx->task)
So you need this:
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fa6dab08be47..c4fbbe25361a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1270,9 +1270,10 @@ static void put_ctx(struct perf_event_context *ctx)
if (ctx->task && ctx->task != TASK_TOMBSTONE)
put_task_struct(ctx->task);
call_rcu(&ctx->rcu_head, free_ctx);
- } else if (ctx->task == TASK_TOMBSTONE) {
+ } else {
smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(&ctx->refcount);
+ if (ctx->task == TASK_TOMBSTONE)
+ wake_up_var(&ctx->refcount);
}
}
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [tip: perf/core] perf: Simplify perf_event_free_task() wait
2025-04-09 13:01 ` Frederic Weisbecker
@ 2025-04-10 9:34 ` Peter Zijlstra
2025-04-10 9:45 ` Frederic Weisbecker
2025-04-17 12:03 ` Ingo Molnar
2025-04-17 13:01 ` [tip: perf/core] perf/core: Fix put_ctx() ordering tip-bot2 for Frederic Weisbecker
2 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2025-04-10 9:34 UTC (permalink / raw)
To: Frederic Weisbecker; +Cc: linux-kernel, linux-tip-commits, Ravi Bangoria, x86
On Wed, Apr 09, 2025 at 03:01:12PM +0200, Frederic Weisbecker wrote:
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 3c92b75..fa6dab0 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1270,6 +1270,9 @@ static void put_ctx(struct perf_event_context *ctx)
> > if (ctx->task && ctx->task != TASK_TOMBSTONE)
> > put_task_struct(ctx->task);
> > call_rcu(&ctx->rcu_head, free_ctx);
> > + } else if (ctx->task == TASK_TOMBSTONE) {
> > + smp_mb(); /* pairs with wait_var_event() */
> > + wake_up_var(&ctx->refcount);
>
> So there are three situations:
>
> * If perf_event_free_task() has removed all the children from the parent list
> before perf_event_release_kernel() got a chance to even iterate them, then
> it's all good as there is no get_ctx() pending.
>
> * If perf_event_release_kernel() iterates a child event, but it gets freed
> meanwhile by perf_event_free_task() while the mutexes are temporarily
> unlocked, it's all good because while locking again the ctx mutex,
> perf_event_release_kernel() observes TASK_TOMBSTONE.
>
> * But if perf_event_release_kernel() frees the child event before
> perf_event_free_task() got a chance, we may face this scenario:
>
> perf_event_release_kernel() perf_event_free_task()
> -------------------------- ------------------------
> mutex_lock(&event->child_mutex)
> get_ctx(child->ctx)
> mutex_unlock(&event->child_mutex)
>
> mutex_lock(ctx->mutex)
> mutex_lock(&event->child_mutex)
> perf_remove_from_context(child)
> mutex_unlock(&event->child_mutex)
> mutex_unlock(ctx->mutex)
>
> // This lock acquires ctx->refcount == 2
> // visibility
> mutex_lock(ctx->mutex)
> ctx->task = TASK_TOMBSTONE
> mutex_unlock(ctx->mutex)
>
> wait_var_event()
> // enters prepare_to_wait() since
> // ctx->refcount == 2
> // is guaranteed to be seen
> set_current_state(TASK_INTERRUPTIBLE)
> smp_mb()
> if (ctx->refcount != 1)
> schedule()
> put_ctx()
> // NOT fully ordered! Only RELEASE semantics
> refcount_dec_and_test()
> atomic_fetch_sub_release()
> // So TASK_TOMBSTONE is not guaranteed to be seen
> if (ctx->task == TASK_TOMBSTONE)
> wake_up_var()
>
> Basically it's a broken store buffer:
>
> perf_event_release_kernel() perf_event_free_task()
> -------------------------- ------------------------
> ctx->task = TASK_TOMBSTONE smp_store_release(&ctx->refcount, ctx->refcount - 1)
> smp_mb()
> READ_ONCE(ctx->refcount) READ_ONCE(ctx->task)
>
>
> So you need this:
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index fa6dab08be47..c4fbbe25361a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1270,9 +1270,10 @@ static void put_ctx(struct perf_event_context *ctx)
> if (ctx->task && ctx->task != TASK_TOMBSTONE)
> put_task_struct(ctx->task);
> call_rcu(&ctx->rcu_head, free_ctx);
> - } else if (ctx->task == TASK_TOMBSTONE) {
> + } else {
> smp_mb(); /* pairs with wait_var_event() */
> - wake_up_var(&ctx->refcount);
> + if (ctx->task == TASK_TOMBSTONE)
> + wake_up_var(&ctx->refcount);
> }
> }
Very good, thanks!
I'll make that smp_mb__after_atomic() instead, but yes, this barrier
needs to move before the loading of ctx->task.
I'll transform this into a patch and stuff on top.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [tip: perf/core] perf: Simplify perf_event_free_task() wait
2025-04-10 9:34 ` Peter Zijlstra
@ 2025-04-10 9:45 ` Frederic Weisbecker
0 siblings, 0 replies; 43+ messages in thread
From: Frederic Weisbecker @ 2025-04-10 9:45 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, linux-tip-commits, Ravi Bangoria, x86
Le Thu, Apr 10, 2025 at 11:34:56AM +0200, Peter Zijlstra a écrit :
> On Wed, Apr 09, 2025 at 03:01:12PM +0200, Frederic Weisbecker wrote:
>
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index 3c92b75..fa6dab0 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -1270,6 +1270,9 @@ static void put_ctx(struct perf_event_context *ctx)
> > > if (ctx->task && ctx->task != TASK_TOMBSTONE)
> > > put_task_struct(ctx->task);
> > > call_rcu(&ctx->rcu_head, free_ctx);
> > > + } else if (ctx->task == TASK_TOMBSTONE) {
> > > + smp_mb(); /* pairs with wait_var_event() */
> > > + wake_up_var(&ctx->refcount);
> >
> > So there are three situations:
> >
> > * If perf_event_free_task() has removed all the children from the parent list
> > before perf_event_release_kernel() got a chance to even iterate them, then
> > it's all good as there is no get_ctx() pending.
> >
> > * If perf_event_release_kernel() iterates a child event, but it gets freed
> > meanwhile by perf_event_free_task() while the mutexes are temporarily
> > unlocked, it's all good because while locking again the ctx mutex,
> > perf_event_release_kernel() observes TASK_TOMBSTONE.
> >
> > * But if perf_event_release_kernel() frees the child event before
> > perf_event_free_task() got a chance, we may face this scenario:
> >
> > perf_event_release_kernel() perf_event_free_task()
> > -------------------------- ------------------------
> > mutex_lock(&event->child_mutex)
> > get_ctx(child->ctx)
> > mutex_unlock(&event->child_mutex)
> >
> > mutex_lock(ctx->mutex)
> > mutex_lock(&event->child_mutex)
> > perf_remove_from_context(child)
> > mutex_unlock(&event->child_mutex)
> > mutex_unlock(ctx->mutex)
> >
> > // This lock acquires ctx->refcount == 2
> > // visibility
> > mutex_lock(ctx->mutex)
> > ctx->task = TASK_TOMBSTONE
> > mutex_unlock(ctx->mutex)
> >
> > wait_var_event()
> > // enters prepare_to_wait() since
> > // ctx->refcount == 2
> > // is guaranteed to be seen
> > set_current_state(TASK_INTERRUPTIBLE)
> > smp_mb()
> > if (ctx->refcount != 1)
> > schedule()
> > put_ctx()
> > // NOT fully ordered! Only RELEASE semantics
> > refcount_dec_and_test()
> > atomic_fetch_sub_release()
> > // So TASK_TOMBSTONE is not guaranteed to be seen
> > if (ctx->task == TASK_TOMBSTONE)
> > wake_up_var()
> >
> > Basically it's a broken store buffer:
> >
> > perf_event_release_kernel() perf_event_free_task()
> > -------------------------- ------------------------
> > ctx->task = TASK_TOMBSTONE smp_store_release(&ctx->refcount, ctx->refcount - 1)
> > smp_mb()
> > READ_ONCE(ctx->refcount) READ_ONCE(ctx->task)
> >
> >
> > So you need this:
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index fa6dab08be47..c4fbbe25361a 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1270,9 +1270,10 @@ static void put_ctx(struct perf_event_context *ctx)
> > if (ctx->task && ctx->task != TASK_TOMBSTONE)
> > put_task_struct(ctx->task);
> > call_rcu(&ctx->rcu_head, free_ctx);
> > - } else if (ctx->task == TASK_TOMBSTONE) {
> > + } else {
> > smp_mb(); /* pairs with wait_var_event() */
> > - wake_up_var(&ctx->refcount);
> > + if (ctx->task == TASK_TOMBSTONE)
> > + wake_up_var(&ctx->refcount);
> > }
> > }
>
> Very good, thanks!
>
> I'll make that smp_mb__after_atomic() instead, but yes, this barrier
> needs to move before the loading of ctx->task.
>
> I'll transform this into a patch and stuff on top.
Sure! Or feel free to fold it, though that imply a rebase...
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable
2025-03-07 19:33 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Peter Zijlstra
` (2 preceding siblings ...)
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
@ 2025-04-14 0:37 ` Mi, Dapeng
2025-04-17 8:07 ` Peter Zijlstra
3 siblings, 1 reply; 43+ messages in thread
From: Mi, Dapeng @ 2025-04-14 0:37 UTC (permalink / raw)
To: Peter Zijlstra, mingo, ravi.bangoria, lucas.demarchi
Cc: linux-kernel, acme, namhyung, mark.rutland, alexander.shishkin,
jolsa, irogers, adrian.hunter, kan.liang
On 3/8/2025 3:33 AM, Peter Zijlstra wrote:
> Previously it was only safe to call perf_pmu_unregister() if there
> were no active events of that pmu around -- which was impossible to
> guarantee since it races all sorts against perf_init_event().
>
> Rework the whole thing by:
>
> - keeping track of all events for a given pmu
>
> - 'hiding' the pmu from perf_init_event()
>
> - waiting for the appropriate (s)rcu grace periods such that all
> prior references to the PMU will be completed
>
> - detaching all still existing events of that pmu (see first point)
> and moving them to a new REVOKED state.
>
> - actually freeing the pmu data.
>
> Where notably the new REVOKED state must inhibit all event actions
> from reaching code that wants to use event->pmu.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> include/linux/perf_event.h | 15 +-
> kernel/events/core.c | 294 ++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 274 insertions(+), 35 deletions(-)
>
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -325,6 +325,9 @@ struct perf_output_handle;
> struct pmu {
> struct list_head entry;
>
> + spinlock_t events_lock;
> + struct list_head events;
> +
> struct module *module;
> struct device *dev;
> struct device *parent;
> @@ -632,9 +635,10 @@ struct perf_addr_filter_range {
> * enum perf_event_state - the states of an event:
> */
> enum perf_event_state {
> - PERF_EVENT_STATE_DEAD = -4,
> - PERF_EVENT_STATE_EXIT = -3,
> - PERF_EVENT_STATE_ERROR = -2,
> + PERF_EVENT_STATE_DEAD = -5,
> + PERF_EVENT_STATE_REVOKED = -4, /* pmu gone, must not touch */
> + PERF_EVENT_STATE_EXIT = -3, /* task died, still inherit */
> + PERF_EVENT_STATE_ERROR = -2, /* scheduling error, can enable */
> PERF_EVENT_STATE_OFF = -1,
> PERF_EVENT_STATE_INACTIVE = 0,
> PERF_EVENT_STATE_ACTIVE = 1,
> @@ -875,6 +879,7 @@ struct perf_event {
> void *security;
> #endif
> struct list_head sb_list;
> + struct list_head pmu_list;
>
> /*
> * Certain events gets forwarded to another pmu internally by over-
> @@ -1132,7 +1137,7 @@ extern void perf_aux_output_flag(struct
> extern void perf_event_itrace_started(struct perf_event *event);
>
> extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
> -extern void perf_pmu_unregister(struct pmu *pmu);
> +extern int perf_pmu_unregister(struct pmu *pmu);
>
> extern void __perf_event_task_sched_in(struct task_struct *prev,
> struct task_struct *task);
> @@ -1734,7 +1739,7 @@ static inline bool needs_branch_stack(st
>
> static inline bool has_aux(struct perf_event *event)
> {
> - return event->pmu->setup_aux;
> + return event->pmu && event->pmu->setup_aux;
> }
>
> static inline bool has_aux_action(struct perf_event *event)
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -207,6 +207,7 @@ static void perf_ctx_unlock(struct perf_
> }
>
> #define TASK_TOMBSTONE ((void *)-1L)
> +#define EVENT_TOMBSTONE ((void *)-1L)
>
> static bool is_kernel_event(struct perf_event *event)
> {
> @@ -2348,6 +2349,11 @@ static void perf_child_detach(struct per
>
> sync_child_event(event);
> list_del_init(&event->child_list);
> + /*
> + * Cannot set to NULL, as that would confuse the situation vs
> + * not being a child event. See for example unaccount_event().
> + */
> + event->parent = EVENT_TOMBSTONE;
> }
>
> static bool is_orphaned_event(struct perf_event *event)
> @@ -2469,7 +2475,9 @@ ctx_time_update_event(struct perf_event_
>
> #define DETACH_GROUP 0x01UL
> #define DETACH_CHILD 0x02UL
> -#define DETACH_DEAD 0x04UL
> +#define DETACH_EXIT 0x04UL
> +#define DETACH_REVOKE 0x08UL
> +#define DETACH_DEAD 0x10UL
>
> /*
> * Cross CPU call to remove a performance event
> @@ -2484,6 +2492,7 @@ __perf_remove_from_context(struct perf_e
> void *info)
> {
> struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
> + enum perf_event_state state = PERF_EVENT_STATE_OFF;
> unsigned long flags = (unsigned long)info;
>
> ctx_time_update(cpuctx, ctx);
> @@ -2492,16 +2501,22 @@ __perf_remove_from_context(struct perf_e
> * Ensure event_sched_out() switches to OFF, at the very least
> * this avoids raising perf_pending_task() at this time.
> */
> - if (flags & DETACH_DEAD)
> + if (flags & DETACH_EXIT)
> + state = PERF_EVENT_STATE_EXIT;
> + if (flags & DETACH_REVOKE)
> + state = PERF_EVENT_STATE_REVOKED;
> + if (flags & DETACH_DEAD) {
> event->pending_disable = 1;
> + state = PERF_EVENT_STATE_DEAD;
> + }
> event_sched_out(event, ctx);
> if (flags & DETACH_GROUP)
> perf_group_detach(event);
> if (flags & DETACH_CHILD)
> perf_child_detach(event);
> list_del_event(event, ctx);
> - if (flags & DETACH_DEAD)
> - event->state = PERF_EVENT_STATE_DEAD;
> +
> + event->state = min(event->state, state);
>
> if (!pmu_ctx->nr_events) {
> pmu_ctx->rotate_necessary = 0;
> @@ -4560,7 +4575,8 @@ static void perf_event_enable_on_exec(st
>
> static void perf_remove_from_owner(struct perf_event *event);
> static void perf_event_exit_event(struct perf_event *event,
> - struct perf_event_context *ctx);
> + struct perf_event_context *ctx,
> + bool revoke);
>
> /*
> * Removes all events from the current task that have been marked
> @@ -4587,7 +4603,7 @@ static void perf_event_remove_on_exec(st
>
> modified = true;
>
> - perf_event_exit_event(event, ctx);
> + perf_event_exit_event(event, ctx, false);
> }
>
> raw_spin_lock_irqsave(&ctx->lock, flags);
> @@ -5187,6 +5203,7 @@ static bool is_sb_event(struct perf_even
> attr->context_switch || attr->text_poke ||
> attr->bpf_event)
> return true;
> +
> return false;
> }
>
> @@ -5388,6 +5405,8 @@ static void perf_pending_task_sync(struc
> /* vs perf_event_alloc() error */
> static void __free_event(struct perf_event *event)
> {
> + struct pmu *pmu = event->pmu;
> +
> if (event->attach_state & PERF_ATTACH_CALLCHAIN)
> put_callchain_buffers();
>
> @@ -5414,6 +5433,7 @@ static void __free_event(struct perf_eve
> * put_pmu_ctx() needs an event->ctx reference, because of
> * epc->ctx.
> */
> + WARN_ON_ONCE(!pmu);
> WARN_ON_ONCE(!event->ctx);
> WARN_ON_ONCE(event->pmu_ctx->ctx != event->ctx);
> put_pmu_ctx(event->pmu_ctx);
> @@ -5426,8 +5446,13 @@ static void __free_event(struct perf_eve
> if (event->ctx)
> put_ctx(event->ctx);
>
> - if (event->pmu)
> - module_put(event->pmu->module);
> + if (pmu) {
> + module_put(pmu->module);
> + scoped_guard (spinlock, &pmu->events_lock) {
> + list_del(&event->pmu_list);
> + wake_up_var(pmu);
> + }
> + }
>
> call_rcu(&event->rcu_head, free_event_rcu);
> }
> @@ -5575,7 +5600,11 @@ int perf_event_release_kernel(struct per
> * Thus this guarantees that we will in fact observe and kill _ALL_
> * child events.
> */
> - perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
> + if (event->state > PERF_EVENT_STATE_REVOKED) {
> + perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
> + } else {
> + event->state = PERF_EVENT_STATE_DEAD;
> + }
>
> perf_event_ctx_unlock(event, ctx);
>
> @@ -5628,7 +5657,7 @@ int perf_event_release_kernel(struct per
> mutex_unlock(&ctx->mutex);
>
> if (child)
> - free_event(child);
> + put_event(child);
> put_ctx(ctx);
>
> goto again;
> @@ -5863,7 +5892,7 @@ __perf_read(struct perf_event *event, ch
> * error state (i.e. because it was pinned but it couldn't be
> * scheduled on to the CPU at some point).
> */
> - if (event->state == PERF_EVENT_STATE_ERROR)
> + if (event->state <= PERF_EVENT_STATE_ERROR)
> return 0;
>
> if (count < event->read_size)
> @@ -5902,8 +5931,14 @@ static __poll_t perf_poll(struct file *f
> struct perf_buffer *rb;
> __poll_t events = EPOLLHUP;
>
> + if (event->state <= PERF_EVENT_STATE_REVOKED)
> + return EPOLLERR;
> +
> poll_wait(file, &event->waitq, wait);
>
> + if (event->state <= PERF_EVENT_STATE_REVOKED)
> + return EPOLLERR;
> +
> if (is_event_hup(event))
> return events;
>
> @@ -6078,6 +6113,9 @@ static long _perf_ioctl(struct perf_even
> void (*func)(struct perf_event *);
> u32 flags = arg;
>
> + if (event->state <= PERF_EVENT_STATE_REVOKED)
> + return -ENODEV;
> +
> switch (cmd) {
> case PERF_EVENT_IOC_ENABLE:
> func = _perf_event_enable;
> @@ -6453,9 +6491,22 @@ void ring_buffer_put(struct perf_buffer
> call_rcu(&rb->rcu_head, rb_free_rcu);
> }
>
> +typedef void (*mapped_f)(struct perf_event *event, struct mm_struct *mm);
> +
> +#define get_mapped(event, func) \
> +({ struct pmu *pmu; \
> + mapped_f f = NULL; \
> + guard(rcu)(); \
> + pmu = READ_ONCE(event->pmu); \
> + if (pmu) \
> + f = pmu->func; \
> + f; \
> +})
> +
> static void perf_mmap_open(struct vm_area_struct *vma)
> {
> struct perf_event *event = vma->vm_file->private_data;
> + mapped_f mapped = get_mapped(event, event_mapped);
>
> atomic_inc(&event->mmap_count);
> atomic_inc(&event->rb->mmap_count);
> @@ -6463,8 +6514,8 @@ static void perf_mmap_open(struct vm_are
> if (vma->vm_pgoff)
> atomic_inc(&event->rb->aux_mmap_count);
>
> - if (event->pmu->event_mapped)
> - event->pmu->event_mapped(event, vma->vm_mm);
> + if (mapped)
> + mapped(event, vma->vm_mm);
> }
>
> static void perf_pmu_output_stop(struct perf_event *event);
> @@ -6480,14 +6531,16 @@ static void perf_pmu_output_stop(struct
> static void perf_mmap_close(struct vm_area_struct *vma)
> {
> struct perf_event *event = vma->vm_file->private_data;
> + mapped_f unmapped = get_mapped(event, event_unmapped);
> struct perf_buffer *rb = ring_buffer_get(event);
> struct user_struct *mmap_user = rb->mmap_user;
> int mmap_locked = rb->mmap_locked;
> unsigned long size = perf_data_size(rb);
> bool detach_rest = false;
>
> - if (event->pmu->event_unmapped)
> - event->pmu->event_unmapped(event, vma->vm_mm);
> + /* FIXIES vs perf_pmu_unregister() */
> + if (unmapped)
> + unmapped(event, vma->vm_mm);
>
> /*
> * The AUX buffer is strictly a sub-buffer, serialize using aux_mutex
> @@ -6680,6 +6733,7 @@ static int perf_mmap(struct file *file,
> unsigned long nr_pages;
> long user_extra = 0, extra = 0;
> int ret, flags = 0;
> + mapped_f mapped;
>
> /*
> * Don't allow mmap() of inherited per-task counters. This would
> @@ -6710,6 +6764,16 @@ static int perf_mmap(struct file *file,
> mutex_lock(&event->mmap_mutex);
> ret = -EINVAL;
>
> + /*
> + * This relies on __pmu_detach_event() taking mmap_mutex after marking
> + * the event REVOKED. Either we observe the state, or __pmu_detach_event()
> + * will detach the rb created here.
> + */
> + if (event->state <= PERF_EVENT_STATE_REVOKED) {
> + ret = -ENODEV;
> + goto unlock;
> + }
> +
> if (vma->vm_pgoff == 0) {
> nr_pages -= 1;
>
> @@ -6888,8 +6952,9 @@ static int perf_mmap(struct file *file,
> if (!ret)
> ret = map_range(rb, vma);
>
> - if (!ret && event->pmu->event_mapped)
> - event->pmu->event_mapped(event, vma->vm_mm);
> + mapped = get_mapped(event, event_mapped);
> + if (mapped)
> + mapped(event, vma->vm_mm);
>
> return ret;
> }
> @@ -6900,6 +6965,9 @@ static int perf_fasync(int fd, struct fi
> struct perf_event *event = filp->private_data;
> int retval;
>
> + if (event->state <= PERF_EVENT_STATE_REVOKED)
> + return -ENODEV;
> +
> inode_lock(inode);
> retval = fasync_helper(fd, filp, on, &event->fasync);
> inode_unlock(inode);
> @@ -10866,6 +10934,9 @@ static int __perf_event_set_bpf_prog(str
> {
> bool is_kprobe, is_uprobe, is_tracepoint, is_syscall_tp;
>
> + if (event->state <= PERF_EVENT_STATE_REVOKED)
> + return -ENODEV;
> +
> if (!perf_event_is_tracing(event))
> return perf_event_set_bpf_handler(event, prog, bpf_cookie);
>
> @@ -12049,6 +12120,9 @@ int perf_pmu_register(struct pmu *_pmu,
> if (!pmu->event_idx)
> pmu->event_idx = perf_event_idx_default;
>
> + INIT_LIST_HEAD(&pmu->events);
> + spin_lock_init(&pmu->events_lock);
> +
> /*
> * Now that the PMU is complete, make it visible to perf_try_init_event().
> */
> @@ -12062,21 +12136,143 @@ int perf_pmu_register(struct pmu *_pmu,
> }
> EXPORT_SYMBOL_GPL(perf_pmu_register);
>
> -void perf_pmu_unregister(struct pmu *pmu)
> +static void __pmu_detach_event(struct pmu *pmu, struct perf_event *event,
> + struct perf_event_context *ctx)
> +{
> + /*
> + * De-schedule the event and mark it REVOKED.
> + */
> + perf_event_exit_event(event, ctx, true);
> +
> + /*
> + * All _free_event() bits that rely on event->pmu:
> + *
> + * Notably, perf_mmap() relies on the ordering here.
> + */
> + scoped_guard (mutex, &event->mmap_mutex) {
> + WARN_ON_ONCE(pmu->event_unmapped);
> + /*
> + * Mostly an empty lock sequence, such that perf_mmap(), which
> + * relies on mmap_mutex, is sure to observe the state change.
> + */
> + }
> +
> + perf_event_free_bpf_prog(event);
> + perf_free_addr_filters(event);
> +
> + if (event->destroy) {
> + event->destroy(event);
> + event->destroy = NULL;
> + }
> +
> + if (event->pmu_ctx) {
> + put_pmu_ctx(event->pmu_ctx);
> + event->pmu_ctx = NULL;
> + }
> +
> + exclusive_event_destroy(event);
> + module_put(pmu->module);
> +
> + event->pmu = NULL; /* force fault instead of UAF */
> +}
> +
> +static void pmu_detach_event(struct pmu *pmu, struct perf_event *event)
> +{
> + struct perf_event_context *ctx;
> +
> + ctx = perf_event_ctx_lock(event);
> + __pmu_detach_event(pmu, event, ctx);
> + perf_event_ctx_unlock(event, ctx);
> +
> + scoped_guard (spinlock, &pmu->events_lock)
> + list_del(&event->pmu_list);
> +}
> +
> +static struct perf_event *pmu_get_event(struct pmu *pmu)
> +{
> + struct perf_event *event;
> +
> + guard(spinlock)(&pmu->events_lock);
> + list_for_each_entry(event, &pmu->events, pmu_list) {
> + if (atomic_long_inc_not_zero(&event->refcount))
> + return event;
> + }
> +
> + return NULL;
> +}
> +
> +static bool pmu_empty(struct pmu *pmu)
> +{
> + guard(spinlock)(&pmu->events_lock);
> + return list_empty(&pmu->events);
> +}
> +
> +static void pmu_detach_events(struct pmu *pmu)
> +{
> + struct perf_event *event;
> +
> + for (;;) {
> + event = pmu_get_event(pmu);
> + if (!event)
> + break;
> +
> + pmu_detach_event(pmu, event);
> + put_event(event);
> + }
> +
> + /*
> + * wait for pending _free_event()s
> + */
> + wait_var_event(pmu, pmu_empty(pmu));
> +}
> +
> +int perf_pmu_unregister(struct pmu *pmu)
> {
> scoped_guard (mutex, &pmus_lock) {
> + if (!idr_cmpxchg(&pmu_idr, pmu->type, pmu, NULL))
> + return -EINVAL;
> +
> list_del_rcu(&pmu->entry);
> - idr_remove(&pmu_idr, pmu->type);
> }
>
> /*
> * We dereference the pmu list under both SRCU and regular RCU, so
> * synchronize against both of those.
> + *
> + * Notably, the entirety of event creation, from perf_init_event()
> + * (which will now fail, because of the above) until
> + * perf_install_in_context() should be under SRCU such that
> + * this synchronizes against event creation. This avoids trying to
> + * detach events that are not fully formed.
> */
> synchronize_srcu(&pmus_srcu);
> synchronize_rcu();
>
> + if (pmu->event_unmapped && !pmu_empty(pmu)) {
> + /*
> + * Can't force remove events when pmu::event_unmapped()
> + * is used in perf_mmap_close().
> + */
> + guard(mutex)(&pmus_lock);
> + idr_cmpxchg(&pmu_idr, pmu->type, NULL, pmu);
> + list_add_rcu(&pmu->entry, &pmus);
> + return -EBUSY;
> + }
> +
> + scoped_guard (mutex, &pmus_lock)
> + idr_remove(&pmu_idr, pmu->type);
> +
> + /*
> + * PMU is removed from the pmus list, so no new events will
> + * be created, now take care of the existing ones.
> + */
> + pmu_detach_events(pmu);
> +
> + /*
> + * PMU is unused, make it go away.
> + */
> perf_pmu_free(pmu);
> + return 0;
> }
> EXPORT_SYMBOL_GPL(perf_pmu_unregister);
>
> @@ -12170,7 +12366,7 @@ static struct pmu *perf_init_event(struc
> struct pmu *pmu;
> int type, ret;
>
> - guard(srcu)(&pmus_srcu);
> + guard(srcu)(&pmus_srcu); /* pmu idr/list access */
>
> /*
> * Save original type before calling pmu->event_init() since certain
> @@ -12394,6 +12590,7 @@ perf_event_alloc(struct perf_event_attr
> INIT_LIST_HEAD(&event->active_entry);
> INIT_LIST_HEAD(&event->addr_filters.list);
> INIT_HLIST_NODE(&event->hlist_entry);
> + INIT_LIST_HEAD(&event->pmu_list);
>
>
> init_waitqueue_head(&event->waitq);
> @@ -12561,6 +12758,13 @@ perf_event_alloc(struct perf_event_attr
> /* symmetric to unaccount_event() in _free_event() */
> account_event(event);
>
> + /*
> + * Event creation should be under SRCU, see perf_pmu_unregister().
> + */
> + lockdep_assert_held(&pmus_srcu);
> + scoped_guard (spinlock, &pmu->events_lock)
> + list_add(&event->pmu_list, &pmu->events);
> +
> return_ptr(event);
> }
>
> @@ -12760,6 +12964,9 @@ perf_event_set_output(struct perf_event
> goto unlock;
>
> if (output_event) {
> + if (output_event->state <= PERF_EVENT_STATE_REVOKED)
> + goto unlock;
> +
> /* get the rb we want to redirect to */
> rb = ring_buffer_get(output_event);
> if (!rb)
> @@ -12941,6 +13148,11 @@ SYSCALL_DEFINE5(perf_event_open,
> if (event_fd < 0)
> return event_fd;
>
> + /*
> + * Event creation should be under SRCU, see perf_pmu_unregister().
> + */
> + guard(srcu)(&pmus_srcu);
> +
> CLASS(fd, group)(group_fd); // group_fd == -1 => empty
> if (group_fd != -1) {
> if (!is_perf_file(group)) {
> @@ -12948,6 +13160,10 @@ SYSCALL_DEFINE5(perf_event_open,
> goto err_fd;
> }
> group_leader = fd_file(group)->private_data;
> + if (group_leader->state <= PERF_EVENT_STATE_REVOKED) {
> + err = -ENODEV;
> + goto err_fd;
> + }
> if (flags & PERF_FLAG_FD_OUTPUT)
> output_event = group_leader;
> if (flags & PERF_FLAG_FD_NO_GROUP)
> @@ -13281,6 +13497,11 @@ perf_event_create_kernel_counter(struct
> if (attr->aux_output || attr->aux_action)
> return ERR_PTR(-EINVAL);
>
> + /*
> + * Event creation should be under SRCU, see perf_pmu_unregister().
> + */
> + guard(srcu)(&pmus_srcu);
> +
> event = perf_event_alloc(attr, cpu, task, NULL, NULL,
> overflow_handler, context, -1);
> if (IS_ERR(event)) {
> @@ -13492,10 +13713,14 @@ static void sync_child_event(struct perf
> }
>
> static void
> -perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> +perf_event_exit_event(struct perf_event *event,
> + struct perf_event_context *ctx, bool revoke)
> {
> struct perf_event *parent_event = event->parent;
> - unsigned long detach_flags = 0;
> + unsigned long detach_flags = DETACH_EXIT;
> +
> + if (parent_event == EVENT_TOMBSTONE)
> + parent_event = NULL;
>
> if (parent_event) {
> /*
> @@ -13510,16 +13735,14 @@ perf_event_exit_event(struct perf_event
> * Do destroy all inherited groups, we don't care about those
> * and being thorough is better.
> */
> - detach_flags = DETACH_GROUP | DETACH_CHILD;
> + detach_flags |= DETACH_GROUP | DETACH_CHILD;
> mutex_lock(&parent_event->child_mutex);
> }
>
> - perf_remove_from_context(event, detach_flags);
> + if (revoke)
> + detach_flags |= DETACH_GROUP | DETACH_REVOKE;
>
> - raw_spin_lock_irq(&ctx->lock);
> - if (event->state > PERF_EVENT_STATE_EXIT)
> - perf_event_set_state(event, PERF_EVENT_STATE_EXIT);
> - raw_spin_unlock_irq(&ctx->lock);
> + perf_remove_from_context(event, detach_flags);
>
> /*
> * Child events can be freed.
> @@ -13530,7 +13753,10 @@ perf_event_exit_event(struct perf_event
> * Kick perf_poll() for is_event_hup();
> */
> perf_event_wakeup(parent_event);
> - free_event(event);
> + /*
> + * pmu_detach_event() will have an extra refcount.
> + */
> + put_event(event);
> put_event(parent_event);
> return;
> }
> @@ -13596,7 +13822,7 @@ static void perf_event_exit_task_context
> perf_event_task(task, ctx, 0);
>
> list_for_each_entry_safe(child_event, next, &ctx->event_list, event_entry)
> - perf_event_exit_event(child_event, ctx);
> + perf_event_exit_event(child_event, ctx, false);
>
> mutex_unlock(&ctx->mutex);
>
> @@ -13743,6 +13969,14 @@ inherit_event(struct perf_event *parent_
> if (parent_event->parent)
> parent_event = parent_event->parent;
>
> + if (parent_event->state <= PERF_EVENT_STATE_REVOKED)
> + return NULL;
> +
> + /*
> + * Event creation should be under SRCU, see perf_pmu_unregister().
> + */
> + guard(srcu)(&pmus_srcu);
> +
> child_event = perf_event_alloc(&parent_event->attr,
> parent_event->cpu,
> child,
>
Hi Perter,
It seems this patch would break the task attached events counting like the
below command shows.
sudo ./perf stat true
Performance counter stats for 'true':
<not counted> msec task-clock
<not counted> context-switches
<not counted> cpu-migrations
<not counted> page-faults
<not counted> instructions
<not counted> cycles
<not counted> branches
<not counted> branch-misses
0.002615279 seconds time elapsed
0.002771000 seconds user
0.000000000 seconds sys
Thanks,
Dapeng Mi
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable
2025-04-14 0:37 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Mi, Dapeng
@ 2025-04-17 8:07 ` Peter Zijlstra
2025-04-17 8:24 ` Mi, Dapeng
` (2 more replies)
0 siblings, 3 replies; 43+ messages in thread
From: Peter Zijlstra @ 2025-04-17 8:07 UTC (permalink / raw)
To: Mi, Dapeng
Cc: mingo, ravi.bangoria, lucas.demarchi, linux-kernel, acme,
namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, james.clark
On Mon, Apr 14, 2025 at 08:37:07AM +0800, Mi, Dapeng wrote:
> It seems this patch would break the task attached events counting like the
> below command shows.
>
Right, found another report for that yesterday.
---
Subject: perf: Fix perf-stat / read()
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Apr 16 20:50:27 CEST 2025
In the zeal to adjust all event->state checks to include the new
REVOKED state, one adjustment was made in error. Notably it resulted
in read() on the perf filedesc to stop working for any state lower
than ERROR, specifically EXIT.
This leads to problems with (among others) perf-stat, which wants to
read the counts after a program has finished execution.
Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
Reported-by: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
Reported-by: James Clark <james.clark@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/77036114-8723-4af9-a068-1d535f4e2e81@linaro.org
---
kernel/events/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6023,7 +6023,7 @@ __perf_read(struct perf_event *event, ch
* error state (i.e. because it was pinned but it couldn't be
* scheduled on to the CPU at some point).
*/
- if (event->state <= PERF_EVENT_STATE_ERROR)
+ if (event->state == PERF_EVENT_STATE_ERROR)
return 0;
if (count < event->read_size)
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [tip: perf/core] perf: Make perf_pmu_unregister() useable
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
@ 2025-04-17 8:08 ` Peter Zijlstra
2025-04-17 13:01 ` [tip: perf/core] perf/core: Fix event timekeeping merge tip-bot2 for Peter Zijlstra
0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2025-04-17 8:08 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-tip-commits, Ravi Bangoria, x86, yeoreum.yun
Subject: perf: Fix event timekeeping merge
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Apr 16 11:36:24 CEST 2025
Due to an oversight in merging da916e96e2de ("perf: Make
perf_pmu_unregister() useable") on top of a3c3c66670ce ("perf/core:
Fix child_total_time_enabled accounting bug at task exit") the
timekeeping fix from this latter patch got undone.
Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2500,14 +2500,14 @@ __perf_remove_from_context(struct perf_e
state = PERF_EVENT_STATE_DEAD;
}
event_sched_out(event, ctx);
+ perf_event_set_state(event, min(event->state, state));
+
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
- event->state = min(event->state, state);
-
if (!pmu_ctx->nr_events) {
pmu_ctx->rotate_necessary = 0;
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable
2025-04-17 8:07 ` Peter Zijlstra
@ 2025-04-17 8:24 ` Mi, Dapeng
2025-04-17 11:30 ` [tip: perf/core] perf/core: Fix perf-stat / read() tip-bot2 for Peter Zijlstra
2025-04-17 13:01 ` tip-bot2 for Peter Zijlstra
2 siblings, 0 replies; 43+ messages in thread
From: Mi, Dapeng @ 2025-04-17 8:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, ravi.bangoria, lucas.demarchi, linux-kernel, acme,
namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, james.clark
On 4/17/2025 4:07 PM, Peter Zijlstra wrote:
> On Mon, Apr 14, 2025 at 08:37:07AM +0800, Mi, Dapeng wrote:
>
>> It seems this patch would break the task attached events counting like the
>> below command shows.
>>
> Right, found another report for that yesterday.
>
> ---
> Subject: perf: Fix perf-stat / read()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed Apr 16 20:50:27 CEST 2025
>
> In the zeal to adjust all event->state checks to include the new
> REVOKED state, one adjustment was made in error. Notably it resulted
> in read() on the perf filedesc to stop working for any state lower
> than ERROR, specifically EXIT.
>
> This leads to problems with (among others) perf-stat, which wants to
> read the counts after a program has finished execution.
>
> Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
> Reported-by: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
> Reported-by: James Clark <james.clark@linaro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/77036114-8723-4af9-a068-1d535f4e2e81@linaro.org
> ---
> kernel/events/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6023,7 +6023,7 @@ __perf_read(struct perf_event *event, ch
> * error state (i.e. because it was pinned but it couldn't be
> * scheduled on to the CPU at some point).
> */
> - if (event->state <= PERF_EVENT_STATE_ERROR)
> + if (event->state == PERF_EVENT_STATE_ERROR)
> return 0;
>
> if (count < event->read_size)
Good to know it has been fixed. Thanks.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [tip: perf/core] perf/core: Fix perf-stat / read()
2025-04-17 8:07 ` Peter Zijlstra
2025-04-17 8:24 ` Mi, Dapeng
@ 2025-04-17 11:30 ` tip-bot2 for Peter Zijlstra
2025-04-17 13:01 ` tip-bot2 for Peter Zijlstra
2 siblings, 0 replies; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-17 11:30 UTC (permalink / raw)
To: linux-tip-commits
Cc: Mi, Dapeng, James Clark, Peter Zijlstra (Intel), Ingo Molnar, x86,
linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 75195453f7b17ae604dc28a91f19937b1906b333
Gitweb: https://git.kernel.org/tip/75195453f7b17ae604dc28a91f19937b1906b333
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 20:50:27 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 17 Apr 2025 13:24:49 +02:00
perf/core: Fix perf-stat / read()
In the zeal to adjust all event->state checks to include the new
REVOKED state, one adjustment was made in error. Notably it resulted
in read() on the perf filedesc to stop working for any state lower
than ERROR, specifically EXIT.
This leads to problems with (among others) perf-stat, which wants to
read the counts after a program has finished execution.
Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
Reported-by: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
Reported-by: James Clark <james.clark@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250417080725.GH38216@noisy.programming.kicks-ass.net
---
kernel/events/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2eb9cd5..e4d7a0c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6021,7 +6021,7 @@ __perf_read(struct perf_event *event, char __user *buf, size_t count)
* error state (i.e. because it was pinned but it couldn't be
* scheduled on to the CPU at some point).
*/
- if (event->state <= PERF_EVENT_STATE_ERROR)
+ if (event->state == PERF_EVENT_STATE_ERROR)
return 0;
if (count < event->read_size)
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [tip: perf/core] perf: Simplify perf_event_free_task() wait
2025-04-09 13:01 ` Frederic Weisbecker
2025-04-10 9:34 ` Peter Zijlstra
@ 2025-04-17 12:03 ` Ingo Molnar
2025-04-17 13:01 ` [tip: perf/core] perf/core: Fix put_ctx() ordering tip-bot2 for Frederic Weisbecker
2 siblings, 0 replies; 43+ messages in thread
From: Ingo Molnar @ 2025-04-17 12:03 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: linux-kernel, linux-tip-commits, Peter Zijlstra (Intel),
Ravi Bangoria, x86
* Frederic Weisbecker <frederic@kernel.org> wrote:
> Le Tue, Apr 08, 2025 at 07:05:04PM -0000, tip-bot2 for Peter Zijlstra a écrit :
> > The following commit has been merged into the perf/core branch of tip:
> >
> > Commit-ID: 59f3aa4a3ee27e96132e16d2d2bdc3acadb4bf79
> > Gitweb: https://git.kernel.org/tip/59f3aa4a3ee27e96132e16d2d2bdc3acadb4bf79
> > Author: Peter Zijlstra <peterz@infradead.org>
> > AuthorDate: Fri, 17 Jan 2025 15:27:07 +01:00
> > Committer: Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Tue, 08 Apr 2025 20:55:46 +02:00
> >
> > perf: Simplify perf_event_free_task() wait
> >
> > Simplify the code by moving the duplicated wakeup condition into
> > put_ctx().
> >
> > Notably, wait_var_event() is in perf_event_free_task() and will have
> > set ctx->task = TASK_TOMBSTONE.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
> > Link: https://lkml.kernel.org/r/20250307193723.044499344@infradead.org
> > ---
> > kernel/events/core.c | 25 +++----------------------
> > 1 file changed, 3 insertions(+), 22 deletions(-)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 3c92b75..fa6dab0 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1270,6 +1270,9 @@ static void put_ctx(struct perf_event_context *ctx)
> > if (ctx->task && ctx->task != TASK_TOMBSTONE)
> > put_task_struct(ctx->task);
> > call_rcu(&ctx->rcu_head, free_ctx);
> > + } else if (ctx->task == TASK_TOMBSTONE) {
> > + smp_mb(); /* pairs with wait_var_event() */
> > + wake_up_var(&ctx->refcount);
>
> So there are three situations:
>
> * If perf_event_free_task() has removed all the children from the parent list
> before perf_event_release_kernel() got a chance to even iterate them, then
> it's all good as there is no get_ctx() pending.
>
> * If perf_event_release_kernel() iterates a child event, but it gets freed
> meanwhile by perf_event_free_task() while the mutexes are temporarily
> unlocked, it's all good because while locking again the ctx mutex,
> perf_event_release_kernel() observes TASK_TOMBSTONE.
>
> * But if perf_event_release_kernel() frees the child event before
> perf_event_free_task() got a chance, we may face this scenario:
>
> perf_event_release_kernel() perf_event_free_task()
> -------------------------- ------------------------
> mutex_lock(&event->child_mutex)
> get_ctx(child->ctx)
> mutex_unlock(&event->child_mutex)
>
> mutex_lock(ctx->mutex)
> mutex_lock(&event->child_mutex)
> perf_remove_from_context(child)
> mutex_unlock(&event->child_mutex)
> mutex_unlock(ctx->mutex)
>
> // This lock acquires ctx->refcount == 2
> // visibility
> mutex_lock(ctx->mutex)
> ctx->task = TASK_TOMBSTONE
> mutex_unlock(ctx->mutex)
>
> wait_var_event()
> // enters prepare_to_wait() since
> // ctx->refcount == 2
> // is guaranteed to be seen
> set_current_state(TASK_INTERRUPTIBLE)
> smp_mb()
> if (ctx->refcount != 1)
> schedule()
> put_ctx()
> // NOT fully ordered! Only RELEASE semantics
> refcount_dec_and_test()
> atomic_fetch_sub_release()
> // So TASK_TOMBSTONE is not guaranteed to be seen
> if (ctx->task == TASK_TOMBSTONE)
> wake_up_var()
>
> Basically it's a broken store buffer:
>
> perf_event_release_kernel() perf_event_free_task()
> -------------------------- ------------------------
> ctx->task = TASK_TOMBSTONE smp_store_release(&ctx->refcount, ctx->refcount - 1)
> smp_mb()
> READ_ONCE(ctx->refcount) READ_ONCE(ctx->task)
>
>
> So you need this:
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index fa6dab08be47..c4fbbe25361a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1270,9 +1270,10 @@ static void put_ctx(struct perf_event_context *ctx)
> if (ctx->task && ctx->task != TASK_TOMBSTONE)
> put_task_struct(ctx->task);
> call_rcu(&ctx->rcu_head, free_ctx);
> - } else if (ctx->task == TASK_TOMBSTONE) {
> + } else {
> smp_mb(); /* pairs with wait_var_event() */
> - wake_up_var(&ctx->refcount);
> + if (ctx->task == TASK_TOMBSTONE)
> + wake_up_var(&ctx->refcount);
> }
> }
JFYI, I've added your SOB:
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Thanks,
Ingo
^ permalink raw reply [flat|nested] 43+ messages in thread
* [tip: perf/core] perf/core: Fix event timekeeping merge
2025-04-17 8:08 ` Peter Zijlstra
@ 2025-04-17 13:01 ` tip-bot2 for Peter Zijlstra
0 siblings, 0 replies; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-17 13:01 UTC (permalink / raw)
To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: b02b41c827de9f4b785f57e82d76d0826cc8398b
Gitweb: https://git.kernel.org/tip/b02b41c827de9f4b785f57e82d76d0826cc8398b
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 11:36:24 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 17 Apr 2025 14:21:15 +02:00
perf/core: Fix event timekeeping merge
Due to an oversight in merging:
da916e96e2de ("perf: Make perf_pmu_unregister() useable")
on top of:
a3c3c66670ce ("perf/core: Fix child_total_time_enabled accounting bug at task exit")
the timekeeping fix from this latter patch got undone.
Redo it.
Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20250417080815.GI38216@noisy.programming.kicks-ass.net
---
kernel/events/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 43d87de..07414cb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2500,14 +2500,14 @@ __perf_remove_from_context(struct perf_event *event,
state = PERF_EVENT_STATE_DEAD;
}
event_sched_out(event, ctx);
+ perf_event_set_state(event, min(event->state, state));
+
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
- event->state = min(event->state, state);
-
if (!pmu_ctx->nr_events) {
pmu_ctx->rotate_necessary = 0;
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [tip: perf/core] perf/core: Fix put_ctx() ordering
2025-04-09 13:01 ` Frederic Weisbecker
2025-04-10 9:34 ` Peter Zijlstra
2025-04-17 12:03 ` Ingo Molnar
@ 2025-04-17 13:01 ` tip-bot2 for Frederic Weisbecker
2 siblings, 0 replies; 43+ messages in thread
From: tip-bot2 for Frederic Weisbecker @ 2025-04-17 13:01 UTC (permalink / raw)
To: linux-tip-commits
Cc: Frederic Weisbecker, Peter Zijlstra (Intel), Ingo Molnar, x86,
linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 2839f393c69456bc356738e521b2e70b82977f46
Gitweb: https://git.kernel.org/tip/2839f393c69456bc356738e521b2e70b82977f46
Author: Frederic Weisbecker <frederic@kernel.org>
AuthorDate: Wed, 09 Apr 2025 15:01:12 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 17 Apr 2025 14:21:15 +02:00
perf/core: Fix put_ctx() ordering
So there are three situations:
* If perf_event_free_task() has removed all the children from the parent list
before perf_event_release_kernel() got a chance to even iterate them, then
it's all good as there is no get_ctx() pending.
* If perf_event_release_kernel() iterates a child event, but it gets freed
meanwhile by perf_event_free_task() while the mutexes are temporarily
unlocked, it's all good because while locking again the ctx mutex,
perf_event_release_kernel() observes TASK_TOMBSTONE.
* But if perf_event_release_kernel() frees the child event before
perf_event_free_task() got a chance, we may face this scenario:
perf_event_release_kernel() perf_event_free_task()
-------------------------- ------------------------
mutex_lock(&event->child_mutex)
get_ctx(child->ctx)
mutex_unlock(&event->child_mutex)
mutex_lock(ctx->mutex)
mutex_lock(&event->child_mutex)
perf_remove_from_context(child)
mutex_unlock(&event->child_mutex)
mutex_unlock(ctx->mutex)
// This lock acquires ctx->refcount == 2
// visibility
mutex_lock(ctx->mutex)
ctx->task = TASK_TOMBSTONE
mutex_unlock(ctx->mutex)
wait_var_event()
// enters prepare_to_wait() since
// ctx->refcount == 2
// is guaranteed to be seen
set_current_state(TASK_INTERRUPTIBLE)
smp_mb()
if (ctx->refcount != 1)
schedule()
put_ctx()
// NOT fully ordered! Only RELEASE semantics
refcount_dec_and_test()
atomic_fetch_sub_release()
// So TASK_TOMBSTONE is not guaranteed to be seen
if (ctx->task == TASK_TOMBSTONE)
wake_up_var()
Basically it's a broken store buffer:
perf_event_release_kernel() perf_event_free_task()
-------------------------- ------------------------
ctx->task = TASK_TOMBSTONE smp_store_release(&ctx->refcount, ctx->refcount - 1)
smp_mb()
READ_ONCE(ctx->refcount) READ_ONCE(ctx->task)
So we need a smp_mb__after_atomic() before looking at ctx->task.
Fixes: 59f3aa4a3ee2 ("perf: Simplify perf_event_free_task() wait")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/Z_ZvmEhjkAhplCBE@localhost.localdomain
---
kernel/events/core.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e4d7a0c..1a19df9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1271,9 +1271,10 @@ static void put_ctx(struct perf_event_context *ctx)
if (ctx->task && ctx->task != TASK_TOMBSTONE)
put_task_struct(ctx->task);
call_rcu(&ctx->rcu_head, free_ctx);
- } else if (ctx->task == TASK_TOMBSTONE) {
- smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(&ctx->refcount);
+ } else {
+ smp_mb__after_atomic(); /* pairs with wait_var_event() */
+ if (ctx->task == TASK_TOMBSTONE)
+ wake_up_var(&ctx->refcount);
}
}
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [tip: perf/core] perf/core: Fix perf-stat / read()
2025-04-17 8:07 ` Peter Zijlstra
2025-04-17 8:24 ` Mi, Dapeng
2025-04-17 11:30 ` [tip: perf/core] perf/core: Fix perf-stat / read() tip-bot2 for Peter Zijlstra
@ 2025-04-17 13:01 ` tip-bot2 for Peter Zijlstra
2 siblings, 0 replies; 43+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-04-17 13:01 UTC (permalink / raw)
To: linux-tip-commits
Cc: Mi, Dapeng, James Clark, Peter Zijlstra (Intel), Ingo Molnar, x86,
linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: f6938a562a6249000de211a710807ebf0b8fdf26
Gitweb: https://git.kernel.org/tip/f6938a562a6249000de211a710807ebf0b8fdf26
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 20:50:27 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 17 Apr 2025 14:21:15 +02:00
perf/core: Fix perf-stat / read()
In the zeal to adjust all event->state checks to include the new
REVOKED state, one adjustment was made in error. Notably it resulted
in read() on the perf filedesc to stop working for any state lower
than ERROR, specifically EXIT.
This leads to problems with (among others) perf-stat, which wants to
read the counts after a program has finished execution.
Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
Reported-by: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
Reported-by: James Clark <james.clark@linaro.org>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/77036114-8723-4af9-a068-1d535f4e2e81@linaro.org
Link: https://lore.kernel.org/r/20250417080725.GH38216@noisy.programming.kicks-ass.net
---
kernel/events/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2eb9cd5..e4d7a0c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6021,7 +6021,7 @@ __perf_read(struct perf_event *event, char __user *buf, size_t count)
* error state (i.e. because it was pinned but it couldn't be
* scheduled on to the CPU at some point).
*/
- if (event->state <= PERF_EVENT_STATE_ERROR)
+ if (event->state == PERF_EVENT_STATE_ERROR)
return 0;
if (count < event->read_size)
^ permalink raw reply related [flat|nested] 43+ messages in thread
end of thread, other threads:[~2025-04-17 13:01 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-07 19:33 [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 1/7] perf: Ensure bpf_perf_link path is properly serialized Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 2/7] perf: Simplify child event tear-down Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 3/7] perf: Simplify perf_event_free_task() wait Peter Zijlstra
2025-03-17 6:49 ` Ravi Bangoria
2025-04-02 9:15 ` Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2025-04-09 13:01 ` Frederic Weisbecker
2025-04-10 9:34 ` Peter Zijlstra
2025-04-10 9:45 ` Frederic Weisbecker
2025-04-17 12:03 ` Ingo Molnar
2025-04-17 13:01 ` [tip: perf/core] perf/core: Fix put_ctx() ordering tip-bot2 for Frederic Weisbecker
2025-03-07 19:33 ` [PATCH v3 4/7] perf: Simplify perf_event_release_kernel() Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 5/7] perf: Unify perf_event_free_task() / perf_event_exit_task_context() Peter Zijlstra
2025-03-10 15:35 ` [PATCH v3a " Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 6/7] perf: Rename perf_event_exit_task(.child) Peter Zijlstra
2025-03-10 11:08 ` Ravi Bangoria
2025-03-10 14:47 ` Peter Zijlstra
2025-03-10 15:20 ` Ravi Bangoria
2025-03-10 15:27 ` Peter Zijlstra
2025-03-10 15:37 ` [PATCH v3a " Peter Zijlstra
2025-03-12 6:31 ` Ravi Bangoria
2025-03-12 10:16 ` Peter Zijlstra
2025-03-07 19:33 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Peter Zijlstra
2025-03-10 15:35 ` Ravi Bangoria
2025-03-10 16:14 ` Peter Zijlstra
2025-03-10 16:46 ` Ravi Bangoria
2025-03-12 12:57 ` Peter Zijlstra
2025-03-12 13:57 ` Ravi Bangoria
2025-04-08 19:05 ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2025-04-17 8:08 ` Peter Zijlstra
2025-04-17 13:01 ` [tip: perf/core] perf/core: Fix event timekeeping merge tip-bot2 for Peter Zijlstra
2025-04-14 0:37 ` [PATCH v3 7/7] perf: Make perf_pmu_unregister() useable Mi, Dapeng
2025-04-17 8:07 ` Peter Zijlstra
2025-04-17 8:24 ` Mi, Dapeng
2025-04-17 11:30 ` [tip: perf/core] perf/core: Fix perf-stat / read() tip-bot2 for Peter Zijlstra
2025-04-17 13:01 ` tip-bot2 for Peter Zijlstra
2025-03-17 6:54 ` [PATCH v3 0/7] perf: Make perf_pmu_unregister() usable Ravi Bangoria
2025-04-08 19:05 ` [tip: perf/core] perf: Rename perf_event_exit_task(.child) tip-bot2 for Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] perf: Simplify child event tear-down tip-bot2 for Peter Zijlstra
2025-04-08 19:05 ` [tip: perf/core] perf: Ensure bpf_perf_link path is properly serialized tip-bot2 for Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox