* [PATCH 0/4] perf fixes
@ 2025-04-24 16:11 Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 1/4] perf: Fix failing inherit_event() doing extra refcount decrement on parent Frederic Weisbecker
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-24 16:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: LKML, Frederic Weisbecker, Namhyung Kim, Adrian Hunter,
Ravi Bangoria, Jiri Olsa, Ian Rogers, linux-perf-users,
Mark Rutland, Alexander Shishkin, Liang, Kan,
Arnaldo Carvalho de Melo
Hi,
A few bunch of fixes against perf/core while staring at perf latest changes.
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
perf/core
HEAD: 623d84ef5ee7d4cefa546de630fec805e4634335
Thanks,
Frederic
---
Frederic Weisbecker (4):
perf: Fix failing inherit_event() doing extra refcount decrement on parent
perf: Fix irq work dereferencing garbage
perf: Remove too early and redundant CPU hotplug handling
perf: Fix confusing aux iteration
include/linux/cpuhotplug.h | 1 -
kernel/cpu.c | 5 ---
kernel/events/core.c | 109 ++++++++++++++++++++++++++++-----------------
3 files changed, 67 insertions(+), 48 deletions(-)
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/4] perf: Fix failing inherit_event() doing extra refcount decrement on parent
2025-04-24 16:11 [PATCH 0/4] perf fixes Frederic Weisbecker
@ 2025-04-24 16:11 ` Frederic Weisbecker
2025-04-24 16:16 ` Peter Zijlstra
2025-04-24 16:11 ` [PATCH 2/4] perf: Fix irq work dereferencing garbage Frederic Weisbecker
` (2 subsequent siblings)
3 siblings, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-24 16:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: LKML, Frederic Weisbecker, Liang, Kan, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
Jiri Olsa, Mark Rutland, Namhyung Kim, Ravi Bangoria,
linux-perf-users
When inherit_event() fails after the child allocation but before the
parent refcount has been incremented, calling put_event() wrongly
decrements the reference to the parent, risking to free it too early.
Also pmu_get_event() can't be holding a reference to the child
concurrently at this point since it is under pmus_srcu critical section.
Fix it with restoring the deleted free_event() function and call it on
the failing child in order to free it directly under the verified
assumption that its refcount is only 1. The refcount to the parent is
then voluntarily omitted.
Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/events/core.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 07414cb1279b..7bcb02ffb93a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5627,6 +5627,22 @@ static void _free_event(struct perf_event *event)
__free_event(event);
}
+/*
+ * Used to free events which have a known refcount of 1, such as in error paths
+ * of inherited events.
+ */
+static void free_event(struct perf_event *event)
+{
+ if (WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1,
+ "unexpected event refcount: %ld; ptr=%p\n",
+ atomic_long_read(&event->refcount), event)) {
+ /* leak to avoid use-after-free */
+ return;
+ }
+
+ _free_event(event);
+}
+
/*
* Remove user event from the owner task.
*/
@@ -14184,7 +14200,7 @@ inherit_event(struct perf_event *parent_event,
pmu_ctx = find_get_pmu_context(child_event->pmu, child_ctx, child_event);
if (IS_ERR(pmu_ctx)) {
- put_event(child_event);
+ free_event(child_event);
return ERR_CAST(pmu_ctx);
}
child_event->pmu_ctx = pmu_ctx;
@@ -14199,7 +14215,7 @@ inherit_event(struct perf_event *parent_event,
if (is_orphaned_event(parent_event) ||
!atomic_long_inc_not_zero(&parent_event->refcount)) {
mutex_unlock(&parent_event->child_mutex);
- put_event(child_event);
+ free_event(child_event);
return NULL;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/4] perf: Fix irq work dereferencing garbage
2025-04-24 16:11 [PATCH 0/4] perf fixes Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 1/4] perf: Fix failing inherit_event() doing extra refcount decrement on parent Frederic Weisbecker
@ 2025-04-24 16:11 ` Frederic Weisbecker
2025-04-24 16:30 ` Peter Zijlstra
2025-04-24 16:11 ` [PATCH 3/4] perf: Remove too early and redundant CPU hotplug handling Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 4/4] perf: Fix confusing aux iteration Frederic Weisbecker
3 siblings, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-24 16:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: LKML, Frederic Weisbecker, Liang, Kan, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
Jiri Olsa, Mark Rutland, Namhyung Kim, Ravi Bangoria,
linux-perf-users
The following commit:
da916e96e2de ("perf: Make perf_pmu_unregister() useable")
has introduced two significant event's parent lifecycle changes:
1) An event that has exited now has EVENT_TOMBSTONE as a parent.
This can result in a situation where the delayed wakeup irq_work can
accidentally dereference EVENT_TOMBSTONE on:
CPU 0 CPU 1
----- -----
__schedule()
local_irq_disable()
rq_lock()
<NMI>
perf_event_overflow()
irq_work_queue(&child->pending_irq)
</NMI>
perf_event_task_sched_out()
raw_spin_lock(&ctx->lock)
ctx_sched_out()
ctx->is_active = 0
event_sched_out(child)
raw_spin_unlock(&ctx->lock)
perf_event_release_kernel(parent)
perf_remove_from_context(child)
raw_spin_lock_irq(&ctx->lock)
// Sees !ctx->is_active
// Removes from context inline
__perf_remove_from_context(child)
perf_child_detach(child)
event->parent = EVENT_TOMBSTONE
raw_spin_rq_unlock_irq(rq);
<IRQ>
perf_pending_irq()
perf_event_wakeup(child)
ring_buffer_wakeup(child)
rcu_dereference(child->parent->rb) <--- CRASH
This also concerns the call to kill_fasync() on parent->fasync.
2) The final parent reference count decrement can now happen before the
the final child reference count decrement. ie: the parent can now
be freed before its child. On PREEMPT_RT, this can result in a
situation where the delayed wakeup irq_work can accidentally
dereference a freed parent:
CPU 0 CPU 1 CPU 2
----- ----- ------
perf_pmu_unregister()
pmu_detach_events()
pmu_get_event()
atomic_long_inc_not_zero(&child->refcount)
<NMI>
perf_event_overflow()
irq_work_queue(&child->pending_irq);
</NMI>
<IRQ>
irq_work_run()
wake_irq_workd()
</IRQ>
preempt_schedule_irq()
=========> SWITCH to workd
irq_work_run_list()
perf_pending_irq()
perf_event_wakeup(child)
ring_buffer_wakeup(child)
event = child->parent
perf_event_release_kernel(parent)
// Not last ref, PMU holds it
put_event(child)
// Last ref
put_event(parent)
free_event()
call_rcu(...)
rcu_core()
free_event_rcu()
rcu_dereference(event->rb) <--- CRASH
This also concerns the call to kill_fasync() on parent->fasync.
The "easy" solution to 1) is to check that event->parent is not
EVENT_TOMBSTONE on perf_event_wakeup() (including both ring buffer
and fasync uses).
The "easy" solution to 2) is to turn perf_event_wakeup() to wholefully
run under rcu_read_lock().
However because of 2), sanity would prescribe to make event::parent
an __rcu pointer and annotate each and every users to prove they are
reliable.
Propose an alternate solution and restore the stable pointer to the
parent until all its children have called _free_event() themselves to
avoid any further accident. Also revert the EVENT_TOMBSTONE design
that is mostly here to determine which caller of perf_event_exit_event()
must perform the refcount decrement on a child event matching the
increment in inherit_event().
Arrange instead for checking the attach state of an event prior to its
removal and decrement the refcount of the child accordingly.
Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/events/core.c | 87 ++++++++++++++++++++++++--------------------
1 file changed, 48 insertions(+), 39 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7bcb02ffb93a..968a1d14bc8b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -208,7 +208,6 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
}
#define TASK_TOMBSTONE ((void *)-1L)
-#define EVENT_TOMBSTONE ((void *)-1L)
static bool is_kernel_event(struct perf_event *event)
{
@@ -2338,12 +2337,6 @@ static void perf_child_detach(struct perf_event *event)
sync_child_event(event);
list_del_init(&event->child_list);
- /*
- * Cannot set to NULL, as that would confuse the situation vs
- * not being a child event. See for example unaccount_event().
- */
- event->parent = EVENT_TOMBSTONE;
- put_event(parent_event);
}
static bool is_orphaned_event(struct perf_event *event)
@@ -2469,6 +2462,11 @@ ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event)
#define DETACH_REVOKE 0x08UL
#define DETACH_DEAD 0x10UL
+struct perf_remove_data {
+ unsigned int detach_flags;
+ unsigned int old_state;
+};
+
/*
* Cross CPU call to remove a performance event
*
@@ -2483,28 +2481,30 @@ __perf_remove_from_context(struct perf_event *event,
{
struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
enum perf_event_state state = PERF_EVENT_STATE_OFF;
- unsigned long flags = (unsigned long)info;
+ struct perf_remove_data *prd = info;
ctx_time_update(cpuctx, ctx);
+ prd->old_state = event->attach_state;
+
/*
* Ensure event_sched_out() switches to OFF, at the very least
* this avoids raising perf_pending_task() at this time.
*/
- if (flags & DETACH_EXIT)
+ if (prd->detach_flags & DETACH_EXIT)
state = PERF_EVENT_STATE_EXIT;
- if (flags & DETACH_REVOKE)
+ if (prd->detach_flags & DETACH_REVOKE)
state = PERF_EVENT_STATE_REVOKED;
- if (flags & DETACH_DEAD) {
+ if (prd->detach_flags & DETACH_DEAD) {
event->pending_disable = 1;
state = PERF_EVENT_STATE_DEAD;
}
event_sched_out(event, ctx);
perf_event_set_state(event, min(event->state, state));
- if (flags & DETACH_GROUP)
+ if (prd->detach_flags & DETACH_GROUP)
perf_group_detach(event);
- if (flags & DETACH_CHILD)
+ if (prd->detach_flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
@@ -2541,7 +2541,7 @@ __perf_remove_from_context(struct perf_event *event,
* When called from perf_event_exit_task, it's OK because the
* context has been detached from its task.
*/
-static void perf_remove_from_context(struct perf_event *event, unsigned long flags)
+static void perf_remove_from_context(struct perf_event *event, struct perf_remove_data *prd)
{
struct perf_event_context *ctx = event->ctx;
@@ -2555,13 +2555,13 @@ static void perf_remove_from_context(struct perf_event *event, unsigned long fla
raw_spin_lock_irq(&ctx->lock);
if (!ctx->is_active) {
__perf_remove_from_context(event, this_cpu_ptr(&perf_cpu_context),
- ctx, (void *)flags);
+ ctx, (void *)prd);
raw_spin_unlock_irq(&ctx->lock);
return;
}
raw_spin_unlock_irq(&ctx->lock);
- event_function_call(event, __perf_remove_from_context, (void *)flags);
+ event_function_call(event, __perf_remove_from_context, (void *)prd);
}
/*
@@ -5705,7 +5705,7 @@ static void put_event(struct perf_event *event)
_free_event(event);
/* Matches the refcount bump in inherit_event() */
- if (parent && parent != EVENT_TOMBSTONE)
+ if (parent)
put_event(parent);
}
@@ -5718,6 +5718,7 @@ int perf_event_release_kernel(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;
struct perf_event *child, *tmp;
+ struct perf_remove_data prd = { .old_state = 0 };
/*
* If we got here through err_alloc: free_event(event); we will not
@@ -5747,7 +5748,8 @@ int perf_event_release_kernel(struct perf_event *event)
* child events.
*/
if (event->state > PERF_EVENT_STATE_REVOKED) {
- perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
+ prd.detach_flags = DETACH_GROUP | DETACH_DEAD;
+ perf_remove_from_context(event, &prd);
} else {
event->state = PERF_EVENT_STATE_DEAD;
}
@@ -5789,7 +5791,8 @@ int perf_event_release_kernel(struct perf_event *event)
tmp = list_first_entry_or_null(&event->child_list,
struct perf_event, child_list);
if (tmp == child) {
- perf_remove_from_context(child, DETACH_GROUP | DETACH_CHILD);
+ prd.detach_flags = DETACH_GROUP | DETACH_CHILD;
+ perf_remove_from_context(child, &prd);
} else {
child = NULL;
}
@@ -13583,11 +13586,12 @@ SYSCALL_DEFINE5(perf_event_open,
*/
if (move_group) {
- perf_remove_from_context(group_leader, 0);
+ struct perf_remove_data prd = { 0 };
+ perf_remove_from_context(group_leader, &prd);
put_pmu_ctx(group_leader->pmu_ctx);
for_each_sibling_event(sibling, group_leader) {
- perf_remove_from_context(sibling, 0);
+ perf_remove_from_context(sibling, &prd);
put_pmu_ctx(sibling->pmu_ctx);
}
@@ -13789,14 +13793,15 @@ static void __perf_pmu_remove(struct perf_event_context *ctx,
struct list_head *events)
{
struct perf_event *event, *sibling;
+ struct perf_remove_data prd = { 0 };
perf_event_groups_for_cpu_pmu(event, groups, cpu, pmu) {
- perf_remove_from_context(event, 0);
+ perf_remove_from_context(event, &prd);
put_pmu_ctx(event->pmu_ctx);
list_add(&event->migrate_entry, events);
for_each_sibling_event(sibling, event) {
- perf_remove_from_context(sibling, 0);
+ perf_remove_from_context(sibling, &prd);
put_pmu_ctx(sibling->pmu_ctx);
list_add(&sibling->migrate_entry, events);
}
@@ -13921,11 +13926,7 @@ perf_event_exit_event(struct perf_event *event,
struct perf_event_context *ctx, bool revoke)
{
struct perf_event *parent_event = event->parent;
- unsigned long detach_flags = DETACH_EXIT;
- bool is_child = !!parent_event;
-
- if (parent_event == EVENT_TOMBSTONE)
- parent_event = NULL;
+ struct perf_remove_data prd = { .detach_flags = DETACH_EXIT };
if (parent_event) {
/*
@@ -13940,29 +13941,36 @@ perf_event_exit_event(struct perf_event *event,
* Do destroy all inherited groups, we don't care about those
* and being thorough is better.
*/
- detach_flags |= DETACH_GROUP | DETACH_CHILD;
+ prd.detach_flags |= DETACH_GROUP | DETACH_CHILD;
mutex_lock(&parent_event->child_mutex);
}
if (revoke)
- detach_flags |= DETACH_GROUP | DETACH_REVOKE;
+ prd.detach_flags |= DETACH_GROUP | DETACH_REVOKE;
- perf_remove_from_context(event, detach_flags);
+ perf_remove_from_context(event, &prd);
/*
* Child events can be freed.
*/
- if (is_child) {
- if (parent_event) {
- mutex_unlock(&parent_event->child_mutex);
- /*
- * Kick perf_poll() for is_event_hup();
- */
- perf_event_wakeup(parent_event);
+ if (parent_event) {
+ mutex_unlock(&parent_event->child_mutex);
+ /*
+ * Kick perf_poll() for is_event_hup();
+ */
+ perf_event_wakeup(parent_event);
+
+ /*
+ * Match the refcount initialization. Make sure it doesn't happen
+ * twice if pmu_detach_event() calls it on an already exited task.
+ */
+ if (prd.old_state & PERF_ATTACH_CHILD) {
/*
* pmu_detach_event() will have an extra refcount.
+ * perf_pending_task() might have one too.
*/
put_event(event);
}
+
return;
}
@@ -14532,13 +14540,14 @@ static void perf_swevent_init_cpu(unsigned int cpu)
static void __perf_event_exit_context(void *__info)
{
struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+ struct perf_remove_data prd = { .detach_flags = DETACH_GROUP };
struct perf_event_context *ctx = __info;
struct perf_event *event;
raw_spin_lock(&ctx->lock);
ctx_sched_out(ctx, NULL, EVENT_TIME);
list_for_each_entry(event, &ctx->event_list, event_entry)
- __perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
+ __perf_remove_from_context(event, cpuctx, ctx, (void *)&prd);
raw_spin_unlock(&ctx->lock);
}
--
2.48.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/4] perf: Remove too early and redundant CPU hotplug handling
2025-04-24 16:11 [PATCH 0/4] perf fixes Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 1/4] perf: Fix failing inherit_event() doing extra refcount decrement on parent Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 2/4] perf: Fix irq work dereferencing garbage Frederic Weisbecker
@ 2025-04-24 16:11 ` Frederic Weisbecker
2025-04-24 16:32 ` Peter Zijlstra
2025-04-24 16:11 ` [PATCH 4/4] perf: Fix confusing aux iteration Frederic Weisbecker
3 siblings, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-24 16:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: LKML, Frederic Weisbecker, Liang, Kan, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
Jiri Olsa, Mark Rutland, Namhyung Kim, Ravi Bangoria,
linux-perf-users
The CPU hotplug handlers are called twice: at prepare and online stage.
Their role is to:
1) Enable/disable a CPU context. This is irrelevant and even buggy at
the prepare stage because the CPU is still offline. On early
secondary CPU up, creating an event attached to that CPU might
silently fail because the CPU context is observed as online but the
context installation's IPI failure is ignored.
2) Update the scope cpumasks and re-migrate the events accordingly in
the CPU down case. This is irrelevant at the prepare stage.
3) Remove the events attached to the context of the offlining CPU. It
even uses an (unnecessary) IPI for it. This is also irrelevant at the
prepare stage.
Also none of the *_PREPARE and *_STARTING architecture perf related CPU
hotplug callbacks rely on CPUHP_PERF_PREPARE.
CPUHP_AP_PERF_ONLINE is enough and the right place to perform the work.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/cpuhotplug.h | 1 -
kernel/cpu.c | 5 -----
2 files changed, 6 deletions(-)
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 1987400000b4..df366ee15456 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -60,7 +60,6 @@ enum cpuhp_state {
/* PREPARE section invoked on a control CPU */
CPUHP_OFFLINE = 0,
CPUHP_CREATE_THREADS,
- CPUHP_PERF_PREPARE,
CPUHP_PERF_X86_PREPARE,
CPUHP_PERF_X86_AMD_UNCORE_PREP,
CPUHP_PERF_POWER,
diff --git a/kernel/cpu.c b/kernel/cpu.c
index b08bb34b1718..a59e009e0be4 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2069,11 +2069,6 @@ static struct cpuhp_step cpuhp_hp_states[] = {
.teardown.single = NULL,
.cant_stop = true,
},
- [CPUHP_PERF_PREPARE] = {
- .name = "perf:prepare",
- .startup.single = perf_event_init_cpu,
- .teardown.single = perf_event_exit_cpu,
- },
[CPUHP_RANDOM_PREPARE] = {
.name = "random:prepare",
.startup.single = random_prepare_cpu,
--
2.48.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/4] perf: Fix confusing aux iteration
2025-04-24 16:11 [PATCH 0/4] perf fixes Frederic Weisbecker
` (2 preceding siblings ...)
2025-04-24 16:11 ` [PATCH 3/4] perf: Remove too early and redundant CPU hotplug handling Frederic Weisbecker
@ 2025-04-24 16:11 ` Frederic Weisbecker
3 siblings, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-24 16:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: LKML, Frederic Weisbecker, Liang, Kan, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
Jiri Olsa, Mark Rutland, Namhyung Kim, Ravi Bangoria,
linux-perf-users
While an event tears down all links to it as an aux, the iteration
happens on the event's group leader instead of the group itself.
If the event is a group leader, it has no effect because the event is
also its own group leader. But otherwise there would be a risk to detach
all the siblings events from the wrong group leader.
It just happens to work because each sibling's aux link is tested
against the right event before proceeding. Also the ctx lock is the same
for the events and their group leader so the iteration is safe.
Yet the iteration is confusing. Clarify the actual intent.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/events/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 968a1d14bc8b..0d25bde536c9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2171,7 +2171,7 @@ static void perf_put_aux_event(struct perf_event *event)
* If the event is an aux_event, tear down all links to
* it from other events.
*/
- for_each_sibling_event(iter, event->group_leader) {
+ for_each_sibling_event(iter, event) {
if (iter->aux_event != event)
continue;
--
2.48.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] perf: Fix failing inherit_event() doing extra refcount decrement on parent
2025-04-24 16:11 ` [PATCH 1/4] perf: Fix failing inherit_event() doing extra refcount decrement on parent Frederic Weisbecker
@ 2025-04-24 16:16 ` Peter Zijlstra
0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2025-04-24 16:16 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, LKML, Liang, Kan, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Mark Rutland,
Namhyung Kim, Ravi Bangoria, linux-perf-users
On Thu, Apr 24, 2025 at 06:11:25PM +0200, Frederic Weisbecker wrote:
> When inherit_event() fails after the child allocation but before the
> parent refcount has been incremented, calling put_event() wrongly
> decrements the reference to the parent, risking to free it too early.
>
> Also pmu_get_event() can't be holding a reference to the child
> concurrently at this point since it is under pmus_srcu critical section.
>
> Fix it with restoring the deleted free_event() function and call it on
> the failing child in order to free it directly under the verified
> assumption that its refcount is only 1. The refcount to the parent is
> then voluntarily omitted.
>
> Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Bah, yes, another bad interaction between your fix and my patch :/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
2025-04-24 16:11 ` [PATCH 2/4] perf: Fix irq work dereferencing garbage Frederic Weisbecker
@ 2025-04-24 16:30 ` Peter Zijlstra
2025-04-28 11:11 ` Frederic Weisbecker
0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2025-04-24 16:30 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, LKML, Liang, Kan, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Mark Rutland,
Namhyung Kim, Ravi Bangoria, linux-perf-users
On Thu, Apr 24, 2025 at 06:11:26PM +0200, Frederic Weisbecker wrote:
> The following commit:
>
> da916e96e2de ("perf: Make perf_pmu_unregister() useable")
>
> has introduced two significant event's parent lifecycle changes:
>
> 1) An event that has exited now has EVENT_TOMBSTONE as a parent.
> This can result in a situation where the delayed wakeup irq_work can
> accidentally dereference EVENT_TOMBSTONE on:
>
> CPU 0 CPU 1
> ----- -----
>
> __schedule()
> local_irq_disable()
> rq_lock()
> <NMI>
> perf_event_overflow()
> irq_work_queue(&child->pending_irq)
> </NMI>
> perf_event_task_sched_out()
> raw_spin_lock(&ctx->lock)
> ctx_sched_out()
> ctx->is_active = 0
> event_sched_out(child)
> raw_spin_unlock(&ctx->lock)
> perf_event_release_kernel(parent)
> perf_remove_from_context(child)
> raw_spin_lock_irq(&ctx->lock)
> // Sees !ctx->is_active
> // Removes from context inline
> __perf_remove_from_context(child)
> perf_child_detach(child)
> event->parent = EVENT_TOMBSTONE
> raw_spin_rq_unlock_irq(rq);
> <IRQ>
> perf_pending_irq()
> perf_event_wakeup(child)
> ring_buffer_wakeup(child)
> rcu_dereference(child->parent->rb) <--- CRASH
>
> This also concerns the call to kill_fasync() on parent->fasync.
Argh, I actually looked for this case and didn't find it in one of the
earlier fixes :/
> 2) The final parent reference count decrement can now happen before the
> the final child reference count decrement. ie: the parent can now
> be freed before its child. On PREEMPT_RT, this can result in a
> situation where the delayed wakeup irq_work can accidentally
> dereference a freed parent:
>
> CPU 0 CPU 1 CPU 2
> ----- ----- ------
>
> perf_pmu_unregister()
> pmu_detach_events()
> pmu_get_event()
> atomic_long_inc_not_zero(&child->refcount)
>
> <NMI>
> perf_event_overflow()
> irq_work_queue(&child->pending_irq);
> </NMI>
> <IRQ>
> irq_work_run()
> wake_irq_workd()
> </IRQ>
> preempt_schedule_irq()
> =========> SWITCH to workd
> irq_work_run_list()
> perf_pending_irq()
> perf_event_wakeup(child)
> ring_buffer_wakeup(child)
> event = child->parent
>
> perf_event_release_kernel(parent)
> // Not last ref, PMU holds it
> put_event(child)
> // Last ref
> put_event(parent)
> free_event()
> call_rcu(...)
> rcu_core()
> free_event_rcu()
>
> rcu_dereference(event->rb) <--- CRASH
>
> This also concerns the call to kill_fasync() on parent->fasync.
>
> The "easy" solution to 1) is to check that event->parent is not
> EVENT_TOMBSTONE on perf_event_wakeup() (including both ring buffer
> and fasync uses).
>
> The "easy" solution to 2) is to turn perf_event_wakeup() to wholefully
> run under rcu_read_lock().
>
> However because of 2), sanity would prescribe to make event::parent
> an __rcu pointer and annotate each and every users to prove they are
> reliable.
>
> Propose an alternate solution and restore the stable pointer to the
> parent until all its children have called _free_event() themselves to
> avoid any further accident. Also revert the EVENT_TOMBSTONE design
> that is mostly here to determine which caller of perf_event_exit_event()
> must perform the refcount decrement on a child event matching the
> increment in inherit_event().
>
> Arrange instead for checking the attach state of an event prior to its
> removal and decrement the refcount of the child accordingly.
Urgh, brain hurts, will have to look again tomorrow.
> @@ -13940,29 +13941,36 @@ perf_event_exit_event(struct perf_event *event,
> * Do destroy all inherited groups, we don't care about those
> * and being thorough is better.
> */
> - detach_flags |= DETACH_GROUP | DETACH_CHILD;
> + prd.detach_flags |= DETACH_GROUP | DETACH_CHILD;
> mutex_lock(&parent_event->child_mutex);
> }
>
> if (revoke)
> - detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> + prd.detach_flags |= DETACH_GROUP | DETACH_REVOKE;
>
> - perf_remove_from_context(event, detach_flags);
> + perf_remove_from_context(event, &prd);
Isn't all this waay to complicated?
That is, to modify state we need both ctx->mutex and ctx->lock, and this
is what __perf_remove_from_context() has, but because of this, holding
either one of those locks is sufficient to read the state -- it cannot
change.
And here we already hold ctx->mutex.
So can't we simply do:
old_state = event->attach_state;
perf_remove_from_context(event, detach_flags);
// do whatever with old_state
> /*
> * Child events can be freed.
> */
> - if (is_child) {
> - if (parent_event) {
> - mutex_unlock(&parent_event->child_mutex);
> - /*
> - * Kick perf_poll() for is_event_hup();
> - */
> - perf_event_wakeup(parent_event);
> + if (parent_event) {
> + mutex_unlock(&parent_event->child_mutex);
> + /*
> + * Kick perf_poll() for is_event_hup();
> + */
> + perf_event_wakeup(parent_event);
> +
> + /*
> + * Match the refcount initialization. Make sure it doesn't happen
> + * twice if pmu_detach_event() calls it on an already exited task.
> + */
> + if (prd.old_state & PERF_ATTACH_CHILD) {
> /*
> * pmu_detach_event() will have an extra refcount.
> + * perf_pending_task() might have one too.
> */
> put_event(event);
> }
> +
> return;
> }
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/4] perf: Remove too early and redundant CPU hotplug handling
2025-04-24 16:11 ` [PATCH 3/4] perf: Remove too early and redundant CPU hotplug handling Frederic Weisbecker
@ 2025-04-24 16:32 ` Peter Zijlstra
0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2025-04-24 16:32 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, LKML, Liang, Kan, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Mark Rutland,
Namhyung Kim, Ravi Bangoria, linux-perf-users
On Thu, Apr 24, 2025 at 06:11:27PM +0200, Frederic Weisbecker wrote:
> The CPU hotplug handlers are called twice: at prepare and online stage.
>
> Their role is to:
>
> 1) Enable/disable a CPU context. This is irrelevant and even buggy at
> the prepare stage because the CPU is still offline. On early
> secondary CPU up, creating an event attached to that CPU might
> silently fail because the CPU context is observed as online but the
> context installation's IPI failure is ignored.
>
> 2) Update the scope cpumasks and re-migrate the events accordingly in
> the CPU down case. This is irrelevant at the prepare stage.
>
> 3) Remove the events attached to the context of the offlining CPU. It
> even uses an (unnecessary) IPI for it. This is also irrelevant at the
> prepare stage.
>
> Also none of the *_PREPARE and *_STARTING architecture perf related CPU
> hotplug callbacks rely on CPUHP_PERF_PREPARE.
>
> CPUHP_AP_PERF_ONLINE is enough and the right place to perform the work.
Oh hey, that's curious indeed.
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> include/linux/cpuhotplug.h | 1 -
> kernel/cpu.c | 5 -----
> 2 files changed, 6 deletions(-)
>
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 1987400000b4..df366ee15456 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -60,7 +60,6 @@ enum cpuhp_state {
> /* PREPARE section invoked on a control CPU */
> CPUHP_OFFLINE = 0,
> CPUHP_CREATE_THREADS,
> - CPUHP_PERF_PREPARE,
> CPUHP_PERF_X86_PREPARE,
> CPUHP_PERF_X86_AMD_UNCORE_PREP,
> CPUHP_PERF_POWER,
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index b08bb34b1718..a59e009e0be4 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2069,11 +2069,6 @@ static struct cpuhp_step cpuhp_hp_states[] = {
> .teardown.single = NULL,
> .cant_stop = true,
> },
> - [CPUHP_PERF_PREPARE] = {
> - .name = "perf:prepare",
> - .startup.single = perf_event_init_cpu,
> - .teardown.single = perf_event_exit_cpu,
> - },
> [CPUHP_RANDOM_PREPARE] = {
> .name = "random:prepare",
> .startup.single = random_prepare_cpu,
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
2025-04-24 16:30 ` Peter Zijlstra
@ 2025-04-28 11:11 ` Frederic Weisbecker
2025-05-02 10:29 ` Peter Zijlstra
0 siblings, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2025-04-28 11:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, LKML, Liang, Kan, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Mark Rutland,
Namhyung Kim, Ravi Bangoria, linux-perf-users
Le Thu, Apr 24, 2025 at 06:30:24PM +0200, Peter Zijlstra a écrit :
> On Thu, Apr 24, 2025 at 06:11:26PM +0200, Frederic Weisbecker wrote:
> > @@ -13940,29 +13941,36 @@ perf_event_exit_event(struct perf_event *event,
> > * Do destroy all inherited groups, we don't care about those
> > * and being thorough is better.
> > */
> > - detach_flags |= DETACH_GROUP | DETACH_CHILD;
> > + prd.detach_flags |= DETACH_GROUP | DETACH_CHILD;
> > mutex_lock(&parent_event->child_mutex);
> > }
> >
> > if (revoke)
> > - detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> > + prd.detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> >
> > - perf_remove_from_context(event, detach_flags);
> > + perf_remove_from_context(event, &prd);
>
> Isn't all this waay to complicated?
>
> That is, to modify state we need both ctx->mutex and ctx->lock, and this
> is what __perf_remove_from_context() has, but because of this, holding
> either one of those locks is sufficient to read the state -- it cannot
> change.
>
> And here we already hold ctx->mutex.
>
> So can't we simply do:
>
> old_state = event->attach_state;
> perf_remove_from_context(event, detach_flags);
>
> // do whatever with old_state
Right, the locking scenario is just a bit more complicated.
Most flags are set on init or with both ctx mutex and lock.
But:
_ PERF_ATTACH_CHILD is set instead with parent child_mutex and ctx lock.
_ PERF_ATTACH_ITRACE is set from pmu::start(). Thus from the event context
with just interrupt disabled. It's probably enough to synchronize against
initialization and remove_from_context IPIs but perf_event_exit_event() needs
some care.
So we must hold both ctx mutex and child_mutex (although the pmus_srcu thing
should make that PERF_ATTACH_CHILD thing visible but let's keep things obvious).
And also have WRITE_ONCE() / READ_ONCE() to take care about PERF_ATTACH_ITRACE,
which we don't care about anyway.
Now this looks like this:
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7bcb02ffb93a..7278ca731a55 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -208,7 +208,6 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
}
#define TASK_TOMBSTONE ((void *)-1L)
-#define EVENT_TOMBSTONE ((void *)-1L)
static bool is_kernel_event(struct perf_event *event)
{
@@ -2338,12 +2337,6 @@ static void perf_child_detach(struct perf_event *event)
sync_child_event(event);
list_del_init(&event->child_list);
- /*
- * Cannot set to NULL, as that would confuse the situation vs
- * not being a child event. See for example unaccount_event().
- */
- event->parent = EVENT_TOMBSTONE;
- put_event(parent_event);
}
static bool is_orphaned_event(struct perf_event *event)
@@ -5705,7 +5698,7 @@ static void put_event(struct perf_event *event)
_free_event(event);
/* Matches the refcount bump in inherit_event() */
- if (parent && parent != EVENT_TOMBSTONE)
+ if (parent)
put_event(parent);
}
@@ -9998,7 +9991,7 @@ void perf_event_text_poke(const void *addr, const void *old_bytes,
void perf_event_itrace_started(struct perf_event *event)
{
- event->attach_state |= PERF_ATTACH_ITRACE;
+ WRITE_ONCE(event->attach_state, event->attach_state | PERF_ATTACH_ITRACE);
}
static void perf_log_itrace_start(struct perf_event *event)
@@ -13922,10 +13915,7 @@ perf_event_exit_event(struct perf_event *event,
{
struct perf_event *parent_event = event->parent;
unsigned long detach_flags = DETACH_EXIT;
- bool is_child = !!parent_event;
-
- if (parent_event == EVENT_TOMBSTONE)
- parent_event = NULL;
+ unsigned int attach_state;
if (parent_event) {
/*
@@ -13942,6 +13932,8 @@ perf_event_exit_event(struct perf_event *event,
*/
detach_flags |= DETACH_GROUP | DETACH_CHILD;
mutex_lock(&parent_event->child_mutex);
+ /* PERF_ATTACH_ITRACE might be set concurrently */
+ attach_state = READ_ONCE(event->attach_state);
}
if (revoke)
@@ -13951,18 +13943,25 @@ perf_event_exit_event(struct perf_event *event,
/*
* Child events can be freed.
*/
- if (is_child) {
- if (parent_event) {
- mutex_unlock(&parent_event->child_mutex);
- /*
- * Kick perf_poll() for is_event_hup();
- */
- perf_event_wakeup(parent_event);
+ if (parent_event) {
+ mutex_unlock(&parent_event->child_mutex);
+ /*
+ * Kick perf_poll() for is_event_hup();
+ */
+ perf_event_wakeup(parent_event);
+
+ /*
+ * Match the refcount initialization. Make sure it doesn't happen
+ * twice if pmu_detach_event() calls it on an already exited task.
+ */
+ if (attach_state & PERF_ATTACH_CHILD) {
/*
* pmu_detach_event() will have an extra refcount.
+ * perf_pending_task() might have one too.
*/
put_event(event);
}
+
return;
}
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
2025-04-28 11:11 ` Frederic Weisbecker
@ 2025-05-02 10:29 ` Peter Zijlstra
2025-05-02 11:30 ` Peter Zijlstra
2025-05-02 11:58 ` Frederic Weisbecker
0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2025-05-02 10:29 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, LKML, Liang, Kan, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Mark Rutland,
Namhyung Kim, Ravi Bangoria, linux-perf-users
On Mon, Apr 28, 2025 at 01:11:47PM +0200, Frederic Weisbecker wrote:
> Le Thu, Apr 24, 2025 at 06:30:24PM +0200, Peter Zijlstra a écrit :
> > On Thu, Apr 24, 2025 at 06:11:26PM +0200, Frederic Weisbecker wrote:
> > > @@ -13940,29 +13941,36 @@ perf_event_exit_event(struct perf_event *event,
> > > * Do destroy all inherited groups, we don't care about those
> > > * and being thorough is better.
> > > */
> > > - detach_flags |= DETACH_GROUP | DETACH_CHILD;
> > > + prd.detach_flags |= DETACH_GROUP | DETACH_CHILD;
> > > mutex_lock(&parent_event->child_mutex);
> > > }
> > >
> > > if (revoke)
> > > - detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> > > + prd.detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> > >
> > > - perf_remove_from_context(event, detach_flags);
> > > + perf_remove_from_context(event, &prd);
> >
> > Isn't all this waay to complicated?
> >
> > That is, to modify state we need both ctx->mutex and ctx->lock, and this
> > is what __perf_remove_from_context() has, but because of this, holding
> > either one of those locks is sufficient to read the state -- it cannot
> > change.
> >
> > And here we already hold ctx->mutex.
> >
> > So can't we simply do:
> >
> > old_state = event->attach_state;
> > perf_remove_from_context(event, detach_flags);
> >
> > // do whatever with old_state
>
> Right, the locking scenario is just a bit more complicated.
> Most flags are set on init or with both ctx mutex and lock.
> But:
>
> _ PERF_ATTACH_CHILD is set instead with parent child_mutex and ctx lock.
Looks trivial to add ctx->mutex to the mix here. Its not like that's a
fast path. But let me go read your patch before deciding if that's
actually needed :-)
> _ PERF_ATTACH_ITRACE is set from pmu::start(). Thus from the event context
> with just interrupt disabled. It's probably enough to synchronize against
> initialization and remove_from_context IPIs but perf_event_exit_event() needs
> some care.
Right, that's a little tricky indeed. As stated, we don't care about the
bit, but the write shouldn't mess things up.
> So we must hold both ctx mutex and child_mutex (although the pmus_srcu thing
> should make that PERF_ATTACH_CHILD thing visible but let's keep things obvious).
> And also have WRITE_ONCE() / READ_ONCE() to take care about PERF_ATTACH_ITRACE,
> which we don't care about anyway.
>
> Now this looks like this:
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7bcb02ffb93a..7278ca731a55 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -208,7 +208,6 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
> }
>
> #define TASK_TOMBSTONE ((void *)-1L)
> -#define EVENT_TOMBSTONE ((void *)-1L)
>
> static bool is_kernel_event(struct perf_event *event)
> {
> @@ -2338,12 +2337,6 @@ static void perf_child_detach(struct perf_event *event)
>
> sync_child_event(event);
> list_del_init(&event->child_list);
> - /*
> - * Cannot set to NULL, as that would confuse the situation vs
> - * not being a child event. See for example unaccount_event().
> - */
> - event->parent = EVENT_TOMBSTONE;
> - put_event(parent_event);
> }
>
> static bool is_orphaned_event(struct perf_event *event)
> @@ -5705,7 +5698,7 @@ static void put_event(struct perf_event *event)
> _free_event(event);
>
> /* Matches the refcount bump in inherit_event() */
> - if (parent && parent != EVENT_TOMBSTONE)
> + if (parent)
> put_event(parent);
> }
>
> @@ -9998,7 +9991,7 @@ void perf_event_text_poke(const void *addr, const void *old_bytes,
>
> void perf_event_itrace_started(struct perf_event *event)
> {
> - event->attach_state |= PERF_ATTACH_ITRACE;
> + WRITE_ONCE(event->attach_state, event->attach_state | PERF_ATTACH_ITRACE);
> }
>
> static void perf_log_itrace_start(struct perf_event *event)
> @@ -13922,10 +13915,7 @@ perf_event_exit_event(struct perf_event *event,
> {
> struct perf_event *parent_event = event->parent;
> unsigned long detach_flags = DETACH_EXIT;
> - bool is_child = !!parent_event;
> -
> - if (parent_event == EVENT_TOMBSTONE)
> - parent_event = NULL;
> + unsigned int attach_state;
>
> if (parent_event) {
> /*
> @@ -13942,6 +13932,8 @@ perf_event_exit_event(struct perf_event *event,
> */
> detach_flags |= DETACH_GROUP | DETACH_CHILD;
> mutex_lock(&parent_event->child_mutex);
> + /* PERF_ATTACH_ITRACE might be set concurrently */
> + attach_state = READ_ONCE(event->attach_state);
> }
>
> if (revoke)
> @@ -13951,18 +13943,25 @@ perf_event_exit_event(struct perf_event *event,
> /*
> * Child events can be freed.
> */
> - if (is_child) {
> - if (parent_event) {
> - mutex_unlock(&parent_event->child_mutex);
> - /*
> - * Kick perf_poll() for is_event_hup();
> - */
> - perf_event_wakeup(parent_event);
> + if (parent_event) {
> + mutex_unlock(&parent_event->child_mutex);
> + /*
> + * Kick perf_poll() for is_event_hup();
> + */
> + perf_event_wakeup(parent_event);
Should not this perf_event_wakeup() be inside the next if() as well?
doing anything on parent_event when !ATTACH_CHILD seems dodgy.
> +
> + /*
> + * Match the refcount initialization. Make sure it doesn't happen
> + * twice if pmu_detach_event() calls it on an already exited task.
> + */
> + if (attach_state & PERF_ATTACH_CHILD) {
> /*
> * pmu_detach_event() will have an extra refcount.
> + * perf_pending_task() might have one too.
> */
> put_event(event);
> }
> +
> return;
> }
This is a *much* saner patch, thank you!
So the thing I worried about... which is why I chose for the TOMBSTONE
thing, is that this second invocation will now dereference parent_event,
even though we've already released our reference count on it.
This is essentially a use-after-free.
The thing that makes it work is RCU. And I think we're good, since the
fail case is two perf_event_exit_event() invocations on the same event,
separated by an RCU grace period, and I don't think this can happen.
But its a shame we can't reliably detect that.. Oh well.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
2025-05-02 10:29 ` Peter Zijlstra
@ 2025-05-02 11:30 ` Peter Zijlstra
2025-05-02 12:04 ` Frederic Weisbecker
2025-05-02 11:58 ` Frederic Weisbecker
1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2025-05-02 11:30 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, LKML, Liang, Kan, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Mark Rutland,
Namhyung Kim, Ravi Bangoria, linux-perf-users
On Fri, May 02, 2025 at 12:29:18PM +0200, Peter Zijlstra wrote:
> > @@ -13951,18 +13943,25 @@ perf_event_exit_event(struct perf_event *event,
> > /*
> > * Child events can be freed.
> > */
> > - if (is_child) {
> > - if (parent_event) {
> > - mutex_unlock(&parent_event->child_mutex);
> > - /*
> > - * Kick perf_poll() for is_event_hup();
> > - */
> > - perf_event_wakeup(parent_event);
> > + if (parent_event) {
> > + mutex_unlock(&parent_event->child_mutex);
> > + /*
> > + * Kick perf_poll() for is_event_hup();
> > + */
> > + perf_event_wakeup(parent_event);
>
> Should not this perf_event_wakeup() be inside the next if() as well?
> doing anything on parent_event when !ATTACH_CHILD seems dodgy.
I made this change, and munged the original changelog on top and stuffed
the patches into queue/perf/core.
> > +
> > + /*
> > + * Match the refcount initialization. Make sure it doesn't happen
> > + * twice if pmu_detach_event() calls it on an already exited task.
> > + */
> > + if (attach_state & PERF_ATTACH_CHILD) {
> > /*
> > * pmu_detach_event() will have an extra refcount.
> > + * perf_pending_task() might have one too.
> > */
> > put_event(event);
> > }
> > +
> > return;
> > }
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
2025-05-02 10:29 ` Peter Zijlstra
2025-05-02 11:30 ` Peter Zijlstra
@ 2025-05-02 11:58 ` Frederic Weisbecker
1 sibling, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-05-02 11:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, LKML, Liang, Kan, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Mark Rutland,
Namhyung Kim, Ravi Bangoria, linux-perf-users
Le Fri, May 02, 2025 at 12:29:18PM +0200, Peter Zijlstra a écrit :
> > @@ -13951,18 +13943,25 @@ perf_event_exit_event(struct perf_event *event,
> > /*
> > * Child events can be freed.
> > */
> > - if (is_child) {
> > - if (parent_event) {
> > - mutex_unlock(&parent_event->child_mutex);
> > - /*
> > - * Kick perf_poll() for is_event_hup();
> > - */
> > - perf_event_wakeup(parent_event);
> > + if (parent_event) {
> > + mutex_unlock(&parent_event->child_mutex);
> > + /*
> > + * Kick perf_poll() for is_event_hup();
> > + */
> > + perf_event_wakeup(parent_event);
>
> Should not this perf_event_wakeup() be inside the next if() as well?
> doing anything on parent_event when !ATTACH_CHILD seems dodgy.
Good point!
>
> > +
> > + /*
> > + * Match the refcount initialization. Make sure it doesn't happen
> > + * twice if pmu_detach_event() calls it on an already exited task.
> > + */
> > + if (attach_state & PERF_ATTACH_CHILD) {
> > /*
> > * pmu_detach_event() will have an extra refcount.
> > + * perf_pending_task() might have one too.
> > */
> > put_event(event);
> > }
> > +
> > return;
> > }
>
> This is a *much* saner patch, thank you!
>
> So the thing I worried about... which is why I chose for the TOMBSTONE
> thing, is that this second invocation will now dereference parent_event,
> even though we've already released our reference count on it.
>
> This is essentially a use-after-free.
>
> The thing that makes it work is RCU. And I think we're good, since the
> fail case is two perf_event_exit_event() invocations on the same event,
> separated by an RCU grace period, and I don't think this can happen.
>
> But its a shame we can't reliably detect that.. Oh well.
It's not RCU but the reference count of the child that protects it.
In a second invocation, pmu_unregister() still holds a reference to
the child and that protects the parent as well because the reference
to the parent is only dropped once the child has dropped its own.
Hopefully that is one less opportunity for a headache :-)
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
2025-05-02 11:30 ` Peter Zijlstra
@ 2025-05-02 12:04 ` Frederic Weisbecker
0 siblings, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-05-02 12:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, LKML, Liang, Kan, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Mark Rutland,
Namhyung Kim, Ravi Bangoria, linux-perf-users
Le Fri, May 02, 2025 at 01:30:02PM +0200, Peter Zijlstra a écrit :
> On Fri, May 02, 2025 at 12:29:18PM +0200, Peter Zijlstra wrote:
>
> > > @@ -13951,18 +13943,25 @@ perf_event_exit_event(struct perf_event *event,
> > > /*
> > > * Child events can be freed.
> > > */
> > > - if (is_child) {
> > > - if (parent_event) {
> > > - mutex_unlock(&parent_event->child_mutex);
> > > - /*
> > > - * Kick perf_poll() for is_event_hup();
> > > - */
> > > - perf_event_wakeup(parent_event);
> > > + if (parent_event) {
> > > + mutex_unlock(&parent_event->child_mutex);
> > > + /*
> > > + * Kick perf_poll() for is_event_hup();
> > > + */
> > > + perf_event_wakeup(parent_event);
> >
> > Should not this perf_event_wakeup() be inside the next if() as well?
> > doing anything on parent_event when !ATTACH_CHILD seems dodgy.
>
> I made this change, and munged the original changelog on top and stuffed
> the patches into queue/perf/core.
Looks good, but it looks like you trimmed the changelog with the
race windows part. Though I must confess, who wants to read that anyway? ;-)
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-05-02 12:04 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-24 16:11 [PATCH 0/4] perf fixes Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 1/4] perf: Fix failing inherit_event() doing extra refcount decrement on parent Frederic Weisbecker
2025-04-24 16:16 ` Peter Zijlstra
2025-04-24 16:11 ` [PATCH 2/4] perf: Fix irq work dereferencing garbage Frederic Weisbecker
2025-04-24 16:30 ` Peter Zijlstra
2025-04-28 11:11 ` Frederic Weisbecker
2025-05-02 10:29 ` Peter Zijlstra
2025-05-02 11:30 ` Peter Zijlstra
2025-05-02 12:04 ` Frederic Weisbecker
2025-05-02 11:58 ` Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 3/4] perf: Remove too early and redundant CPU hotplug handling Frederic Weisbecker
2025-04-24 16:32 ` Peter Zijlstra
2025-04-24 16:11 ` [PATCH 4/4] perf: Fix confusing aux iteration Frederic Weisbecker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).