From: Peter Zijlstra <peterz@infradead.org>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
LKML <linux-kernel@vger.kernel.org>,
"Liang, Kan" <kan.liang@linux.intel.com>,
Adrian Hunter <adrian.hunter@intel.com>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Arnaldo Carvalho de Melo <acme@kernel.org>,
Ian Rogers <irogers@google.com>, Jiri Olsa <jolsa@kernel.org>,
Mark Rutland <mark.rutland@arm.com>,
Namhyung Kim <namhyung@kernel.org>,
Ravi Bangoria <ravi.bangoria@amd.com>,
linux-perf-users@vger.kernel.org
Subject: Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
Date: Thu, 24 Apr 2025 18:30:24 +0200 [thread overview]
Message-ID: <20250424163024.GC18306@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <20250424161128.29176-3-frederic@kernel.org>
On Thu, Apr 24, 2025 at 06:11:26PM +0200, Frederic Weisbecker wrote:
> The following commit:
>
> da916e96e2de ("perf: Make perf_pmu_unregister() useable")
>
> has introduced two significant event's parent lifecycle changes:
>
> 1) An event that has exited now has EVENT_TOMBSTONE as a parent.
> This can result in a situation where the delayed wakeup irq_work can
> accidentally dereference EVENT_TOMBSTONE on:
>
> CPU 0 CPU 1
> ----- -----
>
> __schedule()
> local_irq_disable()
> rq_lock()
> <NMI>
> perf_event_overflow()
> irq_work_queue(&child->pending_irq)
> </NMI>
> perf_event_task_sched_out()
> raw_spin_lock(&ctx->lock)
> ctx_sched_out()
> ctx->is_active = 0
> event_sched_out(child)
> raw_spin_unlock(&ctx->lock)
> perf_event_release_kernel(parent)
> perf_remove_from_context(child)
> raw_spin_lock_irq(&ctx->lock)
> // Sees !ctx->is_active
> // Removes from context inline
> __perf_remove_from_context(child)
> perf_child_detach(child)
> event->parent = EVENT_TOMBSTONE
> raw_spin_rq_unlock_irq(rq);
> <IRQ>
> perf_pending_irq()
> perf_event_wakeup(child)
> ring_buffer_wakeup(child)
> rcu_dereference(child->parent->rb) <--- CRASH
>
> This also concerns the call to kill_fasync() on parent->fasync.
Argh, I actually looked for this case and didn't find it in one of the
earlier fixes :/
> 2) The final parent reference count decrement can now happen before the
> the final child reference count decrement. ie: the parent can now
> be freed before its child. On PREEMPT_RT, this can result in a
> situation where the delayed wakeup irq_work can accidentally
> dereference a freed parent:
>
> CPU 0 CPU 1 CPU 2
> ----- ----- ------
>
> perf_pmu_unregister()
> pmu_detach_events()
> pmu_get_event()
> atomic_long_inc_not_zero(&child->refcount)
>
> <NMI>
> perf_event_overflow()
> irq_work_queue(&child->pending_irq);
> </NMI>
> <IRQ>
> irq_work_run()
> wake_irq_workd()
> </IRQ>
> preempt_schedule_irq()
> =========> SWITCH to workd
> irq_work_run_list()
> perf_pending_irq()
> perf_event_wakeup(child)
> ring_buffer_wakeup(child)
> event = child->parent
>
> perf_event_release_kernel(parent)
> // Not last ref, PMU holds it
> put_event(child)
> // Last ref
> put_event(parent)
> free_event()
> call_rcu(...)
> rcu_core()
> free_event_rcu()
>
> rcu_dereference(event->rb) <--- CRASH
>
> This also concerns the call to kill_fasync() on parent->fasync.
>
> The "easy" solution to 1) is to check that event->parent is not
> EVENT_TOMBSTONE on perf_event_wakeup() (including both ring buffer
> and fasync uses).
>
> The "easy" solution to 2) is to turn perf_event_wakeup() to wholefully
> run under rcu_read_lock().
>
> However because of 2), sanity would prescribe to make event::parent
> an __rcu pointer and annotate each and every users to prove they are
> reliable.
>
> Propose an alternate solution and restore the stable pointer to the
> parent until all its children have called _free_event() themselves to
> avoid any further accident. Also revert the EVENT_TOMBSTONE design
> that is mostly here to determine which caller of perf_event_exit_event()
> must perform the refcount decrement on a child event matching the
> increment in inherit_event().
>
> Arrange instead for checking the attach state of an event prior to its
> removal and decrement the refcount of the child accordingly.
Urgh, brain hurts, will have to look again tomorrow.
> @@ -13940,29 +13941,36 @@ perf_event_exit_event(struct perf_event *event,
> * Do destroy all inherited groups, we don't care about those
> * and being thorough is better.
> */
> - detach_flags |= DETACH_GROUP | DETACH_CHILD;
> + prd.detach_flags |= DETACH_GROUP | DETACH_CHILD;
> mutex_lock(&parent_event->child_mutex);
> }
>
> if (revoke)
> - detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> + prd.detach_flags |= DETACH_GROUP | DETACH_REVOKE;
>
> - perf_remove_from_context(event, detach_flags);
> + perf_remove_from_context(event, &prd);
Isn't all this waay to complicated?
That is, to modify state we need both ctx->mutex and ctx->lock, and this
is what __perf_remove_from_context() has, but because of this, holding
either one of those locks is sufficient to read the state -- it cannot
change.
And here we already hold ctx->mutex.
So can't we simply do:
old_state = event->attach_state;
perf_remove_from_context(event, detach_flags);
// do whatever with old_state
> /*
> * Child events can be freed.
> */
> - if (is_child) {
> - if (parent_event) {
> - mutex_unlock(&parent_event->child_mutex);
> - /*
> - * Kick perf_poll() for is_event_hup();
> - */
> - perf_event_wakeup(parent_event);
> + if (parent_event) {
> + mutex_unlock(&parent_event->child_mutex);
> + /*
> + * Kick perf_poll() for is_event_hup();
> + */
> + perf_event_wakeup(parent_event);
> +
> + /*
> + * Match the refcount initialization. Make sure it doesn't happen
> + * twice if pmu_detach_event() calls it on an already exited task.
> + */
> + if (prd.old_state & PERF_ATTACH_CHILD) {
> /*
> * pmu_detach_event() will have an extra refcount.
> + * perf_pending_task() might have one too.
> */
> put_event(event);
> }
> +
> return;
> }
>
next prev parent reply other threads:[~2025-04-24 16:30 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-24 16:11 [PATCH 0/4] perf fixes Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 1/4] perf: Fix failing inherit_event() doing extra refcount decrement on parent Frederic Weisbecker
2025-04-24 16:16 ` Peter Zijlstra
2025-05-08 10:34 ` [tip: perf/core] " tip-bot2 for Frederic Weisbecker
2025-05-08 19:55 ` tip-bot2 for Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 2/4] perf: Fix irq work dereferencing garbage Frederic Weisbecker
2025-04-24 16:30 ` Peter Zijlstra [this message]
2025-04-28 11:11 ` Frederic Weisbecker
2025-05-02 10:29 ` Peter Zijlstra
2025-05-02 11:30 ` Peter Zijlstra
2025-05-02 12:04 ` Frederic Weisbecker
2025-05-02 11:58 ` Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 3/4] perf: Remove too early and redundant CPU hotplug handling Frederic Weisbecker
2025-04-24 16:32 ` Peter Zijlstra
2025-05-08 10:34 ` [tip: perf/core] " tip-bot2 for Frederic Weisbecker
2025-05-08 19:55 ` tip-bot2 for Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 4/4] perf: Fix confusing aux iteration Frederic Weisbecker
2025-05-08 10:34 ` [tip: perf/core] " tip-bot2 for Frederic Weisbecker
2025-05-08 19:55 ` tip-bot2 for Frederic Weisbecker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250424163024.GC18306@noisy.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=acme@kernel.org \
--cc=adrian.hunter@intel.com \
--cc=alexander.shishkin@linux.intel.com \
--cc=frederic@kernel.org \
--cc=irogers@google.com \
--cc=jolsa@kernel.org \
--cc=kan.liang@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=mingo@redhat.com \
--cc=namhyung@kernel.org \
--cc=ravi.bangoria@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.