public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"Liang, Kan" <kan.liang@linux.intel.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Ian Rogers <irogers@google.com>, Jiri Olsa <jolsa@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Namhyung Kim <namhyung@kernel.org>,
	Ravi Bangoria <ravi.bangoria@amd.com>,
	linux-perf-users@vger.kernel.org
Subject: Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
Date: Thu, 24 Apr 2025 18:30:24 +0200	[thread overview]
Message-ID: <20250424163024.GC18306@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <20250424161128.29176-3-frederic@kernel.org>

On Thu, Apr 24, 2025 at 06:11:26PM +0200, Frederic Weisbecker wrote:
> The following commit:
> 
> 	da916e96e2de ("perf: Make perf_pmu_unregister() useable")
> 
> has introduced two significant event's parent lifecycle changes:
> 
> 1) An event that has exited now has EVENT_TOMBSTONE as a parent.
>    This can result in a situation where the delayed wakeup irq_work can
>    accidentally dereference EVENT_TOMBSTONE on:
> 
> CPU 0                                          CPU 1
> -----                                          -----
> 
> __schedule()
>     local_irq_disable()
>     rq_lock()
>     <NMI>
>     perf_event_overflow()
>         irq_work_queue(&child->pending_irq)
>     </NMI>
>     perf_event_task_sched_out()
>         raw_spin_lock(&ctx->lock)
>         ctx_sched_out()
>         ctx->is_active = 0
>         event_sched_out(child)
>         raw_spin_unlock(&ctx->lock)
>                                               perf_event_release_kernel(parent)
>                                                   perf_remove_from_context(child)
>                                                   raw_spin_lock_irq(&ctx->lock)
>                                                   // Sees !ctx->is_active
>                                                   // Removes from context inline
>                                                   __perf_remove_from_context(child)
>                                                       perf_child_detach(child)
>                                                           event->parent = EVENT_TOMBSTONE
>     raw_spin_rq_unlock_irq(rq);
>     <IRQ>
>     perf_pending_irq()
>         perf_event_wakeup(child)
>             ring_buffer_wakeup(child)
>                 rcu_dereference(child->parent->rb) <--- CRASH
> 
> This also concerns the call to kill_fasync() on parent->fasync.

Argh, I actually looked for this case and didn't find it in one of the
earlier fixes :/

> 2) The final parent reference count decrement can now happen before the
>    the final child reference count decrement. ie: the parent can now
>    be freed before its child. On PREEMPT_RT, this can result in a
>    situation where the delayed wakeup irq_work can accidentally
>    dereference a freed parent:
> 
> CPU 0                                          CPU 1                              CPU 2
> -----                                          -----                              ------
> 
> perf_pmu_unregister()
>     pmu_detach_events()
>        pmu_get_event()
>            atomic_long_inc_not_zero(&child->refcount)
> 
>                                                <NMI>
>                                                perf_event_overflow()
>                                                    irq_work_queue(&child->pending_irq);
>                                                </NMI>
>                                                <IRQ>
>                                                irq_work_run()
>                                                    wake_irq_workd()
>                                                </IRQ>
>                                                preempt_schedule_irq()
>                                                =========> SWITCH to workd
>                                                irq_work_run_list()
>                                                    perf_pending_irq()
>                                                        perf_event_wakeup(child)
>                                                            ring_buffer_wakeup(child)
>                                                                event = child->parent
> 
>                                                                                   perf_event_release_kernel(parent)
>                                                                                       // Not last ref, PMU holds it
>                                                                                       put_event(child)
>                                                                                       // Last ref
>                                                                                       put_event(parent)
>                                                                                           free_event()
>                                                                                               call_rcu(...)
>                                                                                   rcu_core()
>                                                                                       free_event_rcu()
> 
>                                                                rcu_dereference(event->rb) <--- CRASH
> 
> This also concerns the call to kill_fasync() on parent->fasync.
> 
> The "easy" solution to 1) is to check that event->parent is not
> EVENT_TOMBSTONE on perf_event_wakeup() (including both ring buffer
> and fasync uses).
> 
> The "easy" solution to 2) is to turn perf_event_wakeup() to wholefully
> run under rcu_read_lock().
> 
> However because of 2), sanity would prescribe to make event::parent
> an __rcu pointer and annotate each and every users to prove they are
> reliable.
> 
> Propose an alternate solution and restore the stable pointer to the
> parent until all its children have called _free_event() themselves to
> avoid any further accident. Also revert the EVENT_TOMBSTONE design
> that is mostly here to determine which caller of perf_event_exit_event()
> must perform the refcount decrement on a child event matching the
> increment in inherit_event().
> 
> Arrange instead for checking the attach state of an event prior to its
> removal and decrement the refcount of the child accordingly.

Urgh, brain hurts, will have to look again tomorrow.


> @@ -13940,29 +13941,36 @@ perf_event_exit_event(struct perf_event *event,
>  		 * Do destroy all inherited groups, we don't care about those
>  		 * and being thorough is better.
>  		 */
> -		detach_flags |= DETACH_GROUP | DETACH_CHILD;
> +		prd.detach_flags |= DETACH_GROUP | DETACH_CHILD;
>  		mutex_lock(&parent_event->child_mutex);
>  	}
>  
>  	if (revoke)
> -		detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> +		prd.detach_flags |= DETACH_GROUP | DETACH_REVOKE;
>  
> -	perf_remove_from_context(event, detach_flags);
> +	perf_remove_from_context(event, &prd);

Isn't all this waay to complicated?

That is, to modify state we need both ctx->mutex and ctx->lock, and this
is what __perf_remove_from_context() has, but because of this, holding
either one of those locks is sufficient to read the state -- it cannot
change.

And here we already hold ctx->mutex.

So can't we simply do:

	old_state = event->attach_state;
	perf_remove_from_context(event, detach_flags);

	// do whatever with old_state

>  	/*
>  	 * Child events can be freed.
>  	 */
> -	if (is_child) {
> -		if (parent_event) {
> -			mutex_unlock(&parent_event->child_mutex);
> -			/*
> -			 * Kick perf_poll() for is_event_hup();
> -			 */
> -			perf_event_wakeup(parent_event);
> +	if (parent_event) {
> +		mutex_unlock(&parent_event->child_mutex);
> +		/*
> +		 * Kick perf_poll() for is_event_hup();
> +		 */
> +		perf_event_wakeup(parent_event);
> +
> +		/*
> +		 * Match the refcount initialization. Make sure it doesn't happen
> +		 * twice if pmu_detach_event() calls it on an already exited task.
> +		 */
> +		if (prd.old_state & PERF_ATTACH_CHILD) {
>  			/*
>  			 * pmu_detach_event() will have an extra refcount.
> +			 * perf_pending_task() might have one too.
>  			 */
>  			put_event(event);
>  		}
> +
>  		return;
>  	}
>  

  reply	other threads:[~2025-04-24 16:30 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24 16:11 [PATCH 0/4] perf fixes Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 1/4] perf: Fix failing inherit_event() doing extra refcount decrement on parent Frederic Weisbecker
2025-04-24 16:16   ` Peter Zijlstra
2025-05-08 10:34   ` [tip: perf/core] " tip-bot2 for Frederic Weisbecker
2025-05-08 19:55   ` tip-bot2 for Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 2/4] perf: Fix irq work dereferencing garbage Frederic Weisbecker
2025-04-24 16:30   ` Peter Zijlstra [this message]
2025-04-28 11:11     ` Frederic Weisbecker
2025-05-02 10:29       ` Peter Zijlstra
2025-05-02 11:30         ` Peter Zijlstra
2025-05-02 12:04           ` Frederic Weisbecker
2025-05-02 11:58         ` Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 3/4] perf: Remove too early and redundant CPU hotplug handling Frederic Weisbecker
2025-04-24 16:32   ` Peter Zijlstra
2025-05-08 10:34   ` [tip: perf/core] " tip-bot2 for Frederic Weisbecker
2025-05-08 19:55   ` tip-bot2 for Frederic Weisbecker
2025-04-24 16:11 ` [PATCH 4/4] perf: Fix confusing aux iteration Frederic Weisbecker
2025-05-08 10:34   ` [tip: perf/core] " tip-bot2 for Frederic Weisbecker
2025-05-08 19:55   ` tip-bot2 for Frederic Weisbecker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250424163024.GC18306@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=frederic@kernel.org \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=kan.liang@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=ravi.bangoria@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox