public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org,
	Adrian Hunter <adrian.hunter@intel.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Ian Rogers <irogers@google.com>, Ingo Molnar <mingo@redhat.com>,
	Jiri Olsa <jolsa@kernel.org>, Marco Elver <elver@google.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Namhyung Kim <namhyung@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Arnaldo Carvalho de Melo <acme@redhat.com>
Subject: Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
Date: Tue, 9 Apr 2024 10:57:32 +0200	[thread overview]
Message-ID: <20240409085732.FBItbOSO@linutronix.de> (raw)
In-Reply-To: <ZhRhn1B0rMSNv6mV@pavilion.home>

On 2024-04-08 23:29:03 [+0200], Frederic Weisbecker wrote:
> > index c7a0274c662c8..e0b2da8de485f 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
> >  		state = PERF_EVENT_STATE_OFF;
> >  	}
> >  
> > -	if (event->pending_sigtrap) {
> > -		bool dec = true;
> > -
> > -		event->pending_sigtrap = 0;
> > -		if (state != PERF_EVENT_STATE_OFF &&
> > -		    !event->pending_work) {
> > -			event->pending_work = 1;
> > -			dec = false;
> > -			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > -			task_work_add(current, &event->pending_task, TWA_RESUME);
> > -		}
> > -		if (dec)
> > -			local_dec(&event->ctx->nr_pending);
> > -	}
> > -
> >  	perf_event_set_state(event, state);
> >  
> >  	if (!is_software_event(event))
> > @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
> >  	 * Yay, we hit home and are in the context of the event.
> >  	 */
> >  	if (cpu == smp_processor_id()) {
> > -		if (event->pending_sigtrap) {
> > -			event->pending_sigtrap = 0;
> > -			perf_sigtrap(event);
> > -			local_dec(&event->ctx->nr_pending);
> > -		}
> >  		if (event->pending_disable) {
> >  			event->pending_disable = 0;
> >  			perf_event_disable_local(event);
> > @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
> >  
> >  		if (regs)
> >  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> > -		if (!event->pending_sigtrap) {
> > -			event->pending_sigtrap = pending_id;
> > +		if (!event->pending_work) {
> > +			event->pending_work = pending_id;
> >  			local_inc(&event->ctx->nr_pending);
> > -			irq_work_queue(&event->pending_irq);
> > +			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > +			task_work_add(current, &event->pending_task, TWA_RESUME);
> 
> If the overflow happens between exit_task_work() and perf_event_exit_task(),
> you're leaking the event. (This was there before this patch).
> See:
> 	https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39

Okay.

> > +			/*
> > +			 * The NMI path returns directly to userland. The
> > +			 * irq_work is raised as a dummy interrupt to ensure
> > +			 * regular return path to user is taken and task_work
> > +			 * is processed.
> > +			 */
> > +			if (in_nmi())
> > +				irq_work_queue(&event->pending_irq);
> >  		} else if (event->attr.exclude_kernel && valid_sample) {
> >  			/*
> >  			 * Should not be able to return to user space without
> > -			 * consuming pending_sigtrap; with exceptions:
> > +			 * consuming pending_work; with exceptions:
> >  			 *
> >  			 *  1. Where !exclude_kernel, events can overflow again
> >  			 *     in the kernel without returning to user space.
> > @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
> >  			 *     To approximate progress (with false negatives),
> >  			 *     check 32-bit hash of the current IP.
> >  			 */
> > -			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> > +			WARN_ON_ONCE(event->pending_work != pending_id);
> >  		}
> >  
> >  		event->pending_addr = 0;
> > @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
> >  		     &parent_event->child_total_time_running);
> >  }
> >  
> > +static bool task_work_cb_match(struct callback_head *cb, void *data)
> > +{
> > +	struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> > +
> > +	return event == data;
> > +}
> 
> I suggest we introduce a proper API to cancel an actual callback head, see:
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m0a347249a462523358724085f2489ce9ed91e640

This rework would work.

> >  static void
> >  perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> >  {
> > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> >  		 * Kick perf_poll() for is_event_hup();
> >  		 */
> >  		perf_event_wakeup(parent_event);
> > +		/*
> > +		 * Cancel pending task_work and update counters if it has not
> > +		 * yet been delivered to userland. free_event() expects the
> > +		 * reference counter at one and keeping the event around until
> > +		 * the task returns to userland can be a unexpected if there is
> > +		 * no signal handler registered.
> > +		 */
> > +		if (event->pending_work &&
> > +		    task_work_cancel_match(current, task_work_cb_match, event)) {
> > +			put_event(event);
> > +			local_dec(&event->ctx->nr_pending);
> > +		}
> 
> So exiting task, privileged exec and also exit on exec call into this before
> releasing the children.
> 
> And parents rely on put_event() from file close + the task work.
> 
> But what about remote release of children on file close?
> See perf_event_release_kernel() directly calling free_event() on them.

Interesting things you are presenting. I had events popping up at random
even after the task decided that it won't go back to userland to handle
it so letting it free looked like the only option…

> One possible fix is to avoid the reference count game around task work
> and flush them on free_event().
> 
> See here:
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

That wake_up() within preempt_disable() section breaks on RT.

How do we go on from here?

> Thanks.

Sebastian

  reply	other threads:[~2024-04-09  8:57 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-22  6:48 [PATCH v3 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
2024-03-22  6:48 ` [PATCH v3 1/4] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
2024-03-22  6:48 ` [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
2024-04-08 21:29   ` Frederic Weisbecker
2024-04-09  8:57     ` Sebastian Andrzej Siewior [this message]
2024-04-09 12:36       ` Frederic Weisbecker
2024-04-09 13:47         ` Sebastian Andrzej Siewior
2024-04-10 11:37           ` Frederic Weisbecker
2024-04-10 13:47             ` Sebastian Andrzej Siewior
2024-04-10 14:00               ` Frederic Weisbecker
2024-04-10 14:06                 ` Sebastian Andrzej Siewior
2024-04-10 14:42                   ` Frederic Weisbecker
2024-04-10 14:48                     ` Sebastian Andrzej Siewior
2024-04-10 14:50                       ` Frederic Weisbecker
2024-03-22  6:48 ` [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task() Sebastian Andrzej Siewior
2024-04-08 22:06   ` Frederic Weisbecker
2024-04-09  6:25     ` Sebastian Andrzej Siewior
2024-04-09 10:35       ` Frederic Weisbecker
2024-04-09 10:54         ` Sebastian Andrzej Siewior
2024-04-09 12:00           ` Frederic Weisbecker
2024-04-09 13:33             ` Sebastian Andrzej Siewior
2024-04-10 10:38               ` Frederic Weisbecker
2024-04-10 12:51                 ` Sebastian Andrzej Siewior
2024-04-10 13:58                   ` Frederic Weisbecker
2024-03-22  6:48 ` [PATCH v3 4/4] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240409085732.FBItbOSO@linutronix.de \
    --to=bigeasy@linutronix.de \
    --cc=acme@kernel.org \
    --cc=acme@redhat.com \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=elver@google.com \
    --cc=frederic@kernel.org \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox