All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org,
	Adrian Hunter <adrian.hunter@intel.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Ian Rogers <irogers@google.com>, Ingo Molnar <mingo@redhat.com>,
	Jiri Olsa <jolsa@kernel.org>, Marco Elver <elver@google.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Namhyung Kim <namhyung@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Arnaldo Carvalho de Melo <acme@redhat.com>
Subject: Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.
Date: Tue, 9 Apr 2024 10:57:32 +0200	[thread overview]
Message-ID: <20240409085732.FBItbOSO@linutronix.de> (raw)
In-Reply-To: <ZhRhn1B0rMSNv6mV@pavilion.home>

On 2024-04-08 23:29:03 [+0200], Frederic Weisbecker wrote:
> > index c7a0274c662c8..e0b2da8de485f 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
> >  		state = PERF_EVENT_STATE_OFF;
> >  	}
> >  
> > -	if (event->pending_sigtrap) {
> > -		bool dec = true;
> > -
> > -		event->pending_sigtrap = 0;
> > -		if (state != PERF_EVENT_STATE_OFF &&
> > -		    !event->pending_work) {
> > -			event->pending_work = 1;
> > -			dec = false;
> > -			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > -			task_work_add(current, &event->pending_task, TWA_RESUME);
> > -		}
> > -		if (dec)
> > -			local_dec(&event->ctx->nr_pending);
> > -	}
> > -
> >  	perf_event_set_state(event, state);
> >  
> >  	if (!is_software_event(event))
> > @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
> >  	 * Yay, we hit home and are in the context of the event.
> >  	 */
> >  	if (cpu == smp_processor_id()) {
> > -		if (event->pending_sigtrap) {
> > -			event->pending_sigtrap = 0;
> > -			perf_sigtrap(event);
> > -			local_dec(&event->ctx->nr_pending);
> > -		}
> >  		if (event->pending_disable) {
> >  			event->pending_disable = 0;
> >  			perf_event_disable_local(event);
> > @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
> >  
> >  		if (regs)
> >  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> > -		if (!event->pending_sigtrap) {
> > -			event->pending_sigtrap = pending_id;
> > +		if (!event->pending_work) {
> > +			event->pending_work = pending_id;
> >  			local_inc(&event->ctx->nr_pending);
> > -			irq_work_queue(&event->pending_irq);
> > +			WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > +			task_work_add(current, &event->pending_task, TWA_RESUME);
> 
> If the overflow happens between exit_task_work() and perf_event_exit_task(),
> you're leaking the event. (This was there before this patch).
> See:
> 	https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39

Okay.

> > +			/*
> > +			 * The NMI path returns directly to userland. The
> > +			 * irq_work is raised as a dummy interrupt to ensure
> > +			 * regular return path to user is taken and task_work
> > +			 * is processed.
> > +			 */
> > +			if (in_nmi())
> > +				irq_work_queue(&event->pending_irq);
> >  		} else if (event->attr.exclude_kernel && valid_sample) {
> >  			/*
> >  			 * Should not be able to return to user space without
> > -			 * consuming pending_sigtrap; with exceptions:
> > +			 * consuming pending_work; with exceptions:
> >  			 *
> >  			 *  1. Where !exclude_kernel, events can overflow again
> >  			 *     in the kernel without returning to user space.
> > @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
> >  			 *     To approximate progress (with false negatives),
> >  			 *     check 32-bit hash of the current IP.
> >  			 */
> > -			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> > +			WARN_ON_ONCE(event->pending_work != pending_id);
> >  		}
> >  
> >  		event->pending_addr = 0;
> > @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
> >  		     &parent_event->child_total_time_running);
> >  }
> >  
> > +static bool task_work_cb_match(struct callback_head *cb, void *data)
> > +{
> > +	struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> > +
> > +	return event == data;
> > +}
> 
> I suggest we introduce a proper API to cancel an actual callback head, see:
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m0a347249a462523358724085f2489ce9ed91e640

This rework would work.

> >  static void
> >  perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> >  {
> > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> >  		 * Kick perf_poll() for is_event_hup();
> >  		 */
> >  		perf_event_wakeup(parent_event);
> > +		/*
> > +		 * Cancel pending task_work and update counters if it has not
> > +		 * yet been delivered to userland. free_event() expects the
> > +		 * reference counter at one and keeping the event around until
> > +		 * the task returns to userland can be a unexpected if there is
> > +		 * no signal handler registered.
> > +		 */
> > +		if (event->pending_work &&
> > +		    task_work_cancel_match(current, task_work_cb_match, event)) {
> > +			put_event(event);
> > +			local_dec(&event->ctx->nr_pending);
> > +		}
> 
> So exiting task, privileged exec and also exit on exec call into this before
> releasing the children.
> 
> And parents rely on put_event() from file close + the task work.
> 
> But what about remote release of children on file close?
> See perf_event_release_kernel() directly calling free_event() on them.

Interesting things you are presenting. I had events popping up at random
even after the task decided that it won't go back to userland to handle
it so letting it free looked like the only option…

> One possible fix is to avoid the reference count game around task work
> and flush them on free_event().
> 
> See here:
> 
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

That wake_up() within preempt_disable() section breaks on RT.

How do we go on from here?

> Thanks.

Sebastian

  reply	other threads:[~2024-04-09  8:57 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-22  6:48 [PATCH v3 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
2024-03-22  6:48 ` [PATCH v3 1/4] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
2024-03-22  6:48 ` [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
2024-04-08 21:29   ` Frederic Weisbecker
2024-04-09  8:57     ` Sebastian Andrzej Siewior [this message]
2024-04-09 12:36       ` Frederic Weisbecker
2024-04-09 13:47         ` Sebastian Andrzej Siewior
2024-04-10 11:37           ` Frederic Weisbecker
2024-04-10 13:47             ` Sebastian Andrzej Siewior
2024-04-10 14:00               ` Frederic Weisbecker
2024-04-10 14:06                 ` Sebastian Andrzej Siewior
2024-04-10 14:42                   ` Frederic Weisbecker
2024-04-10 14:48                     ` Sebastian Andrzej Siewior
2024-04-10 14:50                       ` Frederic Weisbecker
2024-03-22  6:48 ` [PATCH v3 3/4] perf: Remove perf_swevent_get_recursion_context() from perf_pending_task() Sebastian Andrzej Siewior
2024-04-08 22:06   ` Frederic Weisbecker
2024-04-09  6:25     ` Sebastian Andrzej Siewior
2024-04-09 10:35       ` Frederic Weisbecker
2024-04-09 10:54         ` Sebastian Andrzej Siewior
2024-04-09 12:00           ` Frederic Weisbecker
2024-04-09 13:33             ` Sebastian Andrzej Siewior
2024-04-10 10:38               ` Frederic Weisbecker
2024-04-10 12:51                 ` Sebastian Andrzej Siewior
2024-04-10 13:58                   ` Frederic Weisbecker
2024-03-22  6:48 ` [PATCH v3 4/4] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240409085732.FBItbOSO@linutronix.de \
    --to=bigeasy@linutronix.de \
    --cc=acme@kernel.org \
    --cc=acme@redhat.com \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=elver@google.com \
    --cc=frederic@kernel.org \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.