perf_event: avoiding gpf on rdpmc for top-down-events on hybrid

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
@ 2024-12-04 21:20 Vince Weaver
  2024-12-05  0:57 ` Ian Rogers
  0 siblings, 1 reply; 11+ messages in thread
From: Vince Weaver @ 2024-12-04 21:20 UTC (permalink / raw)
  To: linux-perf-users; +Cc: Peter Zijlstra

Hello

so the PAPI team is working on trying to get Intel top-down event support
working.

We ran into a problem where on hybrid machines (Alder/Raptor Lake) topdown 
events are only supported on P-cores but not E-cores.

So you have code that is happily using rdpmc to read the data on P-cores 
but if you have bad luck and get rescheduled to an E-core then the rdpmc 
instruction will segfault/gpf the whole program.

Is there any way, short of setting up a complex segfault signal handler, 
to avoid this happening?

In theory you could try to check what core type you are on before doing 
the rdpmc but there's a race there if you get rescheduled after the check 
but before the actual rdpmc instruction.

Vince Weaver
vincent.weaver@maine.edu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2024-12-04 21:20 perf_event: avoiding gpf on rdpmc for top-down-events on hybrid Vince Weaver
@ 2024-12-05  0:57 ` Ian Rogers
  2025-06-15 23:23   ` Ian Rogers
  0 siblings, 1 reply; 11+ messages in thread
From: Ian Rogers @ 2024-12-05  0:57 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-perf-users, Peter Zijlstra

On Wed, Dec 4, 2024 at 2:06 PM Vince Weaver <vincent.weaver@maine.edu> wrote:
>
> Hello
>
> so the PAPI team is working on trying to get Intel top-down event support
> working.
>
> We ran into a problem where on hybrid machines (Alder/Raptor Lake) topdown
> events are only supported on P-cores but not E-cores.
>
> So you have code that is happily using rdpmc to read the data on P-cores
> but if you have bad luck and get rescheduled to an E-core then the rdpmc
> instruction will segfault/gpf the whole program.
>
> Is there any way, short of setting up a complex segfault signal handler,
> to avoid this happening?
>
> In theory you could try to check what core type you are on before doing
> the rdpmc but there's a race there if you get rescheduled after the check
> but before the actual rdpmc instruction.

Perhaps this is a use-case for restartable sequences? The current
logic in libperf doesn't handle this, nor hybrid:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/lib/perf/tests/test-evsel.c?h=perf-tools-next#n127
which is a shame.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2024-12-05  0:57 ` Ian Rogers
@ 2025-06-15 23:23   ` Ian Rogers
  2025-06-16 19:34     ` Vince Weaver
  0 siblings, 1 reply; 11+ messages in thread
From: Ian Rogers @ 2025-06-15 23:23 UTC (permalink / raw)
  To: Vince Weaver, Liang, Kan, Peter Zijlstra; +Cc: linux-perf-users

On Wed, Dec 4, 2024 at 4:57 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Dec 4, 2024 at 2:06 PM Vince Weaver <vincent.weaver@maine.edu> wrote:
> >
> > Hello
> >
> > so the PAPI team is working on trying to get Intel top-down event support
> > working.
> >
> > We ran into a problem where on hybrid machines (Alder/Raptor Lake) topdown
> > events are only supported on P-cores but not E-cores.
> >
> > So you have code that is happily using rdpmc to read the data on P-cores
> > but if you have bad luck and get rescheduled to an E-core then the rdpmc
> > instruction will segfault/gpf the whole program.
> >
> > Is there any way, short of setting up a complex segfault signal handler,
> > to avoid this happening?
> >
> > In theory you could try to check what core type you are on before doing
> > the rdpmc but there's a race there if you get rescheduled after the check
> > but before the actual rdpmc instruction.
>
> Perhaps this is a use-case for restartable sequences? The current
> logic in libperf doesn't handle this, nor hybrid:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/lib/perf/tests/test-evsel.c?h=perf-tools-next#n127
> which is a shame.

So I spoke to Kan and posted a revised perf test that works on hybrid:
https://lore.kernel.org/lkml/20250614004528.1652860-1-irogers@google.com/
Basically on hybrid before doing the rdpmc instruction the test now
ensures the affinity matches that of the CPUs the perf event can be
scheduled upon.

I do think that there is still a race and the race is there even
without hybrid. One thought is that an event may get scheduled on
fixed or generic counters depending on what was previously scheduled
during sched_in. The mmap has an "index" value described as the
"hardware event identifier":
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/perf_event.h#n632
and this value minus 1 gets passed to rdpmc in the rdpmc loop (as in
libperf and shown in perf_event.h). If a pinned event were using a
fixed counter on certain CPUs and the same event were being used by
rdpmc on all/any CPUs, then the particular counter used for the event
may vary causing the index to perhaps show the fixed counter but on
the CPU the rdpmc is executed the generic counter was scheduled with
the event and should be read instead. I think something similar could
happen if an event were deleted on a CPU between the read of "index"
and the user by rdpmc. Perhaps I'm ignorant of the inner workings of
the user page and scheduling, but it seems a restartable sequence is
needed to make this somewhat atomic. Even if it were in a restartable
sequence I think a remote delete of an event could cause the
counter/"index" to change while the reader stays on the same CPU (ie
the restartable sequence needn't restart but the counters changed).
Perhaps there needs to be more "buyer beware" language around the
rdpmc instruction in perf_event.h and associated man pages, while the
perf tool should avoid rdpmc due to the need for at least thread
affinity calls.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2025-06-15 23:23   ` Ian Rogers
@ 2025-06-16 19:34     ` Vince Weaver
  2025-06-17  4:36       ` Ian Rogers
  0 siblings, 1 reply; 11+ messages in thread
From: Vince Weaver @ 2025-06-16 19:34 UTC (permalink / raw)
  To: Ian Rogers; +Cc: Vince Weaver, Liang, Kan, Peter Zijlstra, linux-perf-users

On Sun, 15 Jun 2025, Ian Rogers wrote:

> So I spoke to Kan and posted a revised perf test that works on hybrid:
> https://lore.kernel.org/lkml/20250614004528.1652860-1-irogers@google.com/
> Basically on hybrid before doing the rdpmc instruction the test now
> ensures the affinity matches that of the CPUs the perf event can be
> scheduled upon.

for PAPI we ended up using restartable sequences which more or less work 
for our case (though we had to get some fixes merged with the upstream 
rseq developers)

I don't always track the libperf code as its license makes it 
incompatible with PAPI (which is BSD licensed).  Does it handle the 
rdpmc case where by default on x86 you're supposed to sign-extend from 48 
bits (or whatever is in the mmap page) but some of the top-down event 
rdpmc registers use all 64-bits so the sign-extension corrupts the value?

Vince Weaver
vincent.weaver@maine.edu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2025-06-16 19:34     ` Vince Weaver
@ 2025-06-17  4:36       ` Ian Rogers
  2025-06-17 16:37         ` Ian Rogers
  0 siblings, 1 reply; 11+ messages in thread
From: Ian Rogers @ 2025-06-17  4:36 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Liang, Kan, Peter Zijlstra, linux-perf-users

On Mon, Jun 16, 2025 at 12:34 PM Vince Weaver <vincent.weaver@maine.edu> wrote:
>
> On Sun, 15 Jun 2025, Ian Rogers wrote:
>
> > So I spoke to Kan and posted a revised perf test that works on hybrid:
> > https://lore.kernel.org/lkml/20250614004528.1652860-1-irogers@google.com/
> > Basically on hybrid before doing the rdpmc instruction the test now
> > ensures the affinity matches that of the CPUs the perf event can be
> > scheduled upon.
>
> for PAPI we ended up using restartable sequences which more or less work
> for our case (though we had to get some fixes merged with the upstream
> rseq developers)
>
> I don't always track the libperf code as its license makes it
> incompatible with PAPI (which is BSD licensed).  Does it handle the
> rdpmc case where by default on x86 you're supposed to sign-extend from 48
> bits (or whatever is in the mmap page) but some of the top-down event
> rdpmc registers use all 64-bits so the sign-extension corrupts the value?

Yep:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/lib/perf/mmap.c?h=perf-tools-next#n514
The same code is in UAPI perf_event.h which has the Linux-syscall-note:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n651

Fwiw, I don't think restartable sequences would solve the problem of:
 - 2 events programmed 1 gets the fixed counter and 1 on a generic,
 - the thread gets ready to do rdpmc for the generic counter,
 - another CPU/thread requests deleting the fixed counter event,
 - the event is deleted and the generic event is now scheduled onto
the fixed counter,
 - the rdpmc executes but the "index" is still set for the generic
counter but the kernel scheduled the counter onto the fixed counter
which I believe would have a different index.

but to be honest I don't understand how the mmap page works for say
only opening the event on a subset of the PMU's CPUs. Or if the
"index" is loaded and then thread preempted onto a different CPU with
events scheduled differently (as with the delete case above).

Thanks,
Ian

> Vince Weaver
> vincent.weaver@maine.edu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2025-06-17  4:36       ` Ian Rogers
@ 2025-06-17 16:37         ` Ian Rogers
  2025-06-17 18:17           ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Ian Rogers @ 2025-06-17 16:37 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Liang, Kan, Peter Zijlstra, linux-perf-users

On Mon, Jun 16, 2025 at 9:36 PM Ian Rogers <irogers@google.com> wrote:
>
> On Mon, Jun 16, 2025 at 12:34 PM Vince Weaver <vincent.weaver@maine.edu> wrote:
> >
> > On Sun, 15 Jun 2025, Ian Rogers wrote:
> >
> > > So I spoke to Kan and posted a revised perf test that works on hybrid:
> > > https://lore.kernel.org/lkml/20250614004528.1652860-1-irogers@google.com/
> > > Basically on hybrid before doing the rdpmc instruction the test now
> > > ensures the affinity matches that of the CPUs the perf event can be
> > > scheduled upon.
> >
> > for PAPI we ended up using restartable sequences which more or less work
> > for our case (though we had to get some fixes merged with the upstream
> > rseq developers)
> >
> > I don't always track the libperf code as its license makes it
> > incompatible with PAPI (which is BSD licensed).  Does it handle the
> > rdpmc case where by default on x86 you're supposed to sign-extend from 48
> > bits (or whatever is in the mmap page) but some of the top-down event
> > rdpmc registers use all 64-bits so the sign-extension corrupts the value?
>
> Yep:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/lib/perf/mmap.c?h=perf-tools-next#n514
> The same code is in UAPI perf_event.h which has the Linux-syscall-note:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n651
>
> Fwiw, I don't think restartable sequences would solve the problem of:
>  - 2 events programmed 1 gets the fixed counter and 1 on a generic,
>  - the thread gets ready to do rdpmc for the generic counter,
>  - another CPU/thread requests deleting the fixed counter event,
>  - the event is deleted and the generic event is now scheduled onto
> the fixed counter,
>  - the rdpmc executes but the "index" is still set for the generic
> counter but the kernel scheduled the counter onto the fixed counter
> which I believe would have a different index.

To answer my own question, the sequence number will vary after the
counter is read and so the read loop will be retried with a different
index value loaded on the next loop. Perhaps this is the right fix for
hybrid, if you have a hybrid system a per-thread event with
cap_user_rdpmc and the thread is rescheduled (say p-core to e-core),
then the sequence number needs changing in the user page mmap. This
will trigger a second loop where when the index is loaded it will be 0
for disabled.

Thanks,
Ian

> but to be honest I don't understand how the mmap page works for say
> only opening the event on a subset of the PMU's CPUs. Or if the
> "index" is loaded and then thread preempted onto a different CPU with
> events scheduled differently (as with the delete case above).
>
> Thanks,
> Ian
>
> > Vince Weaver
> > vincent.weaver@maine.edu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2025-06-17 16:37         ` Ian Rogers
@ 2025-06-17 18:17           ` Peter Zijlstra
  2025-06-17 20:36             ` Ian Rogers
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2025-06-17 18:17 UTC (permalink / raw)
  To: Ian Rogers; +Cc: Vince Weaver, Liang, Kan, linux-perf-users

On Tue, Jun 17, 2025 at 09:37:39AM -0700, Ian Rogers wrote:

> To answer my own question, the sequence number will vary after the
> counter is read and so the read loop will be retried with a different
> index value loaded on the next loop. Perhaps this is the right fix for
> hybrid, if you have a hybrid system a per-thread event with
> cap_user_rdpmc and the thread is rescheduled (say p-core to e-core),
> then the sequence number needs changing in the user page mmap. This
> will trigger a second loop where when the index is loaded it will be 0
> for disabled.

It does; migrating the counter will change the sequence number. But this
is not sufficient, because meanwhile you will have executed the RDPMC --
which comes before double-checking the sequence number.

Issuing the RDPMC on a wrong index will cause fail.

RSEQ is more strict, it will abort and redirect the instruction stream
on preemption/migration and ensure the RDPMC instruction will not be
issued.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2025-06-17 18:17           ` Peter Zijlstra
@ 2025-06-17 20:36             ` Ian Rogers
  2025-06-18  8:45               ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Ian Rogers @ 2025-06-17 20:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Vince Weaver, Liang, Kan, linux-perf-users

On Tue, Jun 17, 2025 at 11:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jun 17, 2025 at 09:37:39AM -0700, Ian Rogers wrote:
>
> > To answer my own question, the sequence number will vary after the
> > counter is read and so the read loop will be retried with a different
> > index value loaded on the next loop. Perhaps this is the right fix for
> > hybrid, if you have a hybrid system a per-thread event with
> > cap_user_rdpmc and the thread is rescheduled (say p-core to e-core),
> > then the sequence number needs changing in the user page mmap. This
> > will trigger a second loop where when the index is loaded it will be 0
> > for disabled.
>
> It does; migrating the counter will change the sequence number. But this
> is not sufficient, because meanwhile you will have executed the RDPMC --
> which comes before double-checking the sequence number.
>
> Issuing the RDPMC on a wrong index will cause fail.
>
> RSEQ is more strict, it will abort and redirect the instruction stream
> on preemption/migration and ensure the RDPMC instruction will not be
> issued.

Thanks Peter, out of curiosity, how does the RDPMC fail? If an event
were rescheduled (even with RSEQ assuming the CPU the thread is
running on doesn't change) could the RDPMC with the incorrect index
also fail?

Thanks,
Ian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2025-06-17 20:36             ` Ian Rogers
@ 2025-06-18  8:45               ` Peter Zijlstra
  2025-06-18 11:55                 ` Peter Zijlstra
  2025-06-18 13:57                 ` Vince Weaver
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2025-06-18  8:45 UTC (permalink / raw)
  To: Ian Rogers, x86; +Cc: Vince Weaver, Liang, Kan, linux-perf-users

On Tue, Jun 17, 2025 at 01:36:40PM -0700, Ian Rogers wrote:

> Thanks Peter, out of curiosity, how does the RDPMC fail? If an event
> were rescheduled (even with RSEQ assuming the CPU the thread is
> running on doesn't change) could the RDPMC with the incorrect index
> also fail?

So the problem with RDPMC is that it will #GP when used with an invalid
index.

Normally, when counting using fixed or general purpose events, this is
not a problem, because every CPU in the machine has those. So while the
index might not be the event you were after, the instruction doesn't
trap, we observe the sequence changed and retry the loop. No harm done.

But for the top-down thingies, the P-cores will have a RDPMC idx that
the E-cores do not support. Using this index on an E-core will #GP.

So the following situation:

	CPU-P				CPU-E

	do {
	  seq = pc->lock;
	  barrier();

	  index = pc->index;
	  count = pc->count;
	  if (index) {
	    width = pm->pmc_width;
	
			<migrate task to CPU-E>

					    rdpmc(index - 1); <-- #GP
					  }

					  barrier();
					} while (pc->lock != seq);

Where the task is migrated from a P to an E core after reading the
index but before doing RDPMC, turns fatal.

When we use RSEQ, the migration will abort the sequence and the RDPMC
will not be executed.


Aand... while writing this I wondered why we don't simply fix up the
fault in kernel space. Something like so...

---
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9f88b8a78e50..90391526acf7 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -677,6 +677,43 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
 
 #define GPFSTR "general protection fault"
 
+static bool fixup_rdpmc_exception(struct pt_regs *regs)
+{
+	struct mm_struct *mm = current->mm;
+	u8 buf[MAX_INSN_SIZE];
+	struct insn insn;
+	int len;
+
+	len = insn_fetch_from_user(regs, buf);
+	if (len <= 0)
+		return false;
+
+	if (!insn_decode_from_regs(&insn, regs, buf, len))
+		return false;
+
+	/* RDPMC */
+	if (insn.opcode.bytes[0] != 0x0f || insn.opcode.bytes[1] != 0x33)
+		return false;
+
+	if (!atomic_read(&mm->context.perf_rdpmc_allowed))
+		return false;
+
+	/*
+	 * So userspace RDPMC is allowed and took #GP.
+	 *
+	 * This means they got the index wrong. But per the ABI described in
+	 * struct perf_event_mmap_page; this means they'll also fail the
+	 * sequence lock and will retry the operation after re-reading the
+	 * index.
+	 *
+	 * Fake out the RDPMC by returning all zeros and continue.
+	 */
+	regs->ax = 0;
+	regs->dx = 0;
+	regs->ip += insn.length;
+	return true;
+}
+
 static bool fixup_iopl_exception(struct pt_regs *regs)
 {
 	struct thread_struct *t = &current->thread;
@@ -812,6 +849,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	}
 
 	if (user_mode(regs)) {
+		if (fixup_rdpmc_exception(regs))
+			goto exit;
+
 		if (fixup_iopl_exception(regs))
 			goto exit;
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2025-06-18  8:45               ` Peter Zijlstra
@ 2025-06-18 11:55                 ` Peter Zijlstra
  2025-06-18 13:57                 ` Vince Weaver
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2025-06-18 11:55 UTC (permalink / raw)
  To: Ian Rogers, x86; +Cc: Vince Weaver, Liang, Kan, linux-perf-users

On Wed, Jun 18, 2025 at 10:45:22AM +0200, Peter Zijlstra wrote:

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 9f88b8a78e50..90391526acf7 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -677,6 +677,43 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
>  
>  #define GPFSTR "general protection fault"
>  
> +static bool fixup_rdpmc_exception(struct pt_regs *regs)
> +{
> +	struct mm_struct *mm = current->mm;
> +	u8 buf[MAX_INSN_SIZE];
> +	struct insn insn;
> +	int len;
> +
> +	len = insn_fetch_from_user(regs, buf);
> +	if (len <= 0)
> +		return false;
> +
> +	if (!insn_decode_from_regs(&insn, regs, buf, len))
> +		return false;
> +
> +	/* RDPMC */
> +	if (insn.opcode.bytes[0] != 0x0f || insn.opcode.bytes[1] != 0x33)
> +		return false;
> +
> +	if (!atomic_read(&mm->context.perf_rdpmc_allowed))
> +		return false;
> +
> +	/*
> +	 * So userspace RDPMC is allowed and took #GP.
> +	 *
> +	 * This means they got the index wrong. But per the ABI described in
> +	 * struct perf_event_mmap_page; this means they'll also fail the
> +	 * sequence lock and will retry the operation after re-reading the
> +	 * index.
> +	 *
> +	 * Fake out the RDPMC by returning all zeros and continue.
> +	 */
> +	regs->ax = 0;
> +	regs->dx = 0;

And alternative might be to return ~0UL in both registers. All 32bit
chips have limited counter width and this will set the high bits. All
64bit chips do no expect the high words of the registers to be set.

That way one could recognise the fail case; if that was so desired.

> +	regs->ip += insn.length;
> +	return true;
> +}
> +
>  static bool fixup_iopl_exception(struct pt_regs *regs)
>  {
>  	struct thread_struct *t = &current->thread;
> @@ -812,6 +849,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  	}
>  
>  	if (user_mode(regs)) {
> +		if (fixup_rdpmc_exception(regs))
> +			goto exit;
> +
>  		if (fixup_iopl_exception(regs))
>  			goto exit;
>  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf_event: avoiding gpf on rdpmc for top-down-events on hybrid
  2025-06-18  8:45               ` Peter Zijlstra
  2025-06-18 11:55                 ` Peter Zijlstra
@ 2025-06-18 13:57                 ` Vince Weaver
  1 sibling, 0 replies; 11+ messages in thread
From: Vince Weaver @ 2025-06-18 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ian Rogers, x86, Vince Weaver, Liang, Kan, linux-perf-users

On Wed, 18 Jun 2025, Peter Zijlstra wrote:

> On Tue, Jun 17, 2025 at 01:36:40PM -0700, Ian Rogers wrote:
> 
> > Thanks Peter, out of curiosity, how does the RDPMC fail? If an event
> > were rescheduled (even with RSEQ assuming the CPU the thread is
> > running on doesn't change) could the RDPMC with the incorrect index
> > also fail?
> 
> So the problem with RDPMC is that it will #GP when used with an invalid
> index.
> 
> Normally, when counting using fixed or general purpose events, this is
> not a problem, because every CPU in the machine has those. So while the
> index might not be the event you were after, the instruction doesn't
> trap, we observe the sequence changed and retry the loop. No harm done.
> 
> But for the top-down thingies, the P-cores will have a RDPMC idx that
> the E-cores do not support. Using this index on an E-core will #GP.
> 
> So the following situation:
> 
> 	CPU-P				CPU-E
> 
> 	do {
> 	  seq = pc->lock;
> 	  barrier();
> 
> 	  index = pc->index;
> 	  count = pc->count;
> 	  if (index) {
> 	    width = pm->pmc_width;
> 	
> 			<migrate task to CPU-E>
> 
> 					    rdpmc(index - 1); <-- #GP
> 					  }
> 
> 					  barrier();
> 					} while (pc->lock != seq);
> 
> Where the task is migrated from a P to an E core after reading the
> index but before doing RDPMC, turns fatal.
> 
> When we use RSEQ, the migration will abort the sequence and the RDPMC
> will not be executed.
> 
> 
> Aand... while writing this I wondered why we don't simply fix up the
> fault in kernel space. Something like so...

hah, we went through a lot of trouble in PAPI to implement the rseq 
solution because we were told it wasn't possible to fix this in the 
kernel.  I knew that couldn't really be true.

One thing to note about the sample mmap RDPMC code above, it doesn't work 
with topdown counters because you need to read all 64-bits with rdpmc and the 
default code sign-extends from 47 bits corrupting the top bits.  Though 
maybe it's been fixed since then to set pmc->width properly if it's a 
topdown event.

Vince

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-06-18 13:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-04 21:20 perf_event: avoiding gpf on rdpmc for top-down-events on hybrid Vince Weaver
2024-12-05  0:57 ` Ian Rogers
2025-06-15 23:23   ` Ian Rogers
2025-06-16 19:34     ` Vince Weaver
2025-06-17  4:36       ` Ian Rogers
2025-06-17 16:37         ` Ian Rogers
2025-06-17 18:17           ` Peter Zijlstra
2025-06-17 20:36             ` Ian Rogers
2025-06-18  8:45               ` Peter Zijlstra
2025-06-18 11:55                 ` Peter Zijlstra
2025-06-18 13:57                 ` Vince Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).