linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake
@ 2015-10-19 22:58 Andi Kleen
  2015-10-20 11:36 ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2015-10-19 22:58 UTC (permalink / raw)
  To: peterz; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Switch the cycles:pp alias from UOPS_RETITRED to INST_RETIRED.PREC_DIST.
The basic mechanism of abusing the inverse cmask to get all cycles
works the same as before.

PREC_DIST has special support for avoiding shadow effects, which
can give better results compare to UOPS_RETIRED. The drawback is
that PREC_DIST can only schedule on counter 1, but that is ok for
cycle sampling, as there is normally no need to do multiple cycle
sampling runs in parallel. It is still possible to run perf top
in parallel, as that doesn't use precise mode. Also of course
the multiplexing can still allow parallel operation.

The UOPS_RETIRED old alias is still available in raw form.

Example:

Sample a loop with 10 sqrt with old cycles:pp

  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
  9.13 │      sqrtps %xmm1,%xmm0
 11.58 │      sqrtps %xmm1,%xmm0
 11.51 │      sqrtps %xmm1,%xmm0
  6.27 │      sqrtps %xmm1,%xmm0
 10.38 │      sqrtps %xmm1,%xmm0
 12.20 │      sqrtps %xmm1,%xmm0
 12.74 │      sqrtps %xmm1,%xmm0
  5.40 │      sqrtps %xmm1,%xmm0
 10.14 │      sqrtps %xmm1,%xmm0
 10.51 │    ↑ jmp    10

We expect all 10 sqrt to get roughly the sample number of samples.

But you can see that the instruction directly after the jmp is
systematically underestimated in the result, due to sampling shadow
effects.

With the new PREC_DIST based sampling this problem is gone
and all instructions show up roughly evenly:

  9.51 │10:   sqrtps %xmm1,%xmm0
 11.74 │      sqrtps %xmm1,%xmm0
 11.84 │      sqrtps %xmm1,%xmm0
  6.05 │      sqrtps %xmm1,%xmm0
 10.46 │      sqrtps %xmm1,%xmm0
 12.25 │      sqrtps %xmm1,%xmm0
 12.18 │      sqrtps %xmm1,%xmm0
  5.26 │      sqrtps %xmm1,%xmm0
 10.13 │      sqrtps %xmm1,%xmm0
 10.43 │      sqrtps %xmm1,%xmm0
  0.16 │    ↑ jmp    10

Even with PREC_DIST there is still sampling skid and the result
is not completely even, but systematic shadow effects are
significantly reduced.

The improvements are mainly expected to make a difference
in high IPC code. With low IPC it should be similar.

The PREC_DIST event is supported back to IvyBridge, but I only tested
it on Skylake for now, so I only enabled it on Skylake.

On earlier parts there were various hardware bugs in it
(but no show stopper on IvyBridge and up I believe),
so it could be enabled there after sufficient testing.
On Sandy Bridge PREC_DIST can only be scheduled as a single
event on the PMU, which is too limiting. Before Sandy
Bridge it was not supported.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c    | 27 ++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  4 +++-
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index f8aab48..f17772a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2487,6 +2487,31 @@ static void intel_pebs_aliases_snb(struct perf_event *event)
 	}
 }
 
+static void intel_pebs_aliases_skl(struct perf_event *event)
+{
+	if ((event->hw.config & X86_RAW_EVENT_MASK) == 0x003c) {
+		/*
+		 * Use an alternative encoding for CPU_CLK_UNHALTED.THREAD_P
+		 * (0x003c) so that we can use it with PEBS.
+		 *
+		 * The regular CPU_CLK_UNHALTED.THREAD_P event (0x003c) isn't
+		 * PEBS capable. However we can use INST_RETIRED.PREC_DIST
+		 * (0x01c0), which is a PEBS capable event, to get the same
+		 * count.
+		 *
+		 * The PREC_DIST event has special support to minimize sample
+		 * shadowing effects. One drawback is that it can be
+		 * only programmed on counter 1, but that seems like an
+		 * acceptable trade off.
+		 */
+		u64 alt_config = X86_CONFIG(.event=0xc0, .umask=0x01, .inv=1, .cmask=16);
+
+		alt_config |= (event->hw.config & ~X86_RAW_EVENT_MASK);
+		event->hw.config = alt_config;
+	}
+}
+
+
 static unsigned long intel_pmu_free_running_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.free_running_flags;
@@ -3535,7 +3560,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_skl;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 5db1c77..acaebc7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -719,8 +719,10 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 struct event_constraint intel_skl_pebs_event_constraints[] = {
 	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
-	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
+	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (old style cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:p). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	INTEL_PLD_CONSTRAINT(0x1cd, 0xf),		      /* MEM_TRANS_RETIRED.* */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_LOADS */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x12d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_STORES */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake
  2015-10-19 22:58 [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake Andi Kleen
@ 2015-10-20 11:36 ` Peter Zijlstra
  2015-10-20 22:28   ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2015-10-20 11:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, Andi Kleen, Ingo Molnar, Thomas Gleixner

On Mon, Oct 19, 2015 at 03:58:16PM -0700, Andi Kleen wrote:

> Switch the cycles:pp alias from UOPS_RETITRED to INST_RETIRED.PREC_DIST.
> The basic mechanism of abusing the inverse cmask to get all cycles
> works the same as before.
> 
> PREC_DIST has special support for avoiding shadow effects, which
> can give better results compare to UOPS_RETIRED. The drawback is
> that PREC_DIST can only schedule on counter 1, but that is ok for
> cycle sampling, as there is normally no need to do multiple cycle
> sampling runs in parallel. It is still possible to run perf top
> in parallel, as that doesn't use precise mode. Also of course
> the multiplexing can still allow parallel operation.

So the worry I have with this is that there might indeed be people
wanting to use this in parallel.

Typically on workstations you do not, because there's only a single
user, but on servers it might be more common.  The thing I expect to be
most common is having both a CPU wide and a per task cycle counter
enabled.

This means a fairly visible change in behaviour depending on uarch.

And you having killed the flag bits for PEBS events precludes people
from using this manually, right?  I think we want to exempt .inv=1
.cmask=16 from that general rule on general utility value.

We could maybe abuse .precise_ip = 3 for this?

> On earlier parts there were various hardware bugs in it
> (but no show stopper on IvyBridge and up I believe),
> so it could be enabled there after sufficient testing.

Just enable it for IVB+ then.

> On Sandy Bridge PREC_DIST can only be scheduled as a single
> event on the PMU, which is too limiting. Before Sandy
> Bridge it was not supported.

Right, that was a bit cumbersome :-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake
  2015-10-20 11:36 ` Peter Zijlstra
@ 2015-10-20 22:28   ` Andi Kleen
  2015-10-21  8:09     ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2015-10-20 22:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, linux-kernel, Andi Kleen, Ingo Molnar,
	Thomas Gleixner

> Typically on workstations you do not, because there's only a single
> user, but on servers it might be more common.  The thing I expect to be
> most common is having both a CPU wide and a per task cycle counter
> enabled.

With multiple users they would sample only their own processes.
That works fine. The only thing that needs multiplexing is overlapping
sampling.

perf top also works fine because it doesn't use PEBS by default.

> And you having killed the flag bits for PEBS events precludes people
> from using this manually, right?  I think we want to exempt .inv=1
> .cmask=16 from that general rule on general utility value.

I don't want to do this for every event. It has caused problems
in the past. There's also no reason to use the other events.

> We could maybe abuse .precise_ip = 3 for this?

I implemented this now, even though it's ugly.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake
  2015-10-20 22:28   ` Andi Kleen
@ 2015-10-21  8:09     ` Peter Zijlstra
  2015-10-21 15:26       ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2015-10-21  8:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, Andi Kleen, Ingo Molnar, Thomas Gleixner

On Wed, Oct 21, 2015 at 12:28:44AM +0200, Andi Kleen wrote:
> > Typically on workstations you do not, because there's only a single
> > user, but on servers it might be more common.  The thing I expect to be
> > most common is having both a CPU wide and a per task cycle counter
> > enabled.
> 
> With multiple users they would sample only their own processes.
> That works fine. The only thing that needs multiplexing is overlapping
> sampling.

Right, but suppose one user (root) would be sampling machine wide with
PEBS whilst another user is using it just on his one task.

I realize this is unlikely, but I bet there's people out there doing
this and cycles:pp is probably the most useful event by a long way.

> > And you having killed the flag bits for PEBS events precludes people
> > from using this manually, right?  I think we want to exempt .inv=1
> > .cmask=16 from that general rule on general utility value.
> 
> I don't want to do this for every event. It has caused problems
> in the past. There's also no reason to use the other events.

But it does allow users to play around with alternative encodings for
cycles etc.. We've made use of this in the past when SNB turned out to
have broken the event.

ISTR also having proposed a sysfs flag to reenable the pebs flags for
people who need this. That avoids the things being exposed by default.

> > We could maybe abuse .precise_ip = 3 for this?
> 
> I implemented this now, even though it's ugly.

Agreed, its not ideal.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake
  2015-10-21  8:09     ` Peter Zijlstra
@ 2015-10-21 15:26       ` Andi Kleen
  2015-10-21 16:52         ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2015-10-21 15:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, linux-kernel, Ingo Molnar, Thomas Gleixner

> Right, but suppose one user (root) would be sampling machine wide with
> PEBS whilst another user is using it just on his one task.

The root case is likely perf top which does not use PEBS.

> > > And you having killed the flag bits for PEBS events precludes people
> > > from using this manually, right?  I think we want to exempt .inv=1
> > > .cmask=16 from that general rule on general utility value.
> > 
> > I don't want to do this for every event. It has caused problems
> > in the past. There's also no reason to use the other events.
> 
> But it does allow users to play around with alternative encodings for
> cycles etc.. We've made use of this in the past when SNB turned out to
> have broken the event.

If they want to violate the SDM they can hack the kernel.
It's risky to do as you well know.

> > > We could maybe abuse .precise_ip = 3 for this?
> > 
> > I implemented this now, even though it's ugly.
> 
> Agreed, its not ideal.

So it turns out that UOPS_RETIRED.ALL+pebs only works by accident
on Skylake. It's not in the specification.

So on Skylake it has to be unconditional for any cycles:p
On earlier parts it can be only for three p to handle your 
unlikely scenario.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake
  2015-10-21 15:26       ` Andi Kleen
@ 2015-10-21 16:52         ` Peter Zijlstra
  2015-10-21 16:55           ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2015-10-21 16:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, linux-kernel, Ingo Molnar, Thomas Gleixner

On Wed, Oct 21, 2015 at 08:26:08AM -0700, Andi Kleen wrote:
> So it turns out that UOPS_RETIRED.ALL+pebs only works by accident
> on Skylake. It's not in the specification.

Accident or not, it works and you've tested it. Might as well keep it.
Specs can be updated.

Also, even with the SNB 'hickup' they still haven't got the memo that
cycles is important for PEBS ?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake
  2015-10-21 16:52         ` Peter Zijlstra
@ 2015-10-21 16:55           ` Andi Kleen
  2015-10-21 18:57             ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2015-10-21 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Andi Kleen, linux-kernel, Ingo Molnar,
	Thomas Gleixner

On Wed, Oct 21, 2015 at 06:52:52PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 08:26:08AM -0700, Andi Kleen wrote:
> > So it turns out that UOPS_RETIRED.ALL+pebs only works by accident
> > on Skylake. It's not in the specification.
> 
> Accident or not, it works and you've tested it. Might as well keep it.
> Specs can be updated.

No, it's better not to use it.

> Also, even with the SNB 'hickup' they still haven't got the memo that
> cycles is important for PEBS ?

INST_RETIRED.PREC_DIST is a far better cycles than UOPS_RETIRED.ALL
(see my earlier example)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake
  2015-10-21 16:55           ` Andi Kleen
@ 2015-10-21 18:57             ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2015-10-21 18:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, linux-kernel, Ingo Molnar, Thomas Gleixner

On Wed, Oct 21, 2015 at 06:55:42PM +0200, Andi Kleen wrote:
> On Wed, Oct 21, 2015 at 06:52:52PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 21, 2015 at 08:26:08AM -0700, Andi Kleen wrote:
> > > So it turns out that UOPS_RETIRED.ALL+pebs only works by accident
> > > on Skylake. It's not in the specification.
> > 
> > Accident or not, it works and you've tested it. Might as well keep it.
> > Specs can be updated.
> 
> No, it's better not to use it.

This is really rather unfortunate; what is the failure mode?

> > Also, even with the SNB 'hickup' they still haven't got the memo that
> > cycles is important for PEBS ?
> 
> INST_RETIRED.PREC_DIST is a far better cycles than UOPS_RETIRED.ALL
> (see my earlier example)

If you only want the one counter, yes.

And since you made me look at the SDM; it looks like our skl pebs table
isn't anything like the table there. Also, do the events BR_INST_RETIRED
have a composable umask? That is, .event=0xc4, .umask=41 is valid and
would be 'conditional far branches'.

The same would be true for the HSW/BDW table I suppose, the memops
retired stuff looks composable but we currently do not allow that.



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-10-21 18:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-19 22:58 [PATCH] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake Andi Kleen
2015-10-20 11:36 ` Peter Zijlstra
2015-10-20 22:28   ` Andi Kleen
2015-10-21  8:09     ` Peter Zijlstra
2015-10-21 15:26       ` Andi Kleen
2015-10-21 16:52         ` Peter Zijlstra
2015-10-21 16:55           ` Andi Kleen
2015-10-21 18:57             ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).