public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
@ 2010-12-22  8:12 Lin Ming
  2010-12-22  8:33 ` Peter Zijlstra
  2010-12-22  9:00 ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Lin Ming @ 2010-12-22  8:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Andi Kleen, Stephane Eranian,
	Frederic Weisbecker, Arjan van de Ven
  Cc: lkml

Hi, all

This patch adds load latency monitoring on Intel Nehalem/Westmere.
It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
patchsets are needed.

Updated perf offcore patchkit
http://marc.info/?l=linux-kernel&m=129103647731356&w=2

The load latency on Intel Nehalem/Westmere is monitored by event
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD(0x100b). It measures latency
from micro-operation (uop) dispatch to when data is globally observable
(GO).

To monitor load latency, both the PEBS_EN_CTRx and LL_EN_CTRx bits must
be set in IA32_PEBS_ENABLE register. And an extra MSR
MSR_PEBS_LD_LAT_THRESHOLD must be programmed with the desired latency
threshold in core clock cycles. Loads with latencies greater than this
value are counted.

The latency threshold is encoded in the upper 32 bits of
perf_event_attr::config and 'p' modifier must be used to enabel PEBS.

The default latency threshold is 3, as Intel manual says, "The minimum
value that may be programmed in this register is 3 (the minimum
detectable load latency is 4 core clock cycles)."

Here are some example outputs.

# perf top -e r100b:p

----------------------------------------------------------------------------------------------------
   PerfTop:    1800 irqs/sec  kernel:41.9%  exact:  0.0% [1000Hz raw 0x100b],  (all, 4 CPUs)
----------------------------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ _________________________________________

              643.00 13.0% __lock_acquire                  [kernel.kallsyms]                        
              328.00  6.6% check_chain_key                 [kernel.kallsyms]                        
              308.00  6.2% mark_lock                       [kernel.kallsyms]                        
              286.00  5.8% lock_release                    [kernel.kallsyms]                        
              280.00  5.7% check_flags                     [kernel.kallsyms]                        
              239.00  4.8% use_config                      /home/mlin/linux-2.6/scripts/basic/fixdep
              201.00  4.1% trace_hardirqs_on_caller        [kernel.kallsyms]                        
              193.00  3.9% lock_acquire                    [kernel.kallsyms]                        
              182.00  3.7% validate_chain                  [kernel.kallsyms]                        
              162.00  3.3% __gconv_transform_utf8_internal /lib64/libc-2.8.so                       
              145.00  2.9% trace_hardirqs_off_caller       [kernel.kallsyms]                        
              115.00  2.3% mark_held_locks                 [kernel.kallsyms]                        
              102.00  2.1% __GI_mbrtowc                    /lib64/libc-2.8.so                       
               76.00  1.5% do_raw_spin_lock                [kernel.kallsyms]                        
               73.00  1.5% parse_dep_file                  /home/mlin/linux-2.6/scripts/basic/fixdep
               71.00  1.4% restore                         [kernel.kallsyms]                        
               59.00  1.2% _int_malloc                     /lib64/libc-2.8.so  

#Monitor load latency > 51 cycles
# perf top -e r510000100b:p

----------------------------------------------------------------------------------------------------
   PerfTop:    2055 irqs/sec  kernel:47.0%  exact:  0.0% [1000Hz raw 0x510000100b],  (all, 4 CPUs)
----------------------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ _________________________________________

              419.00  9.8% __lock_acquire            [kernel.kallsyms]                        
              278.00  6.5% parse_dep_file            /home/mlin/linux-2.6/scripts/basic/fixdep
              195.00  4.6% find_get_page             [kernel.kallsyms]                        
              195.00  4.6% get_page_from_freelist    [kernel.kallsyms]                        
              184.00  4.3% __d_lookup                [kernel.kallsyms]                        
              180.00  4.2% memcmp                    [kernel.kallsyms]                        
              135.00  3.2% __rmqueue                 [kernel.kallsyms]                        
              123.00  2.9% __GI_strcmp               /lib64/libc-2.8.so                       
               92.00  2.2% unmap_vmas                [kernel.kallsyms]                        
               88.00  2.1% validate_chain            [kernel.kallsyms]                        
               85.00  2.0% do_raw_spin_lock          [kernel.kallsyms]                        
               78.00  1.8% __wake_up_bit             [kernel.kallsyms]                        
               77.00  1.8% __GI_strlen               /lib64/libc-2.8.so                       
               73.00  1.7% copy_user_generic_string  [kernel.kallsyms]                        
               71.00  1.7% page_remove_rmap          [kernel.kallsyms]                        
               70.00  1.6% kmem_cache_free           [kernel.kallsyms]                        
               67.00  1.6% free_pcppages_bulk        [kernel.kallsyms]                        
               62.00  1.5% kmem_cache_alloc          [kernel.kallsyms]                        
               58.00  1.4% ext3_release_file         [kernel.kallsyms]                        
               54.00  1.3% radix_tree_lookup_element [kernel.kallsyms]                        
               47.00  1.1% generic_fillattr          [kernel.kallsyms] 

Any comment is appropriated.

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c          |   18 +++++++++++++++---
 arch/x86/kernel/cpu/perf_event_intel.c    |    4 ++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c |    5 +++++
 include/linux/perf_event.h                |    1 +
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index ed6ff11..2a02529 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -197,18 +197,25 @@ struct extra_reg {
 	unsigned int		extra_shift;
 	u64			config_mask;
 	u64			valid_mask;
+	u64			flags;
 };
 
-#define EVENT_EXTRA_REG(e, ms, m, vm, es) {	\
+#define EVENT_EXTRA_REG(e, ms, m, vm, es, f) {	\
 	.event = (e),		\
 	.msr = (ms),		\
 	.config_mask = (m),	\
 	.valid_mask = (vm),	\
 	.extra_shift = (es),	\
+	.flags = (f),	\
 	}
 #define INTEL_EVENT_EXTRA_REG(event, msr, vm, es)	\
-	EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es)
-#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0)
+	EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es, 0)
+#define INTEL_EVENT_EXTRA_REG2(event, msr, vm, es, f)	\
+	EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
+			ARCH_PERFMON_EVENTSEL_UMASK, vm, es, f)
+#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0, 0)
+
+#define EXTRA_REG_LD_LAT 0x1
 
 union perf_capabilities {
 	struct {
@@ -384,6 +391,11 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event)
 		if (extra & ~er->valid_mask)
 			return -EINVAL;
 		event->hw.extra_config = extra;
+		event->hw.extra_flags = er->flags;
+
+		/* The minimum value that may be programmed into MSR_PEBS_LD_LAT is 3 */
+		if ((er->flags & EXTRA_REG_LD_LAT) && extra < 3)
+			event->hw.extra_config = 3;
 		break;
 	}
 	return 0;
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index bc4afb1..7e2b873 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -89,6 +89,8 @@ static struct event_constraint intel_nehalem_event_constraints[] =
 static struct extra_reg intel_nehalem_extra_regs[] =
 {
 	INTEL_EVENT_EXTRA_REG(0xb7, 0x1a6, 0xffff, 32), /* OFFCORE_RESPONSE */
+	/* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
+	INTEL_EVENT_EXTRA_REG2(0x100b, 0x3f6, 0xffff, 32, EXTRA_REG_LD_LAT),
 	EVENT_EXTRA_END
 };
 
@@ -114,6 +116,8 @@ static struct extra_reg intel_westmere_extra_regs[] =
 {
 	INTEL_EVENT_EXTRA_REG(0xb7, 0x1a6, 0xffff, 32), /* OFFCORE_RESPONSE_0 */
 	INTEL_EVENT_EXTRA_REG(0xbb, 0x1a7, 0xffff, 32), /* OFFCORE_RESPONSE_1 */
+	/* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
+	INTEL_EVENT_EXTRA_REG2(0x100b, 0x3f6, 0xffff, 32, EXTRA_REG_LD_LAT),
 	EVENT_EXTRA_END
 };
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index b7dcd9f..d008c40 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -376,6 +376,7 @@ static struct event_constraint intel_core_pebs_events[] = {
 };
 
 static struct event_constraint intel_nehalem_pebs_events[] = {
+	PEBS_EVENT_CONSTRAINT(0x100b, 0xf), /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
 	PEBS_EVENT_CONSTRAINT(0x00c0, 0xf), /* INSTR_RETIRED.ANY */
 	PEBS_EVENT_CONSTRAINT(0xfec1, 0xf), /* X87_OPS_RETIRED.ANY */
 	PEBS_EVENT_CONSTRAINT(0x00c5, 0xf), /* BR_INST_RETIRED.MISPRED */
@@ -414,6 +415,8 @@ static void intel_pmu_pebs_enable(struct perf_event *event)
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
 
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
+	if (hwc->extra_flags & EXTRA_REG_LD_LAT)
+		cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
 	WARN_ON_ONCE(cpuc->enabled);
 
 	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
@@ -426,6 +429,8 @@ static void intel_pmu_pebs_disable(struct perf_event *event)
 	struct hw_perf_event *hwc = &event->hw;
 
 	cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+	if (hwc->extra_flags & EXTRA_REG_LD_LAT)
+		cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
 	if (cpuc->enabled)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d24d9ab..38bffa4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -541,6 +541,7 @@ struct hw_perf_event {
 			int		last_cpu;
 			unsigned int	extra_reg;
 			u64		extra_config;
+			u64		extra_flags;
 		};
 		struct { /* software */
 			struct hrtimer	hrtimer;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22  8:12 [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere Lin Ming
@ 2010-12-22  8:33 ` Peter Zijlstra
  2010-12-22  8:47   ` Lin Ming
  2010-12-22  9:04   ` Peter Zijlstra
  2010-12-22  9:00 ` Peter Zijlstra
  1 sibling, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22  8:33 UTC (permalink / raw)
  To: Lin Ming
  Cc: Ingo Molnar, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
	Arjan van de Ven, lkml

On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> patchsets are needed. 

I still haven't gotten a version of that I'm willing to apply.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22  8:33 ` Peter Zijlstra
@ 2010-12-22  8:47   ` Lin Ming
  2010-12-22  9:04   ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Lin Ming @ 2010-12-22  8:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
	Arjan van de Ven, lkml

On Wed, 2010-12-22 at 16:33 +0800, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> > patchsets are needed. 
> 
> I still haven't gotten a version of that I'm willing to apply.

OK, then I'll update this patch after the offcore patches applied.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22  8:12 [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere Lin Ming
  2010-12-22  8:33 ` Peter Zijlstra
@ 2010-12-22  9:00 ` Peter Zijlstra
  2010-12-22 10:08   ` Stephane Eranian
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22  9:00 UTC (permalink / raw)
  To: Lin Ming
  Cc: Ingo Molnar, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index ed6ff11..2a02529 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -197,18 +197,25 @@ struct extra_reg {
>         unsigned int            extra_shift;
>         u64                     config_mask;
>         u64                     valid_mask;
> +       u64                     flags;
>  };
>  
> -#define EVENT_EXTRA_REG(e, ms, m, vm, es) {    \
> +#define EVENT_EXTRA_REG(e, ms, m, vm, es, f) { \
>         .event = (e),           \
>         .msr = (ms),            \
>         .config_mask = (m),     \
>         .valid_mask = (vm),     \
>         .extra_shift = (es),    \
> +       .flags = (f),   \
>         }
>  #define INTEL_EVENT_EXTRA_REG(event, msr, vm, es)      \
> -       EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es)
> -#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0)
> +       EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es, 0)
> +#define INTEL_EVENT_EXTRA_REG2(event, msr, vm, es, f)  \
> +       EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
> +                       ARCH_PERFMON_EVENTSEL_UMASK, vm, es, f)
> +#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0, 0)

You'll need to increment MAX_EXTRA_REGS to 3 I think.

> +#define EXTRA_REG_LD_LAT 0x1

I'm not quite sure we actually need the whole flags business.

>  union perf_capabilities {
>         struct {
> @@ -384,6 +391,11 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event)
>                 if (extra & ~er->valid_mask)
>                         return -EINVAL;
>                 event->hw.extra_config = extra;
> +               event->hw.extra_flags = er->flags;
> +
> +               /* The minimum value that may be programmed into MSR_PEBS_LD_LAT is 3 */
> +               if ((er->flags & EXTRA_REG_LD_LAT) && extra < 3)
> +                       event->hw.extra_config = 3;

	if (er->msr == MSR_PEBS_LD_LAT_THRESHOLD && extra < 3)
		event->hw.extra_config = 3;

>                 break;
>         }
>         return 0;
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index bc4afb1..7e2b873 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -89,6 +89,8 @@ static struct event_constraint intel_nehalem_event_constraints[] =
>  static struct extra_reg intel_nehalem_extra_regs[] =
>  {
>         INTEL_EVENT_EXTRA_REG(0xb7, 0x1a6, 0xffff, 32), /* OFFCORE_RESPONSE */
> +       /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
> +       INTEL_EVENT_EXTRA_REG2(0x100b, 0x3f6, 0xffff, 32, EXTRA_REG_LD_LAT),
>         EVENT_EXTRA_END
>  };

Maybe use the MSR names instead of the numbers.


> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index b7dcd9f..d008c40 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -376,6 +376,7 @@ static struct event_constraint intel_core_pebs_events[] = {
>  };
>  
>  static struct event_constraint intel_nehalem_pebs_events[] = {
> +       PEBS_EVENT_CONSTRAINT(0x100b, 0xf), /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
>         PEBS_EVENT_CONSTRAINT(0x00c0, 0xf), /* INSTR_RETIRED.ANY */
>         PEBS_EVENT_CONSTRAINT(0xfec1, 0xf), /* X87_OPS_RETIRED.ANY */
>         PEBS_EVENT_CONSTRAINT(0x00c5, 0xf), /* BR_INST_RETIRED.MISPRED */
> @@ -414,6 +415,8 @@ static void intel_pmu_pebs_enable(struct perf_event *event)
>         hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
>  
>         cpuc->pebs_enabled |= 1ULL << hwc->idx;
> +       if (hwc->extra_flags & EXTRA_REG_LD_LAT)
> +               cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);

	if (hwc->extra_reg == MSR_PEBS_LD_LAT_THRESHOLD)
		cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);

>         WARN_ON_ONCE(cpuc->enabled);
>  
>         if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
> @@ -426,6 +429,8 @@ static void intel_pmu_pebs_disable(struct perf_event *event)
>         struct hw_perf_event *hwc = &event->hw;
>  
>         cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
> +       if (hwc->extra_flags & EXTRA_REG_LD_LAT)
> +               cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));

	if (hwx->extra_reg == MSR_PEBS_LD_LAT_THRESHOLD)
		cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));

>         if (cpuc->enabled)
>                 wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>  
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d24d9ab..38bffa4 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -541,6 +541,7 @@ struct hw_perf_event {
>                         int             last_cpu;
>                         unsigned int    extra_reg;
>                         u64             extra_config;
> +                       u64             extra_flags;
>                 };
>                 struct { /* software */
>                         struct hrtimer  hrtimer;
> 

Which then also obviates the need for this extra field.

You also need some extra goo in intel_pmu_drain_pebs_nhm(), we can
already use the PERF_SAMPLE_ADDR for the linear data address provided by
the pebs-ll thing, and we might need to add:

 PERF_SAMPLE_LATENCY -- Stephane said other archs can also use this

Not quite sure what to do for the source bits, POWER also has some extra
bits, but I'm not sure they qualify as purely source bits. And
interpreting them is going to be inherently arch specific, which
sucks :/



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22  8:33 ` Peter Zijlstra
  2010-12-22  8:47   ` Lin Ming
@ 2010-12-22  9:04   ` Peter Zijlstra
  2010-12-22 10:14     ` Ingo Molnar
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22  9:04 UTC (permalink / raw)
  To: Lin Ming
  Cc: Ingo Molnar, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
	Arjan van de Ven, lkml

On Wed, 2010-12-22 at 09:33 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> > patchsets are needed. 
> 
> I still haven't gotten a version of that I'm willing to apply.

That is, maybe you could pick up his patches and finish them, Andi
doesn't seem interested in getting them sorted.

I think the last posting lost the needs_percore logic, which would avoid
the percore allocations on platforms that don't need it (if we can
detect if HT is enabled that would be nice too).

I think he also lost the topology iteration (which can be used for both
the percore and the amd northbridge stuff), currently we iterate the
full machine looking for a matching nb/core_id, using a smaller topology
mask makes sense.

Also, after reading your patch, I think we want the MSR names in the
lists, not the numbers.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22  9:00 ` Peter Zijlstra
@ 2010-12-22 10:08   ` Stephane Eranian
  2010-12-22 10:45     ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Stephane Eranian @ 2010-12-22 10:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

Hi,

On Wed, Dec 22, 2010 at 10:00 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
>> index ed6ff11..2a02529 100644
>> --- a/arch/x86/kernel/cpu/perf_event.c
>> +++ b/arch/x86/kernel/cpu/perf_event.c
>> @@ -197,18 +197,25 @@ struct extra_reg {
>>         unsigned int            extra_shift;
>>         u64                     config_mask;
>>         u64                     valid_mask;
>> +       u64                     flags;
>>  };
>>
>> -#define EVENT_EXTRA_REG(e, ms, m, vm, es) {    \
>> +#define EVENT_EXTRA_REG(e, ms, m, vm, es, f) { \
>>         .event = (e),           \
>>         .msr = (ms),            \
>>         .config_mask = (m),     \
>>         .valid_mask = (vm),     \
>>         .extra_shift = (es),    \
>> +       .flags = (f),   \
>>         }
>>  #define INTEL_EVENT_EXTRA_REG(event, msr, vm, es)      \
>> -       EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es)
>> -#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0)
>> +       EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es, 0)
>> +#define INTEL_EVENT_EXTRA_REG2(event, msr, vm, es, f)  \
>> +       EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
>> +                       ARCH_PERFMON_EVENTSEL_UMASK, vm, es, f)
>> +#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0, 0)
>
> You'll need to increment MAX_EXTRA_REGS to 3 I think.
>
>> +#define EXTRA_REG_LD_LAT 0x1
>
> I'm not quite sure we actually need the whole flags business.
>
>>  union perf_capabilities {
>>         struct {
>> @@ -384,6 +391,11 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event)
>>                 if (extra & ~er->valid_mask)
>>                         return -EINVAL;
>>                 event->hw.extra_config = extra;
>> +               event->hw.extra_flags = er->flags;
>> +
>> +               /* The minimum value that may be programmed into MSR_PEBS_LD_LAT is 3 */
>> +               if ((er->flags & EXTRA_REG_LD_LAT) && extra < 3)
>> +                       event->hw.extra_config = 3;
>
>        if (er->msr == MSR_PEBS_LD_LAT_THRESHOLD && extra < 3)
>                event->hw.extra_config = 3;
>
>>                 break;
>>         }
>>         return 0;
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
>> index bc4afb1..7e2b873 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> @@ -89,6 +89,8 @@ static struct event_constraint intel_nehalem_event_constraints[] =
>>  static struct extra_reg intel_nehalem_extra_regs[] =
>>  {
>>         INTEL_EVENT_EXTRA_REG(0xb7, 0x1a6, 0xffff, 32), /* OFFCORE_RESPONSE */
>> +       /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
>> +       INTEL_EVENT_EXTRA_REG2(0x100b, 0x3f6, 0xffff, 32, EXTRA_REG_LD_LAT),
>>         EVENT_EXTRA_END
>>  };
>
> Maybe use the MSR names instead of the numbers.
>
>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
>> index b7dcd9f..d008c40 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
>> @@ -376,6 +376,7 @@ static struct event_constraint intel_core_pebs_events[] = {
>>  };
>>
>>  static struct event_constraint intel_nehalem_pebs_events[] = {
>> +       PEBS_EVENT_CONSTRAINT(0x100b, 0xf), /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
>>         PEBS_EVENT_CONSTRAINT(0x00c0, 0xf), /* INSTR_RETIRED.ANY */
>>         PEBS_EVENT_CONSTRAINT(0xfec1, 0xf), /* X87_OPS_RETIRED.ANY */
>>         PEBS_EVENT_CONSTRAINT(0x00c5, 0xf), /* BR_INST_RETIRED.MISPRED */
>> @@ -414,6 +415,8 @@ static void intel_pmu_pebs_enable(struct perf_event *event)
>>         hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
>>
>>         cpuc->pebs_enabled |= 1ULL << hwc->idx;
>> +       if (hwc->extra_flags & EXTRA_REG_LD_LAT)
>> +               cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
>
>        if (hwc->extra_reg == MSR_PEBS_LD_LAT_THRESHOLD)
>                cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
>
>>         WARN_ON_ONCE(cpuc->enabled);
>>
>>         if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
>> @@ -426,6 +429,8 @@ static void intel_pmu_pebs_disable(struct perf_event *event)
>>         struct hw_perf_event *hwc = &event->hw;
>>
>>         cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
>> +       if (hwc->extra_flags & EXTRA_REG_LD_LAT)
>> +               cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
>
>        if (hwx->extra_reg == MSR_PEBS_LD_LAT_THRESHOLD)
>                cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
>
>>         if (cpuc->enabled)
>>                 wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index d24d9ab..38bffa4 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -541,6 +541,7 @@ struct hw_perf_event {
>>                         int             last_cpu;
>>                         unsigned int    extra_reg;
>>                         u64             extra_config;
>> +                       u64             extra_flags;
>>                 };
>>                 struct { /* software */
>>                         struct hrtimer  hrtimer;
>>
>
> Which then also obviates the need for this extra field.
>
> You also need some extra goo in intel_pmu_drain_pebs_nhm(), we can
> already use the PERF_SAMPLE_ADDR for the linear data address provided by
> the pebs-ll thing, and we might need to add:
>
>  PERF_SAMPLE_LATENCY -- Stephane said other archs can also use this
>
Extracting the instruction address is not so useful. You need the
instruction and data addresses, the latency and data source. As Peter
pointed out, you can use PERF_SAMPLE_ADDR for the data address.

True. And also we would need a PERF_SAMPLE_DATA_SRC to extract
the data source information. Other archs also have that.

Note that PEBS-Load latency needs the IP+1 correction. It points to the
instruction address after the load/lfetch. But I suspect your patch already
takes care of that.

> Not quite sure what to do for the source bits, POWER also has some extra
> bits, but I'm not sure they qualify as purely source bits. And
> interpreting them is going to be inherently arch specific, which
> sucks :/
>
>
Yes, I think there is more to it than just data source, unfortunately.
If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
and so on.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22  9:04   ` Peter Zijlstra
@ 2010-12-22 10:14     ` Ingo Molnar
  2010-12-23  1:14       ` Lin Ming
  2010-12-23  7:35       ` Andi Kleen
  0 siblings, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2010-12-22 10:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lin Ming, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
	Arjan van de Ven, lkml


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2010-12-22 at 09:33 +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> > > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> > > patchsets are needed. 
> > 
> > I still haven't gotten a version of that I'm willing to apply.
> 
> That is, maybe you could pick up his patches and finish them, Andi
> doesn't seem interested in getting them sorted.

FYI, this one of Peter's reviews that went unaddressed:

  https://lkml.org/lkml/2010/12/1/213

and he got no reply to that feedback in the past 3 weeks. (There may be other issues 
outstanding, i have not checked.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22 10:08   ` Stephane Eranian
@ 2010-12-22 10:45     ` Peter Zijlstra
  2010-12-22 10:49       ` Peter Zijlstra
  2010-12-23  8:28       ` Lin Ming
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22 10:45 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> Yes, I think there is more to it than just data source, unfortunately.
> If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> and so on. 

I guess we can do things like:

Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
encode "Pending core cache HIT" as L2-snoop or something, whatever is
most appropriate.

But does that cover every architecture?

Also, since that doesn't require more that 4 bits to encode, we could
try and categorize what else is around and try and create a well
specified _EXTRA register, I mean, we still got 60bits left after this.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22 10:45     ` Peter Zijlstra
@ 2010-12-22 10:49       ` Peter Zijlstra
  2010-12-23  8:59         ` Lin Ming
  2010-12-23  8:28       ` Lin Ming
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22 10:49 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Wed, 2010-12-22 at 11:45 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> > Yes, I think there is more to it than just data source, unfortunately.
> > If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> > you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> > and so on. 
> 
> I guess we can do things like:
> 
> Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
> encode "Pending core cache HIT" as L2-snoop or something, whatever is
> most appropriate.

Ah, I just saw my email window covered part of the spec and we can also
have x{shared,exclusive}, so we end up with:

{L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
uncached, IO}

Which takes all of 5 bits to encode.

> But does that cover every architecture?
> 
> Also, since that doesn't require more that 4 bits to encode, we could
> try and categorize what else is around and try and create a well
> specified _EXTRA register, I mean, we still got 60bits left after this.

Leaving us with 59 bits to consider.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22 10:14     ` Ingo Molnar
@ 2010-12-23  1:14       ` Lin Ming
  2010-12-23  7:35       ` Andi Kleen
  1 sibling, 0 replies; 19+ messages in thread
From: Lin Ming @ 2010-12-23  1:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
	Arjan van de Ven, lkml

On Wed, 2010-12-22 at 18:14 +0800, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, 2010-12-22 at 09:33 +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> > > > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> > > > patchsets are needed. 
> > > 
> > > I still haven't gotten a version of that I'm willing to apply.
> > 
> > That is, maybe you could pick up his patches and finish them, Andi
> > doesn't seem interested in getting them sorted.
> 
> FYI, this one of Peter's reviews that went unaddressed:
> 
>   https://lkml.org/lkml/2010/12/1/213
> 
> and he got no reply to that feedback in the past 3 weeks. (There may be other issues 
> outstanding, i have not checked.)

Andi mentioned he will be on vacation in December, see the end of below
mail.
http://marc.info/?l=linux-kernel&m=129136895415202&w=2

> 
> Thanks,
> 
> 	Ingo



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22 10:14     ` Ingo Molnar
  2010-12-23  1:14       ` Lin Ming
@ 2010-12-23  7:35       ` Andi Kleen
  1 sibling, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2010-12-23  7:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Lin Ming, Andi Kleen, Stephane Eranian,
	Frederic Weisbecker, Arjan van de Ven, lkml

>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Wed, 2010-12-22 at 09:33 +0100, Peter Zijlstra wrote:
>> > On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
>> > > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
>> > > patchsets are needed.
>> >
>> > I still haven't gotten a version of that I'm willing to apply.
>>
>> That is, maybe you could pick up his patches and finish them, Andi
>> doesn't seem interested in getting them sorted.
>
> FYI, this one of Peter's reviews that went unaddressed:
>
>   https://lkml.org/lkml/2010/12/1/213
>
> and he got no reply to that feedback in the past 3 weeks. (There may be
> other issues
> outstanding, i have not checked.)

Sorry didn't see that email. I dropped the variable intentionally
to simplify the code; there is no fastpath overhead for the other
CPUs and allocating a few bytes per core didn't seem worth having a
special variable to guard it. I can readd it if people really
want it.

-Andi


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22 10:45     ` Peter Zijlstra
  2010-12-22 10:49       ` Peter Zijlstra
@ 2010-12-23  8:28       ` Lin Ming
  2010-12-23 10:11         ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Lin Ming @ 2010-12-23  8:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Wed, 2010-12-22 at 18:45 +0800, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> > Yes, I think there is more to it than just data source, unfortunately.
> > If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> > you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> > and so on. 
> 
> I guess we can do things like:
> 
> Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
> encode "Pending core cache HIT" as L2-snoop or something, whatever is
> most appropriate.
> 
> But does that cover every architecture?
> 
> Also, since that doesn't require more that 4 bits to encode, we could
> try and categorize what else is around and try and create a well
> specified _EXTRA register, I mean, we still got 60bits left after this.

Could you tell more about this well specified _EXTRA register?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-22 10:49       ` Peter Zijlstra
@ 2010-12-23  8:59         ` Lin Ming
  2010-12-23 10:18           ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Lin Ming @ 2010-12-23  8:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Wed, 2010-12-22 at 18:49 +0800, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 11:45 +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> > > Yes, I think there is more to it than just data source, unfortunately.
> > > If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> > > you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> > > and so on. 
> > 
> > I guess we can do things like:
> > 
> > Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
> > encode "Pending core cache HIT" as L2-snoop or something, whatever is
> > most appropriate.
> 
> Ah, I just saw my email window covered part of the spec and we can also
> have x{shared,exclusive}, so we end up with:
> 
> {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
> uncached, IO}
> 
> Which takes all of 5 bits to encode.

Do you mean below encoding?

bits4 3 2 1 0
    + + + + +
    | | | | |
    | | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
    | | |
    | {snoop, local, remote, OTHER}
    |
    {shared, exclusive}

If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
uncached, IO}.

> 
> > But does that cover every architecture?
> > 
> > Also, since that doesn't require more that 4 bits to encode, we could
> > try and categorize what else is around and try and create a well
> > specified _EXTRA register, I mean, we still got 60bits left after this.
> 
> Leaving us with 59 bits to consider.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-23  8:28       ` Lin Ming
@ 2010-12-23 10:11         ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-23 10:11 UTC (permalink / raw)
  To: Lin Ming
  Cc: Stephane Eranian, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Thu, 2010-12-23 at 16:28 +0800, Lin Ming wrote:
> On Wed, 2010-12-22 at 18:45 +0800, Peter Zijlstra wrote:
> > On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> > > Yes, I think there is more to it than just data source, unfortunately.
> > > If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> > > you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> > > and so on. 
> > 
> > I guess we can do things like:
> > 
> > Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
> > encode "Pending core cache HIT" as L2-snoop or something, whatever is
> > most appropriate.
> > 
> > But does that cover every architecture?
> > 
> > Also, since that doesn't require more that 4 bits to encode, we could
> > try and categorize what else is around and try and create a well
> > specified _EXTRA register, I mean, we still got 60bits left after this.
> 
> Could you tell more about this well specified _EXTRA register?

Like how we should look at other archs (like POWER) and try and figure
out what other useful information is available and come up with a
generic format for that?

We could of course leave the 59 bits as reserved and fill it in later,
and go with the data source bits only for now..







^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-23  8:59         ` Lin Ming
@ 2010-12-23 10:18           ` Peter Zijlstra
  2010-12-23 10:31             ` Stephane Eranian
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-23 10:18 UTC (permalink / raw)
  To: Lin Ming
  Cc: Stephane Eranian, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Thu, 2010-12-23 at 16:59 +0800, Lin Ming wrote:
> > {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
> > uncached, IO}
> > 
> > Which takes all of 5 bits to encode.
> 
> Do you mean below encoding?
> 
> bits4 3 2 1 0
>     + + + + +
>     | | | | |
>     | | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
>     | | |
>     | {snoop, local, remote, OTHER}
>     |
>     {shared, exclusive}
> 
> If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
> uncached, IO}. 

That is most certainly a very valid encoding, and a rather nice one at
that. I hadn't really gone further than: 4*3*2 + 3 < 2^5 :-)

If you also make OTHER=0, then a valid encoding for unknown is also 0,
which is a nice meaning for 0...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-23 10:18           ` Peter Zijlstra
@ 2010-12-23 10:31             ` Stephane Eranian
  2010-12-23 10:48               ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Stephane Eranian @ 2010-12-23 10:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Thu, Dec 23, 2010 at 11:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-12-23 at 16:59 +0800, Lin Ming wrote:
>> > {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
>> > uncached, IO}
>> >
>> > Which takes all of 5 bits to encode.
>>
>> Do you mean below encoding?
>>
>> bits4 3 2 1 0
>>     + + + + +
>>     | | | | |
>>     | | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
>>     | | |
>>     | {snoop, local, remote, OTHER}
>>     |
>>     {shared, exclusive}
>>
>> If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
>> uncached, IO}.
>
> That is most certainly a very valid encoding, and a rather nice one at
> that. I hadn't really gone further than: 4*3*2 + 3 < 2^5 :-)
>
> If you also make OTHER=0, then a valid encoding for unknown is also 0,
> which is a nice meaning for 0...
>
I am not sure how you would cover the 9 possibilities for data source as
shown in Table 10-13 using this encoding. Could you show me?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-23 10:31             ` Stephane Eranian
@ 2010-12-23 10:48               ` Peter Zijlstra
  2010-12-23 11:05                 ` Stephane Eranian
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-23 10:48 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Thu, 2010-12-23 at 11:31 +0100, Stephane Eranian wrote:
> On Thu, Dec 23, 2010 at 11:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, 2010-12-23 at 16:59 +0800, Lin Ming wrote:
> >> > {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
> >> > uncached, IO}
> >> >
> >> > Which takes all of 5 bits to encode.
> >>
> >> Do you mean below encoding?
> >>
> >> bits4 3 2 1 0
> >>     + + + + +
> >>     | | | | |
> >>     | | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
> >>     | | |
> >>     | {snoop, local, remote, OTHER}
> >>     |
> >>     {shared, exclusive}
> >>
> >> If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
> >> uncached, IO}.
> >
> > That is most certainly a very valid encoding, and a rather nice one at
> > that. I hadn't really gone further than: 4*3*2 + 3 < 2^5 :-)
> >
> > If you also make OTHER=0, then a valid encoding for unknown is also 0,
> > which is a nice meaning for 0...
> >
> I am not sure how you would cover the 9 possibilities for data source as
> shown in Table 10-13 using this encoding. Could you show me?

Ah, I think I see the problem, there's multiple L3-snoops, I guess we
can fix that by extending the {shared, exclusive} to full MESI, growing
us to 6 bits.

I'm assuming you mean "Table 30-13. Data Source Encoding for Load
Latency Record", which has 14 values defined.

Value	Intel				Perf
0x0	Unknown L3			Unknown

0x1	L1				L1-local

0x2	Pending core cache HIT		L2-snoop
	Outstanding core cache miss to
	the same line was underway
0x3	L2				L2-local

0x4	L3-snoop, no coherency actions	L3-snoop-I
0x5	L3-snoop, found no M		L3-snoop-S
0x6	L3-snoop, found M		L3-snoop-M

0x8     L3-miss, snoop, shared		RAM-snoop-S
0xA	L3-miss, local, shared		RAM-local-S
0xB	L3-miss, remote, shared		RAM-remote-S

0xC	L3-miss, local, exclusive	RAM-local-E
0xD	L3-miss, remote, exclusive	RAM-remote-E

0xE	IO				IO
0xF	uncached			uncached


Leaving us with: 

{L1, L2, L3, RAM}x{snoop, local, remote}x{modified, exclusive, shared, invalid} + {unknown, uncached, IO}

Now the question is, is this sufficient to map all data sources from
other archs as well?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-23 10:48               ` Peter Zijlstra
@ 2010-12-23 11:05                 ` Stephane Eranian
  2010-12-23 11:37                   ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Stephane Eranian @ 2010-12-23 11:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Thu, Dec 23, 2010 at 11:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-12-23 at 11:31 +0100, Stephane Eranian wrote:
>> On Thu, Dec 23, 2010 at 11:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Thu, 2010-12-23 at 16:59 +0800, Lin Ming wrote:
>> >> > {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
>> >> > uncached, IO}
>> >> >
>> >> > Which takes all of 5 bits to encode.
>> >>
>> >> Do you mean below encoding?
>> >>
>> >> bits4 3 2 1 0
>> >>     + + + + +
>> >>     | | | | |
>> >>     | | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
>> >>     | | |
>> >>     | {snoop, local, remote, OTHER}
>> >>     |
>> >>     {shared, exclusive}
>> >>
>> >> If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
>> >> uncached, IO}.
>> >
>> > That is most certainly a very valid encoding, and a rather nice one at
>> > that. I hadn't really gone further than: 4*3*2 + 3 < 2^5 :-)
>> >
>> > If you also make OTHER=0, then a valid encoding for unknown is also 0,
>> > which is a nice meaning for 0...
>> >
>> I am not sure how you would cover the 9 possibilities for data source as
>> shown in Table 10-13 using this encoding. Could you show me?
>
> Ah, I think I see the problem, there's multiple L3-snoops, I guess we
> can fix that by extending the {shared, exclusive} to full MESI, growing
> us to 6 bits.
>
> I'm assuming you mean "Table 30-13. Data Source Encoding for Load
> Latency Record", which has 14 values defined.
>
Yes.

> Value   Intel                           Perf
> 0x0     Unknown L3                      Unknown
>
> 0x1     L1                              L1-local
>
> 0x2     Pending core cache HIT          L2-snoop
>        Outstanding core cache miss to

Not clear how you know this is snoop or L2?

I suspect this one is saying you have a request for a line
for which there is already a pending request underway. Could
be the first came from prefetchers, the 2nd is actual demand.

Let me check with Intel. The table is unclear.

>        the same line was underway
> 0x3     L2                              L2-local
>
> 0x4     L3-snoop, no coherency actions  L3-snoop-I

I am not sure I understand what you mean by local vs. remote
in your terminology.


> 0x5     L3-snoop, found no M            L3-snoop-S
> 0x6     L3-snoop, found M               L3-snoop-M
>
> 0x8     L3-miss, snoop, shared          RAM-snoop-S
> 0xA     L3-miss, local, shared          RAM-local-S
> 0xB     L3-miss, remote, shared         RAM-remote-S
>
> 0xC     L3-miss, local, exclusive       RAM-local-E
> 0xD     L3-miss, remote, exclusive      RAM-remote-E
>
> 0xE     IO                              IO
> 0xF     uncached                        uncached
>
>
> Leaving us with:
>
> {L1, L2, L3, RAM}x{snoop, local, remote}x{modified, exclusive, shared, invalid} + {unknown, uncached, IO}
>
> Now the question is, is this sufficient to map all data sources from
> other archs as well?
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
  2010-12-23 11:05                 ` Stephane Eranian
@ 2010-12-23 11:37                   ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-23 11:37 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
	Arjan van de Ven, lkml, paulus

On Thu, 2010-12-23 at 12:05 +0100, Stephane Eranian wrote:

> > Value   Intel                           Perf
> > 0x0     Unknown L3                      Unknown
> >
> > 0x1     L1                              L1-local
> >
> > 0x2     Pending core cache HIT          L2-snoop
> >        Outstanding core cache miss to
> 
> Not clear how you know this is snoop or L2?
> 
> I suspect this one is saying you have a request for a line
> for which there is already a pending request underway. Could
> be the first came from prefetchers, the 2nd is actual demand.
> 
> Let me check with Intel. The table is unclear.

Right, so cache snoops as used by Intel are data transfer operations
(not only the watching for remote modifications and local invalidation
as per the strict definition), they typically short-circuit a complete
fetch, get it from a neighboring cache, or otherwise in-flight data.

Since this is a pending fetch, the data is in-flight, and snoop seemed
to apply, but I admit it is somewhat of a stretch.

The L2 came from the usage of "core cache", I might be wrong on that.

Anyway, its a bit of an odd one out, you can have the exact same
'problem' of pending fetches on the same line on all levels, yet they
don't provide this 'source' for other levels.

Strictly speaking, this is a stall, not a source, and we could simply
map it to 'unknown' and be done with it.

> >        the same line was underway
> > 0x3     L2                              L2-local
> >
> > 0x4     L3-snoop, no coherency actions  L3-snoop-I
> 
> I am not sure I understand what you mean by local vs. remote
> in your terminology.

Local being the cache nearest to the cpu, remote being all others.

Admittedly that doesn't really make too much sense for L[12], but
imagine threads having their own L1, then I could imagine a thread
trying to peek in a sibling's L1 since its so near. In that case it
would make sense to use local vs remote on the L1.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-12-23 11:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-22  8:12 [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere Lin Ming
2010-12-22  8:33 ` Peter Zijlstra
2010-12-22  8:47   ` Lin Ming
2010-12-22  9:04   ` Peter Zijlstra
2010-12-22 10:14     ` Ingo Molnar
2010-12-23  1:14       ` Lin Ming
2010-12-23  7:35       ` Andi Kleen
2010-12-22  9:00 ` Peter Zijlstra
2010-12-22 10:08   ` Stephane Eranian
2010-12-22 10:45     ` Peter Zijlstra
2010-12-22 10:49       ` Peter Zijlstra
2010-12-23  8:59         ` Lin Ming
2010-12-23 10:18           ` Peter Zijlstra
2010-12-23 10:31             ` Stephane Eranian
2010-12-23 10:48               ` Peter Zijlstra
2010-12-23 11:05                 ` Stephane Eranian
2010-12-23 11:37                   ` Peter Zijlstra
2010-12-23  8:28       ` Lin Ming
2010-12-23 10:11         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox