* [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
@ 2010-12-22 8:12 Lin Ming
2010-12-22 8:33 ` Peter Zijlstra
2010-12-22 9:00 ` Peter Zijlstra
0 siblings, 2 replies; 19+ messages in thread
From: Lin Ming @ 2010-12-22 8:12 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Andi Kleen, Stephane Eranian,
Frederic Weisbecker, Arjan van de Ven
Cc: lkml
Hi, all
This patch adds load latency monitoring on Intel Nehalem/Westmere.
It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
patchsets are needed.
Updated perf offcore patchkit
http://marc.info/?l=linux-kernel&m=129103647731356&w=2
The load latency on Intel Nehalem/Westmere is monitored by event
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD(0x100b). It measures latency
from micro-operation (uop) dispatch to when data is globally observable
(GO).
To monitor load latency, both the PEBS_EN_CTRx and LL_EN_CTRx bits must
be set in IA32_PEBS_ENABLE register. And an extra MSR
MSR_PEBS_LD_LAT_THRESHOLD must be programmed with the desired latency
threshold in core clock cycles. Loads with latencies greater than this
value are counted.
The latency threshold is encoded in the upper 32 bits of
perf_event_attr::config and 'p' modifier must be used to enabel PEBS.
The default latency threshold is 3, as Intel manual says, "The minimum
value that may be programmed in this register is 3 (the minimum
detectable load latency is 4 core clock cycles)."
Here are some example outputs.
# perf top -e r100b:p
----------------------------------------------------------------------------------------------------
PerfTop: 1800 irqs/sec kernel:41.9% exact: 0.0% [1000Hz raw 0x100b], (all, 4 CPUs)
----------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ _______________________________ _________________________________________
643.00 13.0% __lock_acquire [kernel.kallsyms]
328.00 6.6% check_chain_key [kernel.kallsyms]
308.00 6.2% mark_lock [kernel.kallsyms]
286.00 5.8% lock_release [kernel.kallsyms]
280.00 5.7% check_flags [kernel.kallsyms]
239.00 4.8% use_config /home/mlin/linux-2.6/scripts/basic/fixdep
201.00 4.1% trace_hardirqs_on_caller [kernel.kallsyms]
193.00 3.9% lock_acquire [kernel.kallsyms]
182.00 3.7% validate_chain [kernel.kallsyms]
162.00 3.3% __gconv_transform_utf8_internal /lib64/libc-2.8.so
145.00 2.9% trace_hardirqs_off_caller [kernel.kallsyms]
115.00 2.3% mark_held_locks [kernel.kallsyms]
102.00 2.1% __GI_mbrtowc /lib64/libc-2.8.so
76.00 1.5% do_raw_spin_lock [kernel.kallsyms]
73.00 1.5% parse_dep_file /home/mlin/linux-2.6/scripts/basic/fixdep
71.00 1.4% restore [kernel.kallsyms]
59.00 1.2% _int_malloc /lib64/libc-2.8.so
#Monitor load latency > 51 cycles
# perf top -e r510000100b:p
----------------------------------------------------------------------------------------------------
PerfTop: 2055 irqs/sec kernel:47.0% exact: 0.0% [1000Hz raw 0x510000100b], (all, 4 CPUs)
----------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ _________________________ _________________________________________
419.00 9.8% __lock_acquire [kernel.kallsyms]
278.00 6.5% parse_dep_file /home/mlin/linux-2.6/scripts/basic/fixdep
195.00 4.6% find_get_page [kernel.kallsyms]
195.00 4.6% get_page_from_freelist [kernel.kallsyms]
184.00 4.3% __d_lookup [kernel.kallsyms]
180.00 4.2% memcmp [kernel.kallsyms]
135.00 3.2% __rmqueue [kernel.kallsyms]
123.00 2.9% __GI_strcmp /lib64/libc-2.8.so
92.00 2.2% unmap_vmas [kernel.kallsyms]
88.00 2.1% validate_chain [kernel.kallsyms]
85.00 2.0% do_raw_spin_lock [kernel.kallsyms]
78.00 1.8% __wake_up_bit [kernel.kallsyms]
77.00 1.8% __GI_strlen /lib64/libc-2.8.so
73.00 1.7% copy_user_generic_string [kernel.kallsyms]
71.00 1.7% page_remove_rmap [kernel.kallsyms]
70.00 1.6% kmem_cache_free [kernel.kallsyms]
67.00 1.6% free_pcppages_bulk [kernel.kallsyms]
62.00 1.5% kmem_cache_alloc [kernel.kallsyms]
58.00 1.4% ext3_release_file [kernel.kallsyms]
54.00 1.3% radix_tree_lookup_element [kernel.kallsyms]
47.00 1.1% generic_fillattr [kernel.kallsyms]
Any comment is appropriated.
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
---
arch/x86/kernel/cpu/perf_event.c | 18 +++++++++++++++---
arch/x86/kernel/cpu/perf_event_intel.c | 4 ++++
arch/x86/kernel/cpu/perf_event_intel_ds.c | 5 +++++
include/linux/perf_event.h | 1 +
4 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index ed6ff11..2a02529 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -197,18 +197,25 @@ struct extra_reg {
unsigned int extra_shift;
u64 config_mask;
u64 valid_mask;
+ u64 flags;
};
-#define EVENT_EXTRA_REG(e, ms, m, vm, es) { \
+#define EVENT_EXTRA_REG(e, ms, m, vm, es, f) { \
.event = (e), \
.msr = (ms), \
.config_mask = (m), \
.valid_mask = (vm), \
.extra_shift = (es), \
+ .flags = (f), \
}
#define INTEL_EVENT_EXTRA_REG(event, msr, vm, es) \
- EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es)
-#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0)
+ EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es, 0)
+#define INTEL_EVENT_EXTRA_REG2(event, msr, vm, es, f) \
+ EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
+ ARCH_PERFMON_EVENTSEL_UMASK, vm, es, f)
+#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0, 0)
+
+#define EXTRA_REG_LD_LAT 0x1
union perf_capabilities {
struct {
@@ -384,6 +391,11 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event)
if (extra & ~er->valid_mask)
return -EINVAL;
event->hw.extra_config = extra;
+ event->hw.extra_flags = er->flags;
+
+ /* The minimum value that may be programmed into MSR_PEBS_LD_LAT is 3 */
+ if ((er->flags & EXTRA_REG_LD_LAT) && extra < 3)
+ event->hw.extra_config = 3;
break;
}
return 0;
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index bc4afb1..7e2b873 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -89,6 +89,8 @@ static struct event_constraint intel_nehalem_event_constraints[] =
static struct extra_reg intel_nehalem_extra_regs[] =
{
INTEL_EVENT_EXTRA_REG(0xb7, 0x1a6, 0xffff, 32), /* OFFCORE_RESPONSE */
+ /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
+ INTEL_EVENT_EXTRA_REG2(0x100b, 0x3f6, 0xffff, 32, EXTRA_REG_LD_LAT),
EVENT_EXTRA_END
};
@@ -114,6 +116,8 @@ static struct extra_reg intel_westmere_extra_regs[] =
{
INTEL_EVENT_EXTRA_REG(0xb7, 0x1a6, 0xffff, 32), /* OFFCORE_RESPONSE_0 */
INTEL_EVENT_EXTRA_REG(0xbb, 0x1a7, 0xffff, 32), /* OFFCORE_RESPONSE_1 */
+ /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
+ INTEL_EVENT_EXTRA_REG2(0x100b, 0x3f6, 0xffff, 32, EXTRA_REG_LD_LAT),
EVENT_EXTRA_END
};
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index b7dcd9f..d008c40 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -376,6 +376,7 @@ static struct event_constraint intel_core_pebs_events[] = {
};
static struct event_constraint intel_nehalem_pebs_events[] = {
+ PEBS_EVENT_CONSTRAINT(0x100b, 0xf), /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
PEBS_EVENT_CONSTRAINT(0x00c0, 0xf), /* INSTR_RETIRED.ANY */
PEBS_EVENT_CONSTRAINT(0xfec1, 0xf), /* X87_OPS_RETIRED.ANY */
PEBS_EVENT_CONSTRAINT(0x00c5, 0xf), /* BR_INST_RETIRED.MISPRED */
@@ -414,6 +415,8 @@ static void intel_pmu_pebs_enable(struct perf_event *event)
hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
cpuc->pebs_enabled |= 1ULL << hwc->idx;
+ if (hwc->extra_flags & EXTRA_REG_LD_LAT)
+ cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
WARN_ON_ONCE(cpuc->enabled);
if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
@@ -426,6 +429,8 @@ static void intel_pmu_pebs_disable(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+ if (hwc->extra_flags & EXTRA_REG_LD_LAT)
+ cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
if (cpuc->enabled)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d24d9ab..38bffa4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -541,6 +541,7 @@ struct hw_perf_event {
int last_cpu;
unsigned int extra_reg;
u64 extra_config;
+ u64 extra_flags;
};
struct { /* software */
struct hrtimer hrtimer;
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 8:12 [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere Lin Ming
@ 2010-12-22 8:33 ` Peter Zijlstra
2010-12-22 8:47 ` Lin Ming
2010-12-22 9:04 ` Peter Zijlstra
2010-12-22 9:00 ` Peter Zijlstra
1 sibling, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22 8:33 UTC (permalink / raw)
To: Lin Ming
Cc: Ingo Molnar, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
Arjan van de Ven, lkml
On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> patchsets are needed.
I still haven't gotten a version of that I'm willing to apply.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 8:33 ` Peter Zijlstra
@ 2010-12-22 8:47 ` Lin Ming
2010-12-22 9:04 ` Peter Zijlstra
1 sibling, 0 replies; 19+ messages in thread
From: Lin Ming @ 2010-12-22 8:47 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
Arjan van de Ven, lkml
On Wed, 2010-12-22 at 16:33 +0800, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> > patchsets are needed.
>
> I still haven't gotten a version of that I'm willing to apply.
OK, then I'll update this patch after the offcore patches applied.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 8:12 [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere Lin Ming
2010-12-22 8:33 ` Peter Zijlstra
@ 2010-12-22 9:00 ` Peter Zijlstra
2010-12-22 10:08 ` Stephane Eranian
1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22 9:00 UTC (permalink / raw)
To: Lin Ming
Cc: Ingo Molnar, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index ed6ff11..2a02529 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -197,18 +197,25 @@ struct extra_reg {
> unsigned int extra_shift;
> u64 config_mask;
> u64 valid_mask;
> + u64 flags;
> };
>
> -#define EVENT_EXTRA_REG(e, ms, m, vm, es) { \
> +#define EVENT_EXTRA_REG(e, ms, m, vm, es, f) { \
> .event = (e), \
> .msr = (ms), \
> .config_mask = (m), \
> .valid_mask = (vm), \
> .extra_shift = (es), \
> + .flags = (f), \
> }
> #define INTEL_EVENT_EXTRA_REG(event, msr, vm, es) \
> - EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es)
> -#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0)
> + EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es, 0)
> +#define INTEL_EVENT_EXTRA_REG2(event, msr, vm, es, f) \
> + EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
> + ARCH_PERFMON_EVENTSEL_UMASK, vm, es, f)
> +#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0, 0)
You'll need to increment MAX_EXTRA_REGS to 3 I think.
> +#define EXTRA_REG_LD_LAT 0x1
I'm not quite sure we actually need the whole flags business.
> union perf_capabilities {
> struct {
> @@ -384,6 +391,11 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event)
> if (extra & ~er->valid_mask)
> return -EINVAL;
> event->hw.extra_config = extra;
> + event->hw.extra_flags = er->flags;
> +
> + /* The minimum value that may be programmed into MSR_PEBS_LD_LAT is 3 */
> + if ((er->flags & EXTRA_REG_LD_LAT) && extra < 3)
> + event->hw.extra_config = 3;
if (er->msr == MSR_PEBS_LD_LAT_THRESHOLD && extra < 3)
event->hw.extra_config = 3;
> break;
> }
> return 0;
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index bc4afb1..7e2b873 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -89,6 +89,8 @@ static struct event_constraint intel_nehalem_event_constraints[] =
> static struct extra_reg intel_nehalem_extra_regs[] =
> {
> INTEL_EVENT_EXTRA_REG(0xb7, 0x1a6, 0xffff, 32), /* OFFCORE_RESPONSE */
> + /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
> + INTEL_EVENT_EXTRA_REG2(0x100b, 0x3f6, 0xffff, 32, EXTRA_REG_LD_LAT),
> EVENT_EXTRA_END
> };
Maybe use the MSR names instead of the numbers.
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index b7dcd9f..d008c40 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -376,6 +376,7 @@ static struct event_constraint intel_core_pebs_events[] = {
> };
>
> static struct event_constraint intel_nehalem_pebs_events[] = {
> + PEBS_EVENT_CONSTRAINT(0x100b, 0xf), /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
> PEBS_EVENT_CONSTRAINT(0x00c0, 0xf), /* INSTR_RETIRED.ANY */
> PEBS_EVENT_CONSTRAINT(0xfec1, 0xf), /* X87_OPS_RETIRED.ANY */
> PEBS_EVENT_CONSTRAINT(0x00c5, 0xf), /* BR_INST_RETIRED.MISPRED */
> @@ -414,6 +415,8 @@ static void intel_pmu_pebs_enable(struct perf_event *event)
> hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
>
> cpuc->pebs_enabled |= 1ULL << hwc->idx;
> + if (hwc->extra_flags & EXTRA_REG_LD_LAT)
> + cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
if (hwc->extra_reg == MSR_PEBS_LD_LAT_THRESHOLD)
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
> WARN_ON_ONCE(cpuc->enabled);
>
> if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
> @@ -426,6 +429,8 @@ static void intel_pmu_pebs_disable(struct perf_event *event)
> struct hw_perf_event *hwc = &event->hw;
>
> cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
> + if (hwc->extra_flags & EXTRA_REG_LD_LAT)
> + cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
if (hwx->extra_reg == MSR_PEBS_LD_LAT_THRESHOLD)
cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
> if (cpuc->enabled)
> wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d24d9ab..38bffa4 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -541,6 +541,7 @@ struct hw_perf_event {
> int last_cpu;
> unsigned int extra_reg;
> u64 extra_config;
> + u64 extra_flags;
> };
> struct { /* software */
> struct hrtimer hrtimer;
>
Which then also obviates the need for this extra field.
You also need some extra goo in intel_pmu_drain_pebs_nhm(), we can
already use the PERF_SAMPLE_ADDR for the linear data address provided by
the pebs-ll thing, and we might need to add:
PERF_SAMPLE_LATENCY -- Stephane said other archs can also use this
Not quite sure what to do for the source bits, POWER also has some extra
bits, but I'm not sure they qualify as purely source bits. And
interpreting them is going to be inherently arch specific, which
sucks :/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 8:33 ` Peter Zijlstra
2010-12-22 8:47 ` Lin Ming
@ 2010-12-22 9:04 ` Peter Zijlstra
2010-12-22 10:14 ` Ingo Molnar
1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22 9:04 UTC (permalink / raw)
To: Lin Ming
Cc: Ingo Molnar, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
Arjan van de Ven, lkml
On Wed, 2010-12-22 at 09:33 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> > patchsets are needed.
>
> I still haven't gotten a version of that I'm willing to apply.
That is, maybe you could pick up his patches and finish them, Andi
doesn't seem interested in getting them sorted.
I think the last posting lost the needs_percore logic, which would avoid
the percore allocations on platforms that don't need it (if we can
detect if HT is enabled that would be nice too).
I think he also lost the topology iteration (which can be used for both
the percore and the amd northbridge stuff), currently we iterate the
full machine looking for a matching nb/core_id, using a smaller topology
mask makes sense.
Also, after reading your patch, I think we want the MSR names in the
lists, not the numbers.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 9:00 ` Peter Zijlstra
@ 2010-12-22 10:08 ` Stephane Eranian
2010-12-22 10:45 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Stephane Eranian @ 2010-12-22 10:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
Hi,
On Wed, Dec 22, 2010 at 10:00 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
>> index ed6ff11..2a02529 100644
>> --- a/arch/x86/kernel/cpu/perf_event.c
>> +++ b/arch/x86/kernel/cpu/perf_event.c
>> @@ -197,18 +197,25 @@ struct extra_reg {
>> unsigned int extra_shift;
>> u64 config_mask;
>> u64 valid_mask;
>> + u64 flags;
>> };
>>
>> -#define EVENT_EXTRA_REG(e, ms, m, vm, es) { \
>> +#define EVENT_EXTRA_REG(e, ms, m, vm, es, f) { \
>> .event = (e), \
>> .msr = (ms), \
>> .config_mask = (m), \
>> .valid_mask = (vm), \
>> .extra_shift = (es), \
>> + .flags = (f), \
>> }
>> #define INTEL_EVENT_EXTRA_REG(event, msr, vm, es) \
>> - EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es)
>> -#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0)
>> + EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, es, 0)
>> +#define INTEL_EVENT_EXTRA_REG2(event, msr, vm, es, f) \
>> + EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
>> + ARCH_PERFMON_EVENTSEL_UMASK, vm, es, f)
>> +#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, 0, 0)
>
> You'll need to increment MAX_EXTRA_REGS to 3 I think.
>
>> +#define EXTRA_REG_LD_LAT 0x1
>
> I'm not quite sure we actually need the whole flags business.
>
>> union perf_capabilities {
>> struct {
>> @@ -384,6 +391,11 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event)
>> if (extra & ~er->valid_mask)
>> return -EINVAL;
>> event->hw.extra_config = extra;
>> + event->hw.extra_flags = er->flags;
>> +
>> + /* The minimum value that may be programmed into MSR_PEBS_LD_LAT is 3 */
>> + if ((er->flags & EXTRA_REG_LD_LAT) && extra < 3)
>> + event->hw.extra_config = 3;
>
> if (er->msr == MSR_PEBS_LD_LAT_THRESHOLD && extra < 3)
> event->hw.extra_config = 3;
>
>> break;
>> }
>> return 0;
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
>> index bc4afb1..7e2b873 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> @@ -89,6 +89,8 @@ static struct event_constraint intel_nehalem_event_constraints[] =
>> static struct extra_reg intel_nehalem_extra_regs[] =
>> {
>> INTEL_EVENT_EXTRA_REG(0xb7, 0x1a6, 0xffff, 32), /* OFFCORE_RESPONSE */
>> + /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
>> + INTEL_EVENT_EXTRA_REG2(0x100b, 0x3f6, 0xffff, 32, EXTRA_REG_LD_LAT),
>> EVENT_EXTRA_END
>> };
>
> Maybe use the MSR names instead of the numbers.
>
>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
>> index b7dcd9f..d008c40 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
>> @@ -376,6 +376,7 @@ static struct event_constraint intel_core_pebs_events[] = {
>> };
>>
>> static struct event_constraint intel_nehalem_pebs_events[] = {
>> + PEBS_EVENT_CONSTRAINT(0x100b, 0xf), /* MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD */
>> PEBS_EVENT_CONSTRAINT(0x00c0, 0xf), /* INSTR_RETIRED.ANY */
>> PEBS_EVENT_CONSTRAINT(0xfec1, 0xf), /* X87_OPS_RETIRED.ANY */
>> PEBS_EVENT_CONSTRAINT(0x00c5, 0xf), /* BR_INST_RETIRED.MISPRED */
>> @@ -414,6 +415,8 @@ static void intel_pmu_pebs_enable(struct perf_event *event)
>> hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
>>
>> cpuc->pebs_enabled |= 1ULL << hwc->idx;
>> + if (hwc->extra_flags & EXTRA_REG_LD_LAT)
>> + cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
>
> if (hwc->extra_reg == MSR_PEBS_LD_LAT_THRESHOLD)
> cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
>
>> WARN_ON_ONCE(cpuc->enabled);
>>
>> if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
>> @@ -426,6 +429,8 @@ static void intel_pmu_pebs_disable(struct perf_event *event)
>> struct hw_perf_event *hwc = &event->hw;
>>
>> cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
>> + if (hwc->extra_flags & EXTRA_REG_LD_LAT)
>> + cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
>
> if (hwx->extra_reg == MSR_PEBS_LD_LAT_THRESHOLD)
> cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
>
>> if (cpuc->enabled)
>> wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index d24d9ab..38bffa4 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -541,6 +541,7 @@ struct hw_perf_event {
>> int last_cpu;
>> unsigned int extra_reg;
>> u64 extra_config;
>> + u64 extra_flags;
>> };
>> struct { /* software */
>> struct hrtimer hrtimer;
>>
>
> Which then also obviates the need for this extra field.
>
> You also need some extra goo in intel_pmu_drain_pebs_nhm(), we can
> already use the PERF_SAMPLE_ADDR for the linear data address provided by
> the pebs-ll thing, and we might need to add:
>
> PERF_SAMPLE_LATENCY -- Stephane said other archs can also use this
>
Extracting the instruction address is not so useful. You need the
instruction and data addresses, the latency and data source. As Peter
pointed out, you can use PERF_SAMPLE_ADDR for the data address.
True. And also we would need a PERF_SAMPLE_DATA_SRC to extract
the data source information. Other archs also have that.
Note that PEBS-Load latency needs the IP+1 correction. It points to the
instruction address after the load/lfetch. But I suspect your patch already
takes care of that.
> Not quite sure what to do for the source bits, POWER also has some extra
> bits, but I'm not sure they qualify as purely source bits. And
> interpreting them is going to be inherently arch specific, which
> sucks :/
>
>
Yes, I think there is more to it than just data source, unfortunately.
If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
and so on.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 9:04 ` Peter Zijlstra
@ 2010-12-22 10:14 ` Ingo Molnar
2010-12-23 1:14 ` Lin Ming
2010-12-23 7:35 ` Andi Kleen
0 siblings, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2010-12-22 10:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
Arjan van de Ven, lkml
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-12-22 at 09:33 +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> > > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> > > patchsets are needed.
> >
> > I still haven't gotten a version of that I'm willing to apply.
>
> That is, maybe you could pick up his patches and finish them, Andi
> doesn't seem interested in getting them sorted.
FYI, this one of Peter's reviews that went unaddressed:
https://lkml.org/lkml/2010/12/1/213
and he got no reply to that feedback in the past 3 weeks. (There may be other issues
outstanding, i have not checked.)
Thanks,
Ingo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 10:08 ` Stephane Eranian
@ 2010-12-22 10:45 ` Peter Zijlstra
2010-12-22 10:49 ` Peter Zijlstra
2010-12-23 8:28 ` Lin Ming
0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22 10:45 UTC (permalink / raw)
To: Stephane Eranian
Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> Yes, I think there is more to it than just data source, unfortunately.
> If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> and so on.
I guess we can do things like:
Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
encode "Pending core cache HIT" as L2-snoop or something, whatever is
most appropriate.
But does that cover every architecture?
Also, since that doesn't require more that 4 bits to encode, we could
try and categorize what else is around and try and create a well
specified _EXTRA register, I mean, we still got 60bits left after this.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 10:45 ` Peter Zijlstra
@ 2010-12-22 10:49 ` Peter Zijlstra
2010-12-23 8:59 ` Lin Ming
2010-12-23 8:28 ` Lin Ming
1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-22 10:49 UTC (permalink / raw)
To: Stephane Eranian
Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Wed, 2010-12-22 at 11:45 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> > Yes, I think there is more to it than just data source, unfortunately.
> > If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> > you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> > and so on.
>
> I guess we can do things like:
>
> Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
> encode "Pending core cache HIT" as L2-snoop or something, whatever is
> most appropriate.
Ah, I just saw my email window covered part of the spec and we can also
have x{shared,exclusive}, so we end up with:
{L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
uncached, IO}
Which takes all of 5 bits to encode.
> But does that cover every architecture?
>
> Also, since that doesn't require more that 4 bits to encode, we could
> try and categorize what else is around and try and create a well
> specified _EXTRA register, I mean, we still got 60bits left after this.
Leaving us with 59 bits to consider.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 10:14 ` Ingo Molnar
@ 2010-12-23 1:14 ` Lin Ming
2010-12-23 7:35 ` Andi Kleen
1 sibling, 0 replies; 19+ messages in thread
From: Lin Ming @ 2010-12-23 1:14 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, Andi Kleen, Stephane Eranian, Frederic Weisbecker,
Arjan van de Ven, lkml
On Wed, 2010-12-22 at 18:14 +0800, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Wed, 2010-12-22 at 09:33 +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
> > > > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
> > > > patchsets are needed.
> > >
> > > I still haven't gotten a version of that I'm willing to apply.
> >
> > That is, maybe you could pick up his patches and finish them, Andi
> > doesn't seem interested in getting them sorted.
>
> FYI, this one of Peter's reviews that went unaddressed:
>
> https://lkml.org/lkml/2010/12/1/213
>
> and he got no reply to that feedback in the past 3 weeks. (There may be other issues
> outstanding, i have not checked.)
Andi mentioned he will be on vacation in December, see the end of below
mail.
http://marc.info/?l=linux-kernel&m=129136895415202&w=2
>
> Thanks,
>
> Ingo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 10:14 ` Ingo Molnar
2010-12-23 1:14 ` Lin Ming
@ 2010-12-23 7:35 ` Andi Kleen
1 sibling, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2010-12-23 7:35 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, Lin Ming, Andi Kleen, Stephane Eranian,
Frederic Weisbecker, Arjan van de Ven, lkml
>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Wed, 2010-12-22 at 09:33 +0100, Peter Zijlstra wrote:
>> > On Wed, 2010-12-22 at 16:12 +0800, Lin Ming wrote:
>> > > It's applied on top of tip/master(3ea1f4f89) and Andi's offcore
>> > > patchsets are needed.
>> >
>> > I still haven't gotten a version of that I'm willing to apply.
>>
>> That is, maybe you could pick up his patches and finish them, Andi
>> doesn't seem interested in getting them sorted.
>
> FYI, this one of Peter's reviews that went unaddressed:
>
> https://lkml.org/lkml/2010/12/1/213
>
> and he got no reply to that feedback in the past 3 weeks. (There may be
> other issues
> outstanding, i have not checked.)
Sorry didn't see that email. I dropped the variable intentionally
to simplify the code; there is no fastpath overhead for the other
CPUs and allocating a few bytes per core didn't seem worth having a
special variable to guard it. I can readd it if people really
want it.
-Andi
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 10:45 ` Peter Zijlstra
2010-12-22 10:49 ` Peter Zijlstra
@ 2010-12-23 8:28 ` Lin Ming
2010-12-23 10:11 ` Peter Zijlstra
1 sibling, 1 reply; 19+ messages in thread
From: Lin Ming @ 2010-12-23 8:28 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Stephane Eranian, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Wed, 2010-12-22 at 18:45 +0800, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> > Yes, I think there is more to it than just data source, unfortunately.
> > If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> > you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> > and so on.
>
> I guess we can do things like:
>
> Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
> encode "Pending core cache HIT" as L2-snoop or something, whatever is
> most appropriate.
>
> But does that cover every architecture?
>
> Also, since that doesn't require more that 4 bits to encode, we could
> try and categorize what else is around and try and create a well
> specified _EXTRA register, I mean, we still got 60bits left after this.
Could you tell more about this well specified _EXTRA register?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-22 10:49 ` Peter Zijlstra
@ 2010-12-23 8:59 ` Lin Ming
2010-12-23 10:18 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Lin Ming @ 2010-12-23 8:59 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Stephane Eranian, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Wed, 2010-12-22 at 18:49 +0800, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 11:45 +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> > > Yes, I think there is more to it than just data source, unfortunately.
> > > If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> > > you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> > > and so on.
> >
> > I guess we can do things like:
> >
> > Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
> > encode "Pending core cache HIT" as L2-snoop or something, whatever is
> > most appropriate.
>
> Ah, I just saw my email window covered part of the spec and we can also
> have x{shared,exclusive}, so we end up with:
>
> {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
> uncached, IO}
>
> Which takes all of 5 bits to encode.
Do you mean below encoding?
bits4 3 2 1 0
+ + + + +
| | | | |
| | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
| | |
| {snoop, local, remote, OTHER}
|
{shared, exclusive}
If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
uncached, IO}.
>
> > But does that cover every architecture?
> >
> > Also, since that doesn't require more that 4 bits to encode, we could
> > try and categorize what else is around and try and create a well
> > specified _EXTRA register, I mean, we still got 60bits left after this.
>
> Leaving us with 59 bits to consider.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-23 8:28 ` Lin Ming
@ 2010-12-23 10:11 ` Peter Zijlstra
0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-23 10:11 UTC (permalink / raw)
To: Lin Ming
Cc: Stephane Eranian, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Thu, 2010-12-23 at 16:28 +0800, Lin Ming wrote:
> On Wed, 2010-12-22 at 18:45 +0800, Peter Zijlstra wrote:
> > On Wed, 2010-12-22 at 11:08 +0100, Stephane Eranian wrote:
> > > Yes, I think there is more to it than just data source, unfortunately.
> > > If you want to avoid returning an opaque u64 (PERF_SAMPLE_EXTRA), then
> > > you need to break it down: PERF_SAMPLE_DATA_SRC, PERF_SAMPLE_XX
> > > and so on.
> >
> > I guess we can do things like:
> >
> > Satisfied by {L1, L2, L3, RAM}x{snoop, local, remote} + unknown, and
> > encode "Pending core cache HIT" as L2-snoop or something, whatever is
> > most appropriate.
> >
> > But does that cover every architecture?
> >
> > Also, since that doesn't require more that 4 bits to encode, we could
> > try and categorize what else is around and try and create a well
> > specified _EXTRA register, I mean, we still got 60bits left after this.
>
> Could you tell more about this well specified _EXTRA register?
Like how we should look at other archs (like POWER) and try and figure
out what other useful information is available and come up with a
generic format for that?
We could of course leave the 59 bits as reserved and fill it in later,
and go with the data source bits only for now..
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-23 8:59 ` Lin Ming
@ 2010-12-23 10:18 ` Peter Zijlstra
2010-12-23 10:31 ` Stephane Eranian
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-23 10:18 UTC (permalink / raw)
To: Lin Ming
Cc: Stephane Eranian, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Thu, 2010-12-23 at 16:59 +0800, Lin Ming wrote:
> > {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
> > uncached, IO}
> >
> > Which takes all of 5 bits to encode.
>
> Do you mean below encoding?
>
> bits4 3 2 1 0
> + + + + +
> | | | | |
> | | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
> | | |
> | {snoop, local, remote, OTHER}
> |
> {shared, exclusive}
>
> If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
> uncached, IO}.
That is most certainly a very valid encoding, and a rather nice one at
that. I hadn't really gone further than: 4*3*2 + 3 < 2^5 :-)
If you also make OTHER=0, then a valid encoding for unknown is also 0,
which is a nice meaning for 0...
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-23 10:18 ` Peter Zijlstra
@ 2010-12-23 10:31 ` Stephane Eranian
2010-12-23 10:48 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Stephane Eranian @ 2010-12-23 10:31 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Thu, Dec 23, 2010 at 11:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-12-23 at 16:59 +0800, Lin Ming wrote:
>> > {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
>> > uncached, IO}
>> >
>> > Which takes all of 5 bits to encode.
>>
>> Do you mean below encoding?
>>
>> bits4 3 2 1 0
>> + + + + +
>> | | | | |
>> | | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
>> | | |
>> | {snoop, local, remote, OTHER}
>> |
>> {shared, exclusive}
>>
>> If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
>> uncached, IO}.
>
> That is most certainly a very valid encoding, and a rather nice one at
> that. I hadn't really gone further than: 4*3*2 + 3 < 2^5 :-)
>
> If you also make OTHER=0, then a valid encoding for unknown is also 0,
> which is a nice meaning for 0...
>
I am not sure how you would cover the 9 possibilities for data source as
shown in Table 10-13 using this encoding. Could you show me?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-23 10:31 ` Stephane Eranian
@ 2010-12-23 10:48 ` Peter Zijlstra
2010-12-23 11:05 ` Stephane Eranian
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-23 10:48 UTC (permalink / raw)
To: Stephane Eranian
Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Thu, 2010-12-23 at 11:31 +0100, Stephane Eranian wrote:
> On Thu, Dec 23, 2010 at 11:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, 2010-12-23 at 16:59 +0800, Lin Ming wrote:
> >> > {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
> >> > uncached, IO}
> >> >
> >> > Which takes all of 5 bits to encode.
> >>
> >> Do you mean below encoding?
> >>
> >> bits4 3 2 1 0
> >> + + + + +
> >> | | | | |
> >> | | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
> >> | | |
> >> | {snoop, local, remote, OTHER}
> >> |
> >> {shared, exclusive}
> >>
> >> If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
> >> uncached, IO}.
> >
> > That is most certainly a very valid encoding, and a rather nice one at
> > that. I hadn't really gone further than: 4*3*2 + 3 < 2^5 :-)
> >
> > If you also make OTHER=0, then a valid encoding for unknown is also 0,
> > which is a nice meaning for 0...
> >
> I am not sure how you would cover the 9 possibilities for data source as
> shown in Table 10-13 using this encoding. Could you show me?
Ah, I think I see the problem, there's multiple L3-snoops, I guess we
can fix that by extending the {shared, exclusive} to full MESI, growing
us to 6 bits.
I'm assuming you mean "Table 30-13. Data Source Encoding for Load
Latency Record", which has 14 values defined.
Value Intel Perf
0x0 Unknown L3 Unknown
0x1 L1 L1-local
0x2 Pending core cache HIT L2-snoop
Outstanding core cache miss to
the same line was underway
0x3 L2 L2-local
0x4 L3-snoop, no coherency actions L3-snoop-I
0x5 L3-snoop, found no M L3-snoop-S
0x6 L3-snoop, found M L3-snoop-M
0x8 L3-miss, snoop, shared RAM-snoop-S
0xA L3-miss, local, shared RAM-local-S
0xB L3-miss, remote, shared RAM-remote-S
0xC L3-miss, local, exclusive RAM-local-E
0xD L3-miss, remote, exclusive RAM-remote-E
0xE IO IO
0xF uncached uncached
Leaving us with:
{L1, L2, L3, RAM}x{snoop, local, remote}x{modified, exclusive, shared, invalid} + {unknown, uncached, IO}
Now the question is, is this sufficient to map all data sources from
other archs as well?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-23 10:48 ` Peter Zijlstra
@ 2010-12-23 11:05 ` Stephane Eranian
2010-12-23 11:37 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Stephane Eranian @ 2010-12-23 11:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Thu, Dec 23, 2010 at 11:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-12-23 at 11:31 +0100, Stephane Eranian wrote:
>> On Thu, Dec 23, 2010 at 11:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Thu, 2010-12-23 at 16:59 +0800, Lin Ming wrote:
>> >> > {L1, L2, L3, RAM}x{snoop, local, remote}x{shared, exclusive} + {unknown,
>> >> > uncached, IO}
>> >> >
>> >> > Which takes all of 5 bits to encode.
>> >>
>> >> Do you mean below encoding?
>> >>
>> >> bits4 3 2 1 0
>> >> + + + + +
>> >> | | | | |
>> >> | | | {L1, L2, L3, RAM} or {unknown, uncached, IO}
>> >> | | |
>> >> | {snoop, local, remote, OTHER}
>> >> |
>> >> {shared, exclusive}
>> >>
>> >> If bits(2-3) is OTHER, then bits(0-1) is the encoding of {unknown,
>> >> uncached, IO}.
>> >
>> > That is most certainly a very valid encoding, and a rather nice one at
>> > that. I hadn't really gone further than: 4*3*2 + 3 < 2^5 :-)
>> >
>> > If you also make OTHER=0, then a valid encoding for unknown is also 0,
>> > which is a nice meaning for 0...
>> >
>> I am not sure how you would cover the 9 possibilities for data source as
>> shown in Table 10-13 using this encoding. Could you show me?
>
> Ah, I think I see the problem, there's multiple L3-snoops, I guess we
> can fix that by extending the {shared, exclusive} to full MESI, growing
> us to 6 bits.
>
> I'm assuming you mean "Table 30-13. Data Source Encoding for Load
> Latency Record", which has 14 values defined.
>
Yes.
> Value Intel Perf
> 0x0 Unknown L3 Unknown
>
> 0x1 L1 L1-local
>
> 0x2 Pending core cache HIT L2-snoop
> Outstanding core cache miss to
Not clear how you know this is snoop or L2?
I suspect this one is saying you have a request for a line
for which there is already a pending request underway. Could
be the first came from prefetchers, the 2nd is actual demand.
Let me check with Intel. The table is unclear.
> the same line was underway
> 0x3 L2 L2-local
>
> 0x4 L3-snoop, no coherency actions L3-snoop-I
I am not sure I understand what you mean by local vs. remote
in your terminology.
> 0x5 L3-snoop, found no M L3-snoop-S
> 0x6 L3-snoop, found M L3-snoop-M
>
> 0x8 L3-miss, snoop, shared RAM-snoop-S
> 0xA L3-miss, local, shared RAM-local-S
> 0xB L3-miss, remote, shared RAM-remote-S
>
> 0xC L3-miss, local, exclusive RAM-local-E
> 0xD L3-miss, remote, exclusive RAM-remote-E
>
> 0xE IO IO
> 0xF uncached uncached
>
>
> Leaving us with:
>
> {L1, L2, L3, RAM}x{snoop, local, remote}x{modified, exclusive, shared, invalid} + {unknown, uncached, IO}
>
> Now the question is, is this sufficient to map all data sources from
> other archs as well?
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere
2010-12-23 11:05 ` Stephane Eranian
@ 2010-12-23 11:37 ` Peter Zijlstra
0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2010-12-23 11:37 UTC (permalink / raw)
To: Stephane Eranian
Cc: Lin Ming, Ingo Molnar, Andi Kleen, Frederic Weisbecker,
Arjan van de Ven, lkml, paulus
On Thu, 2010-12-23 at 12:05 +0100, Stephane Eranian wrote:
> > Value Intel Perf
> > 0x0 Unknown L3 Unknown
> >
> > 0x1 L1 L1-local
> >
> > 0x2 Pending core cache HIT L2-snoop
> > Outstanding core cache miss to
>
> Not clear how you know this is snoop or L2?
>
> I suspect this one is saying you have a request for a line
> for which there is already a pending request underway. Could
> be the first came from prefetchers, the 2nd is actual demand.
>
> Let me check with Intel. The table is unclear.
Right, so cache snoops as used by Intel are data transfer operations
(not only the watching for remote modifications and local invalidation
as per the strict definition), they typically short-circuit a complete
fetch, get it from a neighboring cache, or otherwise in-flight data.
Since this is a pending fetch, the data is in-flight, and snoop seemed
to apply, but I admit it is somewhat of a stretch.
The L2 came from the usage of "core cache", I might be wrong on that.
Anyway, its a bit of an odd one out, you can have the exact same
'problem' of pending fetches on the same line on all levels, yet they
don't provide this 'source' for other levels.
Strictly speaking, this is a stall, not a source, and we could simply
map it to 'unknown' and be done with it.
> > the same line was underway
> > 0x3 L2 L2-local
> >
> > 0x4 L3-snoop, no coherency actions L3-snoop-I
>
> I am not sure I understand what you mean by local vs. remote
> in your terminology.
Local being the cache nearest to the cpu, remote being all others.
Admittedly that doesn't really make too much sense for L[12], but
imagine threads having their own L1, then I could imagine a thread
trying to peek in a sibling's L1 since its so near. In that case it
would make sense to use local vs remote on the L1.
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2010-12-23 11:37 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-22 8:12 [RFC PATCH] perf: Add load latency monitoring on Intel Nehalem/Westmere Lin Ming
2010-12-22 8:33 ` Peter Zijlstra
2010-12-22 8:47 ` Lin Ming
2010-12-22 9:04 ` Peter Zijlstra
2010-12-22 10:14 ` Ingo Molnar
2010-12-23 1:14 ` Lin Ming
2010-12-23 7:35 ` Andi Kleen
2010-12-22 9:00 ` Peter Zijlstra
2010-12-22 10:08 ` Stephane Eranian
2010-12-22 10:45 ` Peter Zijlstra
2010-12-22 10:49 ` Peter Zijlstra
2010-12-23 8:59 ` Lin Ming
2010-12-23 10:18 ` Peter Zijlstra
2010-12-23 10:31 ` Stephane Eranian
2010-12-23 10:48 ` Peter Zijlstra
2010-12-23 11:05 ` Stephane Eranian
2010-12-23 11:37 ` Peter Zijlstra
2010-12-23 8:28 ` Lin Ming
2010-12-23 10:11 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox