[PATCH V5 00/16] perf, x86: Haswell LBR call stack support

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V5 00/16] perf, x86: Haswell LBR call stack support
@ 2014-09-10 14:08 kan.liang
  2014-09-10 14:08 ` [PATCH V5 01/16] perf, x86: Reduce lbr_sel_map size kan.liang
                   ` (15 more replies)
  0 siblings, 16 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:08 UTC (permalink / raw)
  To: a.p.zijlstra, eranian; +Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang

From: Kan Liang <kan.liang@intel.com>

(The Email and patch date of last post isn't correct,
so re-post to fix the time.)
(Re-post the haswell LBR call stack patch on behalf of Yan, Zheng.
The patch set is rebased on tip.git commit ID 8b5df2f.
I've tested them on my Haswell platform.)

For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

When profiling bc(1) on Fedora 19:
echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
If this feature is enabled, perf report output looks like:
    50.36%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start
    33.66%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
                     bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start
     7.62%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                    |
                    |--99.89%-- 0x2000186a8
                     --0.11%-- [...]
     6.83%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
                    |
                    |--99.94%-- bc_add
                    |          execute
                    |          run_code
                    |          yyparse
                    |          main
                    |          __libc_start_main
                    |          _start
                     --0.06%-- [...]
     0.46%       bc  libc-2.17.so       [.] __memset_sse2
                 |
                 --- __memset_sse2
                    |
                    |--54.13%-- bc_new_num
                    |          |
                    |          |--51.00%-- bc_divide
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |          |--30.46%-- _bc_do_sub
                    |          |          bc_add
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |           --18.55%-- _bc_do_add
                    |                     bc_add
                    |                     execute
                    |                     run_code
                    |                     yyparse
                    |                     main
                    |                     __libc_start_main
                    |                     _start
                    |
                     --45.87%-- bc_divide
                               execute
                               run_code
                               yyparse
                               main
                               __libc_start_main
                               _start
If this feature is disabled, perf report output looks like:
    50.49%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
    33.57%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
     7.61%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                     0x2000186a8
     6.88%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
     0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
                 |
                 --- __memcpy_ssse3_back
The LBR call stack has following known limitations
 - Zero length calls are not filtered out by hardware
 - Exception handing such as setjmp/longjmp will have calls/returns not
   match
 - Pushing different return address onto the stack will have calls/returns
   not match
 - If callstack is deeper than the LBR, only the last entries are captured

Changes since v1
 - split change into more patches
 - introduce context switch callback and use it to flush LBR
 - use the context switch callback to save/restore LBR
 - dynamic allocate memory area for storing LBR stack, always switch the
   memory area during context switch
 - disable this feature by default
 - more description in change logs

Changes since v2
 - don't use xchg to switch PMU specific data
 - remove nr_branch_stack from struct perf_event_context
 - simplify the save/restore LBR stack logical
 - remove unnecessary 'has_branch_stack -> needs_branch_stack'
   conversion
 - more description in change logs

Changes since v3
 - remove sysfs attribute file that disable this feature

Changes since v4
 - re-organize code that save/resotre LBR stack
 - allocate pmu specific data when it's needed
 - update code comments

Yan, Zheng (16):
  perf, x86: Reduce lbr_sel_map size
  perf, core: introduce pmu context switch callback
  perf, x86: use context switch callback to flush LBR stack
  perf, x86: Basic Haswell LBR call stack support
  perf, core: pmu specific data for perf task context
  perf, core: always switch pmu specific data during context switch
  perf, x86: allocate space for storing LBR stack
  perf, x86: track number of events that use LBR callstack
  perf, x86: Save/resotre LBR stack during context switch
  perf, core: simplify need branch stack check
  perf, core: Pass perf_sample_data to perf_callchain()
  perf, x86: use LBR call stack to get user callchain
  perf, x86: re-organize code that implicitly enables LBR/PEBS
  perf, x86: enable LBR callstack when recording callchain
  perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack  
      mode
  perf, x86: Discard zero length call entries in LBR call stack

 arch/arm/kernel/perf_event.c               |   4 +-
 arch/powerpc/perf/callchain.c              |   4 +-
 arch/sparc/kernel/perf_event.c             |   4 +-
 arch/x86/kernel/cpu/perf_event.c           | 120 +++++++----
 arch/x86/kernel/cpu/perf_event.h           |  28 ++-
 arch/x86/kernel/cpu/perf_event_intel.c     |  38 +---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 316 ++++++++++++++++++++++-------
 include/linux/perf_event.h                 |  26 ++-
 include/uapi/linux/perf_event.h            |  49 +++--
 kernel/events/callchain.c                  |   8 +-
 kernel/events/core.c                       | 182 +++++++++--------
 kernel/events/internal.h                   |   3 +-
 12 files changed, 528 insertions(+), 254 deletions(-)

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V5 01/16] perf, x86: Reduce lbr_sel_map size
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
@ 2014-09-10 14:08 ` kan.liang
  2014-09-24 10:50   ` Peter Zijlstra
  2014-09-10 14:08 ` [PATCH V5 02/16] perf, core: introduce pmu context switch callback kan.liang
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:08 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

The index of lbr_sel_map is bit value of perf branch_sample_type.
PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
size to 40 bytes. This patch defines 'bit shift' for branch types,
and use 'bit shift' to define lbr_sel_maps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.h           |  4 +++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 54 ++++++++++++++----------------
 include/uapi/linux/perf_event.h            | 49 +++++++++++++++++++--------
 3 files changed, 64 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index fc5eb39..86c675c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -509,6 +509,10 @@ struct x86_pmu {
 	struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
 };
 
+enum {
+	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+};
+
 #define x86_add_quirk(func_)						\
 do {									\
 	static struct x86_pmu_quirk __quirk __initdata = {		\
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 4af1061..7344f05 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -69,10 +69,6 @@ static enum {
 #define LBR_FROM_FLAG_IN_TX    (1ULL << 62)
 #define LBR_FROM_FLAG_ABORT    (1ULL << 61)
 
-#define for_each_branch_sample_type(x) \
-	for ((x) = PERF_SAMPLE_BRANCH_USER; \
-	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
-
 /*
  * x86control flow change classification
  * x86control flow changes include branches, interrupts, traps, faults
@@ -403,14 +399,14 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 {
 	struct hw_perf_event_extra *reg;
 	u64 br_type = event->attr.branch_sample_type;
-	u64 mask = 0, m;
-	u64 v;
+	u64 mask = 0, v;
+	int i;
 
-	for_each_branch_sample_type(m) {
-		if (!(br_type & m))
+	for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
+		if (!(br_type & (1ULL << i)))
 			continue;
 
-		v = x86_pmu.lbr_sel_map[m];
+		v = x86_pmu.lbr_sel_map[i];
 		if (v == LBR_NOT_SUPP)
 			return -EOPNOTSUPP;
 
@@ -665,35 +661,35 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
 /*
  * Map interface branch filters onto LBR filters
  */
-static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
-	[PERF_SAMPLE_BRANCH_ANY]	= LBR_ANY,
-	[PERF_SAMPLE_BRANCH_USER]	= LBR_USER,
-	[PERF_SAMPLE_BRANCH_KERNEL]	= LBR_KERNEL,
-	[PERF_SAMPLE_BRANCH_HV]		= LBR_IGN,
-	[PERF_SAMPLE_BRANCH_ANY_RETURN]	= LBR_RETURN | LBR_REL_JMP
-					| LBR_IND_JMP | LBR_FAR,
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= LBR_RETURN | LBR_REL_JMP
+						| LBR_IND_JMP | LBR_FAR,
 	/*
 	 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
 	 */
-	[PERF_SAMPLE_BRANCH_ANY_CALL] =
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] =
 	 LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
 	/*
 	 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
 	 */
-	[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
-	[PERF_SAMPLE_BRANCH_COND]     = LBR_JCC,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL | LBR_IND_JMP,
+	[PERF_SAMPLE_BRANCH_COND_SHIFT]     = LBR_JCC,
 };
 
-static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
-	[PERF_SAMPLE_BRANCH_ANY]	= LBR_ANY,
-	[PERF_SAMPLE_BRANCH_USER]	= LBR_USER,
-	[PERF_SAMPLE_BRANCH_KERNEL]	= LBR_KERNEL,
-	[PERF_SAMPLE_BRANCH_HV]		= LBR_IGN,
-	[PERF_SAMPLE_BRANCH_ANY_RETURN]	= LBR_RETURN | LBR_FAR,
-	[PERF_SAMPLE_BRANCH_ANY_CALL]	= LBR_REL_CALL | LBR_IND_CALL
-					| LBR_FAR,
-	[PERF_SAMPLE_BRANCH_IND_CALL]	= LBR_IND_CALL,
-	[PERF_SAMPLE_BRANCH_COND]       = LBR_JCC,
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]	= LBR_REL_CALL | LBR_IND_CALL
+						| LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]	= LBR_IND_CALL,
+	[PERF_SAMPLE_BRANCH_COND_SHIFT]		= LBR_JCC,
 };
 
 /* core */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9269de2..7d41d59 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -151,21 +151,42 @@ enum perf_event_sample_format {
  * The branch types can be combined, however BRANCH_ANY covers all types
  * of branches and therefore it supersedes all the other types.
  */
+enum perf_branch_sample_type_shift {
+	PERF_SAMPLE_BRANCH_USER_SHIFT		= 0, /* user branches */
+	PERF_SAMPLE_BRANCH_KERNEL_SHIFT		= 1, /* kernel branches */
+	PERF_SAMPLE_BRANCH_HV_SHIFT		= 2, /* hypervisor branches */
+
+	PERF_SAMPLE_BRANCH_ANY_SHIFT		= 3, /* any branch types */
+	PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT	= 4, /* any call branch */
+	PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT	= 5, /* any return branch */
+	PERF_SAMPLE_BRANCH_IND_CALL_SHIFT	= 6, /* indirect calls */
+	PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT	= 7, /* transaction aborts */
+	PERF_SAMPLE_BRANCH_IN_TX_SHIFT		= 8, /* in transaction */
+	PERF_SAMPLE_BRANCH_NO_TX_SHIFT		= 9, /* not in transaction */
+	PERF_SAMPLE_BRANCH_COND_SHIFT		= 10, /* conditional branches */
+
+	PERF_SAMPLE_BRANCH_MAX_SHIFT		/* non-ABI */
+};
+
 enum perf_branch_sample_type {
-	PERF_SAMPLE_BRANCH_USER		= 1U << 0, /* user branches */
-	PERF_SAMPLE_BRANCH_KERNEL	= 1U << 1, /* kernel branches */
-	PERF_SAMPLE_BRANCH_HV		= 1U << 2, /* hypervisor branches */
-
-	PERF_SAMPLE_BRANCH_ANY		= 1U << 3, /* any branch types */
-	PERF_SAMPLE_BRANCH_ANY_CALL	= 1U << 4, /* any call branch */
-	PERF_SAMPLE_BRANCH_ANY_RETURN	= 1U << 5, /* any return branch */
-	PERF_SAMPLE_BRANCH_IND_CALL	= 1U << 6, /* indirect calls */
-	PERF_SAMPLE_BRANCH_ABORT_TX	= 1U << 7, /* transaction aborts */
-	PERF_SAMPLE_BRANCH_IN_TX	= 1U << 8, /* in transaction */
-	PERF_SAMPLE_BRANCH_NO_TX	= 1U << 9, /* not in transaction */
-	PERF_SAMPLE_BRANCH_COND		= 1U << 10, /* conditional branches */
-
-	PERF_SAMPLE_BRANCH_MAX		= 1U << 11, /* non-ABI */
+	PERF_SAMPLE_BRANCH_USER		= 1U << PERF_SAMPLE_BRANCH_USER_SHIFT,
+	PERF_SAMPLE_BRANCH_KERNEL	= 1U << PERF_SAMPLE_BRANCH_KERNEL_SHIFT,
+	PERF_SAMPLE_BRANCH_HV		= 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,
+
+	PERF_SAMPLE_BRANCH_ANY		= 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
+	PERF_SAMPLE_BRANCH_ANY_CALL	=
+				1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
+	PERF_SAMPLE_BRANCH_ANY_RETURN	=
+				1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
+	PERF_SAMPLE_BRANCH_IND_CALL	=
+				1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
+	PERF_SAMPLE_BRANCH_ABORT_TX	=
+				1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
+	PERF_SAMPLE_BRANCH_IN_TX	= 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
+	PERF_SAMPLE_BRANCH_NO_TX	= 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
+	PERF_SAMPLE_BRANCH_COND		= 1U << PERF_SAMPLE_BRANCH_COND_SHIFT,
+
+	PERF_SAMPLE_BRANCH_MAX		= 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
 };
 
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 01/16] perf, x86: Reduce lbr_sel_map size
  2014-09-10 14:08 ` [PATCH V5 01/16] perf, x86: Reduce lbr_sel_map size kan.liang
@ 2014-09-24 10:50   ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2014-09-24 10:50 UTC (permalink / raw)
  To: kan.liang; +Cc: eranian, linux-kernel, mingo, paulus, acme, ak, Yan, Zheng

On Wed, Sep 10, 2014 at 10:08:58AM -0400, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>

Uhm, I'm very sure Zheng Yan wrote this.. you cannot just add your from
on that.

> 
> The index of lbr_sel_map is bit value of perf branch_sample_type.
> PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
> 4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
> size to 40 bytes. This patch defines 'bit shift' for branch types,
> and use 'bit shift' to define lbr_sel_maps.
> 
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> Reviewed-by: Stephane Eranian <eranian@google.com>

And you're sending me this patch, so I would have expected a SoB from
you here.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V5 02/16] perf, core: introduce pmu context switch callback
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
  2014-09-10 14:08 ` [PATCH V5 01/16] perf, x86: Reduce lbr_sel_map size kan.liang
@ 2014-09-10 14:08 ` kan.liang
  2014-09-24 11:23   ` Peter Zijlstra
  2014-09-24 13:13   ` Peter Zijlstra
  2014-09-10 14:09 ` [PATCH V5 03/16] perf, x86: use context switch callback to flush LBR stack kan.liang
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:08 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

The callback is invoked when process is scheduled in or out.
It provides mechanism for later patches to save/store the LBR
stack. For the schedule in case, the callback is invoked at
the same place that flush branch stack callback is invoked.
So it also can replace the flush branch stack callback. To
avoid unnecessary overhead, the callback is enabled only when
there are events use the LBR stack.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c |  7 +++++
 arch/x86/kernel/cpu/perf_event.h |  2 ++
 include/linux/perf_event.h       | 10 +++++++
 kernel/events/core.c             | 59 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 78 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 0646d3b..4c572e8 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1877,6 +1877,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
 	NULL,
 };
 
+static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+	if (x86_pmu.sched_task)
+		x86_pmu.sched_task(ctx, sched_in);
+}
+
 static void x86_pmu_flush_branch_stack(void)
 {
 	if (x86_pmu.flush_branch_stack)
@@ -1910,6 +1916,7 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.flush_branch_stack	= x86_pmu_flush_branch_stack,
+	.sched_task		= x86_pmu_sched_task,
 };
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 86c675c..0617abb 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -467,6 +467,8 @@ struct x86_pmu {
 
 	void		(*check_microcode)(void);
 	void		(*flush_branch_stack)(void);
+	void		(*sched_task)(struct perf_event_context *ctx,
+				      bool sched_in);
 
 	/*
 	 * Intel Arch Perfmon v2+
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0..be0e870 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -263,6 +263,14 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * context-switches callback for CPU PMU. Other PMUs shouldn't set
+	 * this callback
+	 */
+	void (*sched_task)		(struct perf_event_context *ctx,
+					bool sched_in);
+
 };
 
 /**
@@ -562,6 +570,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
 extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_sched_cb_disable(struct pmu *pmu);
+extern void perf_sched_cb_enable(struct pmu *pmu);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
 extern int perf_event_refresh(struct perf_event *event, int refresh);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1212cc4..15d640e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -154,6 +154,7 @@ enum event_type_t {
 struct static_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
 static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -2427,6 +2428,58 @@ unlock:
 	}
 }
 
+void perf_sched_cb_disable(struct pmu *pmu)
+{
+	this_cpu_dec(perf_sched_cb_usages);
+}
+
+void perf_sched_cb_enable(struct pmu *pmu)
+{
+	this_cpu_inc(perf_sched_cb_usages);
+}
+
+/*
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when the context switch callback is enabled.
+ */
+static void perf_pmu_sched_task(struct task_struct *prev,
+				struct task_struct *next,
+				bool sched_in)
+{
+	struct perf_cpu_context *cpuctx;
+	struct pmu *pmu;
+	unsigned long flags;
+
+	if (prev == next)
+		return;
+
+	local_irq_save(flags);
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(pmu, &pmus, entry) {
+		if (pmu->sched_task) {
+			cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+			perf_pmu_disable(pmu);
+
+			pmu->sched_task(cpuctx->task_ctx, sched_in);
+
+			perf_pmu_enable(pmu);
+
+			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+			/* only CPU PMU has context switch callback */
+			break;
+		}
+	}
+
+	rcu_read_unlock();
+
+	local_irq_restore(flags);
+}
+
 #define for_each_task_context_nr(ctxn)					\
 	for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
 
@@ -2446,6 +2499,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
 {
 	int ctxn;
 
+	if (__get_cpu_var(perf_sched_cb_usages))
+		perf_pmu_sched_task(task, next, false);
+
 	for_each_task_context_nr(ctxn)
 		perf_event_context_sched_out(task, ctxn, next);
 
@@ -2703,6 +2759,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	/* check for system-wide branch_stack events */
 	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
 		perf_branch_stack_sched_in(prev, task);
+
+	if (__get_cpu_var(perf_sched_cb_usages))
+		perf_pmu_sched_task(prev, task, true);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 02/16] perf, core: introduce pmu context switch callback
  2014-09-10 14:08 ` [PATCH V5 02/16] perf, core: introduce pmu context switch callback kan.liang
@ 2014-09-24 11:23   ` Peter Zijlstra
  2014-09-24 13:13   ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2014-09-24 11:23 UTC (permalink / raw)
  To: kan.liang; +Cc: eranian, linux-kernel, mingo, paulus, acme, ak, Yan, Zheng

On Wed, Sep 10, 2014 at 10:08:59AM -0400, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> The callback is invoked when process is scheduled in or out.
> It provides mechanism for later patches to save/store the LBR
> stack. For the schedule in case, the callback is invoked at
> the same place that flush branch stack callback is invoked.
> So it also can replace the flush branch stack callback. To
> avoid unnecessary overhead, the callback is enabled only when
> there are events use the LBR stack.
> 
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

Same broken attribution and SoB chain.


> +void perf_sched_cb_disable(struct pmu *pmu)
> +{
> +	this_cpu_dec(perf_sched_cb_usages);
> +}
> +
> +void perf_sched_cb_enable(struct pmu *pmu)
> +{
> +	this_cpu_inc(perf_sched_cb_usages);
> +}

lkml.kernel.org/r/20140715113957.GD9918@twins.programming.kicks-ass.net

> +/*
> + * This function provides the context switch callback to the lower code
> + * layer. It is invoked ONLY when the context switch callback is enabled.
> + */
> +static void perf_pmu_sched_task(struct task_struct *prev,
> +				struct task_struct *next,
> +				bool sched_in)
> +{
> +	struct perf_cpu_context *cpuctx;
> +	struct pmu *pmu;
> +	unsigned long flags;
> +
> +	if (prev == next)
> +		return;
> +
> +	local_irq_save(flags);
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(pmu, &pmus, entry) {
> +		if (pmu->sched_task) {
> +			cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> +
> +			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +			perf_pmu_disable(pmu);
> +
> +			pmu->sched_task(cpuctx->task_ctx, sched_in);
> +
> +			perf_pmu_enable(pmu);
> +
> +			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +			/* only CPU PMU has context switch callback */
> +			break;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +
> +	local_irq_restore(flags);
> +}

lkml.kernel.org/r/20140702084833.GT6758@twins.programming.kicks-ass.net

Maybe you should have read back the previous postings before taking over
this series :-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 02/16] perf, core: introduce pmu context switch callback
  2014-09-10 14:08 ` [PATCH V5 02/16] perf, core: introduce pmu context switch callback kan.liang
  2014-09-24 11:23   ` Peter Zijlstra
@ 2014-09-24 13:13   ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2014-09-24 13:13 UTC (permalink / raw)
  To: kan.liang; +Cc: eranian, linux-kernel, mingo, paulus, acme, ak, Yan, Zheng

On Wed, Sep 10, 2014 at 10:08:59AM -0400, kan.liang@intel.com wrote:
> @@ -2446,6 +2499,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
>  {
>  	int ctxn;
>  
> +	if (__get_cpu_var(perf_sched_cb_usages))
> +		perf_pmu_sched_task(task, next, false);
> +
>  	for_each_task_context_nr(ctxn)
>  		perf_event_context_sched_out(task, ctxn, next);
>  
> @@ -2703,6 +2759,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
>  	/* check for system-wide branch_stack events */
>  	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
>  		perf_branch_stack_sched_in(prev, task);
> +
> +	if (__get_cpu_var(perf_sched_cb_usages))
> +		perf_pmu_sched_task(prev, task, true);
>  }
>  

I think the general idea was to get rid of __get_cpu_var() and co,
please consider using __this_cpu_read().

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V5 03/16] perf, x86: use context switch callback to flush LBR stack
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
  2014-09-10 14:08 ` [PATCH V5 01/16] perf, x86: Reduce lbr_sel_map size kan.liang
  2014-09-10 14:08 ` [PATCH V5 02/16] perf, core: introduce pmu context switch callback kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-10 14:09 ` [PATCH V5 04/16] perf, x86: Basic Haswell LBR call stack support kan.liang
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

Previous commit introduces context switch callback, its function
overlaps with the flush branch stack callback. So we can use the
context switch callback to flush LBR stack.

This patch adds code that uses the flush branch callback to
flush the LBR stack when task is being scheduled in. The callback
is enabled only when there are events use the LBR hardware. This
patch also removes all old flush branch stack code.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c           |  7 ---
 arch/x86/kernel/cpu/perf_event.h           |  3 +-
 arch/x86/kernel/cpu/perf_event_intel.c     | 14 +-----
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 38 +++++++++++++--
 include/linux/perf_event.h                 |  6 ---
 kernel/events/core.c                       | 77 ------------------------------
 6 files changed, 36 insertions(+), 109 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4c572e8..1bbcd59 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1883,12 +1883,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
 		x86_pmu.sched_task(ctx, sched_in);
 }
 
-static void x86_pmu_flush_branch_stack(void)
-{
-	if (x86_pmu.flush_branch_stack)
-		x86_pmu.flush_branch_stack();
-}
-
 void perf_check_microcode(void)
 {
 	if (x86_pmu.check_microcode)
@@ -1915,7 +1909,6 @@ static struct pmu pmu = {
 	.commit_txn		= x86_pmu_commit_txn,
 
 	.event_idx		= x86_pmu_event_idx,
-	.flush_branch_stack	= x86_pmu_flush_branch_stack,
 	.sched_task		= x86_pmu_sched_task,
 };
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 0617abb..3d6d533 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -466,7 +466,6 @@ struct x86_pmu {
 	void		(*cpu_dead)(int cpu);
 
 	void		(*check_microcode)(void);
-	void		(*flush_branch_stack)(void);
 	void		(*sched_task)(struct perf_event_context *ctx,
 				      bool sched_in);
 
@@ -727,6 +726,8 @@ void intel_pmu_pebs_disable_all(void);
 
 void intel_ds_init(void);
 
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+
 void intel_pmu_lbr_reset(void);
 
 void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 89bc750..61463f7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2044,18 +2044,6 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
-static void intel_pmu_flush_branch_stack(void)
-{
-	/*
-	 * Intel LBR does not tag entries with the
-	 * PID of the current task, then we need to
-	 * flush it on ctxsw
-	 * For now, we simply reset it
-	 */
-	if (x86_pmu.lbr_nr)
-		intel_pmu_lbr_reset();
-}
-
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
 
 PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2107,7 +2095,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
-	.flush_branch_stack	= intel_pmu_flush_branch_stack,
+	.sched_task		= intel_pmu_lbr_sched_task,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 7344f05..4e1e6a5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -177,13 +177,36 @@ void intel_pmu_lbr_reset(void)
 		intel_pmu_lbr_reset_64();
 }
 
-void intel_pmu_lbr_enable(struct perf_event *event)
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 
 	if (!x86_pmu.lbr_nr)
 		return;
+	/*
+	 * When sampling the branck stack in system-wide, it may be
+	 * necessary to flush the stack on context switch. This happens
+	 * when the branch stack does not tag its entries with the pid
+	 * of the current task. Otherwise it becomes impossible to
+	 * associate a branch entry with a task. This ambiguity is more
+	 * likely to appear when the branch stack supports priv level
+	 * filtering and the user sets it to monitor only at the user
+	 * level (which could be a useful measurement in system-wide
+	 * mode). In that case, the risk is high of having a branch
+	 * stack with branch from multiple tasks.
+	 */
+	if (sched_in) {
+		intel_pmu_lbr_reset();
+		cpuc->lbr_context = ctx;
+	}
+}
+
+void intel_pmu_lbr_enable(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 
+	if (!x86_pmu.lbr_nr)
+		return;
 	/*
 	 * Reset the LBR stack if we changed task context to
 	 * avoid data leaks.
@@ -195,6 +218,8 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	cpuc->lbr_users++;
+	if (cpuc->lbr_users == 1)
+		perf_sched_cb_enable(event->ctx->pmu);
 }
 
 void intel_pmu_lbr_disable(struct perf_event *event)
@@ -207,10 +232,13 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-	if (cpuc->enabled && !cpuc->lbr_users) {
-		__intel_pmu_lbr_disable();
-		/* avoid stale pointer */
-		cpuc->lbr_context = NULL;
+	if (!cpuc->lbr_users) {
+		perf_sched_cb_disable(event->ctx->pmu);
+		if (cpuc->enabled) {
+			__intel_pmu_lbr_disable();
+			/* avoid stale pointer */
+			cpuc->lbr_context = NULL;
+		}
 	}
 }
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index be0e870..fc355c4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -260,11 +260,6 @@ struct pmu {
 	int (*event_idx)		(struct perf_event *event); /*optional */
 
 	/*
-	 * flush branch stack on context-switches (needed in cpu-wide mode)
-	 */
-	void (*flush_branch_stack)	(void);
-
-	/*
 	 * context-switches callback for CPU PMU. Other PMUs shouldn't set
 	 * this callback
 	 */
@@ -515,7 +510,6 @@ struct perf_event_context {
 	u64				generation;
 	int				pin_count;
 	int				nr_cgroups;	 /* cgroup evts */
-	int				nr_branch_stack; /* branch_stack evt */
 	struct rcu_head			rcu_head;
 
 	struct delayed_work		orphans_remove;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 15d640e..3248f46 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -153,7 +153,6 @@ enum event_type_t {
  */
 struct static_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
 static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 
 static atomic_t nr_mmap_events __read_mostly;
@@ -1147,9 +1146,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (is_cgroup_event(event))
 		ctx->nr_cgroups++;
 
-	if (has_branch_stack(event))
-		ctx->nr_branch_stack++;
-
 	list_add_rcu(&event->event_entry, &ctx->event_list);
 	if (!ctx->nr_events)
 		perf_pmu_rotate_start(ctx->pmu);
@@ -1312,9 +1308,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
-	if (has_branch_stack(event))
-		ctx->nr_branch_stack--;
-
 	ctx->nr_events--;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat--;
@@ -2667,64 +2660,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 }
 
 /*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
-				       struct task_struct *task)
-{
-	struct perf_cpu_context *cpuctx;
-	struct pmu *pmu;
-	unsigned long flags;
-
-	/* no need to flush branch stack if not changing task */
-	if (prev == task)
-		return;
-
-	local_irq_save(flags);
-
-	rcu_read_lock();
-
-	list_for_each_entry_rcu(pmu, &pmus, entry) {
-		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
-		/*
-		 * check if the context has at least one
-		 * event using PERF_SAMPLE_BRANCH_STACK
-		 */
-		if (cpuctx->ctx.nr_branch_stack > 0
-		    && pmu->flush_branch_stack) {
-
-			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
-			perf_pmu_disable(pmu);
-
-			pmu->flush_branch_stack();
-
-			perf_pmu_enable(pmu);
-
-			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-		}
-	}
-
-	rcu_read_unlock();
-
-	local_irq_restore(flags);
-}
-
-/*
  * Called from scheduler to add the events of the current task
  * with interrupts disabled.
  *
@@ -2756,10 +2691,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
 
-	/* check for system-wide branch_stack events */
-	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
-		perf_branch_stack_sched_in(prev, task);
-
 	if (__get_cpu_var(perf_sched_cb_usages))
 		perf_pmu_sched_task(prev, task, true);
 }
@@ -3346,10 +3277,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
 	if (event->parent)
 		return;
 
-	if (has_branch_stack(event)) {
-		if (!(event->attach_state & PERF_ATTACH_TASK))
-			atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
-	}
 	if (is_cgroup_event(event))
 		atomic_dec(&per_cpu(perf_cgroup_events, cpu));
 }
@@ -6910,10 +6837,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
 	if (event->parent)
 		return;
 
-	if (has_branch_stack(event)) {
-		if (!(event->attach_state & PERF_ATTACH_TASK))
-			atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
-	}
 	if (is_cgroup_event(event))
 		atomic_inc(&per_cpu(perf_cgroup_events, cpu));
 }
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V5 04/16] perf, x86: Basic Haswell LBR call stack support
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (2 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 03/16] perf, x86: use context switch callback to flush LBR stack kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-10 14:09 ` [PATCH V5 05/16] perf, core: pmu specific data for perf task context kan.liang
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

Haswell has a new feature that utilizes the existing LBR facility to
record call chains. To enable this feature, bits (JCC, NEAR_IND_JMP,
NEAR_REL_JMP, FAR_BRANCH, EN_CALLSTACK) in LBR_SELECT must be set to 1,
bits (NEAR_REL_CALL, NEAR-IND_CALL, NEAR_RET) must be cleared. Due to
a hardware bug of Haswell, this feature doesn't work well with
FREEZE_LBRS_ON_PMI.

When the call stack feature is enabled, the LBR stack will capture
unfiltered call data normally, but as return instructions are executed,
the last captured branch record is flushed from the on-chip registers
in a last-in first-out (LIFO) manner. Thus, branch information relative
to leaf functions will not be captured, while preserving the call stack
information of the main line execution path.

This patch defines a separate lbr_sel map for Haswell. The map contains
a new entry for the call stack feature.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           | 14 ++++-
 arch/x86/kernel/cpu/perf_event_intel.c     |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 91 ++++++++++++++++++++++--------
 3 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3d6d533..13464e4 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -511,7 +511,11 @@ struct x86_pmu {
 };
 
 enum {
-	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+	PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
+
+	PERF_SAMPLE_BRANCH_CALL_STACK =
+				1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
 };
 
 #define x86_add_quirk(func_)						\
@@ -545,6 +549,12 @@ static struct perf_pmu_events_attr event_attr_##v = {			\
 
 extern struct x86_pmu x86_pmu __read_mostly;
 
+static inline bool x86_pmu_has_lbr_callstack(void)
+{
+	return  x86_pmu.lbr_sel_map &&
+		x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+}
+
 DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
 
 int x86_perf_event_set_period(struct perf_event *event);
@@ -748,6 +758,8 @@ void intel_pmu_lbr_init_atom(void);
 
 void intel_pmu_lbr_init_snb(void);
 
+void intel_pmu_lbr_init_hsw(void);
+
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
 int p4_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 61463f7..1242314 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2537,7 +2537,7 @@ __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids));
 		memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
 
-		intel_pmu_lbr_init_snb();
+		intel_pmu_lbr_init_hsw();
 
 		x86_pmu.event_constraints = intel_hsw_event_constraints;
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 4e1e6a5..3a63a25 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -39,6 +39,7 @@ static enum {
 #define LBR_IND_JMP_BIT		6 /* do not capture indirect jumps */
 #define LBR_REL_JMP_BIT		7 /* do not capture relative jumps */
 #define LBR_FAR_BIT		8 /* do not capture far branches */
+#define LBR_CALL_STACK_BIT	9 /* enable call stack */
 
 #define LBR_KERNEL	(1 << LBR_KERNEL_BIT)
 #define LBR_USER	(1 << LBR_USER_BIT)
@@ -49,6 +50,7 @@ static enum {
 #define LBR_REL_JMP	(1 << LBR_REL_JMP_BIT)
 #define LBR_IND_JMP	(1 << LBR_IND_JMP_BIT)
 #define LBR_FAR		(1 << LBR_FAR_BIT)
+#define LBR_CALL_STACK	(1 << LBR_CALL_STACK_BIT)
 
 #define LBR_PLM (LBR_KERNEL | LBR_USER)
 
@@ -74,24 +76,25 @@ static enum {
  * x86control flow changes include branches, interrupts, traps, faults
  */
 enum {
-	X86_BR_NONE     = 0,      /* unknown */
-
-	X86_BR_USER     = 1 << 0, /* branch target is user */
-	X86_BR_KERNEL   = 1 << 1, /* branch target is kernel */
-
-	X86_BR_CALL     = 1 << 2, /* call */
-	X86_BR_RET      = 1 << 3, /* return */
-	X86_BR_SYSCALL  = 1 << 4, /* syscall */
-	X86_BR_SYSRET   = 1 << 5, /* syscall return */
-	X86_BR_INT      = 1 << 6, /* sw interrupt */
-	X86_BR_IRET     = 1 << 7, /* return from interrupt */
-	X86_BR_JCC      = 1 << 8, /* conditional */
-	X86_BR_JMP      = 1 << 9, /* jump */
-	X86_BR_IRQ      = 1 << 10,/* hw interrupt or trap or fault */
-	X86_BR_IND_CALL = 1 << 11,/* indirect calls */
-	X86_BR_ABORT    = 1 << 12,/* transaction abort */
-	X86_BR_IN_TX    = 1 << 13,/* in transaction */
-	X86_BR_NO_TX    = 1 << 14,/* not in transaction */
+	X86_BR_NONE		= 0,      /* unknown */
+
+	X86_BR_USER		= 1 << 0, /* branch target is user */
+	X86_BR_KERNEL		= 1 << 1, /* branch target is kernel */
+
+	X86_BR_CALL		= 1 << 2, /* call */
+	X86_BR_RET		= 1 << 3, /* return */
+	X86_BR_SYSCALL		= 1 << 4, /* syscall */
+	X86_BR_SYSRET		= 1 << 5, /* syscall return */
+	X86_BR_INT		= 1 << 6, /* sw interrupt */
+	X86_BR_IRET		= 1 << 7, /* return from interrupt */
+	X86_BR_JCC		= 1 << 8, /* conditional */
+	X86_BR_JMP		= 1 << 9, /* jump */
+	X86_BR_IRQ		= 1 << 10,/* hw interrupt or trap or fault */
+	X86_BR_IND_CALL		= 1 << 11,/* indirect calls */
+	X86_BR_ABORT		= 1 << 12,/* transaction abort */
+	X86_BR_IN_TX		= 1 << 13,/* in transaction */
+	X86_BR_NO_TX		= 1 << 14,/* not in transaction */
+	X86_BR_CALL_STACK	= 1 << 15,/* call stack */
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -374,7 +377,7 @@ void intel_pmu_lbr_read(void)
  * - in case there is no HW filter
  * - in case the HW filter has errata or limitations
  */
-static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
 {
 	u64 br_type = event->attr.branch_sample_type;
 	int mask = 0;
@@ -411,11 +414,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
 	if (br_type & PERF_SAMPLE_BRANCH_COND)
 		mask |= X86_BR_JCC;
 
+	if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
+		if (!x86_pmu_has_lbr_callstack())
+			return -EOPNOTSUPP;
+		if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
+			return -EINVAL;
+		mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
+			X86_BR_CALL_STACK;
+	}
+
 	/*
 	 * stash actual user request into reg, it may
 	 * be used by fixup code for some CPU
 	 */
 	event->hw.branch_reg.reg = mask;
+	return 0;
 }
 
 /*
@@ -444,8 +457,12 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 	reg = &event->hw.branch_reg;
 	reg->idx = EXTRA_REG_LBR;
 
-	/* LBR_SELECT operates in suppress mode so invert mask */
-	reg->config = ~mask & x86_pmu.lbr_sel_mask;
+	/*
+	 * The first 9 bits (LBR_SEL_MASK) in LBR_SELECT operate
+	 * in suppress mode. So LBR_SELECT should be set to
+	 * (~mask & LBR_SEL_MASK) | (mask & ~LBR_SEL_MASK)
+	 */
+	reg->config = mask ^ x86_pmu.lbr_sel_mask;
 
 	return 0;
 }
@@ -463,7 +480,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
 	/*
 	 * setup SW LBR filter
 	 */
-	intel_pmu_setup_sw_lbr_filter(event);
+	ret = intel_pmu_setup_sw_lbr_filter(event);
+	if (ret)
+		return ret;
 
 	/*
 	 * setup HW LBR filter, if any
@@ -720,6 +739,20 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
 	[PERF_SAMPLE_BRANCH_COND_SHIFT]		= LBR_JCC,
 };
 
+static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]	= LBR_REL_CALL | LBR_IND_CALL
+						| LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]	= LBR_IND_CALL,
+	[PERF_SAMPLE_BRANCH_COND_SHIFT]		= LBR_JCC,
+	[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT]	= LBR_REL_CALL | LBR_IND_CALL
+						| LBR_RETURN | LBR_CALL_STACK,
+};
+
 /* core */
 void __init intel_pmu_lbr_init_core(void)
 {
@@ -776,6 +809,20 @@ void __init intel_pmu_lbr_init_snb(void)
 	pr_cont("16-deep LBR, ");
 }
 
+/* haswell */
+void intel_pmu_lbr_init_hsw(void)
+{
+	x86_pmu.lbr_nr	 = 16;
+	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
+	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = hsw_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
+}
+
 /* atom */
 void __init intel_pmu_lbr_init_atom(void)
 {
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V5 05/16] perf, core: pmu specific data for perf task context
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (3 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 04/16] perf, x86: Basic Haswell LBR call stack support kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-10 14:09 ` [PATCH V5 06/16] perf, core: always switch pmu specific data during context switch kan.liang
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

Introduce a new flag PERF_ATTACH_TASK_DATA for perf event's attach
stata. The flag is set by PMU's event_init() callback, it indicates
that perf event needs PMU specific data.

The PMU specific data are initialized to zeros. Later patches will
use PMU specific data to save LBR stack.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 include/linux/perf_event.h |  6 ++++++
 kernel/events/core.c       | 40 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fc355c4..5f857da 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -265,6 +265,10 @@ struct pmu {
 	 */
 	void (*sched_task)		(struct perf_event_context *ctx,
 					bool sched_in);
+	/*
+	 * PMU specific data size
+	 */
+	size_t				task_ctx_size;
 
 };
 
@@ -301,6 +305,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_CONTEXT	0x01
 #define PERF_ATTACH_GROUP	0x02
 #define PERF_ATTACH_TASK	0x04
+#define PERF_ATTACH_TASK_DATA	0x08
 
 struct perf_cgroup;
 struct ring_buffer;
@@ -510,6 +515,7 @@ struct perf_event_context {
 	u64				generation;
 	int				pin_count;
 	int				nr_cgroups;	 /* cgroup evts */
+	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
 
 	struct delayed_work		orphans_remove;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3248f46..3a1458c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -900,6 +900,15 @@ static void get_ctx(struct perf_event_context *ctx)
 	WARN_ON(!atomic_inc_not_zero(&ctx->refcount));
 }
 
+static void free_ctx(struct rcu_head *head)
+{
+	struct perf_event_context *ctx;
+
+	ctx = container_of(head, struct perf_event_context, rcu_head);
+	kfree(ctx->task_ctx_data);
+	kfree(ctx);
+}
+
 static void put_ctx(struct perf_event_context *ctx)
 {
 	if (atomic_dec_and_test(&ctx->refcount)) {
@@ -907,7 +916,7 @@ static void put_ctx(struct perf_event_context *ctx)
 			put_ctx(ctx->parent_ctx);
 		if (ctx->task)
 			put_task_struct(ctx->task);
-		kfree_rcu(ctx, rcu_head);
+		call_rcu(&ctx->rcu_head, free_ctx);
 	}
 }
 
@@ -3178,12 +3187,15 @@ errout:
  * Returns a matching context with refcount and pincount.
  */
 static struct perf_event_context *
-find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
+find_get_context(struct pmu *pmu, struct task_struct *task,
+		 struct perf_event *event)
 {
 	struct perf_event_context *ctx;
 	struct perf_cpu_context *cpuctx;
+	void *task_ctx_data = NULL;
 	unsigned long flags;
 	int ctxn, err;
+	int cpu = event->cpu;
 
 	if (!task) {
 		/* Must be root to operate on a CPU event: */
@@ -3211,11 +3223,24 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
 	if (ctxn < 0)
 		goto errout;
 
+	if (event->attach_state & PERF_ATTACH_TASK_DATA) {
+		task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
+		if (!task_ctx_data) {
+			err = -ENOMEM;
+			goto errout;
+		}
+	}
+
 retry:
 	ctx = perf_lock_task_context(task, ctxn, &flags);
 	if (ctx) {
 		unclone_ctx(ctx);
 		++ctx->pin_count;
+
+		if (task_ctx_data && !ctx->task_ctx_data) {
+			ctx->task_ctx_data = task_ctx_data;
+			task_ctx_data = NULL;
+		}
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	} else {
 		ctx = alloc_perf_context(pmu, task);
@@ -3223,6 +3248,11 @@ retry:
 		if (!ctx)
 			goto errout;
 
+		if (task_ctx_data) {
+			ctx->task_ctx_data = task_ctx_data;
+			task_ctx_data = NULL;
+		}
+
 		err = 0;
 		mutex_lock(&task->perf_event_mutex);
 		/*
@@ -3249,9 +3279,11 @@ retry:
 		}
 	}
 
+	kfree(task_ctx_data);
 	return ctx;
 
 errout:
+	kfree(task_ctx_data);
 	return ERR_PTR(err);
 }
 
@@ -7319,7 +7351,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	/*
 	 * Get the target context (task or percpu):
 	 */
-	ctx = find_get_context(pmu, task, event->cpu);
+	ctx = find_get_context(pmu, task, event);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
 		goto err_alloc;
@@ -7488,7 +7520,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 
 	account_event(event);
 
-	ctx = find_get_context(event->pmu, task, cpu);
+	ctx = find_get_context(event->pmu, task, event);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
 		goto err_free;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V5 06/16] perf, core: always switch pmu specific data during context switch
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (4 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 05/16] perf, core: pmu specific data for perf task context kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-10 14:09 ` [PATCH V5 07/16] perf, x86: allocate space for storing LBR stack kan.liang
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

If two tasks were both forked from the same parent task, Events in
their perf task contexts can be the same. Perf core may leave out
switching the perf event contexts.

Previous patch inroduces pmu specific data. The data is for saving
the LBR stack, it is task specific. So we need to switch the data
even when context switch is optimized out.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 kernel/events/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3a1458c..5f49df2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2412,6 +2412,9 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 			next->perf_event_ctxp[ctxn] = ctx;
 			ctx->task = next;
 			next_ctx->task = task;
+
+			swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
+
 			do_switch = 0;
 
 			perf_event_sync_stat(ctx, next_ctx);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V5 07/16] perf, x86: allocate space for storing LBR stack
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (5 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 06/16] perf, core: always switch pmu specific data during context switch kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-10 14:09 ` [PATCH V5 08/16] perf, x86: track number of events that use LBR callstack kan.liang
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. We can use pmu specific data to
store LBR stack when task is scheduled out. This patch adds code
that allocates the pmu specific data.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c | 4 ++++
 arch/x86/kernel/cpu/perf_event.h | 7 +++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 1bbcd59..a18fd78 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -426,6 +426,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 		}
 	}
 
+	if (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK)
+		event->attach_state |= PERF_ATTACH_TASK_DATA;
+
 	/*
 	 * Generate PMC IRQs:
 	 * (keep 'enabled' bit clear for now)
@@ -1910,6 +1913,7 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.sched_task		= x86_pmu_sched_task,
+	.task_ctx_size          = sizeof(struct x86_perf_task_context),
 };
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 13464e4..b4568e5 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -510,6 +510,13 @@ struct x86_pmu {
 	struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
 };
 
+struct x86_perf_task_context {
+	u64 lbr_from[MAX_LBR_ENTRIES];
+	u64 lbr_to[MAX_LBR_ENTRIES];
+	int lbr_callstack_users;
+	int lbr_stack_state;
+};
+
 enum {
 	PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
 	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V5 08/16] perf, x86: track number of events that use LBR callstack
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (6 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 07/16] perf, x86: allocate space for storing LBR stack kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-24 12:53   ` Peter Zijlstra
  2014-09-10 14:09 ` [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch kan.liang
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

When enabling/disabling an event, check if the event uses the LBR
callstack feature, adjust the LBR callstack usage count accordingly.
Later patch will use the usage count to decide if LBR stack should
be saved/restored.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 3a63a25..8c6da0f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -204,9 +204,15 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 	}
 }
 
+static inline bool branch_user_callstack(unsigned br_sel)
+{
+	return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
+}
+
 void intel_pmu_lbr_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct x86_perf_task_context *task_ctx;
 
 	if (!x86_pmu.lbr_nr)
 		return;
@@ -220,6 +226,10 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 	}
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
+	task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
+	if (task_ctx && branch_user_callstack(cpuc->br_sel))
+		task_ctx->lbr_callstack_users++;
+
 	cpuc->lbr_users++;
 	if (cpuc->lbr_users == 1)
 		perf_sched_cb_enable(event->ctx->pmu);
@@ -228,10 +238,15 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 void intel_pmu_lbr_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct x86_perf_task_context *task_ctx;
 
 	if (!x86_pmu.lbr_nr)
 		return;
 
+	task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
+	if (task_ctx && branch_user_callstack(cpuc->br_sel))
+		task_ctx->lbr_callstack_users--;
+
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 08/16] perf, x86: track number of events that use LBR callstack
  2014-09-10 14:09 ` [PATCH V5 08/16] perf, x86: track number of events that use LBR callstack kan.liang
@ 2014-09-24 12:53   ` Peter Zijlstra
  2014-10-07  2:59     ` Liang, Kan
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2014-09-24 12:53 UTC (permalink / raw)
  To: kan.liang; +Cc: eranian, linux-kernel, mingo, paulus, acme, ak, Yan, Zheng

On Wed, Sep 10, 2014 at 10:09:05AM -0400, kan.liang@intel.com wrote:
> @@ -204,9 +204,15 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
>  	}
>  }
>  
> +static inline bool branch_user_callstack(unsigned br_sel)
> +{
> +	return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
> +}
> +
>  void intel_pmu_lbr_enable(struct perf_event *event)
>  {
>  	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +	struct x86_perf_task_context *task_ctx;
>  
>  	if (!x86_pmu.lbr_nr)
>  		return;
> @@ -220,6 +226,10 @@ void intel_pmu_lbr_enable(struct perf_event *event)
>  	}
>  	cpuc->br_sel = event->hw.branch_reg.reg;
>  
> +	task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
> +	if (task_ctx && branch_user_callstack(cpuc->br_sel))
> +		task_ctx->lbr_callstack_users++;
> +

Does it make sense to flip those conditions to avoid a potentially
useless dereference?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH V5 08/16] perf, x86: track number of events that use LBR callstack
  2014-09-24 12:53   ` Peter Zijlstra
@ 2014-10-07  2:59     ` Liang, Kan
  2014-10-07 15:19       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Liang, Kan @ 2014-10-07  2:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian@google.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, paulus@samba.org, acme@kernel.org,
	ak@linux.intel.com, Yan, Zheng


> 
> On Wed, Sep 10, 2014 at 10:09:05AM -0400, kan.liang@intel.com wrote:
> > @@ -204,9 +204,15 @@ void intel_pmu_lbr_sched_task(struct
> perf_event_context *ctx, bool sched_in)
> >  	}
> >  }
> >
> > +static inline bool branch_user_callstack(unsigned br_sel) {
> > +	return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK); }
> > +
> >  void intel_pmu_lbr_enable(struct perf_event *event)  {
> >  	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> > +	struct x86_perf_task_context *task_ctx;
> >
> >  	if (!x86_pmu.lbr_nr)
> >  		return;
> > @@ -220,6 +226,10 @@ void intel_pmu_lbr_enable(struct perf_event
> *event)
> >  	}
> >  	cpuc->br_sel = event->hw.branch_reg.reg;
> >
> > +	task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
> > +	if (task_ctx && branch_user_callstack(cpuc->br_sel))
> > +		task_ctx->lbr_callstack_users++;
> > +
> 
> Does it make sense to flip those conditions to avoid a potentially useless
> dereference?

I'm not quite sure I understand your meaning here.
But lbr_callstack_users is an indicator for save/restore the LBR stack on context switch.
Here, we only change the lbr_callstack_users, when it's LBR call stack and has space for saving LBR stack.

Should I change the code as below?
+       if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
+               (task_ctx = event->ctx->task_ctx_data))
+               task_ctx->lbr_callstack_users++;

Kan


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 08/16] perf, x86: track number of events that use LBR callstack
  2014-10-07  2:59     ` Liang, Kan
@ 2014-10-07 15:19       ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2014-10-07 15:19 UTC (permalink / raw)
  To: Liang, Kan
  Cc: eranian@google.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, paulus@samba.org, acme@kernel.org,
	ak@linux.intel.com, Yan, Zheng

On Tue, Oct 07, 2014 at 02:59:20AM +0000, Liang, Kan wrote:
> > On Wed, Sep 10, 2014 at 10:09:05AM -0400, kan.liang@intel.com wrote:
> > > @@ -204,9 +204,15 @@ void intel_pmu_lbr_sched_task(struct
> > perf_event_context *ctx, bool sched_in)
> > >  	}
> > >  }
> > >
> > > +static inline bool branch_user_callstack(unsigned br_sel) {
> > > +	return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK); }
> > > +
> > >  void intel_pmu_lbr_enable(struct perf_event *event)  {
> > >  	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> > > +	struct x86_perf_task_context *task_ctx;
> > >
> > >  	if (!x86_pmu.lbr_nr)
> > >  		return;
> > > @@ -220,6 +226,10 @@ void intel_pmu_lbr_enable(struct perf_event
> > *event)
> > >  	}
> > >  	cpuc->br_sel = event->hw.branch_reg.reg;
> > >
> > > +	task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
> > > +	if (task_ctx && branch_user_callstack(cpuc->br_sel))
> > > +		task_ctx->lbr_callstack_users++;
> > > +
> > 
> > Does it make sense to flip those conditions to avoid a potentially useless
> > dereference?
> 
> I'm not quite sure I understand your meaning here.
> But lbr_callstack_users is an indicator for save/restore the LBR stack on context switch.
> Here, we only change the lbr_callstack_users, when it's LBR call stack and has space for saving LBR stack.
> 
> Should I change the code as below?
> +       if (branch_user_callstack(cpuc->br_sel) && event->ctx &&
> +               (task_ctx = event->ctx->task_ctx_data))
> +               task_ctx->lbr_callstack_users++;

Yes, that avoids the ctx->task_ctx_data deref when
!branch_user_callstack().

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (7 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 08/16] perf, x86: track number of events that use LBR callstack kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-24 13:33   ` Peter Zijlstra
  2014-09-10 14:09 ` [PATCH V5 10/16] perf, core: simplify need branch stack check kan.liang
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. The solution is saving/restoring
the LBR stack to/from task's perf event context.

The LBR stack is saved/restored only when there are events that use
the LBR call stack. If no event uses LBR call stack, the LBR stack
is reset when task is scheduled in.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 88 ++++++++++++++++++++++++++----
 1 file changed, 76 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 8c6da0f..6aabbb4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -180,13 +180,89 @@ void intel_pmu_lbr_reset(void)
 		intel_pmu_lbr_reset_64();
 }
 
+/*
+ * TOS = most recently recorded branch
+ */
+static inline u64 intel_pmu_lbr_tos(void)
+{
+	u64 tos;
+
+	rdmsrl(x86_pmu.lbr_tos, tos);
+	return tos;
+}
+
+enum {
+	LBR_NONE,
+	LBR_VALID,
+};
+
+static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask;
+	u64 tos;
+
+	if (task_ctx->lbr_callstack_users == 0 ||
+	    task_ctx->lbr_stack_state == LBR_NONE) {
+		intel_pmu_lbr_reset();
+		return;
+	}
+
+	mask = x86_pmu.lbr_nr - 1;
+	tos = intel_pmu_lbr_tos();
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_state = LBR_NONE;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask;
+	u64 tos;
+
+	if (task_ctx->lbr_callstack_users == 0) {
+		task_ctx->lbr_stack_state = LBR_NONE;
+		return;
+	}
+
+	mask = x86_pmu.lbr_nr - 1;
+	tos = intel_pmu_lbr_tos();
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_state = LBR_VALID;
+}
+
+
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct x86_perf_task_context *task_ctx;
 
 	if (!x86_pmu.lbr_nr)
 		return;
 	/*
+	 * If LBR callstack feature is enabled and the stack was saved when
+	 * the task was scheduled out, restore the stack. Otherwise flush
+	 * the LBR stack.
+	 */
+	task_ctx = ctx ? ctx->task_ctx_data : NULL;
+	if (task_ctx) {
+		if (sched_in) {
+			__intel_pmu_lbr_restore(task_ctx);
+			cpuc->lbr_context = ctx;
+		} else {
+			__intel_pmu_lbr_save(task_ctx);
+		}
+	}
+
+	/*
 	 * When sampling the branck stack in system-wide, it may be
 	 * necessary to flush the stack on context switch. This happens
 	 * when the branch stack does not tag its entries with the pid
@@ -276,18 +352,6 @@ void intel_pmu_lbr_disable_all(void)
 		__intel_pmu_lbr_disable();
 }
 
-/*
- * TOS = most recently recorded branch
- */
-static inline u64 intel_pmu_lbr_tos(void)
-{
-	u64 tos;
-
-	rdmsrl(x86_pmu.lbr_tos, tos);
-
-	return tos;
-}
-
 static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 {
 	unsigned long mask = x86_pmu.lbr_nr - 1;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch
  2014-09-10 14:09 ` [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch kan.liang
@ 2014-09-24 13:33   ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2014-09-24 13:33 UTC (permalink / raw)
  To: kan.liang; +Cc: eranian, linux-kernel, mingo, paulus, acme, ak, Yan, Zheng

On Wed, Sep 10, 2014 at 10:09:06AM -0400, kan.liang@intel.com wrote:
>  void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
>  {
>  	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +	struct x86_perf_task_context *task_ctx;
>  
>  	if (!x86_pmu.lbr_nr)
>  		return;
>  	/*
> +	 * If LBR callstack feature is enabled and the stack was saved when
> +	 * the task was scheduled out, restore the stack. Otherwise flush
> +	 * the LBR stack.
> +	 */
> +	task_ctx = ctx ? ctx->task_ctx_data : NULL;
> +	if (task_ctx) {
> +		if (sched_in) {
> +			__intel_pmu_lbr_restore(task_ctx);
> +			cpuc->lbr_context = ctx;
> +		} else {
> +			__intel_pmu_lbr_save(task_ctx);
> +		}
> +	}
> +
> +	/*
>  	 * When sampling the branck stack in system-wide, it may be
>  	 * necessary to flush the stack on context switch. This happens
>  	 * when the branch stack does not tag its entries with the pid

Why would we still need to reset if we did a save/restore on the branch
stack?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V5 10/16] perf, core: simplify need branch stack check
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (8 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-24 13:55   ` Peter Zijlstra
  2014-09-10 14:09 ` [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain() kan.liang
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

event->attr.branch_sample_type is non-zero no matter branch stack
is enabled explicitly or is enabled implicitly. we can use it to
replace intel_pmu_needs_lbr_smpl(). This avoids duplicating code
that implicitly enables the LBR.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c | 20 +++-----------------
 include/linux/perf_event.h             |  5 +++++
 kernel/events/core.c                   |  3 +++
 3 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 1242314..49e7d14 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1029,20 +1029,6 @@ static __initconst const u64 slm_hw_cache_event_ids
  },
 };
 
-static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
-{
-	/* user explicitly requested branch sampling */
-	if (has_branch_stack(event))
-		return true;
-
-	/* implicit branch sampling to correct PEBS skid */
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
-	    x86_pmu.intel_cap.pebs_format < 2)
-		return true;
-
-	return false;
-}
-
 static void intel_pmu_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -1207,7 +1193,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	 * must disable before any actual event
 	 * because any event may be combined with LBR
 	 */
-	if (intel_pmu_needs_lbr_smpl(event))
+	if (needs_branch_stack(event))
 		intel_pmu_lbr_disable(event);
 
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
@@ -1268,7 +1254,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	 * must enabled before any actual event
 	 * because any event may be combined with LBR
 	 */
-	if (intel_pmu_needs_lbr_smpl(event))
+	if (needs_branch_stack(event))
 		intel_pmu_lbr_enable(event);
 
 	if (event->attr.exclude_host)
@@ -1747,7 +1733,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	if (event->attr.precise_ip && x86_pmu.pebs_aliases)
 		x86_pmu.pebs_aliases(event);
 
-	if (intel_pmu_needs_lbr_smpl(event)) {
+	if (needs_branch_stack(event)) {
 		ret = intel_pmu_setup_lbr_filter(event);
 		if (ret)
 			return ret;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5f857da..a190e91 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -792,6 +792,11 @@ static inline bool has_branch_stack(struct perf_event *event)
 	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
 }
 
+static inline bool needs_branch_stack(struct perf_event *event)
+{
+	return event->attr.branch_sample_type != 0;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5f49df2..b37f2f3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7003,6 +7003,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
 		goto err_ns;
 
+	if (!has_branch_stack(event))
+		event->attr.branch_sample_type = 0;
+
 	pmu = perf_init_event(event);
 	if (!pmu)
 		goto err_ns;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 10/16] perf, core: simplify need branch stack check
  2014-09-10 14:09 ` [PATCH V5 10/16] perf, core: simplify need branch stack check kan.liang
@ 2014-09-24 13:55   ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2014-09-24 13:55 UTC (permalink / raw)
  To: kan.liang; +Cc: eranian, linux-kernel, mingo, paulus, acme, ak, Yan, Zheng

On Wed, Sep 10, 2014 at 10:09:07AM -0400, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> event->attr.branch_sample_type is non-zero no matter branch stack
> is enabled explicitly or is enabled implicitly. we can use it to
> replace intel_pmu_needs_lbr_smpl(). This avoids duplicating code
> that implicitly enables the LBR.
> 

Please extend the changelog such that it's obvious this is correct. This
is the second time I've had to look how this works out for PEBS.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain()
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (9 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 10/16] perf, core: simplify need branch stack check kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-24 14:15   ` Peter Zijlstra
  2014-09-10 14:09 ` [PATCH V5 12/16] perf, x86: use LBR call stack to get user callchain kan.liang
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers.
The LBR call stack facility can help perf to get call chains of progam
without frame pointer.

This patch modifies various architectures' perf_callchain() to accept
perf sample data. Later patch will add code that use the sample data to
get call chains.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/arm/kernel/perf_event.c     | 4 ++--
 arch/powerpc/perf/callchain.c    | 4 ++--
 arch/sparc/kernel/perf_event.c   | 4 ++--
 arch/x86/kernel/cpu/perf_event.c | 4 ++--
 include/linux/perf_event.h       | 4 +++-
 kernel/events/callchain.c        | 8 +++++---
 kernel/events/core.c             | 2 +-
 kernel/events/internal.h         | 3 ++-
 8 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 266cba4..9532bd0 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -584,8 +584,8 @@ user_backtrace(struct frame_tail __user *tail,
 	return buftail.fp - 1;
 }
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	struct frame_tail __user *tail;
 
diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index 74d1e78..b379ebc 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -482,8 +482,8 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
 	}
 }
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	if (current_is_64bit())
 		perf_callchain_user_64(entry, regs);
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index d35c490..9078fe2 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1791,8 +1791,8 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
 	} while (entry->nr < PERF_MAX_STACK_DEPTH);
 }
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	perf_callchain_store(entry, regs->tpc);
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index a18fd78..71e293a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2049,8 +2049,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 }
 #endif
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	struct stack_frame frame;
 	const void __user *fp;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a190e91..8db3520 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -735,7 +735,9 @@ extern void perf_event_fork(struct task_struct *tsk);
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 
-extern void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs);
+extern void perf_callchain_user(struct perf_callchain_entry *entry,
+				struct pt_regs *regs,
+				struct perf_sample_data *data);
 extern void perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs);
 
 static inline void perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index f2a88de..4a18e1e 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -30,7 +30,8 @@ __weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
 }
 
 __weak void perf_callchain_user(struct perf_callchain_entry *entry,
-				struct pt_regs *regs)
+				struct pt_regs *regs,
+				struct perf_sample_data *data)
 {
 }
 
@@ -157,7 +158,8 @@ put_callchain_entry(int rctx)
 }
 
 struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs)
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+	       struct perf_sample_data *data)
 {
 	int rctx;
 	struct perf_callchain_entry *entry;
@@ -198,7 +200,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 				goto exit_put;
 
 			perf_callchain_store(entry, PERF_CONTEXT_USER);
-			perf_callchain_user(entry, regs);
+			perf_callchain_user(entry, regs, data);
 		}
 	}
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b37f2f3..eed0424 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4885,7 +4885,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		int size = 1;
 
-		data->callchain = perf_callchain(event, regs);
+		data->callchain = perf_callchain(event, regs, data);
 
 		if (data->callchain)
 			size += data->callchain->nr;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b2187..cd18b64 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -147,7 +147,8 @@ DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)
 
 /* Callchain handling */
 extern struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs);
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+	       struct perf_sample_data *data);
 extern int get_callchain_buffers(void);
 extern void put_callchain_buffers(void);
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain()
  2014-09-10 14:09 ` [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain() kan.liang
@ 2014-09-24 14:15   ` Peter Zijlstra
  2014-10-07  3:00     ` Liang, Kan
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2014-09-24 14:15 UTC (permalink / raw)
  To: kan.liang; +Cc: eranian, linux-kernel, mingo, paulus, acme, ak, Yan, Zheng

On Wed, Sep 10, 2014 at 10:09:08AM -0400, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Haswell has a new feature that utilizes the existing Last Branch Record
> facility to record call chains. When the feature is enabled, function
> call will be collected as normal, but as return instructions are
> executed the last captured branch record is popped from the on-chip LBR
> registers.
> The LBR call stack facility can help perf to get call chains of progam
> without frame pointer.
> 
> This patch modifies various architectures' perf_callchain() to accept
> perf sample data. Later patch will add code that use the sample data to
> get call chains.

So I don't like this. Why not use the regular PERF_SAMPLE_BRANCH_STACK
output to generate the stuff from? We already have two different means,
with different transport, for callchains anyhow, so a third really won't
matter.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain()
  2014-09-24 14:15   ` Peter Zijlstra
@ 2014-10-07  3:00     ` Liang, Kan
  2014-10-07 15:24       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Liang, Kan @ 2014-10-07  3:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian@google.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, paulus@samba.org, acme@kernel.org,
	ak@linux.intel.com, Yan, Zheng



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Wednesday, September 24, 2014 10:15 AM
> To: Liang, Kan
> Cc: eranian@google.com; linux-kernel@vger.kernel.org; mingo@redhat.com;
> paulus@samba.org; acme@kernel.org; ak@linux.intel.com; Yan, Zheng
> Subject: Re: [PATCH V5 11/16] perf, core: Pass perf_sample_data to
> perf_callchain()
> 
> On Wed, Sep 10, 2014 at 10:09:08AM -0400, kan.liang@intel.com wrote:
> > From: Kan Liang <kan.liang@intel.com>
> >
> > Haswell has a new feature that utilizes the existing Last Branch
> > Record facility to record call chains. When the feature is enabled,
> > function call will be collected as normal, but as return instructions
> > are executed the last captured branch record is popped from the
> > on-chip LBR registers.
> > The LBR call stack facility can help perf to get call chains of progam
> > without frame pointer.
> >
> > This patch modifies various architectures' perf_callchain() to accept
> > perf sample data. Later patch will add code that use the sample data
> > to get call chains.
> 
> So I don't like this. Why not use the regular PERF_SAMPLE_BRANCH_STACK
> output to generate the stuff from? We already have two different means,
> with different transport, for callchains anyhow, so a third really won't matter.

I'm not sure what you mean by using the regular PERF_SAMPLE_BRANCH_STACK output to generate the stuff from.
But we don't need to modify various architectures' perf_callchain_user, if that's your concern. 
An alternative way is to generate the callchain output in a higher level, like perf_callchain.
If there is no frame pointer, the entry->nr will be set to MAX+1. So  the perf_callchain knows that we need to try LBR callstack if possible.
In perf_callchain, it resets entry->nr to old value, and call perf_callchain_lbr_callstack to check and fill the callchain struct if possible. 
The patch is as below.

What do you think?
 

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index f2a88de..677f8af 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -156,11 +156,28 @@ put_callchain_entry(int rctx)
        put_recursion_context(__get_cpu_var(callchain_recursion), rctx);
 }

+static inline void
+perf_callchain_lbr_callstack(struct perf_callchain_entry *entry,
+                            struct perf_sample_data *data)
+{
+       struct perf_branch_stack *br_stack = data->br_stack;
+
+       if (br_stack && br_stack->user_callstack) {
+               int i = 0;
+               while (i < br_stack->nr && entry->nr < PERF_MAX_STACK_DEPTH) {
+                       perf_callchain_store(entry, br_stack->entries[i].from);
+                       i++;
+               }
+       }
+}
+
 struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs)
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+               struct perf_sample_data *data)
 {
        int rctx;
        struct perf_callchain_entry *entry;
+       __u64 old_nr;

        int kernel = !event->attr.exclude_callchain_kernel;
        int user   = !event->attr.exclude_callchain_user;
@@ -198,7 +215,13 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
                                goto exit_put;

                        perf_callchain_store(entry, PERF_CONTEXT_USER);
+                       old_nr = entry->nr;
                        perf_callchain_user(entry, regs);
+                       if (entry->nr == (PERF_MAX_STACK_DEPTH + 1)) {
+                               entry->nr = old_nr;
+                               perf_callchain_lbr_callstack(entry, data);
+                       } else
+                               entry->nr = old_nr;
                }
        }
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 185fa03..0439c8f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2061,6 +2061,15 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
                perf_callchain_store(entry, cs_base + frame.return_address);
                fp = compat_ptr(ss_base + frame.next_frame);
        }
+
+       /*
+        * try LBR callstack if there is no frame pointer
+        * Set entry->nr to MAX + 1 to notify the perf_callchain.
+        * perf_callchain finally try LBR callstack and reset entry->nr
+        */
+       if (fp == compat_ptr(regs->bp))
+               entry->nr = PERF_MAX_STACK_DEPTH + 1;
+
        return 1;
 }
 #else
@@ -2113,6 +2122,14 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
                perf_callchain_store(entry, frame.return_address);
                fp = frame.next_frame;
        }
+
+       /*
+        * try LBR callstack if there is no frame pointer
+        * Set entry->nr to MAX + 1 to notify the perf_callchain.
+        * perf_callchain finally try LBR callstack and reset entry->nr
+        */
+       if (fp == (void __user *)regs->bp)
+               entry->nr = PERF_MAX_STACK_DEPTH + 1;
 }

 /*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 50bb51d..2808267 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1533,7 +1533,7 @@ again:

                perf_sample_data_init(&data, 0, event->hw.last_period);

-               if (has_branch_stack(event))
+               if (needs_branch_stack(event))
                        data.br_stack = &cpuc->lbr_stack;

                if (perf_event_overflow(event, &data, regs))
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 8e6c88f..6c995b7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -758,6 +758,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
        int i, j, type;
        bool compress = false;

+       cpuc->lbr_stack.user_callstack = branch_user_callstack(br_sel);
+
        /* if sampling all branches, then nothing to filter */
        if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
                return;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 285776a..84840cc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -75,6 +75,7 @@ struct perf_raw_record {
  * recent branch.
  */
 struct perf_branch_stack {
+       bool                            user_callstack;
        __u64                           nr;
        struct perf_branch_entry        entries[0];
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4dd5700..825b487 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4915,7 +4915,7 @@ void perf_prepare_sample(struct perf_event_header *header,
        if (sample_type & PERF_SAMPLE_CALLCHAIN) {
                int size = 1;

-               data->callchain = perf_callchain(event, regs);
+               data->callchain = perf_callchain(event, regs, data);

                if (data->callchain)
                        size += data->callchain->nr;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b2187..3a0239e 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -147,7 +147,8 @@ DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)

 /* Callchain handling */
 extern struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs);
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+               struct perf_sample_data *data);
 extern int get_callchain_buffers(void);
 extern void put_callchain_buffers(void);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain()
  2014-10-07  3:00     ` Liang, Kan
@ 2014-10-07 15:24       ` Peter Zijlstra
  2014-10-07 15:50         ` Liang, Kan
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2014-10-07 15:24 UTC (permalink / raw)
  To: Liang, Kan
  Cc: eranian@google.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, paulus@samba.org, acme@kernel.org,
	ak@linux.intel.com, Yan, Zheng

I think you're going to have to stop using outlook or whatnot, this is
horrible.

On Tue, Oct 07, 2014 at 03:00:00AM +0000, Liang, Kan wrote:
> > -----Original Message-----
> > From: Peter Zijlstra [mailto:peterz@infradead.org]

> > So I don't like this. Why not use the regular PERF_SAMPLE_BRANCH_STACK
> > output to generate the stuff from? We already have two different means,
> > with different transport, for callchains anyhow, so a third really won't matter.
> 
> I'm not sure what you mean by using the regular
> PERF_SAMPLE_BRANCH_STACK output to generate the stuff from.  But we
> don't need to modify various architectures' perf_callchain_user, if
> that's your concern.  An alternative way is to generate the callchain
> output in a higher level, like perf_callchain.  If there is no frame
> pointer, the entry->nr will be set to MAX+1. So  the perf_callchain
> knows that we need to try LBR callstack if possible.  In
> perf_callchain, it resets entry->nr to old value, and call
> perf_callchain_lbr_callstack to check and fill the callchain struct if
> possible.  The patch is as below.

Please instruct your MUA to wrap at 78 chars.

What I meant was: why can't we use the regular PERF_SAMPLE_BRANCH_STACK
output to generate user traces from?

PERF_SAMPLE_BRANCH_STACK is the 'normal' LBR output format. Clobbering
the callstack output is bad.

> What do you think?

I think it still sucks.. you're still clobbering potentially more useful
data.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain()
  2014-10-07 15:24       ` Peter Zijlstra
@ 2014-10-07 15:50         ` Liang, Kan
  2014-10-07 16:29           ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Liang, Kan @ 2014-10-07 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian@google.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, paulus@samba.org, acme@kernel.org,
	ak@linux.intel.com, Yan, Zheng



> > > So I don't like this. Why not use the regular
> > > PERF_SAMPLE_BRANCH_STACK output to generate the stuff from? We
> > > already have two different means, with different transport, for callchains
> anyhow, so a third really won't matter.
> >
> > I'm not sure what you mean by using the regular
> > PERF_SAMPLE_BRANCH_STACK output to generate the stuff from.  But we
> > don't need to modify various architectures' perf_callchain_user, if
> > that's your concern.  An alternative way is to generate the callchain
> > output in a higher level, like perf_callchain.  If there is no frame
> > pointer, the entry->nr will be set to MAX+1. So  the perf_callchain
> > knows that we need to try LBR callstack if possible.  In
> > perf_callchain, it resets entry->nr to old value, and call
> > perf_callchain_lbr_callstack to check and fill the callchain struct if
> > possible.  The patch is as below.
> 
> Please instruct your MUA to wrap at 78 chars.
> 
> What I meant was: why can't we use the regular
> PERF_SAMPLE_BRANCH_STACK output to generate user traces from?
> 
> PERF_SAMPLE_BRANCH_STACK is the 'normal' LBR output format.

The data is originally from br_stack which is LBR format.
What the patch did is to convert it to CALLCHAIN output format in kernel.
So you'd like to let the kernel pass the LBR output format data to user space
perf tool,  and let perf tool to generate the callchain information?


Kan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain()
  2014-10-07 15:50         ` Liang, Kan
@ 2014-10-07 16:29           ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2014-10-07 16:29 UTC (permalink / raw)
  To: Liang, Kan
  Cc: eranian@google.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, paulus@samba.org, acme@kernel.org,
	ak@linux.intel.com, Yan, Zheng

On Tue, Oct 07, 2014 at 03:50:45PM +0000, Liang, Kan wrote:
> The data is originally from br_stack which is LBR format.

Right, its read from the LBR, therefore this must be.

> What the patch did is to convert it to CALLCHAIN output format in kernel.

I saw that.

> So you'd like to let the kernel pass the LBR output format data to user space
> perf tool,  and let perf tool to generate the callchain information?

Right, then we can also compare the two when they're both available.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V5 12/16] perf, x86: use LBR call stack to get user callchain
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (10 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain() kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-10 14:09 ` [PATCH V5 13/16] perf, x86: re-organize code that implicitly enables LBR/PEBS kan.liang
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed
the last captured branch record is popped from the on-chip LBR
registers.
The LBR call stack facility can help perf to get call chains of progam
without frame pointer.

This patch makes x86's perf_callchain_user() failback to use LBR call
stack data when there is no frame pointer in the user program. The
'from'
address of branch entry is used as 'return' address of function call.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c           | 34 ++++++++++++++++++++++++++----
 arch/x86/kernel/cpu/perf_event_intel.c     |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  2 ++
 include/linux/perf_event.h                 |  1 +
 4 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 71e293a..0a71f04 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2005,12 +2005,29 @@ static unsigned long get_segment_base(unsigned int segment)
 	return get_desc_base(desc + idx);
 }
 
+static inline void
+perf_callchain_lbr_callstack(struct perf_callchain_entry *entry,
+			     struct perf_sample_data *data)
+{
+	struct perf_branch_stack *br_stack = data->br_stack;
+
+	if (br_stack && br_stack->user_callstack) {
+		int i = 0;
+
+		while (i < br_stack->nr && entry->nr < PERF_MAX_STACK_DEPTH) {
+			perf_callchain_store(entry, br_stack->entries[i].from);
+			i++;
+		}
+	}
+}
+
 #ifdef CONFIG_COMPAT
 
 #include <asm/compat.h>
 
 static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
+perf_callchain_user32(struct perf_callchain_entry *entry,
+		      struct pt_regs *regs, struct perf_sample_data *data)
 {
 	/* 32-bit process in 64-bit kernel. */
 	unsigned long ss_base, cs_base;
@@ -2039,11 +2056,16 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 		perf_callchain_store(entry, cs_base + frame.return_address);
 		fp = compat_ptr(ss_base + frame.next_frame);
 	}
+
+	if (fp == compat_ptr(regs->bp))
+		perf_callchain_lbr_callstack(entry, data);
+
 	return 1;
 }
 #else
 static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
+perf_callchain_user32(struct perf_callchain_entry *entry,
+		      struct pt_regs *regs, struct perf_sample_data *data)
 {
     return 0;
 }
@@ -2073,12 +2095,12 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
 	if (!current->mm)
 		return;
 
-	if (perf_callchain_user32(regs, entry))
+	if (perf_callchain_user32(entry, regs, data))
 		return;
 
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
-		frame.next_frame	     = NULL;
+		frame.next_frame = NULL;
 		frame.return_address = 0;
 
 		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
@@ -2091,6 +2113,10 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
 		perf_callchain_store(entry, frame.return_address);
 		fp = frame.next_frame;
 	}
+
+	/* try LBR callstack if there is no frame pointer */
+	if (fp == (void __user *)regs->bp)
+		perf_callchain_lbr_callstack(entry, data);
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 49e7d14..93e8038 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1404,7 +1404,7 @@ again:
 
 		perf_sample_data_init(&data, 0, event->hw.last_period);
 
-		if (has_branch_stack(event))
+		if (needs_branch_stack(event))
 			data.br_stack = &cpuc->lbr_stack;
 
 		if (perf_event_overflow(event, &data, regs))
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 6aabbb4..5afb21b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -743,6 +743,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
 	int i, j, type;
 	bool compress = false;
 
+	cpuc->lbr_stack.user_callstack = branch_user_callstack(br_sel);
+
 	/* if sampling all branches, then nothing to filter */
 	if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
 		return;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8db3520..4d38d5e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -75,6 +75,7 @@ struct perf_raw_record {
  * recent branch.
  */
 struct perf_branch_stack {
+	bool				user_callstack;
 	__u64				nr;
 	struct perf_branch_entry	entries[0];
 };
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V5 13/16] perf, x86: re-organize code that implicitly enables LBR/PEBS
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (11 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 12/16] perf, x86: use LBR call stack to get user callchain kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-10 14:09 ` [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain kan.liang
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

make later patch more readable, no logic change.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 59 ++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 0a71f04..418e953 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -393,36 +393,35 @@ int x86_pmu_hw_config(struct perf_event *event)
 
 		if (event->attr.precise_ip > precise)
 			return -EOPNOTSUPP;
-		/*
-		 * check that PEBS LBR correction does not conflict with
-		 * whatever the user is asking with attr->branch_sample_type
-		 */
-		if (event->attr.precise_ip > 1 &&
-		    x86_pmu.intel_cap.pebs_format < 2) {
-			u64 *br_type = &event->attr.branch_sample_type;
-
-			if (has_branch_stack(event)) {
-				if (!precise_br_compat(event))
-					return -EOPNOTSUPP;
-
-				/* branch_sample_type is compatible */
-
-			} else {
-				/*
-				 * user did not specify  branch_sample_type
-				 *
-				 * For PEBS fixups, we capture all
-				 * the branches at the priv level of the
-				 * event.
-				 */
-				*br_type = PERF_SAMPLE_BRANCH_ANY;
-
-				if (!event->attr.exclude_user)
-					*br_type |= PERF_SAMPLE_BRANCH_USER;
-
-				if (!event->attr.exclude_kernel)
-					*br_type |= PERF_SAMPLE_BRANCH_KERNEL;
-			}
+	}
+	/*
+	 * check that PEBS LBR correction does not conflict with
+	 * whatever the user is asking with attr->branch_sample_type
+	 */
+	if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format < 2) {
+		u64 *br_type = &event->attr.branch_sample_type;
+
+		if (has_branch_stack(event)) {
+			if (!precise_br_compat(event))
+				return -EOPNOTSUPP;
+
+			/* branch_sample_type is compatible */
+
+		} else {
+			/*
+			 * user did not specify  branch_sample_type
+			 *
+			 * For PEBS fixups, we capture all
+			 * the branches at the priv level of the
+			 * event.
+			 */
+			*br_type = PERF_SAMPLE_BRANCH_ANY;
+
+			if (!event->attr.exclude_user)
+				*br_type |= PERF_SAMPLE_BRANCH_USER;
+
+			if (!event->attr.exclude_kernel)
+				*br_type |= PERF_SAMPLE_BRANCH_KERNEL;
 		}
 	}
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (12 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 13/16] perf, x86: re-organize code that implicitly enables LBR/PEBS kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-24 14:21   ` Peter Zijlstra
  2014-09-10 14:09 ` [PATCH V5 15/16] perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack mode kan.liang
  2014-09-10 14:09 ` [PATCH V5 16/16] perf, x86: Discard zero length call entries in LBR call stack kan.liang
  15 siblings, 1 reply; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

If a task specific event wants user space callchain but does not want
branch stack sampling, enable the LBR call stack facility implicitly.
The LBR call stack facility can help perf to get user space callchain
in case of there is no frame pointer.

Note: this feature only affects how to get user callchain. The kernel
callchain is always got by frame pointers.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 418e953..186f909 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -423,10 +423,23 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (!event->attr.exclude_kernel)
 				*br_type |= PERF_SAMPLE_BRANCH_KERNEL;
 		}
-	}
+	} else if (x86_pmu_has_lbr_callstack() &&
+		   (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
+		   !has_branch_stack(event) &&
+		   !event->attr.exclude_user &&
+		   (event->attach_state & PERF_ATTACH_TASK)) {
+		/*
+		 * user did not specify branch_sample_type,
+		 * try using the LBR call stack facility to
+		 * record call chains of user program.
+		 */
+		event->attr.branch_sample_type =
+			PERF_SAMPLE_BRANCH_USER |
+			PERF_SAMPLE_BRANCH_CALL_STACK;
 
-	if (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK)
+		/* needs PMU specific data to save LBR stack */
 		event->attach_state |= PERF_ATTACH_TASK_DATA;
+	}
 
 	/*
 	 * Generate PMC IRQs:
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain
  2014-09-10 14:09 ` [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain kan.liang
@ 2014-09-24 14:21   ` Peter Zijlstra
  2014-10-07  3:00     ` Liang, Kan
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2014-09-24 14:21 UTC (permalink / raw)
  To: kan.liang; +Cc: eranian, linux-kernel, mingo, paulus, acme, ak, Yan, Zheng

On Wed, Sep 10, 2014 at 10:09:11AM -0400, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> If a task specific event wants user space callchain but does not want
> branch stack sampling, enable the LBR call stack facility implicitly.
> The LBR call stack facility can help perf to get user space callchain
> in case of there is no frame pointer.
> 
> Note: this feature only affects how to get user callchain. The kernel
> callchain is always got by frame pointers.

Yeah, don't like this either. Suppose you have sane userspace (with
framepointers enabled) then you're now loosing the better option.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain
  2014-09-24 14:21   ` Peter Zijlstra
@ 2014-10-07  3:00     ` Liang, Kan
  2014-10-07 15:25       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Liang, Kan @ 2014-10-07  3:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian@google.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, paulus@samba.org, acme@kernel.org,
	ak@linux.intel.com, Yan, Zheng



>
> On Wed, Sep 10, 2014 at 10:09:11AM -0400, kan.liang@intel.com wrote:
> > From: Kan Liang <kan.liang@intel.com>
> >
> > If a task specific event wants user space callchain but does not want
> > branch stack sampling, enable the LBR call stack facility implicitly.
> > The LBR call stack facility can help perf to get user space callchain
> > in case of there is no frame pointer.
> >
> > Note: this feature only affects how to get user callchain. The kernel
> > callchain is always got by frame pointers.
>
> Yeah, don't like this either. Suppose you have sane userspace (with
> framepointers enabled) then you're now loosing the better option.

FP is the first option. This patch tries to enable LBR call stack facility implicitly.
Only when FP disabled or failed, we try to use LBR call stack.
Please refer to the previous patch https://lkml.org/lkml/2014/9/10/376

Kan


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain
  2014-10-07  3:00     ` Liang, Kan
@ 2014-10-07 15:25       ` Peter Zijlstra
  2014-10-07 16:04         ` Liang, Kan
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2014-10-07 15:25 UTC (permalink / raw)
  To: Liang, Kan
  Cc: eranian@google.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, paulus@samba.org, acme@kernel.org,
	ak@linux.intel.com, Yan, Zheng

On Tue, Oct 07, 2014 at 03:00:43AM +0000, Liang, Kan wrote:
> 
> 
> >
> > On Wed, Sep 10, 2014 at 10:09:11AM -0400, kan.liang@intel.com wrote:
> > > From: Kan Liang <kan.liang@intel.com>
> > >
> > > If a task specific event wants user space callchain but does not want
> > > branch stack sampling, enable the LBR call stack facility implicitly.
> > > The LBR call stack facility can help perf to get user space callchain
> > > in case of there is no frame pointer.
> > >
> > > Note: this feature only affects how to get user callchain. The kernel
> > > callchain is always got by frame pointers.
> >
> > Yeah, don't like this either. Suppose you have sane userspace (with
> > framepointers enabled) then you're now loosing the better option.
> 
> FP is the first option. This patch tries to enable LBR call stack facility implicitly.
> Only when FP disabled or failed, we try to use LBR call stack.
> Please refer to the previous patch https://lkml.org/lkml/2014/9/10/376

Still makes for an entirely unpredictable situation. That way you never
quite know where your data came from.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain
  2014-10-07 15:25       ` Peter Zijlstra
@ 2014-10-07 16:04         ` Liang, Kan
  0 siblings, 0 replies; 36+ messages in thread
From: Liang, Kan @ 2014-10-07 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian@google.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, paulus@samba.org, acme@kernel.org,
	ak@linux.intel.com, Yan, Zheng



> 
> On Tue, Oct 07, 2014 at 03:00:43AM +0000, Liang, Kan wrote:
> >
> >
> > >
> > > On Wed, Sep 10, 2014 at 10:09:11AM -0400, kan.liang@intel.com wrote:
> > > > From: Kan Liang <kan.liang@intel.com>
> > > >
> > > > If a task specific event wants user space callchain but does not
> > > > want branch stack sampling, enable the LBR call stack facility implicitly.
> > > > The LBR call stack facility can help perf to get user space
> > > > callchain in case of there is no frame pointer.
> > > >
> > > > Note: this feature only affects how to get user callchain. The
> > > > kernel callchain is always got by frame pointers.
> > >
> > > Yeah, don't like this either. Suppose you have sane userspace (with
> > > framepointers enabled) then you're now loosing the better option.
> >
> > FP is the first option. This patch tries to enable LBR call stack facility implicitly.
> > Only when FP disabled or failed, we try to use LBR call stack.
> > Please refer to the previous patch https://lkml.org/lkml/2014/9/10/376
> 
> Still makes for an entirely unpredictable situation. That way you never quite
> know where your data came from.

But the problem is that we don't know if FP works in advance. We can only
check during runtime. So we have to keep the LBR running.
The process is  fixed. The FP will be checked first.
Only FP failed, LBR data is used. 
If the user want to know where the data come from, we may implement
a flag for user perf tool. 

Kan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V5 15/16] perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack mode
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (13 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain kan.liang
@ 2014-09-10 14:09 ` kan.liang
  2014-09-10 14:09 ` [PATCH V5 16/16] perf, x86: Discard zero length call entries in LBR call stack kan.liang
  15 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

LBR callstack is designed for PEBS, It does not work well with
FREEZE_LBRS_ON_PMI for non PEBS event. If FREEZE_LBRS_ON_PMI is set for
non PEBS event, PMIs near call/return instructions may cause superfluous
increase/decrease of LBR_TOS.

This patch modifies __intel_pmu_lbr_enable() to not enable
FREEZE_LBRS_ON_PMI when LBR operates in callstack mode. We currently
don't use LBR callstack to capture kernel space callchain, so disabling
FREEZE_LBRS_ON_PMI should not be a problem.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 5afb21b..fd8fdfa 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -131,14 +131,23 @@ static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
 
 static void __intel_pmu_lbr_enable(void)
 {
-	u64 debugctl;
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	u64 debugctl, lbr_select = 0;
 
-	if (cpuc->lbr_sel)
-		wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
+	if (cpuc->lbr_sel) {
+		lbr_select = cpuc->lbr_sel->config;
+		wrmsrl(MSR_LBR_SELECT, lbr_select);
+	}
 
 	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
-	debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
+	debugctl |= DEBUGCTLMSR_LBR;
+	/*
+	 * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
+	 * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
+	 * may cause superfluous increase/decrease of LBR_TOS.
+	 */
+	if (!(lbr_select & LBR_CALL_STACK))
+		debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
 	wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 }
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V5 16/16] perf, x86: Discard zero length call entries in LBR call stack
  2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
                   ` (14 preceding siblings ...)
  2014-09-10 14:09 ` [PATCH V5 15/16] perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack mode kan.liang
@ 2014-09-10 14:09 ` kan.liang
  15 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:09 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

"Zero length call" uses the attribute of the call instruction to push
the immediate instruction pointer on to the stack and then pops off
that address into a register. This is accomplished without any matching
return instruction. It confuses the hardware and make the recorded call
stack incorrect.

We can partially resolve this issue by: decode call instructions and
discard any zero length call entry in the LBR stack.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index fd8fdfa..0bd4f5c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -94,7 +94,8 @@ enum {
 	X86_BR_ABORT		= 1 << 12,/* transaction abort */
 	X86_BR_IN_TX		= 1 << 13,/* in transaction */
 	X86_BR_NO_TX		= 1 << 14,/* not in transaction */
-	X86_BR_CALL_STACK	= 1 << 15,/* call stack */
+	X86_BR_ZERO_CALL	= 1 << 15,/* zero length call */
+	X86_BR_CALL_STACK	= 1 << 16,/* call stack */
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -111,13 +112,15 @@ enum {
 	 X86_BR_JMP	 |\
 	 X86_BR_IRQ	 |\
 	 X86_BR_ABORT	 |\
-	 X86_BR_IND_CALL)
+	 X86_BR_IND_CALL |\
+	 X86_BR_ZERO_CALL)
 
 #define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
 
 #define X86_BR_ANY_CALL		 \
 	(X86_BR_CALL		|\
 	 X86_BR_IND_CALL	|\
+	 X86_BR_ZERO_CALL	|\
 	 X86_BR_SYSCALL		|\
 	 X86_BR_IRQ		|\
 	 X86_BR_INT)
@@ -686,6 +689,12 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
 		ret = X86_BR_INT;
 		break;
 	case 0xe8: /* call near rel */
+		insn_get_immediate(&insn);
+		if (insn.immediate1.value == 0) {
+			/* zero length call */
+			ret = X86_BR_ZERO_CALL;
+			break;
+		}
 	case 0x9a: /* call far absolute */
 		ret = X86_BR_CALL;
 		break;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v5 00/16] perf, x86: Haswell LBR call stack support
@ 2014-07-07  6:28 Yan, Zheng
  2014-07-07  6:28 ` [PATCH v5 09/16] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
  0 siblings, 1 reply; 36+ messages in thread
From: Yan, Zheng @ 2014-07-07  6:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

When profiling bc(1) on Fedora 19:
 echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd

If this feature is enabled, perf report output looks like:
    50.36%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start

    33.66%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
                     bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start

     7.62%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                    |
                    |--99.89%-- 0x2000186a8
                     --0.11%-- [...]

     6.83%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
                    |
                    |--99.94%-- bc_add
                    |          execute
                    |          run_code
                    |          yyparse
                    |          main
                    |          __libc_start_main
                    |          _start
                     --0.06%-- [...]

     0.46%       bc  libc-2.17.so       [.] __memset_sse2
                 |
                 --- __memset_sse2
                    |
                    |--54.13%-- bc_new_num
                    |          |
                    |          |--51.00%-- bc_divide
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |          |--30.46%-- _bc_do_sub
                    |          |          bc_add
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |           --18.55%-- _bc_do_add
                    |                     bc_add
                    |                     execute
                    |                     run_code
                    |                     yyparse
                    |                     main
                    |                     __libc_start_main
                    |                     _start
                    |
                     --45.87%-- bc_divide
                               execute
                               run_code
                               yyparse
                               main
                               __libc_start_main
                               _start

If this feature is disabled, perf report output looks like:
    50.49%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide

    33.57%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult

     7.61%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                     0x2000186a8

     6.88%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub

     0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
                 |
                 --- __memcpy_ssse3_back

The LBR call stack has following known limitations
 - Zero length calls are not filtered out by hardware
 - Exception handing such as setjmp/longjmp will have calls/returns not
   match
 - Pushing different return address onto the stack will have calls/returns
   not match
 - If callstack is deeper than the LBR, only the last entries are captured

Changes since v1
 - split change into more patches
 - introduce context switch callback and use it to flush LBR
 - use the context switch callback to save/restore LBR
 - dynamic allocate memory area for storing LBR stack, always switch the
   memory area during context switch
 - disable this feature by default
 - more description in change logs

Changes since v2
 - don't use xchg to switch PMU specific data
 - remove nr_branch_stack from struct perf_event_context
 - simplify the save/restore LBR stack logical
 - remove unnecessary 'has_branch_stack -> needs_branch_stack'
   conversion
 - more description in change logs

Changes since v3
 - remove sysfs attribute file that disable this feature

Changes since v4
 - re-organize code that save/resotre LBR stack
 - allocate pmu specific data when it's needed
 - update code comments

These patches are also available at:

These patches are also available at:
 https://github.com/ukernel/linux.git perf-lbr-callstack

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v5 09/16] perf, x86: Save/resotre LBR stack during context switch
  2014-07-07  6:28 [PATCH v5 00/16] perf, x86: Haswell LBR call stack support Yan, Zheng
@ 2014-07-07  6:28 ` Yan, Zheng
  0 siblings, 0 replies; 36+ messages in thread
From: Yan, Zheng @ 2014-07-07  6:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. The solution is saving/restoring
the LBR stack to/from task's perf event context.

The LBR stack is saved/restored only when there are events that use
the LBR call stack. If no event uses LBR call stack, the LBR stack
is reset when task is scheduled in.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 87 +++++++++++++++++++++++++-----
 1 file changed, 75 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 9ae1875..eee93df 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -180,13 +180,88 @@ void intel_pmu_lbr_reset(void)
 		intel_pmu_lbr_reset_64();
 }
 
+/*
+ * TOS = most recently recorded branch
+ */
+static inline u64 intel_pmu_lbr_tos(void)
+{
+	u64 tos;
+	rdmsrl(x86_pmu.lbr_tos, tos);
+	return tos;
+}
+
+enum {
+	LBR_NONE,
+	LBR_VALID,
+};
+
+static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask;
+	u64 tos;
+
+	if (task_ctx->lbr_callstack_users == 0 ||
+	    task_ctx->lbr_stack_state == LBR_NONE) {
+		intel_pmu_lbr_reset();
+		return;
+	}
+
+	mask = x86_pmu.lbr_nr - 1;
+	tos = intel_pmu_lbr_tos();
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_state = LBR_NONE;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask;
+	u64 tos;
+
+	if (task_ctx->lbr_callstack_users == 0) {
+		task_ctx->lbr_stack_state = LBR_NONE;
+		return;
+	}
+
+	mask = x86_pmu.lbr_nr - 1;
+	tos = intel_pmu_lbr_tos();
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_state = LBR_VALID;
+}
+
+
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct x86_perf_task_context *task_ctx;
 
 	if (!x86_pmu.lbr_nr)
 		return;
 	/*
+	 * If LBR callstack feature is enabled and the stack was saved when
+	 * the task was scheduled out, restore the stack. Otherwise flush
+	 * the LBR stack.
+	 */
+	task_ctx = ctx ? ctx->task_ctx_data : NULL;
+	if (task_ctx) {
+		if (sched_in) {
+			__intel_pmu_lbr_restore(task_ctx);
+			cpuc->lbr_context = ctx;
+		} else {
+			__intel_pmu_lbr_save(task_ctx);
+		}
+	}
+
+	/*
 	 * When sampling the branck stack in system-wide, it may be
 	 * necessary to flush the stack on context switch. This happens
 	 * when the branch stack does not tag its entries with the pid
@@ -276,18 +351,6 @@ void intel_pmu_lbr_disable_all(void)
 		__intel_pmu_lbr_disable();
 }
 
-/*
- * TOS = most recently recorded branch
- */
-static inline u64 intel_pmu_lbr_tos(void)
-{
-	u64 tos;
-
-	rdmsrl(x86_pmu.lbr_tos, tos);
-
-	return tos;
-}
-
 static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 {
 	unsigned long mask = x86_pmu.lbr_nr - 1;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V5 00/16] perf, x86: Haswell LBR call stack support
@ 2001-01-08  2:29 kan.liang
  2001-01-08  2:29 ` [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch kan.liang
  0 siblings, 1 reply; 36+ messages in thread
From: kan.liang @ 2001-01-08  2:29 UTC (permalink / raw)
  To: a.p.zijlstra, eranian; +Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang

From: Kan Liang <kan.liang@intel.com>

(Re-post the haswell LBR call stack patch on behalf of Yan, Zheng.
The patch set is rebased on tip.git commit ID 8b5df2f.
I've tested them on my Haswell platform.)

For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

When profiling bc(1) on Fedora 19:
echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
If this feature is enabled, perf report output looks like:
    50.36%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start
    33.66%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
                     bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start
     7.62%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                    |
                    |--99.89%-- 0x2000186a8
                     --0.11%-- [...]
     6.83%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
                    |
                    |--99.94%-- bc_add
                    |          execute
                    |          run_code
                    |          yyparse
                    |          main
                    |          __libc_start_main
                    |          _start
                     --0.06%-- [...]
     0.46%       bc  libc-2.17.so       [.] __memset_sse2
                 |
                 --- __memset_sse2
                    |
                    |--54.13%-- bc_new_num
                    |          |
                    |          |--51.00%-- bc_divide
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |          |--30.46%-- _bc_do_sub
                    |          |          bc_add
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |           --18.55%-- _bc_do_add
                    |                     bc_add
                    |                     execute
                    |                     run_code
                    |                     yyparse
                    |                     main
                    |                     __libc_start_main
                    |                     _start
                    |
                     --45.87%-- bc_divide
                               execute
                               run_code
                               yyparse
                               main
                               __libc_start_main
                               _start
If this feature is disabled, perf report output looks like:
    50.49%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
    33.57%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
     7.61%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                     0x2000186a8
     6.88%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
     0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
                 |
                 --- __memcpy_ssse3_back
The LBR call stack has following known limitations
 - Zero length calls are not filtered out by hardware
 - Exception handing such as setjmp/longjmp will have calls/returns not
   match
 - Pushing different return address onto the stack will have calls/returns
   not match
 - If callstack is deeper than the LBR, only the last entries are captured

Changes since v1
 - split change into more patches
 - introduce context switch callback and use it to flush LBR
 - use the context switch callback to save/restore LBR
 - dynamic allocate memory area for storing LBR stack, always switch the
   memory area during context switch
 - disable this feature by default
 - more description in change logs

Changes since v2
 - don't use xchg to switch PMU specific data
 - remove nr_branch_stack from struct perf_event_context
 - simplify the save/restore LBR stack logical
 - remove unnecessary 'has_branch_stack -> needs_branch_stack'
   conversion
 - more description in change logs

Changes since v3
 - remove sysfs attribute file that disable this feature

Changes since v4
 - re-organize code that save/resotre LBR stack
 - allocate pmu specific data when it's needed
 - update code comments


Yan, Zheng (16):
  perf, x86: Reduce lbr_sel_map size
  perf, core: introduce pmu context switch callback
  perf, x86: use context switch callback to flush LBR stack      stack
  perf, x86: Basic Haswell LBR call stack support
  perf, core: pmu specific data for perf task context
  perf, core: always switch pmu specific data during context switch
  perf, x86: allocate space for storing LBR stack
  perf, x86: track number of events that use LBR callstack
  perf, x86: Save/resotre LBR stack during context switch
  perf, core: simplify need branch stack check
  perf, core: Pass perf_sample_data to perf_callchain()
  perf, x86: use LBR call stack to get user callchain
  perf, x86: re-organize code that implicitly enables LBR/PEBS
  perf, x86: enable LBR callstack when recording callchain
  perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack  
      mode
  perf, x86: Discard zero length call entries in LBR call stack

 arch/arm/kernel/perf_event.c               |   4 +-
 arch/powerpc/perf/callchain.c              |   4 +-
 arch/sparc/kernel/perf_event.c             |   4 +-
 arch/x86/kernel/cpu/perf_event.c           | 120 +++++++----
 arch/x86/kernel/cpu/perf_event.h           |  28 ++-
 arch/x86/kernel/cpu/perf_event_intel.c     |  38 +---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 316 ++++++++++++++++++++++-------
 include/linux/perf_event.h                 |  26 ++-
 include/uapi/linux/perf_event.h            |  49 +++--
 kernel/events/callchain.c                  |   8 +-
 kernel/events/core.c                       | 182 +++++++++--------
 kernel/events/internal.h                   |   3 +-
 12 files changed, 528 insertions(+), 254 deletions(-)

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch
  2001-01-08  2:29 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
@ 2001-01-08  2:29 ` kan.liang
  0 siblings, 0 replies; 36+ messages in thread
From: kan.liang @ 2001-01-08  2:29 UTC (permalink / raw)
  To: a.p.zijlstra, eranian
  Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang, Yan, Zheng

From: Kan Liang <kan.liang@intel.com>

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. The solution is saving/restoring
the LBR stack to/from task's perf event context.

The LBR stack is saved/restored only when there are events that use
the LBR call stack. If no event uses LBR call stack, the LBR stack
is reset when task is scheduled in.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 88 ++++++++++++++++++++++++++----
 1 file changed, 76 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 8c6da0f..6aabbb4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -180,13 +180,89 @@ void intel_pmu_lbr_reset(void)
 		intel_pmu_lbr_reset_64();
 }
 
+/*
+ * TOS = most recently recorded branch
+ */
+static inline u64 intel_pmu_lbr_tos(void)
+{
+	u64 tos;
+
+	rdmsrl(x86_pmu.lbr_tos, tos);
+	return tos;
+}
+
+enum {
+	LBR_NONE,
+	LBR_VALID,
+};
+
+static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask;
+	u64 tos;
+
+	if (task_ctx->lbr_callstack_users == 0 ||
+	    task_ctx->lbr_stack_state == LBR_NONE) {
+		intel_pmu_lbr_reset();
+		return;
+	}
+
+	mask = x86_pmu.lbr_nr - 1;
+	tos = intel_pmu_lbr_tos();
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_state = LBR_NONE;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask;
+	u64 tos;
+
+	if (task_ctx->lbr_callstack_users == 0) {
+		task_ctx->lbr_stack_state = LBR_NONE;
+		return;
+	}
+
+	mask = x86_pmu.lbr_nr - 1;
+	tos = intel_pmu_lbr_tos();
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_state = LBR_VALID;
+}
+
+
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct x86_perf_task_context *task_ctx;
 
 	if (!x86_pmu.lbr_nr)
 		return;
 	/*
+	 * If LBR callstack feature is enabled and the stack was saved when
+	 * the task was scheduled out, restore the stack. Otherwise flush
+	 * the LBR stack.
+	 */
+	task_ctx = ctx ? ctx->task_ctx_data : NULL;
+	if (task_ctx) {
+		if (sched_in) {
+			__intel_pmu_lbr_restore(task_ctx);
+			cpuc->lbr_context = ctx;
+		} else {
+			__intel_pmu_lbr_save(task_ctx);
+		}
+	}
+
+	/*
 	 * When sampling the branck stack in system-wide, it may be
 	 * necessary to flush the stack on context switch. This happens
 	 * when the branch stack does not tag its entries with the pid
@@ -276,18 +352,6 @@ void intel_pmu_lbr_disable_all(void)
 		__intel_pmu_lbr_disable();
 }
 
-/*
- * TOS = most recently recorded branch
- */
-static inline u64 intel_pmu_lbr_tos(void)
-{
-	u64 tos;
-
-	rdmsrl(x86_pmu.lbr_tos, tos);
-
-	return tos;
-}
-
 static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 {
 	unsigned long mask = x86_pmu.lbr_nr - 1;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-10-07 16:30 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
2014-09-10 14:08 ` [PATCH V5 01/16] perf, x86: Reduce lbr_sel_map size kan.liang
2014-09-24 10:50   ` Peter Zijlstra
2014-09-10 14:08 ` [PATCH V5 02/16] perf, core: introduce pmu context switch callback kan.liang
2014-09-24 11:23   ` Peter Zijlstra
2014-09-24 13:13   ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 03/16] perf, x86: use context switch callback to flush LBR stack kan.liang
2014-09-10 14:09 ` [PATCH V5 04/16] perf, x86: Basic Haswell LBR call stack support kan.liang
2014-09-10 14:09 ` [PATCH V5 05/16] perf, core: pmu specific data for perf task context kan.liang
2014-09-10 14:09 ` [PATCH V5 06/16] perf, core: always switch pmu specific data during context switch kan.liang
2014-09-10 14:09 ` [PATCH V5 07/16] perf, x86: allocate space for storing LBR stack kan.liang
2014-09-10 14:09 ` [PATCH V5 08/16] perf, x86: track number of events that use LBR callstack kan.liang
2014-09-24 12:53   ` Peter Zijlstra
2014-10-07  2:59     ` Liang, Kan
2014-10-07 15:19       ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch kan.liang
2014-09-24 13:33   ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 10/16] perf, core: simplify need branch stack check kan.liang
2014-09-24 13:55   ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain() kan.liang
2014-09-24 14:15   ` Peter Zijlstra
2014-10-07  3:00     ` Liang, Kan
2014-10-07 15:24       ` Peter Zijlstra
2014-10-07 15:50         ` Liang, Kan
2014-10-07 16:29           ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 12/16] perf, x86: use LBR call stack to get user callchain kan.liang
2014-09-10 14:09 ` [PATCH V5 13/16] perf, x86: re-organize code that implicitly enables LBR/PEBS kan.liang
2014-09-10 14:09 ` [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain kan.liang
2014-09-24 14:21   ` Peter Zijlstra
2014-10-07  3:00     ` Liang, Kan
2014-10-07 15:25       ` Peter Zijlstra
2014-10-07 16:04         ` Liang, Kan
2014-09-10 14:09 ` [PATCH V5 15/16] perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack mode kan.liang
2014-09-10 14:09 ` [PATCH V5 16/16] perf, x86: Discard zero length call entries in LBR call stack kan.liang
  -- strict thread matches above, loose matches on Subject: below --
2014-07-07  6:28 [PATCH v5 00/16] perf, x86: Haswell LBR call stack support Yan, Zheng
2014-07-07  6:28 ` [PATCH v5 09/16] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
2001-01-08  2:29 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
2001-01-08  2:29 ` [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox