public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5 00/16] perf, x86: Haswell LBR call stack support
@ 2014-09-10 14:08 kan.liang
  2014-09-10 14:08 ` [PATCH V5 01/16] perf, x86: Reduce lbr_sel_map size kan.liang
                   ` (15 more replies)
  0 siblings, 16 replies; 36+ messages in thread
From: kan.liang @ 2014-09-10 14:08 UTC (permalink / raw)
  To: a.p.zijlstra, eranian; +Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang

From: Kan Liang <kan.liang@intel.com>

(The Email and patch date of last post isn't correct,
so re-post to fix the time.)
(Re-post the haswell LBR call stack patch on behalf of Yan, Zheng.
The patch set is rebased on tip.git commit ID 8b5df2f.
I've tested them on my Haswell platform.)

For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

When profiling bc(1) on Fedora 19:
echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
If this feature is enabled, perf report output looks like:
    50.36%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start
    33.66%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
                     bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start
     7.62%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                    |
                    |--99.89%-- 0x2000186a8
                     --0.11%-- [...]
     6.83%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
                    |
                    |--99.94%-- bc_add
                    |          execute
                    |          run_code
                    |          yyparse
                    |          main
                    |          __libc_start_main
                    |          _start
                     --0.06%-- [...]
     0.46%       bc  libc-2.17.so       [.] __memset_sse2
                 |
                 --- __memset_sse2
                    |
                    |--54.13%-- bc_new_num
                    |          |
                    |          |--51.00%-- bc_divide
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |          |--30.46%-- _bc_do_sub
                    |          |          bc_add
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |           --18.55%-- _bc_do_add
                    |                     bc_add
                    |                     execute
                    |                     run_code
                    |                     yyparse
                    |                     main
                    |                     __libc_start_main
                    |                     _start
                    |
                     --45.87%-- bc_divide
                               execute
                               run_code
                               yyparse
                               main
                               __libc_start_main
                               _start
If this feature is disabled, perf report output looks like:
    50.49%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
    33.57%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
     7.61%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                     0x2000186a8
     6.88%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
     0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
                 |
                 --- __memcpy_ssse3_back
The LBR call stack has following known limitations
 - Zero length calls are not filtered out by hardware
 - Exception handing such as setjmp/longjmp will have calls/returns not
   match
 - Pushing different return address onto the stack will have calls/returns
   not match
 - If callstack is deeper than the LBR, only the last entries are captured

Changes since v1
 - split change into more patches
 - introduce context switch callback and use it to flush LBR
 - use the context switch callback to save/restore LBR
 - dynamic allocate memory area for storing LBR stack, always switch the
   memory area during context switch
 - disable this feature by default
 - more description in change logs

Changes since v2
 - don't use xchg to switch PMU specific data
 - remove nr_branch_stack from struct perf_event_context
 - simplify the save/restore LBR stack logical
 - remove unnecessary 'has_branch_stack -> needs_branch_stack'
   conversion
 - more description in change logs

Changes since v3
 - remove sysfs attribute file that disable this feature

Changes since v4
 - re-organize code that save/resotre LBR stack
 - allocate pmu specific data when it's needed
 - update code comments

Yan, Zheng (16):
  perf, x86: Reduce lbr_sel_map size
  perf, core: introduce pmu context switch callback
  perf, x86: use context switch callback to flush LBR stack
  perf, x86: Basic Haswell LBR call stack support
  perf, core: pmu specific data for perf task context
  perf, core: always switch pmu specific data during context switch
  perf, x86: allocate space for storing LBR stack
  perf, x86: track number of events that use LBR callstack
  perf, x86: Save/resotre LBR stack during context switch
  perf, core: simplify need branch stack check
  perf, core: Pass perf_sample_data to perf_callchain()
  perf, x86: use LBR call stack to get user callchain
  perf, x86: re-organize code that implicitly enables LBR/PEBS
  perf, x86: enable LBR callstack when recording callchain
  perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack  
      mode
  perf, x86: Discard zero length call entries in LBR call stack

 arch/arm/kernel/perf_event.c               |   4 +-
 arch/powerpc/perf/callchain.c              |   4 +-
 arch/sparc/kernel/perf_event.c             |   4 +-
 arch/x86/kernel/cpu/perf_event.c           | 120 +++++++----
 arch/x86/kernel/cpu/perf_event.h           |  28 ++-
 arch/x86/kernel/cpu/perf_event_intel.c     |  38 +---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 316 ++++++++++++++++++++++-------
 include/linux/perf_event.h                 |  26 ++-
 include/uapi/linux/perf_event.h            |  49 +++--
 kernel/events/callchain.c                  |   8 +-
 kernel/events/core.c                       | 182 +++++++++--------
 kernel/events/internal.h                   |   3 +-
 12 files changed, 528 insertions(+), 254 deletions(-)

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 36+ messages in thread
* [PATCH v5 00/16] perf, x86: Haswell LBR call stack support
@ 2014-07-07  6:28 Yan, Zheng
  2014-07-07  6:28 ` [PATCH v5 09/16] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
  0 siblings, 1 reply; 36+ messages in thread
From: Yan, Zheng @ 2014-07-07  6:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

When profiling bc(1) on Fedora 19:
 echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd

If this feature is enabled, perf report output looks like:
    50.36%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start

    33.66%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
                     bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start

     7.62%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                    |
                    |--99.89%-- 0x2000186a8
                     --0.11%-- [...]

     6.83%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
                    |
                    |--99.94%-- bc_add
                    |          execute
                    |          run_code
                    |          yyparse
                    |          main
                    |          __libc_start_main
                    |          _start
                     --0.06%-- [...]

     0.46%       bc  libc-2.17.so       [.] __memset_sse2
                 |
                 --- __memset_sse2
                    |
                    |--54.13%-- bc_new_num
                    |          |
                    |          |--51.00%-- bc_divide
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |          |--30.46%-- _bc_do_sub
                    |          |          bc_add
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |           --18.55%-- _bc_do_add
                    |                     bc_add
                    |                     execute
                    |                     run_code
                    |                     yyparse
                    |                     main
                    |                     __libc_start_main
                    |                     _start
                    |
                     --45.87%-- bc_divide
                               execute
                               run_code
                               yyparse
                               main
                               __libc_start_main
                               _start

If this feature is disabled, perf report output looks like:
    50.49%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide

    33.57%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult

     7.61%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                     0x2000186a8

     6.88%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub

     0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
                 |
                 --- __memcpy_ssse3_back

The LBR call stack has following known limitations
 - Zero length calls are not filtered out by hardware
 - Exception handing such as setjmp/longjmp will have calls/returns not
   match
 - Pushing different return address onto the stack will have calls/returns
   not match
 - If callstack is deeper than the LBR, only the last entries are captured

Changes since v1
 - split change into more patches
 - introduce context switch callback and use it to flush LBR
 - use the context switch callback to save/restore LBR
 - dynamic allocate memory area for storing LBR stack, always switch the
   memory area during context switch
 - disable this feature by default
 - more description in change logs

Changes since v2
 - don't use xchg to switch PMU specific data
 - remove nr_branch_stack from struct perf_event_context
 - simplify the save/restore LBR stack logical
 - remove unnecessary 'has_branch_stack -> needs_branch_stack'
   conversion
 - more description in change logs

Changes since v3
 - remove sysfs attribute file that disable this feature

Changes since v4
 - re-organize code that save/resotre LBR stack
 - allocate pmu specific data when it's needed
 - update code comments

These patches are also available at:

These patches are also available at:
 https://github.com/ukernel/linux.git perf-lbr-callstack

^ permalink raw reply	[flat|nested] 36+ messages in thread
* [PATCH V5 00/16] perf, x86: Haswell LBR call stack support
@ 2001-01-08  2:29 kan.liang
  2001-01-08  2:29 ` [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch kan.liang
  0 siblings, 1 reply; 36+ messages in thread
From: kan.liang @ 2001-01-08  2:29 UTC (permalink / raw)
  To: a.p.zijlstra, eranian; +Cc: linux-kernel, mingo, paulus, acme, ak, kan.liang

From: Kan Liang <kan.liang@intel.com>

(Re-post the haswell LBR call stack patch on behalf of Yan, Zheng.
The patch set is rebased on tip.git commit ID 8b5df2f.
I've tested them on my Haswell platform.)

For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

When profiling bc(1) on Fedora 19:
echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
If this feature is enabled, perf report output looks like:
    50.36%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start
    33.66%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
                     bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start
     7.62%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                    |
                    |--99.89%-- 0x2000186a8
                     --0.11%-- [...]
     6.83%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
                    |
                    |--99.94%-- bc_add
                    |          execute
                    |          run_code
                    |          yyparse
                    |          main
                    |          __libc_start_main
                    |          _start
                     --0.06%-- [...]
     0.46%       bc  libc-2.17.so       [.] __memset_sse2
                 |
                 --- __memset_sse2
                    |
                    |--54.13%-- bc_new_num
                    |          |
                    |          |--51.00%-- bc_divide
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |          |--30.46%-- _bc_do_sub
                    |          |          bc_add
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |           --18.55%-- _bc_do_add
                    |                     bc_add
                    |                     execute
                    |                     run_code
                    |                     yyparse
                    |                     main
                    |                     __libc_start_main
                    |                     _start
                    |
                     --45.87%-- bc_divide
                               execute
                               run_code
                               yyparse
                               main
                               __libc_start_main
                               _start
If this feature is disabled, perf report output looks like:
    50.49%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
    33.57%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
     7.61%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                     0x2000186a8
     6.88%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
     0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
                 |
                 --- __memcpy_ssse3_back
The LBR call stack has following known limitations
 - Zero length calls are not filtered out by hardware
 - Exception handing such as setjmp/longjmp will have calls/returns not
   match
 - Pushing different return address onto the stack will have calls/returns
   not match
 - If callstack is deeper than the LBR, only the last entries are captured

Changes since v1
 - split change into more patches
 - introduce context switch callback and use it to flush LBR
 - use the context switch callback to save/restore LBR
 - dynamic allocate memory area for storing LBR stack, always switch the
   memory area during context switch
 - disable this feature by default
 - more description in change logs

Changes since v2
 - don't use xchg to switch PMU specific data
 - remove nr_branch_stack from struct perf_event_context
 - simplify the save/restore LBR stack logical
 - remove unnecessary 'has_branch_stack -> needs_branch_stack'
   conversion
 - more description in change logs

Changes since v3
 - remove sysfs attribute file that disable this feature

Changes since v4
 - re-organize code that save/resotre LBR stack
 - allocate pmu specific data when it's needed
 - update code comments


Yan, Zheng (16):
  perf, x86: Reduce lbr_sel_map size
  perf, core: introduce pmu context switch callback
  perf, x86: use context switch callback to flush LBR stack      stack
  perf, x86: Basic Haswell LBR call stack support
  perf, core: pmu specific data for perf task context
  perf, core: always switch pmu specific data during context switch
  perf, x86: allocate space for storing LBR stack
  perf, x86: track number of events that use LBR callstack
  perf, x86: Save/resotre LBR stack during context switch
  perf, core: simplify need branch stack check
  perf, core: Pass perf_sample_data to perf_callchain()
  perf, x86: use LBR call stack to get user callchain
  perf, x86: re-organize code that implicitly enables LBR/PEBS
  perf, x86: enable LBR callstack when recording callchain
  perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack  
      mode
  perf, x86: Discard zero length call entries in LBR call stack

 arch/arm/kernel/perf_event.c               |   4 +-
 arch/powerpc/perf/callchain.c              |   4 +-
 arch/sparc/kernel/perf_event.c             |   4 +-
 arch/x86/kernel/cpu/perf_event.c           | 120 +++++++----
 arch/x86/kernel/cpu/perf_event.h           |  28 ++-
 arch/x86/kernel/cpu/perf_event_intel.c     |  38 +---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 316 ++++++++++++++++++++++-------
 include/linux/perf_event.h                 |  26 ++-
 include/uapi/linux/perf_event.h            |  49 +++--
 kernel/events/callchain.c                  |   8 +-
 kernel/events/core.c                       | 182 +++++++++--------
 kernel/events/internal.h                   |   3 +-
 12 files changed, 528 insertions(+), 254 deletions(-)

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-10-07 16:30 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-10 14:08 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
2014-09-10 14:08 ` [PATCH V5 01/16] perf, x86: Reduce lbr_sel_map size kan.liang
2014-09-24 10:50   ` Peter Zijlstra
2014-09-10 14:08 ` [PATCH V5 02/16] perf, core: introduce pmu context switch callback kan.liang
2014-09-24 11:23   ` Peter Zijlstra
2014-09-24 13:13   ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 03/16] perf, x86: use context switch callback to flush LBR stack kan.liang
2014-09-10 14:09 ` [PATCH V5 04/16] perf, x86: Basic Haswell LBR call stack support kan.liang
2014-09-10 14:09 ` [PATCH V5 05/16] perf, core: pmu specific data for perf task context kan.liang
2014-09-10 14:09 ` [PATCH V5 06/16] perf, core: always switch pmu specific data during context switch kan.liang
2014-09-10 14:09 ` [PATCH V5 07/16] perf, x86: allocate space for storing LBR stack kan.liang
2014-09-10 14:09 ` [PATCH V5 08/16] perf, x86: track number of events that use LBR callstack kan.liang
2014-09-24 12:53   ` Peter Zijlstra
2014-10-07  2:59     ` Liang, Kan
2014-10-07 15:19       ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch kan.liang
2014-09-24 13:33   ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 10/16] perf, core: simplify need branch stack check kan.liang
2014-09-24 13:55   ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 11/16] perf, core: Pass perf_sample_data to perf_callchain() kan.liang
2014-09-24 14:15   ` Peter Zijlstra
2014-10-07  3:00     ` Liang, Kan
2014-10-07 15:24       ` Peter Zijlstra
2014-10-07 15:50         ` Liang, Kan
2014-10-07 16:29           ` Peter Zijlstra
2014-09-10 14:09 ` [PATCH V5 12/16] perf, x86: use LBR call stack to get user callchain kan.liang
2014-09-10 14:09 ` [PATCH V5 13/16] perf, x86: re-organize code that implicitly enables LBR/PEBS kan.liang
2014-09-10 14:09 ` [PATCH V5 14/16] perf, x86: enable LBR callstack when recording callchain kan.liang
2014-09-24 14:21   ` Peter Zijlstra
2014-10-07  3:00     ` Liang, Kan
2014-10-07 15:25       ` Peter Zijlstra
2014-10-07 16:04         ` Liang, Kan
2014-09-10 14:09 ` [PATCH V5 15/16] perf, x86: disable FREEZE_LBRS_ON_PMI when LBR operates in callstack mode kan.liang
2014-09-10 14:09 ` [PATCH V5 16/16] perf, x86: Discard zero length call entries in LBR call stack kan.liang
  -- strict thread matches above, loose matches on Subject: below --
2014-07-07  6:28 [PATCH v5 00/16] perf, x86: Haswell LBR call stack support Yan, Zheng
2014-07-07  6:28 ` [PATCH v5 09/16] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
2001-01-08  2:29 [PATCH V5 00/16] perf, x86: Haswell LBR call stack support kan.liang
2001-01-08  2:29 ` [PATCH V5 09/16] perf, x86: Save/resotre LBR stack during context switch kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox