Re: [PATCH V3 0/3] perf tool: Haswell LBR call stack support (user)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jiri Olsa <jolsa@redhat.com>
To: kan.liang@intel.com
Cc: acme@kernel.org, a.p.zijlstra@chello.nl, eranian@google.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com, paulus@samba.org,
	ak@linux.intel.com
Subject: Re: [PATCH V3 0/3] perf tool: Haswell LBR call stack support (user)
Date: Mon, 17 Nov 2014 17:01:26 +0100	[thread overview]
Message-ID: <20141117160125.GD21532@krava.brq.redhat.com> (raw)
In-Reply-To: <1415972652-17310-1-git-send-email-kan.liang@intel.com>

On Fri, Nov 14, 2014 at 08:44:09AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> This is the user space patch for Haswell LBR call stack support.
> For many profiling tasks we need the callgraph. For example we often
> need to see the caller of a lock or the caller of a memcpy or other
> library function to actually tune the program. Frame pointer unwinding
> is efficient and works well. But frame pointers are off by default on
> 64bit code (and on modern 32bit gccs), so there are many binaries around
> that do not use frame pointers. Profiling unchanged production code is
> very useful in practice. On some CPUs frame pointer also has a high
> cost. Dwarf2 unwinding also does not always work and is extremely slow
> (upto 20% overhead).
> 
> Haswell has a new feature that utilizes the existing Last Branch Record
> facility to record call chains. When the feature is enabled, function
> call will be collected as normal, but as return instructions are
> executed the last captured branch record is popped from the on-chip LBR
> registers. The LBR call stack facility provides an alternative to get
> callgraph. It has some limitations too, but should work in most cases
> and is significantly faster than dwarf. Frame pointer unwinding is still
> the best default, but LBR call stack is a good alternative when nothing
> else works.
> 


---
> A new call chain recording option "lbr" is introduced into perf tool for
> LBR call stack. The user can use --call-graph lbr to get the call stack
> information from hardware.
> 
> When profiling bc(1) on Fedora 19:
> echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph lbr bc -l < cmd
> If enabling LBR, perf report output looks like:
>     50.36%       bc  bc                 [.] bc_divide
>                  |
>                  --- bc_divide
>                      execute
>                      run_code
>                      yyparse
>                      main
>                      __libc_start_main
>                      _start
>     33.66%       bc  bc                 [.] _one_mult
>                  |
>                  --- _one_mult
>                      bc_divide
>                      execute
>                      run_code
>                      yyparse
>                      main
>                      __libc_start_main
>                      _start
>      7.62%       bc  bc                 [.] _bc_do_add
>                  |
>                  --- _bc_do_add
>                     |
>                     |--99.89%-- 0x2000186a8
>                      --0.11%-- [...]
>      6.83%       bc  bc                 [.] _bc_do_sub
>                  |
>                  --- _bc_do_sub
>                     |
>                     |--99.94%-- bc_add
>                     |          execute
>                     |          run_code
>                     |          yyparse
>                     |          main
>                     |          __libc_start_main
>                     |          _start
>                      --0.06%-- [...]
>      0.46%       bc  libc-2.17.so       [.] __memset_sse2
>                  |
>                  --- __memset_sse2
>                     |
>                     |--54.13%-- bc_new_num
>                     |          |
>                     |          |--51.00%-- bc_divide
>                     |          |          execute
>                     |          |          run_code
>                     |          |          yyparse
>                     |          |          main
>                     |          |          __libc_start_main
>                     |          |          _start
>                     |          |
>                     |          |--30.46%-- _bc_do_sub
>                     |          |          bc_add
>                     |          |          execute
>                     |          |          run_code
>                     |          |          yyparse
>                     |          |          main
>                     |          |          __libc_start_main
>                     |          |          _start
>                     |          |
>                     |           --18.55%-- _bc_do_add
>                     |                     bc_add
>                     |                     execute
>                     |                     run_code
>                     |                     yyparse
>                     |                     main
>                     |                     __libc_start_main
>                     |                     _start
>                     |
>                      --45.87%-- bc_divide
>                                execute
>                                run_code
>                                yyparse
>                                main
>                                __libc_start_main
>                                _start
> If using FP, perf report output looks like:
> echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
>     50.49%       bc  bc                 [.] bc_divide
>                  |
>                  --- bc_divide
>     33.57%       bc  bc                 [.] _one_mult
>                  |
>                  --- _one_mult
>      7.61%       bc  bc                 [.] _bc_do_add
>                  |
>                  --- _bc_do_add
>                      0x2000186a8
>      6.88%       bc  bc                 [.] _bc_do_sub
>                  |
>                  --- _bc_do_sub
>      0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
>                  |
>                  --- __memcpy_ssse3_back
> 
> If using LBR, perf report -D output looks like:
> 11739295893248 0x4d0 [0xe0]: PERF_RECORD_SAMPLE(IP, 0x2): 10505/10505:
> 0x40054d period: 39255 addr: 0
> ... LBR call chain: nr:7
> .....  0: fffffffffffffe00
> .....  1: 0000000000400540
> .....  2: 0000000000400587
> .....  3: 00000000004005b3
> .....  4: 00000000004005ef
> .....  5: 0000003d1cc21b43
> .....  6: 0000000000400474
> ... FP chain: nr:6
> .....  0: fffffffffffffe00
> .....  1: 000000000040054d
> .....  2: 000000000040058c
> .....  3: 00000000004005b8
> .....  4: 00000000004005f4
> .....  5: 0000003d1cc21b45
>  ... thread: a.out:10505
>  ...... dso: /home/lk/a.out
> 
> 
> The LBR call stack has following known limitations
>  - Zero length calls are not filtered out by hardware
>  - Exception handing such as setjmp/longjmp will have calls/returns not
>    match
>  - Pushing different return address onto the stack will have calls/returns
>    not match
>  - If callstack is deeper than the LBR, only the last entries are captured
---

also could you please add all above ^^^ as an additional text
for patch 3/3 changelog (perf tools: Construct LBR call chain)?

looks too nice to lose it ;-)

thanks,
jirka

     prev parent reply	other threads:[~2014-11-17 16:01 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-14 13:44 [PATCH V3 0/3] perf tool: Haswell LBR call stack support (user) kan.liang
2014-11-14 13:44 ` [PATCH V3 1/3] perf tools: enable LBR call stack support kan.liang
2014-11-18  5:54   ` Namhyung Kim
2014-11-18 13:57     ` Liang, Kan
2014-11-14 13:44 ` [PATCH V3 2/3] perf tool: Move cpumode resolve code to add_callchain_ip kan.liang
2014-11-17 13:57   ` Jiri Olsa
2014-11-17 14:00   ` Jiri Olsa
2014-11-18  8:24   ` Jiri Olsa
2014-11-21 15:06     ` Liang, Kan
2014-11-21 15:19       ` Arnaldo Carvalho de Melo
2014-11-14 13:44 ` [PATCH V3 3/3] perf tools: Construct LBR call chain kan.liang
2014-11-17 15:54   ` Jiri Olsa
2014-11-17 17:41     ` Liang, Kan
2014-11-18  6:13       ` Namhyung Kim
2014-11-18  7:55         ` Jiri Olsa
2014-11-18 14:37           ` Liang, Kan
2014-11-18 19:40             ` Liang, Kan
2014-11-19  5:57               ` Namhyung Kim
2014-11-17 15:55   ` Jiri Olsa
2014-11-18  6:14     ` Namhyung Kim
2014-11-18  6:25   ` Namhyung Kim
2014-11-18 14:01     ` Liang, Kan
2014-11-19  6:01       ` Namhyung Kim
2014-11-19 13:37         ` Liang, Kan
2014-11-17 16:01 ` Jiri Olsa [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141117160125.GD21532@krava.brq.redhat.com \
    --to=jolsa@redhat.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=acme@kernel.org \
    --cc=ak@linux.intel.com \
    --cc=eranian@google.com \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulus@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.