* [PATCH] perf top: Make -g refer to callchains
@ 2013-11-15 3:51 David Ahern
2013-11-15 5:28 ` Ingo Molnar
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: David Ahern @ 2013-11-15 3:51 UTC (permalink / raw)
To: acme, linux-kernel
Cc: David Ahern, Ingo Molnar, Frederic Weisbecker, Jiri Olsa,
Namhyung Kim
In most commands -g is used for callchains. Make perf-top follow suit.
Move group to just --group with no short cut making it similar to
perf-record.
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
---
tools/perf/Documentation/perf-top.txt | 5 ++---
tools/perf/builtin-top.c | 4 ++--
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 7de01dd79688..cdd8d4946dba 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -50,7 +50,6 @@ Default is to monitor all CPUS.
--count-filter=<count>::
Only display functions with more events than this.
--g::
--group::
Put the counters into a counter group.
@@ -143,12 +142,12 @@ Default is to monitor all CPUS.
--asm-raw::
Show raw instruction encoding of assembly instructions.
--G::
+-g::
Enables call-graph (stack chain/backtrace) recording.
--call-graph::
Setup and enable call-graph (stack chain/backtrace) recording,
- implies -G.
+ implies -g.
--max-stack::
Set the stack depth limit when parsing the callchain, anything
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 71e6402729a8..531522d3d97b 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1084,7 +1084,7 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
"dump the symbol table used for profiling"),
OPT_INTEGER('f', "count-filter", &top.count_filter,
"only display functions with more events than this"),
- OPT_BOOLEAN('g', "group", &opts->group,
+ OPT_BOOLEAN(0, "group", &opts->group,
"put the counters into a counter group"),
OPT_BOOLEAN('i', "no-inherit", &opts->no_inherit,
"child tasks do not inherit counters"),
@@ -1105,7 +1105,7 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
" abort, in_tx, transaction"),
OPT_BOOLEAN('n', "show-nr-samples", &symbol_conf.show_nr_samples,
"Show a column with the number of samples"),
- OPT_CALLBACK_NOOPT('G', NULL, &top.record_opts,
+ OPT_CALLBACK_NOOPT('g', NULL, &top.record_opts,
NULL, "enables call-graph recording",
&callchain_opt),
OPT_CALLBACK(0, "call-graph", &top.record_opts,
--
1.8.3.4 (Apple Git-47)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-15 3:51 [PATCH] perf top: Make -g refer to callchains David Ahern
@ 2013-11-15 5:28 ` Ingo Molnar
2013-11-15 5:46 ` Ingo Molnar
2013-11-30 12:49 ` [tip:perf/core] " tip-bot for David Ahern
2 siblings, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2013-11-15 5:28 UTC (permalink / raw)
To: David Ahern
Cc: acme, linux-kernel, Frederic Weisbecker, Jiri Olsa, Namhyung Kim
* David Ahern <dsahern@gmail.com> wrote:
> --- a/tools/perf/Documentation/perf-top.txt
> +++ b/tools/perf/Documentation/perf-top.txt
> @@ -143,12 +142,12 @@ Default is to monitor all CPUS.
> --asm-raw::
> Show raw instruction encoding of assembly instructions.
>
> --G::
> +-g::
> Enables call-graph (stack chain/backtrace) recording.
>
> --call-graph::
> Setup and enable call-graph (stack chain/backtrace) recording,
> - implies -G.
> + implies -g.
Acked-by: Ingo Molnar <mingo@kernel.org>
Thanks,
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-15 3:51 [PATCH] perf top: Make -g refer to callchains David Ahern
2013-11-15 5:28 ` Ingo Molnar
@ 2013-11-15 5:46 ` Ingo Molnar
2013-11-18 12:59 ` Arnaldo Carvalho de Melo
2013-11-30 12:49 ` [tip:perf/core] " tip-bot for David Ahern
2 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2013-11-15 5:46 UTC (permalink / raw)
To: David Ahern
Cc: acme, linux-kernel, Frederic Weisbecker, Jiri Olsa, Namhyung Kim
btw., here's some 'perf top' call graph performance and profiling
quality feedback, with the latest perf code:
'perf top --call-graph fp' now works very well, using just 0.2%
of CPU time on a fast system:
4676 mingo 20 0 612m 56m 9948 S 1 0.2 0:00.68 perf
'perf top --call-graph dwarf' on the other hand is horrendously
slow, using 20% of CPU time on a 4 GHz CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4646 mingo 20 0 658m 81m 12m R 19 0.3 0:18.17 perf
On another system with a 2.4GHz CPU it's taking up 100% of CPU
time (!):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8018 mingo 20 0 290320 45220 8520 R 99.5 0.3 0:58.81 perf
Profiling 'perf top' shows all sorts of very high dwarf
processing overhead:
#
# Overhead Command Shared Object Symbol
# ........ ....... ......................... .................................................
#
7.08% perf perf [.] access_mem
7.03% perf perf [.] dso__data_read_offset
5.83% perf perf [.] maps__find
5.64% perf libunwind-x86_64.so.8.0.1 [.] 0x000000000000ba25
4.75% perf perf [.] thread__find_addr_map
3.81% perf [kernel.kallsyms] [k] unmap_single_vma
2.57% perf perf [.] map__map_ip
2.48% perf libelf-0.156.so [.] 0x0000000000003a84
2.12% perf [kernel.kallsyms] [k] memset
2.12% perf perf [.] dso__data_read_addr
2.10% perf libc-2.17.so [.] __memcpy_sse2
1.72% perf libc-2.17.so [.] __memset_sse2
1.58% perf [kernel.kallsyms] [k] page_fault
1.56% perf libc-2.17.so [.] __memset_x86_64
1.44% perf perf [.] find_proc_info
1.25% perf libelf-0.156.so [.] elf_end
1.19% perf [kernel.kallsyms] [k] flush_tlb_mm_range
1.06% perf libc-2.17.so [.] vfprintf
1.04% perf libunwind-x86_64.so.8.0.1 [.] _Ux86_64_dwarf_search_unwind_table
1.00% perf [kernel.kallsyms] [k] __audit_syscall_exit
0.94% perf libc-2.17.so [.] _int_free
0.92% perf libc-2.17.so [.] _int_malloc
0.84% perf libc-2.17.so [.] __memcmp_sse2
0.81% perf [kernel.kallsyms] [k] unmapped_area_topdown
0.71% perf [kernel.kallsyms] [k] system_call
0.71% perf [kernel.kallsyms] [k] system_call_after_swapgs
0.65% perf [kernel.kallsyms] [k] sysret_check
0.63% perf perf [.] dso__find_symbol
0.58% perf [kernel.kallsyms] [k] clear_page_c
0.58% perf [kernel.kallsyms] [k] handle_mm_fault
0.56% perf libc-2.17.so [.] __sigprocmask
0.55% perf [kernel.kallsyms] [k] copy_user_generic_string
0.51% perf [kernel.kallsyms] [k] __do_fault
0.49% perf [kernel.kallsyms] [k] find_vma
0.47% perf libpthread-2.17.so [.] __libc_close
0.44% perf [kernel.kallsyms] [k] __audit_syscall_entry
0.44% perf [kernel.kallsyms] [k] mmap_region
0.42% perf [kernel.kallsyms] [k] _raw_spin_lock
0.41% perf [kernel.kallsyms] [k] kmem_cache_free
0.40% perf [kernel.kallsyms] [k] kmem_cache_alloc
0.40% perf libpthread-2.17.so [.] pthread_mutex_unlock
0.37% perf [kernel.kallsyms] [k] perf_event_aux_ctx
0.37% perf [kernel.kallsyms] [k] do_munmap
0.37% perf libc-2.17.so [.] free
[...]
Thanks,
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-15 5:46 ` Ingo Molnar
@ 2013-11-18 12:59 ` Arnaldo Carvalho de Melo
2013-11-18 13:25 ` Jiri Olsa
2013-11-19 9:24 ` Jean Pihet
0 siblings, 2 replies; 13+ messages in thread
From: Arnaldo Carvalho de Melo @ 2013-11-18 12:59 UTC (permalink / raw)
To: Ingo Molnar
Cc: David Ahern, linux-kernel, Frederic Weisbecker, Jiri Olsa,
Namhyung Kim
Em Fri, Nov 15, 2013 at 06:46:09AM +0100, Ingo Molnar escreveu:
> btw., here's some 'perf top' call graph performance and profiling
> quality feedback, with the latest perf code:
>
> 'perf top --call-graph fp' now works very well, using just 0.2%
> of CPU time on a fast system:
>
> 4676 mingo 20 0 612m 56m 9948 S 1 0.2 0:00.68 perf
>
> 'perf top --call-graph dwarf' on the other hand is horrendously
> slow, using 20% of CPU time on a 4 GHz CPU:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4646 mingo 20 0 658m 81m 12m R 19 0.3 0:18.17 perf
>
> On another system with a 2.4GHz CPU it's taking up 100% of CPU
> time (!):
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 8018 mingo 20 0 290320 45220 8520 R 99.5 0.3 0:58.81 perf
>
> Profiling 'perf top' shows all sorts of very high dwarf
> processing overhead:
Yeah, top dwarf callchain has been so far a proof of concept, it
exacerbates problems that can be seen on 'report', but since its live,
we can see it more clearly.
The work on improving callchain processing, (rb_tree'ing, new comm
infrastructure) alleviated the problem a bit.
Tuning the stack size requested from the kernel and using --max-stack
can help when it is really needed, but yes, work on it is *badly* needed.
- Arnaldo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-18 12:59 ` Arnaldo Carvalho de Melo
@ 2013-11-18 13:25 ` Jiri Olsa
2013-11-18 14:26 ` Ingo Molnar
2013-11-19 9:24 ` Jean Pihet
1 sibling, 1 reply; 13+ messages in thread
From: Jiri Olsa @ 2013-11-18 13:25 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: Ingo Molnar, David Ahern, linux-kernel, Frederic Weisbecker,
Namhyung Kim
On Mon, Nov 18, 2013 at 09:59:45AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Nov 15, 2013 at 06:46:09AM +0100, Ingo Molnar escreveu:
> > btw., here's some 'perf top' call graph performance and profiling
> > quality feedback, with the latest perf code:
> >
> > 'perf top --call-graph fp' now works very well, using just 0.2%
> > of CPU time on a fast system:
> >
> > 4676 mingo 20 0 612m 56m 9948 S 1 0.2 0:00.68 perf
> >
> > 'perf top --call-graph dwarf' on the other hand is horrendously
> > slow, using 20% of CPU time on a 4 GHz CPU:
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 4646 mingo 20 0 658m 81m 12m R 19 0.3 0:18.17 perf
> >
> > On another system with a 2.4GHz CPU it's taking up 100% of CPU
> > time (!):
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 8018 mingo 20 0 290320 45220 8520 R 99.5 0.3 0:58.81 perf
> >
> > Profiling 'perf top' shows all sorts of very high dwarf
> > processing overhead:
>
> Yeah, top dwarf callchain has been so far a proof of concept, it
> exacerbates problems that can be seen on 'report', but since its live,
> we can see it more clearly.
>
> The work on improving callchain processing, (rb_tree'ing, new comm
> infrastructure) alleviated the problem a bit.
>
> Tuning the stack size requested from the kernel and using --max-stack
> can help when it is really needed, but yes, work on it is *badly* needed.
agreed ;-)
also there's new remote unwind interface recently added
into libdw, which seems to be faster than libunwind.
I plan on adding this soon.
jirka
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-18 13:25 ` Jiri Olsa
@ 2013-11-18 14:26 ` Ingo Molnar
2013-11-18 17:49 ` Jiri Olsa
0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2013-11-18 14:26 UTC (permalink / raw)
To: Jiri Olsa
Cc: Arnaldo Carvalho de Melo, David Ahern, linux-kernel,
Frederic Weisbecker, Namhyung Kim
* Jiri Olsa <jolsa@redhat.com> wrote:
> On Mon, Nov 18, 2013 at 09:59:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Nov 15, 2013 at 06:46:09AM +0100, Ingo Molnar escreveu:
> > > btw., here's some 'perf top' call graph performance and profiling
> > > quality feedback, with the latest perf code:
> > >
> > > 'perf top --call-graph fp' now works very well, using just 0.2%
> > > of CPU time on a fast system:
> > >
> > > 4676 mingo 20 0 612m 56m 9948 S 1 0.2 0:00.68 perf
> > >
> > > 'perf top --call-graph dwarf' on the other hand is horrendously
> > > slow, using 20% of CPU time on a 4 GHz CPU:
> > >
> > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > 4646 mingo 20 0 658m 81m 12m R 19 0.3 0:18.17 perf
> > >
> > > On another system with a 2.4GHz CPU it's taking up 100% of CPU
> > > time (!):
> > >
> > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > 8018 mingo 20 0 290320 45220 8520 R 99.5 0.3 0:58.81 perf
> > >
> > > Profiling 'perf top' shows all sorts of very high dwarf
> > > processing overhead:
> >
> > Yeah, top dwarf callchain has been so far a proof of concept, it
> > exacerbates problems that can be seen on 'report', but since its
> > live, we can see it more clearly.
> >
> > The work on improving callchain processing, (rb_tree'ing, new comm
> > infrastructure) alleviated the problem a bit.
> >
> > Tuning the stack size requested from the kernel and using
> > --max-stack can help when it is really needed, but yes, work on it
> > is *badly* needed.
>
> agreed ;-)
>
> also there's new remote unwind interface recently added into libdw,
> which seems to be faster than libunwind.
>
> I plan on adding this soon.
If the main source of overhead is libunwind (which needs independent
confirmation) then would it make sense to implement dwarf stack unwind
support ourselves?
I think SysProf does that and it appears to be faster - its unwind.c
is only 400 lines long as it only implements the small subset needed
to walk the stack - AFAICS.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-18 14:26 ` Ingo Molnar
@ 2013-11-18 17:49 ` Jiri Olsa
2013-11-18 19:17 ` Ingo Molnar
2013-11-18 20:16 ` Jan Kratochvil
0 siblings, 2 replies; 13+ messages in thread
From: Jiri Olsa @ 2013-11-18 17:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Arnaldo Carvalho de Melo, David Ahern, linux-kernel,
Frederic Weisbecker, Namhyung Kim, Jan Kratochvil
On Mon, Nov 18, 2013 at 03:26:53PM +0100, Ingo Molnar wrote:
>
> * Jiri Olsa <jolsa@redhat.com> wrote:
>
> > On Mon, Nov 18, 2013 at 09:59:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Fri, Nov 15, 2013 at 06:46:09AM +0100, Ingo Molnar escreveu:
> > > > btw., here's some 'perf top' call graph performance and profiling
> > > > quality feedback, with the latest perf code:
> > > >
> > > > 'perf top --call-graph fp' now works very well, using just 0.2%
> > > > of CPU time on a fast system:
> > > >
> > > > 4676 mingo 20 0 612m 56m 9948 S 1 0.2 0:00.68 perf
> > > >
> > > > 'perf top --call-graph dwarf' on the other hand is horrendously
> > > > slow, using 20% of CPU time on a 4 GHz CPU:
> > > >
> > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > > 4646 mingo 20 0 658m 81m 12m R 19 0.3 0:18.17 perf
> > > >
> > > > On another system with a 2.4GHz CPU it's taking up 100% of CPU
> > > > time (!):
> > > >
> > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > > 8018 mingo 20 0 290320 45220 8520 R 99.5 0.3 0:58.81 perf
> > > >
> > > > Profiling 'perf top' shows all sorts of very high dwarf
> > > > processing overhead:
> > >
> > > Yeah, top dwarf callchain has been so far a proof of concept, it
> > > exacerbates problems that can be seen on 'report', but since its
> > > live, we can see it more clearly.
> > >
> > > The work on improving callchain processing, (rb_tree'ing, new comm
> > > infrastructure) alleviated the problem a bit.
> > >
> > > Tuning the stack size requested from the kernel and using
> > > --max-stack can help when it is really needed, but yes, work on it
> > > is *badly* needed.
> >
> > agreed ;-)
> >
> > also there's new remote unwind interface recently added into libdw,
> > which seems to be faster than libunwind.
> >
> > I plan on adding this soon.
>
> If the main source of overhead is libunwind (which needs independent
> confirmation) then would it make sense to implement dwarf stack unwind
> support ourselves?
>
> I think SysProf does that and it appears to be faster - its unwind.c
> is only 400 lines long as it only implements the small subset needed
> to walk the stack - AFAICS.
I think it's an option.. but it'll simpler to try the libdw
interface first and see if it's good/fast enough..
also I recall discussing the speed with libdw developer
Jan Kratochvil (CC-ed) and AFAICS they're open for
suggestions/optimizations
jirka
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-18 17:49 ` Jiri Olsa
@ 2013-11-18 19:17 ` Ingo Molnar
2013-11-18 20:16 ` Jan Kratochvil
1 sibling, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2013-11-18 19:17 UTC (permalink / raw)
To: Jiri Olsa
Cc: Arnaldo Carvalho de Melo, David Ahern, linux-kernel,
Frederic Weisbecker, Namhyung Kim, Jan Kratochvil
* Jiri Olsa <jolsa@redhat.com> wrote:
> On Mon, Nov 18, 2013 at 03:26:53PM +0100, Ingo Molnar wrote:
> >
> > * Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > > On Mon, Nov 18, 2013 at 09:59:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > > > Em Fri, Nov 15, 2013 at 06:46:09AM +0100, Ingo Molnar escreveu:
> > > > > btw., here's some 'perf top' call graph performance and profiling
> > > > > quality feedback, with the latest perf code:
> > > > >
> > > > > 'perf top --call-graph fp' now works very well, using just 0.2%
> > > > > of CPU time on a fast system:
> > > > >
> > > > > 4676 mingo 20 0 612m 56m 9948 S 1 0.2 0:00.68 perf
> > > > >
> > > > > 'perf top --call-graph dwarf' on the other hand is horrendously
> > > > > slow, using 20% of CPU time on a 4 GHz CPU:
> > > > >
> > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > > > 4646 mingo 20 0 658m 81m 12m R 19 0.3 0:18.17 perf
> > > > >
> > > > > On another system with a 2.4GHz CPU it's taking up 100% of CPU
> > > > > time (!):
> > > > >
> > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > > > 8018 mingo 20 0 290320 45220 8520 R 99.5 0.3 0:58.81 perf
> > > > >
> > > > > Profiling 'perf top' shows all sorts of very high dwarf
> > > > > processing overhead:
> > > >
> > > > Yeah, top dwarf callchain has been so far a proof of concept, it
> > > > exacerbates problems that can be seen on 'report', but since its
> > > > live, we can see it more clearly.
> > > >
> > > > The work on improving callchain processing, (rb_tree'ing, new comm
> > > > infrastructure) alleviated the problem a bit.
> > > >
> > > > Tuning the stack size requested from the kernel and using
> > > > --max-stack can help when it is really needed, but yes, work on it
> > > > is *badly* needed.
> > >
> > > agreed ;-)
> > >
> > > also there's new remote unwind interface recently added into libdw,
> > > which seems to be faster than libunwind.
> > >
> > > I plan on adding this soon.
> >
> > If the main source of overhead is libunwind (which needs
> > independent confirmation) then would it make sense to implement
> > dwarf stack unwind support ourselves?
> >
> > I think SysProf does that and it appears to be faster - its
> > unwind.c is only 400 lines long as it only implements the small
> > subset needed to walk the stack - AFAICS.
>
> I think it's an option.. but it'll simpler to try the libdw
> interface first and see if it's good/fast enough..
>
> also I recall discussing the speed with libdw developer Jan
> Kratochvil (CC-ed) and AFAICS they're open for
> suggestions/optimizations
So it's terribly difficult to measure the performance problems, do
something like this on an idle system:
$ perf top --call-graph dwarf
and unless you have a very, very fast CPU this is going to use up 100%
of CPU time. 20% on a very fast system. Both are anomalous and show
this kind of dwarf processing overhead:
#
# Overhead Command Shared Object Symbol
# ........ ....... ......................... .................................................
#
7.08% perf perf [.] access_mem
7.03% perf perf [.] dso__data_read_offset
5.83% perf perf [.] maps__find
5.64% perf libunwind-x86_64.so.8.0.1 [.] 0x000000000000ba25
4.75% perf perf [.] thread__find_addr_map
3.81% perf [kernel.kallsyms] [k] unmap_single_vma
2.57% perf perf [.] map__map_ip
2.48% perf libelf-0.156.so [.] 0x0000000000003a84
2.12% perf [kernel.kallsyms] [k] memset
2.12% perf perf [.] dso__data_read_addr
2.10% perf libc-2.17.so [.] __memcpy_sse2
1.72% perf libc-2.17.so [.] __memset_sse2
1.58% perf [kernel.kallsyms] [k] page_fault
1.56% perf libc-2.17.so [.] __memset_x86_64
1.44% perf perf [.] find_proc_info
1.25% perf libelf-0.156.so [.] elf_end
1.19% perf [kernel.kallsyms] [k] flush_tlb_mm_range
1.06% perf libc-2.17.so [.] vfprintf
1.04% perf libunwind-x86_64.so.8.0.1 [.] _Ux86_64_dwarf_search_unwind_table
1.00% perf [kernel.kallsyms] [k] __audit_syscall_exit
0.94% perf libc-2.17.so [.] _int_free
0.92% perf libc-2.17.so [.] _int_malloc
0.84% perf libc-2.17.so [.] __memcmp_sse2
0.81% perf [kernel.kallsyms] [k] unmapped_area_topdown
0.71% perf [kernel.kallsyms] [k] system_call
0.71% perf [kernel.kallsyms] [k] system_call_after_swapgs
0.65% perf [kernel.kallsyms] [k] sysret_check
0.63% perf perf [.] dso__find_symbol
0.58% perf [kernel.kallsyms] [k] clear_page_c
0.58% perf [kernel.kallsyms] [k] handle_mm_fault
0.56% perf libc-2.17.so [.] __sigprocmask
the libunwind and libelf entries didn't get resolved because I didn't
have a debug version of the libraries installed:
5.64% perf libunwind-x86_64.so.8.0.1 [.] 0x000000000000ba25
2.48% perf libelf-0.156.so [.] 0x0000000000003a84
Btw., tools like GDB are able to resolve symbols in such cases even
without debug packages installed:
(gdb) bt
#0 0x0000003e5908edf9 in __memcpy_sse2 () from /lib64/libc.so.6
#1 0x000000000046b61c in memcpy (__len=8, __src=<optimized out>, __dest=0x7fffc80b09b8) at /usr/include/bits/string3.h:51
#2 dso_cache__memcpy (size=8, data=0x7fffc80b09b8 "@\325\357\377\344\001", offset=1840096, cache=<optimized out>) at util/dso.c:259
#3 dso_cache_read (size=8, data=0x7fffc80b09b8 "@\325\357\377\344\001", offset=1840096, machine=0x9a2a48, dso=0x9b21a0) at util/dso.c:316
#4 dso__data_read_offset (dso=0x9b21a0, machine=0x9a2a48, offset=1840096, data=data@entry=0x7fffc80b09b8 "@\325\357\377\344\001", size=size@entry=8) at util/dso.c:330
#5 0x000000000046b7a5 in dso__data_read_addr (dso=<optimized out>, map=<optimized out>, machine=<optimized out>, addr=addr@entry=6034400,
data=data@entry=0x7fffc80b09b8 "@\325\357\377\344\001", size=size@entry=8) at util/dso.c:355
#6 0x00000000004bea3c in access_dso_mem (ui=0x7fffc80b18b0, ui=0x7fffc80b18b0, data=0x7fffc80b09b8, addr=6034400) at util/unwind.c:404
#7 access_mem (as=<optimized out>, addr=6034400, valp=0x7fffc80b09b8, __write=<optimized out>, arg=0x7fffc80b18b0) at util/unwind.c:455
#8 0x00007f885af02f2d in _Ux86_64_dwarf_read_encoded_pointer () from /lib64/libunwind-x86_64.so.8
#9 0x00007f885aeff992 in _Ux86_64_dwarf_extract_proc_info_from_fde () from /lib64/libunwind-x86_64.so.8
#10 0x00007f885af03e75 in _Ux86_64_dwarf_search_unwind_table () from /lib64/libunwind-x86_64.so.8
#11 0x00000000004bedbc in find_proc_info (as=0x1445560, ip=4975163, pi=0x7fffc80b15b0, need_unwind_info=1, arg=0x7fffc80b18b0) at util/unwind.c:335
#12 0x00007f885af00205 in fetch_proc_info () from /lib64/libunwind-x86_64.so.8
#13 0x00007f885af0246b in _Ux86_64_dwarf_find_save_locs () from /lib64/libunwind-x86_64.so.8
#14 0x00007f885af03769 in _Ux86_64_dwarf_step () from /lib64/libunwind-x86_64.so.8
#15 0x00007f885aefb3f1 in _Ux86_64_step () from /lib64/libunwind-x86_64.so.8
All those entries are within libunwind - and GDB was able to resolve
them.
How do they do it and shouldn't perf be able to do such magick?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-18 17:49 ` Jiri Olsa
2013-11-18 19:17 ` Ingo Molnar
@ 2013-11-18 20:16 ` Jan Kratochvil
2013-11-19 9:26 ` Jean Pihet
1 sibling, 1 reply; 13+ messages in thread
From: Jan Kratochvil @ 2013-11-18 20:16 UTC (permalink / raw)
To: Jiri Olsa
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, David Ahern, linux-kernel,
Frederic Weisbecker, Namhyung Kim, Petr Machata
On Mon, 18 Nov 2013 18:49:45 +0100, Jiri Olsa wrote:
> I think it's an option.. but it'll simpler to try the libdw
> interface first and see if it's good/fast enough..
The elfutils libdw unwinder is being upstreamed these weeks, the x86* unwinder
itself is already upstream now.
> also I recall discussing the speed with libdw developer
My tests with perf using elfutils unwinder were 10x faster than libunwind;
this is by some simple caching of ELF files. Sure a similar cache could be
implemented also for libunwind. But the cache is a wrong solution.
The problem is that currently perf loads the ELF files again and again for
every process as the ELF file always gets automatically relocated for the
address where it was loaded. The right way is to load the ELF file only once
and access always the same copy with process-specific displacement only.
I did not investigate how much feasible it is with libunwind. For elfutils
there is pmachata/sharing unfinished branch from 2008 to implement that.
I have not checked it more before the unwinder gets fully upstreamed.
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-18 12:59 ` Arnaldo Carvalho de Melo
2013-11-18 13:25 ` Jiri Olsa
@ 2013-11-19 9:24 ` Jean Pihet
1 sibling, 0 replies; 13+ messages in thread
From: Jean Pihet @ 2013-11-19 9:24 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: Ingo Molnar, David Ahern, linux-kernel@vger.kernel.org,
Frederic Weisbecker, Jiri Olsa, Namhyung Kim
Hi,
On 18 November 2013 13:59, Arnaldo Carvalho de Melo
<acme@ghostprotocols.net> wrote:
> Em Fri, Nov 15, 2013 at 06:46:09AM +0100, Ingo Molnar escreveu:
>> btw., here's some 'perf top' call graph performance and profiling
>> quality feedback, with the latest perf code:
>>
>> 'perf top --call-graph fp' now works very well, using just 0.2%
>> of CPU time on a fast system:
>>
>> 4676 mingo 20 0 612m 56m 9948 S 1 0.2 0:00.68 perf
>>
>> 'perf top --call-graph dwarf' on the other hand is horrendously
>> slow, using 20% of CPU time on a 4 GHz CPU:
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 4646 mingo 20 0 658m 81m 12m R 19 0.3 0:18.17 perf
>>
>> On another system with a 2.4GHz CPU it's taking up 100% of CPU
>> time (!):
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 8018 mingo 20 0 290320 45220 8520 R 99.5 0.3 0:58.81 perf
>>
>> Profiling 'perf top' shows all sorts of very high dwarf
>> processing overhead:
>
> Yeah, top dwarf callchain has been so far a proof of concept, it
> exacerbates problems that can be seen on 'report', but since its live,
> we can see it more clearly.
Indeed. Because of the poor performance of the dwarf unwinding code,
the only practical use is to record the data (perf record) and then
later parse it (perf report). perf top does all that at once.
> The work on improving callchain processing, (rb_tree'ing, new comm
> infrastructure) alleviated the problem a bit.
>
> Tuning the stack size requested from the kernel and using --max-stack
> can help when it is really needed, but yes, work on it is *badly* needed.
The problem is that the whole user stack is dumped for every sample
while frame pointer unwinding only dumps the useful part of the
callchain.
Also an important point is the robustness of libunwind wrt async
signal, cf. http://lists.nongnu.org/archive/html/libunwind-devel/2013-09/msg00005.html.
So yes some work is °badly* needed on:
- the data size being dumped,
- the data parsing optimization,
- the choice of an implementation of dwarf unwinding (libunwind, libdw etc.),
- the compatibility with 32 bit binaries on AARCH64, which I am now
busy with in libunwind.
Jean
>
> - Arnaldo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-18 20:16 ` Jan Kratochvil
@ 2013-11-19 9:26 ` Jean Pihet
2013-11-19 9:33 ` Jan Kratochvil
0 siblings, 1 reply; 13+ messages in thread
From: Jean Pihet @ 2013-11-19 9:26 UTC (permalink / raw)
To: Jan Kratochvil
Cc: Jiri Olsa, Ingo Molnar, Arnaldo Carvalho de Melo, David Ahern,
linux-kernel@vger.kernel.org, Frederic Weisbecker, Namhyung Kim,
Petr Machata
Hi,
On 18 November 2013 21:16, Jan Kratochvil <jan.kratochvil@redhat.com> wrote:
> On Mon, 18 Nov 2013 18:49:45 +0100, Jiri Olsa wrote:
>> I think it's an option.. but it'll simpler to try the libdw
>> interface first and see if it's good/fast enough..
>
> The elfutils libdw unwinder is being upstreamed these weeks, the x86* unwinder
> itself is already upstream now.
Do you know about the support of AARCH64, both in 64-bit and 32-bit
(compat) mode?
I would be glad to give it a try.
>> also I recall discussing the speed with libdw developer
>
> My tests with perf using elfutils unwinder were 10x faster than libunwind;
> this is by some simple caching of ELF files. Sure a similar cache could be
> implemented also for libunwind. But the cache is a wrong solution.
>
> The problem is that currently perf loads the ELF files again and again for
> every process as the ELF file always gets automatically relocated for the
> address where it was loaded. The right way is to load the ELF file only once
> and access always the same copy with process-specific displacement only.
>
> I did not investigate how much feasible it is with libunwind. For elfutils
> there is pmachata/sharing unfinished branch from 2008 to implement that.
> I have not checked it more before the unwinder gets fully upstreamed.
>
>
> Jan
Jean
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf top: Make -g refer to callchains
2013-11-19 9:26 ` Jean Pihet
@ 2013-11-19 9:33 ` Jan Kratochvil
0 siblings, 0 replies; 13+ messages in thread
From: Jan Kratochvil @ 2013-11-19 9:33 UTC (permalink / raw)
To: Jean Pihet
Cc: Jiri Olsa, Ingo Molnar, Arnaldo Carvalho de Melo, David Ahern,
linux-kernel@vger.kernel.org, Frederic Weisbecker, Namhyung Kim,
Petr Machata
On Tue, 19 Nov 2013 10:26:42 +0100, Jean Pihet wrote:
> Do you know about the support of AARCH64, both in 64-bit and 32-bit
> (compat) mode?
> I would be glad to give it a try.
Please move this topic to:
https://lists.fedorahosted.org/mailman/listinfo/elfutils-devel
aarch64 elfutils port is now posted by Petr Machata.
The elfutils unwinder part for aarch64 (or arm) has not been written yet.
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread
* [tip:perf/core] perf top: Make -g refer to callchains
2013-11-15 3:51 [PATCH] perf top: Make -g refer to callchains David Ahern
2013-11-15 5:28 ` Ingo Molnar
2013-11-15 5:46 ` Ingo Molnar
@ 2013-11-30 12:49 ` tip-bot for David Ahern
2 siblings, 0 replies; 13+ messages in thread
From: tip-bot for David Ahern @ 2013-11-30 12:49 UTC (permalink / raw)
To: linux-tip-commits
Cc: acme, linux-kernel, hpa, mingo, namhyung, jolsa, fweisbec,
dsahern, tglx
Commit-ID: bf80669e4f689f181f23a54dfe2a0f264147ad67
Gitweb: http://git.kernel.org/tip/bf80669e4f689f181f23a54dfe2a0f264147ad67
Author: David Ahern <dsahern@gmail.com>
AuthorDate: Thu, 14 Nov 2013 20:51:30 -0700
Committer: Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 27 Nov 2013 14:58:35 -0300
perf top: Make -g refer to callchains
In most commands -g is used for callchains. Make perf-top follow suit.
Move group to just --group with no short cut making it similar to
perf-record.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1384487490-6865-1-git-send-email-dsahern@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/Documentation/perf-top.txt | 5 ++---
tools/perf/builtin-top.c | 4 ++--
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 7de01dd..cdd8d49 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -50,7 +50,6 @@ Default is to monitor all CPUS.
--count-filter=<count>::
Only display functions with more events than this.
--g::
--group::
Put the counters into a counter group.
@@ -143,12 +142,12 @@ Default is to monitor all CPUS.
--asm-raw::
Show raw instruction encoding of assembly instructions.
--G::
+-g::
Enables call-graph (stack chain/backtrace) recording.
--call-graph::
Setup and enable call-graph (stack chain/backtrace) recording,
- implies -G.
+ implies -g.
--max-stack::
Set the stack depth limit when parsing the callchain, anything
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 71e6402..531522d 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1084,7 +1084,7 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
"dump the symbol table used for profiling"),
OPT_INTEGER('f', "count-filter", &top.count_filter,
"only display functions with more events than this"),
- OPT_BOOLEAN('g', "group", &opts->group,
+ OPT_BOOLEAN(0, "group", &opts->group,
"put the counters into a counter group"),
OPT_BOOLEAN('i', "no-inherit", &opts->no_inherit,
"child tasks do not inherit counters"),
@@ -1105,7 +1105,7 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
" abort, in_tx, transaction"),
OPT_BOOLEAN('n', "show-nr-samples", &symbol_conf.show_nr_samples,
"Show a column with the number of samples"),
- OPT_CALLBACK_NOOPT('G', NULL, &top.record_opts,
+ OPT_CALLBACK_NOOPT('g', NULL, &top.record_opts,
NULL, "enables call-graph recording",
&callchain_opt),
OPT_CALLBACK(0, "call-graph", &top.record_opts,
^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2013-11-30 12:49 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-15 3:51 [PATCH] perf top: Make -g refer to callchains David Ahern
2013-11-15 5:28 ` Ingo Molnar
2013-11-15 5:46 ` Ingo Molnar
2013-11-18 12:59 ` Arnaldo Carvalho de Melo
2013-11-18 13:25 ` Jiri Olsa
2013-11-18 14:26 ` Ingo Molnar
2013-11-18 17:49 ` Jiri Olsa
2013-11-18 19:17 ` Ingo Molnar
2013-11-18 20:16 ` Jan Kratochvil
2013-11-19 9:26 ` Jean Pihet
2013-11-19 9:33 ` Jan Kratochvil
2013-11-19 9:24 ` Jean Pihet
2013-11-30 12:49 ` [tip:perf/core] " tip-bot for David Ahern
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).