public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC/PATCHSET 00/37] perf tools: Speed-up perf report by using multi thread (v1)
@ 2014-12-24  7:14 Namhyung Kim
  2014-12-24  7:14 ` [PATCH 01/37] perf tools: Set attr.task bit for a tracking event Namhyung Kim
                   ` (38 more replies)
  0 siblings, 39 replies; 91+ messages in thread
From: Namhyung Kim @ 2014-12-24  7:14 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ingo Molnar, Peter Zijlstra, Jiri Olsa, LKML, David Ahern,
	Stephane Eranian, Adrian Hunter, Andi Kleen, Frederic Weisbecker

Hello,

This patchset converts perf report to use multiple threads in order to
speed up the processing on large data files.  I can see a minimum 40%
of speedup with this change.  The code is still experimental, little
bit outdated and contains many rough edges.  But I'd like to share and
give some feedbacks.

The perf report processes (sample) events like below:

  1. preprocess sample to get matching thread/dso/symbol info
  2. insert it to hists rbtree (with callchain tree) based on the info
  3. optionally collapse hist entries that match given sort key(s)
  4. resort hist entries (by overhead) for output
  5. display the hist entries

The stage 1 is a preprocessing and mostly act like a read-only
operation during the sample processing.  Meta events like fork, comm
and mmap can change the machine/thread state but symbols can be loaded
during the processing (stage 2).

The stage 2 consumes most of the time especially with callchains and
 --children option is enabled.  And this work can be easily patitioned
as each sample is independent to others.  But the resulting hists must
be combined/collapsed to a single global hists before going to further
steps.

The stage 3 is optional and only needed by certain sort keys - but
with stage 2 paralellized, it needs to be done anyway.

The stage 4 and 5 works on whole hists so must be done serially.

So my approach is like this:

Partially do stage 1 first - but only for meta events that changes
machine state.  To do this I add a dummy tracking event to perf record
and make it collect such meta events only.  They are saved in a
separate file (perf.header) and processed before sample events at perf
report time.

This also requires to handle multiple files and to find a
corresponding machine state when processing samples.  On a large
profiling session, many tasks were created and exited so pid might be
recycled (even more than once!).  To deal with it, I managed to have
thread, map_groups and comm in time sorted.  The only remaining thing
is symbol loading as it's done lazily when sample requires it.

With that being done, the stage 2 can be done by multiple threads.  I
also save each sample data (per-cpu or per-thread) in separate files
during record.  On perf report time, each file will be processed by
each thread.  And symbol loading is protected by a mutex lock.

For DWARF post-unwinding, dso cache data also needs to be protected by
a lock and this causes a huge contention.  I just added a front cache
that can be accessed without the lock but this should be improved IMHO.

The patch 1-10 are to support multi-file data recording.  With
 -M/--multi option, perf record will create a directory (named
'perf.data.dir' by default - but maybe renamed 'perf.data' for
transparent conversion later) and save meta events to perf.header file
and sample events to perf.data.<n> file).  It'd be better considering
file format change Jiri suggested [1].

The patch 11-20 are to manage machine and thread state using timestamp
so that it can be searched when processing samples.  The patch 21-35
are to implement parallel report.  And finally I implemented 'perf
data split' command to convert a single data file into a multi-file
format.

This patchset didn't change perf record to use multi-thread.  But I
think it can be easily done later if needed.

Note that output has a slight difference to original version when
compared using splitted data file.  But they're mostly unresolved
symbols for callchains.

Here is the result:

This is just elapsed (real) time measured by shell 'time' function.

The data file was recorded during kernel build with fp callchain and
size is 2.1GB.  The machine has 6 core with hyper-threading enabled
and I got a similar result on my laptop too.

 time perf report  --children  --no-children  + --call-graph none
 		   ----------  -------------  -------------------
 current            4m43.260s      1m32.779s            0m35.866s            
 patched            4m43.710s      1m29.695s            0m33.995s
 --multi-thread     2m46.265s      0m45.486s             0m7.570s


This result is with 7.7GB data file using libunwind for callchain.

 time perf report  --children  --no-children  + --call-graph none
 		   ----------  -------------  -------------------
 current            3m51.762s      3m10.451s             0m4.695s            
 patched            2m26.030s      1m49.846s             0m4.105s
 --multi-thread     0m49.217s      0m35.106s             0m1.457s

Note that the single thread performance improvement in patched version
is due to changes in the patch 33-35.


This result is with same file but using libdw for callchain unwind.

 time perf report  --children  --no-children  + --call-graph none
 		   ----------  -------------  -------------------
 current           10m22.472s     11m42.290s             0m4.758s            
 patched           10m10.625s     11m45.480s             0m4.162s
 --multi-thread     3m47.332s      3m35.235s             0m1.755s

On my archlinux system, callchain unwind using libdw is much slower
than libunwind.  I'm using elfutils version 0.160.  Also I don't know
why --children takes less time than --no-children.  Anyway we can see
the --multi-thread performance is much better for each case.


You can get it from 'perf/threaded-v1' branch on my tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Please take a look and play with it.  Any comments are welcome! :)

Thanks,
Namhyung


[1] https://lkml.org/lkml/2013/9/1/20


Jiri Olsa (1):
  perf tools: Add new perf data command

Namhyung Kim (36):
  perf tools: Set attr.task bit for a tracking event
  perf record: Use a software dummy event to track task/mmap events
  perf tools: Use perf_data_file__fd() consistently
  perf tools: Add multi file interface to perf_data_file
  perf tools: Create separate mmap for dummy tracking event
  perf tools: Introduce perf_evlist__mmap_multi()
  perf tools: Do not use __perf_session__process_events() directly
  perf tools: Handle multi-file session properly
  perf record: Add -M/--multi option for multi file recording
  perf report: Skip dummy tracking event
  perf tools: Introduce thread__comm_time() helpers
  perf tools: Add a test case for thread comm handling
  perf tools: Use thread__comm_time() when adding hist entries
  perf tools: Convert dead thread list into rbtree
  perf tools: Introduce machine__find*_thread_time()
  perf tools: Add a test case for timed thread handling
  perf tools: Maintain map groups list in a leader thread
  perf tools: Remove thread when map groups initialization failed
  perf tools: Introduce thread__find_addr_location_time() and friends
  perf tools: Add a test case for timed map groups handling
  perf tools: Protect dso symbol loading using a mutex
  perf tools: Protect dso cache tree using dso->lock
  perf tools: Protect dso cache fd with a mutex
  perf session: Pass struct events stats to event processing functions
  perf hists: Pass hists struct to hist_entry_iter functions
  perf tools: Move BUILD_ID_SIZE definition to perf.h
  perf report: Parallelize perf report using multi-thread
  perf tools: Add missing_threads rb tree
  perf top: Always creates thread in the current task tree.
  perf tools: Fix progress ui to support multi thread
  perf record: Show total size of multi file data
  perf report: Add --multi-thread option and config item
  perf tools: Add front cache for dso data access
  perf tools: Convert lseek + read to pread
  perf callchain: Save eh/debug frame offset for dwarf unwind
  perf data: Implement 'split' subcommand

 tools/perf/Documentation/perf-data.txt   |  43 ++++
 tools/perf/Documentation/perf-record.txt |   5 +
 tools/perf/Documentation/perf-report.txt |   3 +
 tools/perf/Makefile.perf                 |   4 +
 tools/perf/builtin-annotate.c            |   5 +-
 tools/perf/builtin-data.c                | 298 ++++++++++++++++++++++++++
 tools/perf/builtin-diff.c                |   8 +-
 tools/perf/builtin-inject.c              |   9 +-
 tools/perf/builtin-record.c              |  65 ++++--
 tools/perf/builtin-report.c              | 107 ++++++++--
 tools/perf/builtin-script.c              |   5 +-
 tools/perf/builtin-top.c                 |   9 +-
 tools/perf/builtin.h                     |   1 +
 tools/perf/command-list.txt              |   1 +
 tools/perf/perf.c                        |   1 +
 tools/perf/perf.h                        |   2 +
 tools/perf/tests/builtin-test.c          |  12 ++
 tools/perf/tests/dso-data.c              |   5 +
 tools/perf/tests/dwarf-unwind.c          |  10 +-
 tools/perf/tests/hists_common.c          |   3 +-
 tools/perf/tests/hists_cumulate.c        |   4 +-
 tools/perf/tests/hists_filter.c          |   3 +-
 tools/perf/tests/hists_link.c            |   6 +-
 tools/perf/tests/hists_output.c          |   4 +-
 tools/perf/tests/tests.h                 |   3 +
 tools/perf/tests/thread-comm.c           |  47 +++++
 tools/perf/tests/thread-lookup-time.c    | 180 ++++++++++++++++
 tools/perf/tests/thread-mg-share.c       |   7 +-
 tools/perf/tests/thread-mg-time.c        |  88 ++++++++
 tools/perf/ui/browsers/hists.c           |  10 +-
 tools/perf/ui/gtk/hists.c                |   3 +
 tools/perf/util/build-id.c               |   9 +-
 tools/perf/util/build-id.h               |   2 -
 tools/perf/util/data.c                   | 188 ++++++++++++++++-
 tools/perf/util/data.h                   |  17 ++
 tools/perf/util/dso.c                    | 192 ++++++++++++-----
 tools/perf/util/dso.h                    |   5 +
 tools/perf/util/event.c                  |  85 ++++++--
 tools/perf/util/event.h                  |   6 +-
 tools/perf/util/evlist.c                 | 151 ++++++++++++--
 tools/perf/util/evlist.h                 |  22 +-
 tools/perf/util/evsel.c                  |   1 +
 tools/perf/util/evsel.h                  |  15 ++
 tools/perf/util/hist.c                   | 121 +++++++----
 tools/perf/util/hist.h                   |  12 +-
 tools/perf/util/machine.c                | 251 +++++++++++++++++++---
 tools/perf/util/machine.h                |  12 +-
 tools/perf/util/map.c                    |   1 +
 tools/perf/util/map.h                    |   2 +
 tools/perf/util/ordered-events.c         |   4 +-
 tools/perf/util/session.c                | 347 ++++++++++++++++++++++++++-----
 tools/perf/util/session.h                |   8 +-
 tools/perf/util/symbol.c                 |  34 ++-
 tools/perf/util/thread.c                 | 140 ++++++++++++-
 tools/perf/util/thread.h                 |  28 ++-
 tools/perf/util/tool.h                   |  17 ++
 tools/perf/util/unwind-libdw.c           |  11 +-
 tools/perf/util/unwind-libunwind.c       |  49 +++--
 tools/perf/util/util.c                   |  43 ++++
 tools/perf/util/util.h                   |   1 +
 60 files changed, 2381 insertions(+), 344 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-data.txt
 create mode 100644 tools/perf/builtin-data.c
 create mode 100644 tools/perf/tests/thread-comm.c
 create mode 100644 tools/perf/tests/thread-lookup-time.c
 create mode 100644 tools/perf/tests/thread-mg-time.c

-- 
2.1.3


^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2015-01-08 14:52 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-24  7:14 [RFC/PATCHSET 00/37] perf tools: Speed-up perf report by using multi thread (v1) Namhyung Kim
2014-12-24  7:14 ` [PATCH 01/37] perf tools: Set attr.task bit for a tracking event Namhyung Kim
2014-12-31 11:25   ` Jiri Olsa
2014-12-24  7:14 ` [PATCH 02/37] perf record: Use a software dummy event to track task/mmap events Namhyung Kim
2014-12-26 16:27   ` David Ahern
2014-12-27  5:28     ` Namhyung Kim
2014-12-29 12:58       ` Adrian Hunter
2014-12-30  5:51         ` Namhyung Kim
2014-12-30  9:04           ` Adrian Hunter
2014-12-24  7:14 ` [PATCH 03/37] perf tools: Use perf_data_file__fd() consistently Namhyung Kim
2014-12-26 16:30   ` David Ahern
2014-12-27  5:30     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 04/37] perf tools: Add multi file interface to perf_data_file Namhyung Kim
2014-12-25 22:08   ` Jiri Olsa
2014-12-26  1:19     ` Namhyung Kim
2014-12-31 11:26   ` Jiri Olsa
2014-12-31 14:55     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 05/37] perf tools: Create separate mmap for dummy tracking event Namhyung Kim
2014-12-25 22:08   ` Jiri Olsa
2014-12-26  1:45     ` Namhyung Kim
2014-12-25 22:09   ` Jiri Olsa
2014-12-26  1:55     ` Namhyung Kim
2014-12-26 16:51   ` David Ahern
2014-12-27  5:32     ` Namhyung Kim
2014-12-29 13:44   ` Adrian Hunter
2014-12-30  5:57     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 06/37] perf tools: Introduce perf_evlist__mmap_multi() Namhyung Kim
2014-12-24  7:15 ` [PATCH 07/37] perf tools: Do not use __perf_session__process_events() directly Namhyung Kim
2014-12-31 11:33   ` Jiri Olsa
2014-12-24  7:15 ` [PATCH 08/37] perf tools: Handle multi-file session properly Namhyung Kim
2014-12-31 12:01   ` Jiri Olsa
2014-12-31 14:53     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 09/37] perf record: Add -M/--multi option for multi file recording Namhyung Kim
2014-12-24  7:15 ` [PATCH 10/37] perf report: Skip dummy tracking event Namhyung Kim
2014-12-24  7:15 ` [PATCH 11/37] perf tools: Introduce thread__comm_time() helpers Namhyung Kim
2014-12-26 17:00   ` David Ahern
2014-12-27  5:36     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 12/37] perf tools: Add a test case for thread comm handling Namhyung Kim
2014-12-24  7:15 ` [PATCH 13/37] perf tools: Use thread__comm_time() when adding hist entries Namhyung Kim
2014-12-25 22:53   ` Jiri Olsa
2014-12-26  2:10     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 14/37] perf tools: Convert dead thread list into rbtree Namhyung Kim
2014-12-25 23:05   ` Jiri Olsa
2014-12-26  2:26     ` Namhyung Kim
2014-12-26 17:14       ` David Ahern
2014-12-27  5:42         ` Namhyung Kim
2014-12-27 15:31   ` David Ahern
2014-12-28 13:24     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 15/37] perf tools: Introduce machine__find*_thread_time() Namhyung Kim
2014-12-27 16:33   ` David Ahern
2014-12-28 14:50     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 16/37] perf tools: Add a test case for timed thread handling Namhyung Kim
2014-12-31 14:17   ` Jiri Olsa
2014-12-31 15:32     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 17/37] perf tools: Maintain map groups list in a leader thread Namhyung Kim
2014-12-24  7:15 ` [PATCH 18/37] perf tools: Remove thread when map groups initialization failed Namhyung Kim
2014-12-28  0:45   ` David Ahern
2014-12-29  7:08     ` Namhyung Kim
2014-12-24  7:15 ` [PATCH 19/37] perf tools: Introduce thread__find_addr_location_time() and friends Namhyung Kim
2014-12-24  7:15 ` [PATCH 20/37] perf tools: Add a test case for timed map groups handling Namhyung Kim
2014-12-24  7:15 ` [PATCH 21/37] perf tools: Protect dso symbol loading using a mutex Namhyung Kim
2014-12-24  7:15 ` [PATCH 22/37] perf tools: Protect dso cache tree using dso->lock Namhyung Kim
2014-12-24  7:15 ` [PATCH 23/37] perf tools: Protect dso cache fd with a mutex Namhyung Kim
2014-12-24  7:15 ` [PATCH 24/37] perf session: Pass struct events stats to event processing functions Namhyung Kim
2014-12-24  7:15 ` [PATCH 25/37] perf hists: Pass hists struct to hist_entry_iter functions Namhyung Kim
2014-12-24  7:15 ` [PATCH 26/37] perf tools: Move BUILD_ID_SIZE definition to perf.h Namhyung Kim
2014-12-24  7:15 ` [PATCH 27/37] perf report: Parallelize perf report using multi-thread Namhyung Kim
2014-12-24  7:15 ` [PATCH 28/37] perf tools: Add missing_threads rb tree Namhyung Kim
2014-12-24  7:15 ` [PATCH 29/37] perf top: Always creates thread in the current task tree Namhyung Kim
2014-12-24  7:15 ` [PATCH 30/37] perf tools: Fix progress ui to support multi thread Namhyung Kim
2014-12-24  7:15 ` [PATCH 31/37] perf record: Show total size of multi file data Namhyung Kim
2014-12-24  7:15 ` [PATCH 32/37] perf report: Add --multi-thread option and config item Namhyung Kim
2014-12-24  7:15 ` [PATCH 33/37] perf tools: Add front cache for dso data access Namhyung Kim
2014-12-24  7:15 ` [PATCH 34/37] perf tools: Convert lseek + read to pread Namhyung Kim
2014-12-24  7:15 ` [PATCH 35/37] perf callchain: Save eh/debug frame offset for dwarf unwind Namhyung Kim
2014-12-24  7:15 ` [PATCH 36/37] perf tools: Add new perf data command Namhyung Kim
2014-12-24  7:15 ` [PATCH 37/37] perf data: Implement 'split' subcommand Namhyung Kim
2014-12-24 13:51   ` Arnaldo Carvalho de Melo
2014-12-24 14:14     ` Namhyung Kim
2014-12-24 14:45       ` Arnaldo Carvalho de Melo
2014-12-26 13:59   ` Jiri Olsa
2014-12-27  5:21     ` Namhyung Kim
2014-12-26 14:02 ` [RFC/PATCHSET 00/37] perf tools: Speed-up perf report by using multi thread (v1) Jiri Olsa
2014-12-27  5:23   ` Namhyung Kim
2015-01-05 18:48 ` Andi Kleen
2015-01-06 15:50   ` Stephane Eranian
2015-01-07  7:13     ` Namhyung Kim
2015-01-07 15:14       ` Stephane Eranian
2015-01-08  5:19         ` Namhyung Kim
2015-01-07  6:58   ` Namhyung Kim
2015-01-08 14:52     ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox