Re: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Ian Rogers <irogers@google.com>
Cc: Milian Wolff <milian.wolff@kdab.com>, linux-perf-users@vger.kernel.org
Subject: Re: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?
Date: Thu, 9 Sep 2021 17:22:25 -0300	[thread overview]
Message-ID: <YTptAbQgae5HEgAU@kernel.org> (raw)
In-Reply-To: <CAP-5=fVM8oxmMzx4VhoCPp=t=vR3W-MT72MRL7pVsjJR9YtxXA@mail.gmail.com>

Em Thu, Sep 09, 2021 at 12:28:10PM -0700, Ian Rogers escreveu:
> On Thu, Sep 9, 2021 at 7:14 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> >
> > Hey there!
> >
> > I'm trying to profile an application that suffers from overcommit and
> > excessive calls to `yield` (via OMP). Generally, it is putting a lot of strain
> > on the system. Sadly, I cannot make `perf record` work without dropping (a lot
> > of) chunks / events.
> >
> > Usually I'm using this command line:
> >
> > ```
> > perf record --call-graph dwarf -z -m 16M ...
> > ```
> >
> > But this is not enough in this case. What other alternatives are there? Is
> > there a `perf record` mode which never drops events, at the cost of maybe
> > stalling the profiled application?

There was discussion about this mode, for using on a 'perf trace' that
works like 'strace', i.e. doesn't lose records but stalls the
profiled/traced application, but I don't think that progressed

> > Generally though, looking at the (partial) data I get from perf, it seems like
> > it does not cope well in the face of excessive calls to `yield`. Take this
> > example code here:
> >
> > ```
> > #include <thread>
> > #include <vector>
> > #include <algorithm>
> >
> > void yieldALot()
> > {
> >     for (int i = 0; i < 1000000; ++i) {
> >         std::this_thread::yield();
> >     }
> > }
> >
> > int main()
> > {
> >     std::vector<std::thread> threads(std::thread::hardware_concurrency() * 2);
> >     std::generate(threads.begin(), threads.end(), []() {
> >         return std::thread(&yieldALot);
> >     });
> >
> >     for (auto &thread : threads)
> >         thread.join();
> >
> >     return 0;
> > }
> > ```
> >
> > When I compile this with `g++ test.cpp -pthread -O2 -g` and run it on my
> > system with a AMD Ryzen 9 3900X with 12 cores and 24 SMT cores, then I observe
> > the following behavior:
> >
> > ```
> > perf stat -r 5 ./a.out
> >
> >  Performance counter stats for './a.out' (5 runs):
> >
> >          18,570.71 msec task-clock                #   19.598 CPUs utilized
> > ( +-  1.27% )
> >          2,717,136      context-switches          #  145.447 K/sec
> > ( +-  3.26% )
> >                 55      cpu-migrations            #    2.944 /sec
> > ( +-  3.79% )
> >                222      page-faults               #   11.884 /sec
> > ( +-  0.49% )
> >     61,223,010,902      cycles                    #    3.277 GHz
> > ( +-  0.54% )  (83.33%)
> >      1,259,821,397      stalled-cycles-frontend   #    2.07% frontend cycles
> > idle     ( +-  1.54% )  (83.16%)
> >      9,001,394,333      stalled-cycles-backend    #   14.81% backend cycles
> > idle      ( +-  0.79% )  (83.25%)
> >     65,746,848,695      instructions              #    1.08  insn per cycle
> >                                                   #    0.14  stalled cycles
> > per insn  ( +-  0.26% )  (83.40%)
> >     14,474,862,454      branches                  #  774.830 M/sec
> > ( +-  0.36% )  (83.49%)
> >        114,518,545      branch-misses             #    0.79% of all branches
> > ( +-  0.71% )  (83.67%)
> >
> >            0.94761 +- 0.00935 seconds time elapsed  ( +-  0.99% )
> > ```
> >
> > Now compare this to what happens when I enable perf recording:
> >
> > ```
> >  Performance counter stats for 'perf record --call-graph dwarf -z ./a.out' (5
> > runs):
> >
> >          19,928.28 msec task-clock                #    9.405 CPUs utilized
> > ( +-  2.74% )
> >          3,026,081      context-switches          #  142.538 K/sec
> > ( +-  7.48% )
> >                120      cpu-migrations            #    5.652 /sec
> > ( +-  3.18% )
> >              5,509      page-faults               #  259.491 /sec
> > ( +-  0.01% )
> >     65,393,592,529      cycles                    #    3.080 GHz
> > ( +-  1.81% )  (71.81%)
> >      1,445,203,862      stalled-cycles-frontend   #    2.13% frontend cycles
> > idle     ( +-  2.01% )  (71.60%)
> >      9,701,321,165      stalled-cycles-backend    #   14.29% backend cycles
> > idle      ( +-  1.28% )  (71.62%)
> >     69,967,062,730      instructions              #    1.03  insn per cycle
> >                                                   #    0.14  stalled cycles
> > per insn  ( +-  0.74% )  (71.83%)
> >     15,376,981,968      branches                  #  724.304 M/sec
> > ( +-  0.83% )  (72.00%)
> >        127,934,194      branch-misses             #    0.81% of all branches
> > ( +-  1.29% )  (72.07%)
> >
> >           2.118881 +- 0.000496 seconds time elapsed  ( +-  0.02% )
> > ```
> >
> > Note how the CPU utilization drops dramatically by roughly a factor of 2x,
> > which correlates with the increase in overall runtime.
> >
> > From the data obtained here I see:
> > ```
> >     98.49%     0.00%  a.out    libc-2.33.so        [.] __GI___sched_yield
> > (inlined)
> >             |
> >             ---__GI___sched_yield (inlined)
> >                |
> >                |--89.93%--entry_SYSCALL_64_after_hwframe
> >                |          |
> >                |           --89.32%--do_syscall_64
> >                |                     |
> >                |                     |--71.70%--__ia32_sys_sched_yield
> >                |                     |          |
> >                |                     |          |--64.36%--schedule
> >                |                     |          |          |
> >                |                     |          |           --63.42%--
> > __schedule
> >                |                     |          |                     |
> >                |                     |          |
> > |--32.56%--__perf_event_task_sched_out
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > --32.07%--amd_pmu_disable_all
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > |--23.54%--x86_pmu_disable_all
> >                |                     |          |                     |
> > |          |
> >                |                     |          |                     |
> > |          |--12.81%--native_write_msr
> >                |                     |          |                     |
> > |          |
> >                |                     |          |                     |
> > |          |--8.24%--native_read_msr
> >                |                     |          |                     |
> > |          |
> >                |                     |          |                     |
> > |           --1.25%--amd_pmu_addr_offset
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > --8.19%--amd_pmu_wait_on_overflow
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > |--5.00%--native_read_msr
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > |--2.20%--delay_halt
> >                |                     |          |                     |
> > |          |
> >                |                     |          |                     |
> > |           --1.99%--delay_halt_mwaitx
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > --0.55%--amd_pmu_addr_offset
> >
> > ```
> >
> > Is this a dificiency with the AMD perf subsystem? Or is this a generic issue
> > with perf? I understand that it has to enable/disable the PMU when it's
> > switching tasks, but potentially there are some ways to optimize this
> > behavior?
> 
> Hi Milian,
> 
> By using dwarf call graphs your samples are writing a dump of the
> stack into the perf event ring buffer that will be processed in the
> userspace perf command. The default stack dump size is 8kb and you can
> lower it - for example with "--call-graph dwarf,4096". I suspect that
> most of the overhead you are seeing is from these stack dumps. There
> is a more complete description in the man page:
> https://man7.org/linux/man-pages/man1/perf-record.1.htm
> 
> Which are always worth improving:
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/Documentation/perf-record.txt?h=perf/core
> 
> Thanks,
> Ian
> > Thanks
> >
> > --
> > Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> > KDAB (Deutschland) GmbH, a KDAB Group company
> > Tel: +49-30-521325470
> > KDAB - The Qt, C++ and OpenGL Experts

-- 

- Arnaldo

next prev parent reply	other threads:[~2021-09-09 20:22 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-09 14:08 Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield? Milian Wolff
2021-09-09 19:28 ` Ian Rogers
2021-09-09 20:22   ` Arnaldo Carvalho de Melo [this message]
2021-09-09 20:38   ` Milian Wolff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YTptAbQgae5HEgAU@kernel.org \
    --to=acme@kernel.org \
    --cc=irogers@google.com \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=milian.wolff@kdab.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).