From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Ian Rogers <irogers@google.com>
Cc: Milian Wolff <milian.wolff@kdab.com>, linux-perf-users@vger.kernel.org
Subject: Re: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?
Date: Thu, 9 Sep 2021 17:22:25 -0300 [thread overview]
Message-ID: <YTptAbQgae5HEgAU@kernel.org> (raw)
In-Reply-To: <CAP-5=fVM8oxmMzx4VhoCPp=t=vR3W-MT72MRL7pVsjJR9YtxXA@mail.gmail.com>
Em Thu, Sep 09, 2021 at 12:28:10PM -0700, Ian Rogers escreveu:
> On Thu, Sep 9, 2021 at 7:14 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> >
> > Hey there!
> >
> > I'm trying to profile an application that suffers from overcommit and
> > excessive calls to `yield` (via OMP). Generally, it is putting a lot of strain
> > on the system. Sadly, I cannot make `perf record` work without dropping (a lot
> > of) chunks / events.
> >
> > Usually I'm using this command line:
> >
> > ```
> > perf record --call-graph dwarf -z -m 16M ...
> > ```
> >
> > But this is not enough in this case. What other alternatives are there? Is
> > there a `perf record` mode which never drops events, at the cost of maybe
> > stalling the profiled application?
There was discussion about this mode, for using on a 'perf trace' that
works like 'strace', i.e. doesn't lose records but stalls the
profiled/traced application, but I don't think that progressed
> > Generally though, looking at the (partial) data I get from perf, it seems like
> > it does not cope well in the face of excessive calls to `yield`. Take this
> > example code here:
> >
> > ```
> > #include <thread>
> > #include <vector>
> > #include <algorithm>
> >
> > void yieldALot()
> > {
> > for (int i = 0; i < 1000000; ++i) {
> > std::this_thread::yield();
> > }
> > }
> >
> > int main()
> > {
> > std::vector<std::thread> threads(std::thread::hardware_concurrency() * 2);
> > std::generate(threads.begin(), threads.end(), []() {
> > return std::thread(&yieldALot);
> > });
> >
> > for (auto &thread : threads)
> > thread.join();
> >
> > return 0;
> > }
> > ```
> >
> > When I compile this with `g++ test.cpp -pthread -O2 -g` and run it on my
> > system with a AMD Ryzen 9 3900X with 12 cores and 24 SMT cores, then I observe
> > the following behavior:
> >
> > ```
> > perf stat -r 5 ./a.out
> >
> > Performance counter stats for './a.out' (5 runs):
> >
> > 18,570.71 msec task-clock # 19.598 CPUs utilized
> > ( +- 1.27% )
> > 2,717,136 context-switches # 145.447 K/sec
> > ( +- 3.26% )
> > 55 cpu-migrations # 2.944 /sec
> > ( +- 3.79% )
> > 222 page-faults # 11.884 /sec
> > ( +- 0.49% )
> > 61,223,010,902 cycles # 3.277 GHz
> > ( +- 0.54% ) (83.33%)
> > 1,259,821,397 stalled-cycles-frontend # 2.07% frontend cycles
> > idle ( +- 1.54% ) (83.16%)
> > 9,001,394,333 stalled-cycles-backend # 14.81% backend cycles
> > idle ( +- 0.79% ) (83.25%)
> > 65,746,848,695 instructions # 1.08 insn per cycle
> > # 0.14 stalled cycles
> > per insn ( +- 0.26% ) (83.40%)
> > 14,474,862,454 branches # 774.830 M/sec
> > ( +- 0.36% ) (83.49%)
> > 114,518,545 branch-misses # 0.79% of all branches
> > ( +- 0.71% ) (83.67%)
> >
> > 0.94761 +- 0.00935 seconds time elapsed ( +- 0.99% )
> > ```
> >
> > Now compare this to what happens when I enable perf recording:
> >
> > ```
> > Performance counter stats for 'perf record --call-graph dwarf -z ./a.out' (5
> > runs):
> >
> > 19,928.28 msec task-clock # 9.405 CPUs utilized
> > ( +- 2.74% )
> > 3,026,081 context-switches # 142.538 K/sec
> > ( +- 7.48% )
> > 120 cpu-migrations # 5.652 /sec
> > ( +- 3.18% )
> > 5,509 page-faults # 259.491 /sec
> > ( +- 0.01% )
> > 65,393,592,529 cycles # 3.080 GHz
> > ( +- 1.81% ) (71.81%)
> > 1,445,203,862 stalled-cycles-frontend # 2.13% frontend cycles
> > idle ( +- 2.01% ) (71.60%)
> > 9,701,321,165 stalled-cycles-backend # 14.29% backend cycles
> > idle ( +- 1.28% ) (71.62%)
> > 69,967,062,730 instructions # 1.03 insn per cycle
> > # 0.14 stalled cycles
> > per insn ( +- 0.74% ) (71.83%)
> > 15,376,981,968 branches # 724.304 M/sec
> > ( +- 0.83% ) (72.00%)
> > 127,934,194 branch-misses # 0.81% of all branches
> > ( +- 1.29% ) (72.07%)
> >
> > 2.118881 +- 0.000496 seconds time elapsed ( +- 0.02% )
> > ```
> >
> > Note how the CPU utilization drops dramatically by roughly a factor of 2x,
> > which correlates with the increase in overall runtime.
> >
> > From the data obtained here I see:
> > ```
> > 98.49% 0.00% a.out libc-2.33.so [.] __GI___sched_yield
> > (inlined)
> > |
> > ---__GI___sched_yield (inlined)
> > |
> > |--89.93%--entry_SYSCALL_64_after_hwframe
> > | |
> > | --89.32%--do_syscall_64
> > | |
> > | |--71.70%--__ia32_sys_sched_yield
> > | | |
> > | | |--64.36%--schedule
> > | | | |
> > | | | --63.42%--
> > __schedule
> > | | | |
> > | | |
> > |--32.56%--__perf_event_task_sched_out
> > | | | |
> > |
> > | | | |
> > --32.07%--amd_pmu_disable_all
> > | | | |
> > |
> > | | | |
> > |--23.54%--x86_pmu_disable_all
> > | | | |
> > | |
> > | | | |
> > | |--12.81%--native_write_msr
> > | | | |
> > | |
> > | | | |
> > | |--8.24%--native_read_msr
> > | | | |
> > | |
> > | | | |
> > | --1.25%--amd_pmu_addr_offset
> > | | | |
> > |
> > | | | |
> > --8.19%--amd_pmu_wait_on_overflow
> > | | | |
> > |
> > | | | |
> > |--5.00%--native_read_msr
> > | | | |
> > |
> > | | | |
> > |--2.20%--delay_halt
> > | | | |
> > | |
> > | | | |
> > | --1.99%--delay_halt_mwaitx
> > | | | |
> > |
> > | | | |
> > --0.55%--amd_pmu_addr_offset
> >
> > ```
> >
> > Is this a dificiency with the AMD perf subsystem? Or is this a generic issue
> > with perf? I understand that it has to enable/disable the PMU when it's
> > switching tasks, but potentially there are some ways to optimize this
> > behavior?
>
> Hi Milian,
>
> By using dwarf call graphs your samples are writing a dump of the
> stack into the perf event ring buffer that will be processed in the
> userspace perf command. The default stack dump size is 8kb and you can
> lower it - for example with "--call-graph dwarf,4096". I suspect that
> most of the overhead you are seeing is from these stack dumps. There
> is a more complete description in the man page:
> https://man7.org/linux/man-pages/man1/perf-record.1.htm
>
> Which are always worth improving:
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/Documentation/perf-record.txt?h=perf/core
>
> Thanks,
> Ian
> > Thanks
> >
> > --
> > Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> > KDAB (Deutschland) GmbH, a KDAB Group company
> > Tel: +49-30-521325470
> > KDAB - The Qt, C++ and OpenGL Experts
--
- Arnaldo
next prev parent reply other threads:[~2021-09-09 20:22 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-09 14:08 Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield? Milian Wolff
2021-09-09 19:28 ` Ian Rogers
2021-09-09 20:22 ` Arnaldo Carvalho de Melo [this message]
2021-09-09 20:38 ` Milian Wolff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YTptAbQgae5HEgAU@kernel.org \
--to=acme@kernel.org \
--cc=irogers@google.com \
--cc=linux-perf-users@vger.kernel.org \
--cc=milian.wolff@kdab.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.