From: Milian Wolff <milian.wolff@kdab.com>
To: linux-perf-users@vger.kernel.org
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Subject: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?
Date: Thu, 09 Sep 2021 16:08:27 +0200 [thread overview]
Message-ID: <9270373.m2mTvpyXtE@milian-workstation> (raw)
[-- Attachment #1: Type: text/plain, Size: 7951 bytes --]
Hey there!
I'm trying to profile an application that suffers from overcommit and
excessive calls to `yield` (via OMP). Generally, it is putting a lot of strain
on the system. Sadly, I cannot make `perf record` work without dropping (a lot
of) chunks / events.
Usually I'm using this command line:
```
perf record --call-graph dwarf -z -m 16M ...
```
But this is not enough in this case. What other alternatives are there? Is
there a `perf record` mode which never drops events, at the cost of maybe
stalling the profiled application?
Generally though, looking at the (partial) data I get from perf, it seems like
it does not cope well in the face of excessive calls to `yield`. Take this
example code here:
```
#include <thread>
#include <vector>
#include <algorithm>
void yieldALot()
{
for (int i = 0; i < 1000000; ++i) {
std::this_thread::yield();
}
}
int main()
{
std::vector<std::thread> threads(std::thread::hardware_concurrency() * 2);
std::generate(threads.begin(), threads.end(), []() {
return std::thread(&yieldALot);
});
for (auto &thread : threads)
thread.join();
return 0;
}
```
When I compile this with `g++ test.cpp -pthread -O2 -g` and run it on my
system with a AMD Ryzen 9 3900X with 12 cores and 24 SMT cores, then I observe
the following behavior:
```
perf stat -r 5 ./a.out
Performance counter stats for './a.out' (5 runs):
18,570.71 msec task-clock # 19.598 CPUs utilized
( +- 1.27% )
2,717,136 context-switches # 145.447 K/sec
( +- 3.26% )
55 cpu-migrations # 2.944 /sec
( +- 3.79% )
222 page-faults # 11.884 /sec
( +- 0.49% )
61,223,010,902 cycles # 3.277 GHz
( +- 0.54% ) (83.33%)
1,259,821,397 stalled-cycles-frontend # 2.07% frontend cycles
idle ( +- 1.54% ) (83.16%)
9,001,394,333 stalled-cycles-backend # 14.81% backend cycles
idle ( +- 0.79% ) (83.25%)
65,746,848,695 instructions # 1.08 insn per cycle
# 0.14 stalled cycles
per insn ( +- 0.26% ) (83.40%)
14,474,862,454 branches # 774.830 M/sec
( +- 0.36% ) (83.49%)
114,518,545 branch-misses # 0.79% of all branches
( +- 0.71% ) (83.67%)
0.94761 +- 0.00935 seconds time elapsed ( +- 0.99% )
```
Now compare this to what happens when I enable perf recording:
```
Performance counter stats for 'perf record --call-graph dwarf -z ./a.out' (5
runs):
19,928.28 msec task-clock # 9.405 CPUs utilized
( +- 2.74% )
3,026,081 context-switches # 142.538 K/sec
( +- 7.48% )
120 cpu-migrations # 5.652 /sec
( +- 3.18% )
5,509 page-faults # 259.491 /sec
( +- 0.01% )
65,393,592,529 cycles # 3.080 GHz
( +- 1.81% ) (71.81%)
1,445,203,862 stalled-cycles-frontend # 2.13% frontend cycles
idle ( +- 2.01% ) (71.60%)
9,701,321,165 stalled-cycles-backend # 14.29% backend cycles
idle ( +- 1.28% ) (71.62%)
69,967,062,730 instructions # 1.03 insn per cycle
# 0.14 stalled cycles
per insn ( +- 0.74% ) (71.83%)
15,376,981,968 branches # 724.304 M/sec
( +- 0.83% ) (72.00%)
127,934,194 branch-misses # 0.81% of all branches
( +- 1.29% ) (72.07%)
2.118881 +- 0.000496 seconds time elapsed ( +- 0.02% )
```
Note how the CPU utilization drops dramatically by roughly a factor of 2x,
which correlates with the increase in overall runtime.
From the data obtained here I see:
```
98.49% 0.00% a.out libc-2.33.so [.] __GI___sched_yield
(inlined)
|
---__GI___sched_yield (inlined)
|
|--89.93%--entry_SYSCALL_64_after_hwframe
| |
| --89.32%--do_syscall_64
| |
| |--71.70%--__ia32_sys_sched_yield
| | |
| | |--64.36%--schedule
| | | |
| | | --63.42%--
__schedule
| | | |
| | |
|--32.56%--__perf_event_task_sched_out
| | | |
|
| | | |
--32.07%--amd_pmu_disable_all
| | | |
|
| | | |
|--23.54%--x86_pmu_disable_all
| | | |
| |
| | | |
| |--12.81%--native_write_msr
| | | |
| |
| | | |
| |--8.24%--native_read_msr
| | | |
| |
| | | |
| --1.25%--amd_pmu_addr_offset
| | | |
|
| | | |
--8.19%--amd_pmu_wait_on_overflow
| | | |
|
| | | |
|--5.00%--native_read_msr
| | | |
|
| | | |
|--2.20%--delay_halt
| | | |
| |
| | | |
| --1.99%--delay_halt_mwaitx
| | | |
|
| | | |
--0.55%--amd_pmu_addr_offset
```
Is this a dificiency with the AMD perf subsystem? Or is this a generic issue
with perf? I understand that it has to enable/disable the PMU when it's
switching tasks, but potentially there are some ways to optimize this
behavior?
Thanks
--
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5272 bytes --]
next reply other threads:[~2021-09-09 14:14 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-09 14:08 Milian Wolff [this message]
2021-09-09 19:28 ` Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield? Ian Rogers
2021-09-09 20:22 ` Arnaldo Carvalho de Melo
2021-09-09 20:38 ` Milian Wolff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9270373.m2mTvpyXtE@milian-workstation \
--to=milian.wolff@kdab.com \
--cc=acme@kernel.org \
--cc=linux-perf-users@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).