From: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
To: "Alex Bennée" <alex.bennee@linaro.org>
Cc: "Lukáš Doktor" <ldoktor@redhat.com>,
"Peter Maydell" <peter.maydell@linaro.org>,
"Stefan Hajnoczi" <stefanha@gmail.com>,
"Richard Henderson" <richard.henderson@linaro.org>,
"QEMU Developers" <qemu-devel@nongnu.org>,
ahmedkhaledkaraman@gmail.com, "Emilio G . Cota" <cota@braap.org>,
"Gerd Hoffmann" <kraxel@redhat.com>
Subject: Re: [INFO] Some preliminary performance data
Date: Sat, 9 May 2020 12:16:09 +0200 [thread overview]
Message-ID: <CAHiYmc4otn_oGqQoVThEs6pmBqWG8u3KjQ+aAvgnZ2jso0-2NQ@mail.gmail.com> (raw)
In-Reply-To: <87imh95mof.fsf@linaro.org>
[-- Attachment #1.1.1: Type: text/plain, Size: 7010 bytes --]
сре, 6. мај 2020. у 13:26 Alex Bennée <alex.bennee@linaro.org> је
написао/ла:
>
>
> Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> writes:
>
> Some preliminary thoughts....
>
Alex, many thanks for all your thoughts and hints that are truly helpful!
I will most likely respond to all of them in some future mail, but for now
I will comment just one.
> >> Hi, all.
> >>
> >> I just want to share with you some bits and pieces of data that I got
> >> while doing some preliminary experimentation for the GSoC project "TCG
> >> Continuous Benchmarking", that Ahmed Karaman, a student of the fourth
final
> >> year of Electical Engineering Faculty in Cairo, will execute.
> >>
> >> *User Mode*
> >>
> >> * As expected, for any program dealing with any substantional
> >> floating-point calculation, softfloat library will be the the heaviest
CPU
> >> cycles consumer.
> >> * We plan to examine the performance behaviour of non-FP programs
> >> (integer arithmetic), or even non-numeric programs (sorting strings,
for
> >> example).
>
> Emilio was the last person to do extensive bench-marking on TCG and he
> used a mild fork of the venerable nbench:
>
> https://github.com/cota/dbt-bench
>
> as the hot code is fairly small it offers a good way of testing quality
> of the output. Larger programs will differ as they can involve more code
> generation.
>
> >>
> >> *System Mode*
> >>
> >> * I did profiling of booting several machines using a tool called
> >> callgrind (a part of valgrind). The tool offers pletora of information,
> >> however it looks it is little confused by usage of coroutines, and that
> >> makes some of its reports look very illogical, or plain ugly.
>
> Doesn't running through valgrind inherently serialise execution anyway?
> If you are looking for latency caused by locks we have support for the
> QEMU sync profiler built into the code. See "help sync-profile" on the
HMP.
>
> >> Still, it
> >> seems valid data can be extracted from it. Without going into details,
here
> >> is what it says for one machine (bear in mind that results may vary to
a
> >> great extent between machines):
>
> You can also use perf to use sampling to find hot points in the code.
> One of last years GSoC student wrote some patches that included the
> ability to dump a jit info file for perf to consume. We never got it
> merged in the end but it might be worth having a go at pulling the
> relevant bits out from:
>
> Subject: [PATCH v9 00/13] TCG code quality tracking and perf
integration
> Date: Mon, 7 Oct 2019 16:28:26 +0100
> Message-Id: <20191007152839.30804-1-alex.bennee@linaro.org>
>
> >> ** The booting involved six threads, one for display handling, one
> >> for emulations, and four more. The last four did almost nothing during
> >> boot, just almost entire time siting idle, waiting for something. As
far as
> >> "Total Instruction Fetch Count" (this is the main measure used in
> >> callgrind), they were distributed in proportion 1:3 between display
thread
> >> and emulation thread (the rest of threads were negligible) (but,
> >> interestingly enough, for another machine that proportion was 1:20).
> >> ** The display thread is dominated by vga_update_display()
function
> >> (21.5% "self" time, and 51.6% "self + callees" time, called almost
40000
> >> times). Other functions worth mentioning are
> >> cpu_physical_memory_snapshot_get_dirty() and
> >> memory_region_snapshot_get_dirty(), which are very small functions,
but are
> >> both invoked over 26 000 000 times, and contribute with over 20% of
display
> >> thread instruction fetch count together.
>
> The memory region tracking code will end up forcing the slow path for a
> lot of memory accesses to video memory via softmmu. You may want to
> measure if there is a difference using one of the virtio based graphics
> displays.
>
> >> ** Focusing now on emulation thread, "Total Instruction Fetch
Counts"
> >> were roughly distributed this way:
> >> - 15.7% is execution of GIT-ed code from translation block
> >> buffer
> >> - 39.9% is execution of helpers
> >> - 44.4% is code translation stage, including some coroutine
> >> activities
> >> Top two among helpers:
> >> - helper_le_stl_memory()
>
> I assume that is the MMU slow-path being called from the generated code.
>
> >> - helper_lookup_tb_ptr() (this one is invoked whopping 36 000
> >> 000 times)
>
> This is an optimisation to avoid exiting the run-loop to find the next
> block. From memory I think the two main cases you'll see are:
>
> - computed jumps (i.e. target not known at JIT time)
> - jumps outside of the current page
>
> >> Single largest instruction consumer of code translation:
> >> - liveness_pass_1(), that constitutes 21.5% of the entire
> >> "emulation thread" consumption, or, in other way, almost half of code
> >> translation stage (that sits at 44.4%)
>
> This is very much driven by how much code generation vs running you see.
> In most of my personal benchmarks I never really notice code generation
> because I give my machines large amounts of RAM so code tends to stay
> resident so not need to be re-translated. When the optimiser shows up
> it's usually accompanied by high TB flush and invalidate counts in "info
> jit" because we are doing more translation that we usually do.
>
Yes, I think the machine was setup with only 128MB RAM.
That would be an interesting experiment for Ahmed actually - to
measure impact of given RAM memory to performance.
But it looks that at least for machines with small RAM, translation
phase will take significant percentage.
I am attaching call graph for translation phase for "Hello World" built
for mips, and emulated by QEMU: *tb_gen_code() and its calees)
(I am also attaching the pic in case it is not visible well inline)
[image: tb_gen_code.png]
> I'll also mention my foray into tracking down the performance regression
> of DOSBox Doom:
>
> https://diasp.eu/posts/8659062
>
> it presented a very nice demonstration of the increasing complexity (and
> run time) of the optimiser which was completely wasted due to
> self-modifying code causing us to regenerate code all the time.
>
> >>
> >> Please take all this with a little grain of salt, since these results
are
> >> just of preliminary nature.
> >>
> >> I would like to use this opportunity to welcome Ahmed Karaman, a
talented
> >> young man from Egypt, into QEMU development community, that'll work on
"TCG
> >> Continuous Benchmarking" project this summer. Please do help them in
his
> >> first steps as our colleague. Best luck to Ahmed!
>
> Welcome to the QEMU community Ahmed. Feel free to CC me on TCG
> performance related patches. I like to see things go faster ;-)
>
> --
> Alex Bennée
[-- Attachment #1.1.2: Type: text/html, Size: 8860 bytes --]
[-- Attachment #1.2: tb_gen_code.png --]
[-- Type: image/png, Size: 96587 bytes --]
[-- Attachment #2: tb_gen_code.png --]
[-- Type: image/png, Size: 96587 bytes --]
next prev parent reply other threads:[~2020-05-09 10:17 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-02 23:20 [INFO] Some preliminary performance data Aleksandar Markovic
2020-05-02 23:24 ` Aleksandar Markovic
2020-05-03 6:47 ` Ahmed Karaman
2020-05-06 11:26 ` Alex Bennée
2020-05-09 10:16 ` Aleksandar Markovic [this message]
2020-05-09 10:26 ` Aleksandar Markovic
2020-05-09 11:36 ` Laurent Desnogues
2020-05-09 12:37 ` Aleksandar Markovic
2020-05-09 12:50 ` Laurent Desnogues
2020-05-09 12:55 ` Aleksandar Markovic
2020-05-09 16:49 ` Alex Bennée
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAHiYmc4otn_oGqQoVThEs6pmBqWG8u3KjQ+aAvgnZ2jso0-2NQ@mail.gmail.com \
--to=aleksandar.qemu.devel@gmail.com \
--cc=ahmedkhaledkaraman@gmail.com \
--cc=alex.bennee@linaro.org \
--cc=cota@braap.org \
--cc=kraxel@redhat.com \
--cc=ldoktor@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=stefanha@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).