Re: [INFO] Some preliminary performance data

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
To: "Alex Bennée" <alex.bennee@linaro.org>
Cc: "Lukáš Doktor" <ldoktor@redhat.com>,
	"Peter Maydell" <peter.maydell@linaro.org>,
	"Stefan Hajnoczi" <stefanha@gmail.com>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	"QEMU Developers" <qemu-devel@nongnu.org>,
	ahmedkhaledkaraman@gmail.com, "Emilio G . Cota" <cota@braap.org>,
	"Gerd Hoffmann" <kraxel@redhat.com>
Subject: Re: [INFO] Some preliminary performance data
Date: Sat, 9 May 2020 12:16:09 +0200	[thread overview]
Message-ID: <CAHiYmc4otn_oGqQoVThEs6pmBqWG8u3KjQ+aAvgnZ2jso0-2NQ@mail.gmail.com> (raw)
In-Reply-To: <87imh95mof.fsf@linaro.org>


[-- Attachment #1.1.1: Type: text/plain, Size: 7010 bytes --]

сре, 6. мај 2020. у 13:26 Alex Bennée <alex.bennee@linaro.org> је
написао/ла:
>
>
> Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> writes:
>
> Some preliminary thoughts....
>

Alex, many thanks for all your thoughts and hints that are truly helpful!

I will most likely respond to all of them in some future mail, but for now
I will comment just one.

> >> Hi, all.
> >>
> >> I just want to share with you some bits and pieces of data that I got
> >> while doing some preliminary experimentation for the GSoC project "TCG
> >> Continuous Benchmarking", that Ahmed Karaman, a student of the fourth
final
> >> year of Electical Engineering Faculty in Cairo, will execute.
> >>
> >> *User Mode*
> >>
> >>    * As expected, for any program dealing with any substantional
> >> floating-point calculation, softfloat library will be the the heaviest
CPU
> >> cycles consumer.
> >>    * We plan to examine the performance behaviour of non-FP programs
> >> (integer arithmetic), or even non-numeric programs (sorting strings,
for
> >> example).
>
> Emilio was the last person to do extensive bench-marking on TCG and he
> used a mild fork of the venerable nbench:
>
>   https://github.com/cota/dbt-bench
>
> as the hot code is fairly small it offers a good way of testing quality
> of the output. Larger programs will differ as they can involve more code
> generation.
>
> >>
> >> *System Mode*
> >>
> >>    * I did profiling of booting several machines using a tool called
> >> callgrind (a part of valgrind). The tool offers pletora of information,
> >> however it looks it is little confused by usage of coroutines, and that
> >> makes some of its reports look very illogical, or plain ugly.
>
> Doesn't running through valgrind inherently serialise execution anyway?
> If you are looking for latency caused by locks we have support for the
> QEMU sync profiler built into the code. See "help sync-profile" on the
HMP.
>
> >> Still, it
> >> seems valid data can be extracted from it. Without going into details,
here
> >> is what it says for one machine (bear in mind that results may vary to
a
> >> great extent between machines):
>
> You can also use perf to use sampling to find hot points in the code.
> One of last years GSoC student wrote some patches that included the
> ability to dump a jit info file for perf to consume. We never got it
> merged in the end but it might be worth having a go at pulling the
> relevant bits out from:
>
>   Subject: [PATCH  v9 00/13] TCG code quality tracking and perf
integration
>   Date: Mon,  7 Oct 2019 16:28:26 +0100
>   Message-Id: <20191007152839.30804-1-alex.bennee@linaro.org>
>
> >>      ** The booting involved six threads, one for display handling, one
> >> for emulations, and four more. The last four did almost nothing during
> >> boot, just almost entire time siting idle, waiting for something. As
far as
> >> "Total Instruction Fetch Count" (this is the main measure used in
> >> callgrind), they were distributed in proportion 1:3 between display
thread
> >> and emulation thread (the rest of threads were negligible) (but,
> >> interestingly enough, for another machine that proportion was 1:20).
> >>      ** The display thread is dominated by vga_update_display()
function
> >> (21.5% "self" time, and 51.6% "self + callees" time, called almost
40000
> >> times). Other functions worth mentioning are
> >> cpu_physical_memory_snapshot_get_dirty() and
> >> memory_region_snapshot_get_dirty(), which are very small functions,
but are
> >> both invoked over 26 000 000 times, and contribute with over 20% of
display
> >> thread instruction fetch count together.
>
> The memory region tracking code will end up forcing the slow path for a
> lot of memory accesses to video memory via softmmu. You may want to
> measure if there is a difference using one of the virtio based graphics
> displays.
>
> >>      ** Focusing now on emulation thread, "Total Instruction Fetch
Counts"
> >> were roughly distributed this way:
> >>            - 15.7% is execution of GIT-ed code from translation block
> >> buffer
> >>            - 39.9% is execution of helpers
> >>            - 44.4% is code translation stage, including some coroutine
> >> activities
> >>         Top two among helpers:
> >>           - helper_le_stl_memory()
>
> I assume that is the MMU slow-path being called from the generated code.
>
> >>           - helper_lookup_tb_ptr() (this one is invoked whopping 36 000
> >> 000 times)
>
> This is an optimisation to avoid exiting the run-loop to find the next
> block. From memory I think the two main cases you'll see are:
>
>  - computed jumps (i.e. target not known at JIT time)
>  - jumps outside of the current page
>
> >>         Single largest instruction consumer of code translation:
> >>           - liveness_pass_1(), that constitutes 21.5% of the entire
> >> "emulation thread" consumption, or, in other way, almost half of code
> >> translation stage (that sits at 44.4%)
>
> This is very much driven by how much code generation vs running you see.
> In most of my personal benchmarks I never really notice code generation
> because I give my machines large amounts of RAM so code tends to stay
> resident so not need to be re-translated. When the optimiser shows up
> it's usually accompanied by high TB flush and invalidate counts in "info
> jit" because we are doing more translation that we usually do.
>

Yes, I think the machine was setup with only 128MB RAM.

That would be an interesting experiment for Ahmed actually - to
measure impact of given RAM memory to performance.

But it looks that at least for machines with small RAM, translation
phase will take significant percentage.

I am attaching call graph for translation phase for "Hello World" built
for mips, and emulated by QEMU: *tb_gen_code() and its calees)

(I am also attaching the pic in case it is not visible well inline)

[image: tb_gen_code.png]


> I'll also mention my foray into tracking down the performance regression
> of DOSBox Doom:
>
>   https://diasp.eu/posts/8659062
>
> it presented a very nice demonstration of the increasing complexity (and
> run time) of the optimiser which was completely wasted due to
> self-modifying code causing us to regenerate code all the time.
>
> >>
> >> Please take all this with a little grain of salt, since these results
are
> >> just of preliminary nature.
> >>
> >> I would like to use this opportunity to welcome Ahmed Karaman, a
talented
> >> young man from Egypt, into QEMU development community, that'll work on
"TCG
> >> Continuous Benchmarking" project this summer. Please do help them in
his
> >> first steps as our colleague. Best luck to Ahmed!
>
> Welcome to the QEMU community Ahmed. Feel free to CC me on TCG
> performance related patches. I like to see things go faster ;-)
>
> --
> Alex Bennée

[-- Attachment #1.1.2: Type: text/html, Size: 8860 bytes --]

[-- Attachment #1.2: tb_gen_code.png --]
[-- Type: image/png, Size: 96587 bytes --]

[-- Attachment #2: tb_gen_code.png --]
[-- Type: image/png, Size: 96587 bytes --]

next prev parent reply	other threads:[~2020-05-09 10:17 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-02 23:20 [INFO] Some preliminary performance data Aleksandar Markovic
2020-05-02 23:24 ` Aleksandar Markovic
2020-05-03  6:47   ` Ahmed Karaman
2020-05-06 11:26   ` Alex Bennée
2020-05-09 10:16     ` Aleksandar Markovic [this message]
2020-05-09 10:26       ` Aleksandar Markovic
2020-05-09 11:36       ` Laurent Desnogues
2020-05-09 12:37         ` Aleksandar Markovic
2020-05-09 12:50           ` Laurent Desnogues
2020-05-09 12:55             ` Aleksandar Markovic
2020-05-09 16:49             ` Alex Bennée

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHiYmc4otn_oGqQoVThEs6pmBqWG8u3KjQ+aAvgnZ2jso0-2NQ@mail.gmail.com \
    --to=aleksandar.qemu.devel@gmail.com \
    --cc=ahmedkhaledkaraman@gmail.com \
    --cc=alex.bennee@linaro.org \
    --cc=cota@braap.org \
    --cc=kraxel@redhat.com \
    --cc=ldoktor@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=stefanha@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).