From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF0EAC28CBC for ; Sat, 9 May 2020 10:27:35 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8791821473 for ; Sat, 9 May 2020 10:27:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="jh4N5vVh" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8791821473 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:56784 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jXMhi-0007wo-LR for qemu-devel@archiver.kernel.org; Sat, 09 May 2020 06:27:34 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:42814) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jXMgx-0007W2-N9 for qemu-devel@nongnu.org; Sat, 09 May 2020 06:26:47 -0400 Received: from mail-wm1-x344.google.com ([2a00:1450:4864:20::344]:33532) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jXMgw-0008LX-2X for qemu-devel@nongnu.org; Sat, 09 May 2020 06:26:47 -0400 Received: by mail-wm1-x344.google.com with SMTP id v8so11717955wma.0 for ; Sat, 09 May 2020 03:26:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jwjgjQt5AJZkHuXbx6xFQRDwpns2JZ7m5+qWRXboZNY=; b=jh4N5vVhAdrYgnpQ4zjTDS5awwzjs2NQusOLHbT9xnqW5TmGpaM+fN2qBbB+i2/JBo qFMKCVVgA47s/By3XLDJSwMjlZcw5YqbWWG4TSsTV/+tzI64YiP+YaA54hruYmO/dYqQ dISAPif7WKSqinaHdX1zZ+/H8KJKp7tNaRq7FUpC8C5zTPSvwrBhX1Xf/ShAyh2Zd1Kq F3QvEFlSvcx3FU7oVbALnIq7OCaMv9vjgEPhK7e7WkiXdX22Y24QVF9u1qmGMF9oMKHh gnXlJbzOmAMdpG8wMD2HydAhIc+UzfeV48RVraCrUPovyRBwsySiipXhPaPrHml1Y2Xj txAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jwjgjQt5AJZkHuXbx6xFQRDwpns2JZ7m5+qWRXboZNY=; b=ZwmZIc2CboSRTXPdCMkb788skXu1JCaU7++IgvWvrS/3Uj2isp6MJxpR10hSPc+d5/ L1vyrEs2oyIJ8Jq52raQfsWpMTiElaM5BZvJAKK/chvZKSQkHhuThEzN9SpQBrTFhw0O /Wu2+AiskXRf7+L86TWh3n2+8Z9YfgO5PwcQ6AUHd5EXqaKnGbSQ8X0XRKCdxXXKkzrS wNNXd4UQEWMQyXwKB4PyAkEesd038Nye5efsnWyg+s6VmZ0+qd0jPPXcLMk0LBDnW9Lf ZqC6mUpqGH9p72FmkNf0Qotq+sI5HY5M8QdJnn2x0VsU6/wOc6Cr04CiMUKLTlVAp2jV cBVg== X-Gm-Message-State: AGi0PubNKBRJ7NxS6xQV8KOXZj1YsFkyTRNARCOcQG5q6d46My+jCRD8 9F0YaqeJHZexF694039jI57bxTr7gLWWMs/9J7Y= X-Google-Smtp-Source: APiQypKWD1EjxKi9krb7MlGtr63NPxI9A8Ny792v29ZWGLhsRtjo8S62NFux6JFMuYHzg1fXEPYpOWtlEn6Z7nXb/PQ= X-Received: by 2002:a7b:c190:: with SMTP id y16mr22131411wmi.50.1589020004475; Sat, 09 May 2020 03:26:44 -0700 (PDT) MIME-Version: 1.0 References: <87imh95mof.fsf@linaro.org> In-Reply-To: From: Aleksandar Markovic Date: Sat, 9 May 2020 12:26:27 +0200 Message-ID: Subject: Re: [INFO] Some preliminary performance data To: =?UTF-8?B?QWxleCBCZW5uw6ll?= Content-Type: multipart/alternative; boundary="000000000000e8687d05a5348cd9" Received-SPF: pass client-ip=2a00:1450:4864:20::344; envelope-from=aleksandar.qemu.devel@gmail.com; helo=mail-wm1-x344.google.com X-detected-operating-system: by eggs.gnu.org: No matching host in p0f cache. That's all we know. X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=_AUTOLEARN X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?UTF-8?B?THVrw6HFoSBEb2t0b3I=?= , Peter Maydell , Stefan Hajnoczi , Richard Henderson , QEMU Developers , ahmedkhaledkaraman@gmail.com, "Emilio G . Cota" , Gerd Hoffmann Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" --000000000000e8687d05a5348cd9 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =D1=81=D1=83=D0=B1, 9. =D0=BC=D0=B0=D1=98 2020. =D1=83 12:16 Aleksandar Mar= kovic < aleksandar.qemu.devel@gmail.com> =D1=98=D0=B5 =D0=BD=D0=B0=D0=BF=D0=B8=D1= =81=D0=B0=D0=BE/=D0=BB=D0=B0: > > > > =D1=81=D1=80=D0=B5, 6. =D0=BC=D0=B0=D1=98 2020. =D1=83 13:26 Alex Benn=C3= =A9e =D1=98=D0=B5 =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BE/=D0=BB=D0=B0: > > > > > > Aleksandar Markovic writes: > > > > Some preliminary thoughts.... > > > > Alex, many thanks for all your thoughts and hints that are truly helpful! > > I will most likely respond to all of them in some future mail, but for no= w > I will comment just one. > It looks right-click and "View Image" works for html mails with embedded images - it displays the image in its original resolution. So, no need for attachments. Good to know for potential Ahmed's reports with images. Aleksandar > > >> Hi, all. > > >> > > >> I just want to share with you some bits and pieces of data that I go= t > > >> while doing some preliminary experimentation for the GSoC project "TCG > > >> Continuous Benchmarking", that Ahmed Karaman, a student of the fourth final > > >> year of Electical Engineering Faculty in Cairo, will execute. > > >> > > >> *User Mode* > > >> > > >> * As expected, for any program dealing with any substantional > > >> floating-point calculation, softfloat library will be the the heaviest CPU > > >> cycles consumer. > > >> * We plan to examine the performance behaviour of non-FP programs > > >> (integer arithmetic), or even non-numeric programs (sorting strings, for > > >> example). > > > > Emilio was the last person to do extensive bench-marking on TCG and he > > used a mild fork of the venerable nbench: > > > > https://github.com/cota/dbt-bench > > > > as the hot code is fairly small it offers a good way of testing quality > > of the output. Larger programs will differ as they can involve more cod= e > > generation. > > > > >> > > >> *System Mode* > > >> > > >> * I did profiling of booting several machines using a tool called > > >> callgrind (a part of valgrind). The tool offers pletora of information, > > >> however it looks it is little confused by usage of coroutines, and that > > >> makes some of its reports look very illogical, or plain ugly. > > > > Doesn't running through valgrind inherently serialise execution anyway? > > If you are looking for latency caused by locks we have support for the > > QEMU sync profiler built into the code. See "help sync-profile" on the HMP. > > > > >> Still, it > > >> seems valid data can be extracted from it. Without going into details, here > > >> is what it says for one machine (bear in mind that results may vary to a > > >> great extent between machines): > > > > You can also use perf to use sampling to find hot points in the code. > > One of last years GSoC student wrote some patches that included the > > ability to dump a jit info file for perf to consume. We never got it > > merged in the end but it might be worth having a go at pulling the > > relevant bits out from: > > > > Subject: [PATCH v9 00/13] TCG code quality tracking and perf integration > > Date: Mon, 7 Oct 2019 16:28:26 +0100 > > Message-Id: <20191007152839.30804-1-alex.bennee@linaro.org> > > > > >> ** The booting involved six threads, one for display handling, one > > >> for emulations, and four more. The last four did almost nothing during > > >> boot, just almost entire time siting idle, waiting for something. As far as > > >> "Total Instruction Fetch Count" (this is the main measure used in > > >> callgrind), they were distributed in proportion 1:3 between display thread > > >> and emulation thread (the rest of threads were negligible) (but, > > >> interestingly enough, for another machine that proportion was 1:20). > > >> ** The display thread is dominated by vga_update_display() function > > >> (21.5% "self" time, and 51.6% "self + callees" time, called almost 40000 > > >> times). Other functions worth mentioning are > > >> cpu_physical_memory_snapshot_get_dirty() and > > >> memory_region_snapshot_get_dirty(), which are very small functions, but are > > >> both invoked over 26 000 000 times, and contribute with over 20% of display > > >> thread instruction fetch count together. > > > > The memory region tracking code will end up forcing the slow path for a > > lot of memory accesses to video memory via softmmu. You may want to > > measure if there is a difference using one of the virtio based graphics > > displays. > > > > >> ** Focusing now on emulation thread, "Total Instruction Fetch Counts" > > >> were roughly distributed this way: > > >> - 15.7% is execution of GIT-ed code from translation bloc= k > > >> buffer > > >> - 39.9% is execution of helpers > > >> - 44.4% is code translation stage, including some coroutine > > >> activities > > >> Top two among helpers: > > >> - helper_le_stl_memory() > > > > I assume that is the MMU slow-path being called from the generated code= . > > > > >> - helper_lookup_tb_ptr() (this one is invoked whopping 36 000 > > >> 000 times) > > > > This is an optimisation to avoid exiting the run-loop to find the next > > block. From memory I think the two main cases you'll see are: > > > > - computed jumps (i.e. target not known at JIT time) > > - jumps outside of the current page > > > > >> Single largest instruction consumer of code translation: > > >> - liveness_pass_1(), that constitutes 21.5% of the entire > > >> "emulation thread" consumption, or, in other way, almost half of cod= e > > >> translation stage (that sits at 44.4%) > > > > This is very much driven by how much code generation vs running you see= . > > In most of my personal benchmarks I never really notice code generation > > because I give my machines large amounts of RAM so code tends to stay > > resident so not need to be re-translated. When the optimiser shows up > > it's usually accompanied by high TB flush and invalidate counts in "inf= o > > jit" because we are doing more translation that we usually do. > > > > Yes, I think the machine was setup with only 128MB RAM. > > That would be an interesting experiment for Ahmed actually - to > measure impact of given RAM memory to performance. > > But it looks that at least for machines with small RAM, translation > phase will take significant percentage. > > I am attaching call graph for translation phase for "Hello World" built > for mips, and emulated by QEMU: *tb_gen_code() and its calees) > > (I am also attaching the pic in case it is not visible well inline) > > > > > > I'll also mention my foray into tracking down the performance regressio= n > > of DOSBox Doom: > > > > https://diasp.eu/posts/8659062 > > > > it presented a very nice demonstration of the increasing complexity (an= d > > run time) of the optimiser which was completely wasted due to > > self-modifying code causing us to regenerate code all the time. > > > > >> > > >> Please take all this with a little grain of salt, since these results are > > >> just of preliminary nature. > > >> > > >> I would like to use this opportunity to welcome Ahmed Karaman, a talented > > >> young man from Egypt, into QEMU development community, that'll work on "TCG > > >> Continuous Benchmarking" project this summer. Please do help them in his > > >> first steps as our colleague. Best luck to Ahmed! > > > > Welcome to the QEMU community Ahmed. Feel free to CC me on TCG > > performance related patches. I like to see things go faster ;-) > > > > -- > > Alex Benn=C3=A9e --000000000000e8687d05a5348cd9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable



=D1=81=D1=83=D0=B1, 9. =D0=BC=D0=B0=D1=98 2020= . =D1=83 12:16 Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> =D1=98=D0=B5 =D0=BD= =D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BE/=D0=BB=D0=B0:
>
>
><= br>> =D1=81=D1=80=D0=B5, 6. =D0=BC=D0=B0=D1=98 2020. =D1=83 13:26 Alex B= enn=C3=A9e <alex.bennee@linaro= .org> =D1=98=D0=B5 =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BE/=D0=BB= =D0=B0:
> >
> >
> > Aleksandar Markovic <aleksandar.qemu.devel@gmail.= com> writes:
> >
> > Some preliminary thoughts....=
> >
>
> Alex, many thanks for all your thoughts and h= ints that are truly helpful!
>
> I will most likely respond to = all of them in some future mail, but for now
> I will comment just on= e.
>

It looks right-click and "Vie= w Image" works for html mails with
embedded images - it disp= lays the image in its original resolution.
So, no need for attach= ments. Good to know for potential Ahmed's
reports with images= .

Aleksandar

> >= >> Hi, all.
> > >>
> > >> I just want = to share with you some bits and pieces of data that I got
> > >= > while doing some preliminary experimentation for the GSoC project &quo= t;TCG
> > >> Continuous Benchmarking", that Ahmed Karam= an, a student of the fourth final
> > >> year of Electical E= ngineering Faculty in Cairo, will execute.
> > >>
> &g= t; >> *User Mode*
> > >>
> > >> =C2=A0 = =C2=A0* As expected, for any program dealing with any substantional
>= > >> floating-point calculation, softfloat library will be the th= e heaviest CPU
> > >> cycles consumer.
> > >>= =C2=A0 =C2=A0* We plan to examine the performance behaviour of non-FP prog= rams
> > >> (integer arithmetic), or even non-numeric progra= ms (sorting strings, for
> > >> example).
> >
&g= t; > Emilio was the last person to do extensive bench-marking on TCG and= he
> > used a mild fork of the venerable nbench:
> >
= > > =C2=A0 https://gith= ub.com/cota/dbt-bench
> >
> > as the hot code is fair= ly small it offers a good way of testing quality
> > of the output= . Larger programs will differ as they can involve more code
> > ge= neration.
> >
> > >>
> > >> *System = Mode*
> > >>
> > >> =C2=A0 =C2=A0* I did prof= iling of booting several machines using a tool called
> > >>= callgrind (a part of valgrind). The tool offers pletora of information,> > >> however it looks it is little confused by usage of coro= utines, and that
> > >> makes some of its reports look very = illogical, or plain ugly.
> >
> > Doesn't running thr= ough valgrind inherently serialise execution anyway?
> > If you ar= e looking for latency caused by locks we have support for the
> > = QEMU sync profiler built into the code. See "help sync-profile" o= n the HMP.
> >
> > >> Still, it
> > >&g= t; seems valid data can be extracted from it. Without going into details, h= ere
> > >> is what it says for one machine (bear in mind tha= t results may vary to a
> > >> great extent between machines= ):
> >
> > You can also use perf to use sampling to find = hot points in the code.
> > One of last years GSoC student wrote s= ome patches that included the
> > ability to dump a jit info file = for perf to consume. We never got it
> > merged in the end but it = might be worth having a go at pulling the
> > relevant bits out fr= om:
> >
> > =C2=A0 Subject: [PATCH =C2=A0v9 00/13] TCG co= de quality tracking and perf integration
> > =C2=A0 Date: Mon, =C2= =A07 Oct 2019 16:28:26 +0100
> > =C2=A0 Message-Id: <20191007152839.30804= -1-alex.bennee@linaro.org>
> >
> > >> =C2=A0= =C2=A0 =C2=A0** The booting involved six threads, one for display handling= , one
> > >> for emulations, and four more. The last four di= d almost nothing during
> > >> boot, just almost entire time= siting idle, waiting for something. As far as
> > >> "= Total Instruction Fetch Count" (this is the main measure used in
&g= t; > >> callgrind), they were distributed in proportion 1:3 betwee= n display thread
> > >> and emulation thread (the rest of th= reads were negligible) (but,
> > >> interestingly enough, fo= r another machine that proportion was 1:20).
> > >> =C2=A0 = =C2=A0 =C2=A0** The display thread is dominated by vga_update_display() fun= ction
> > >> (21.5% "self" time, and 51.6% "s= elf + callees" time, called almost 40000
> > >> times).= Other functions worth mentioning are
> > >> cpu_physical_me= mory_snapshot_get_dirty() and
> > >> memory_region_snapshot_= get_dirty(), which are very small functions, but are
> > >> = both invoked over 26 000 000 times, and contribute with over 20% of display=
> > >> thread instruction fetch count together.
> >= ;
> > The memory region tracking code will end up forcing the slow= path for a
> > lot of memory accesses to video memory via softmmu= . You may want to
> > measure if there is a difference using one o= f the virtio based graphics
> > displays.
> >
> >= ; >> =C2=A0 =C2=A0 =C2=A0** Focusing now on emulation thread, "T= otal Instruction Fetch Counts"
> > >> were roughly dist= ributed this way:
> > >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0- 15.7% is execution of GIT-ed code from translation block
> &g= t; >> buffer
> > >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0- 39.9% is execution of helpers
> > >> =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0- 44.4% is code translation stage, including som= e coroutine
> > >> activities
> > >> =C2=A0 = =C2=A0 =C2=A0 =C2=A0 Top two among helpers:
> > >> =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 - helper_le_stl_memory()
> >
> &= gt; I assume that is the MMU slow-path being called from the generated code= .
> >
> > >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - h= elper_lookup_tb_ptr() (this one is invoked whopping 36 000
> > >= ;> 000 times)
> >
> > This is an optimisation to avoid= exiting the run-loop to find the next
> > block. From memory I th= ink the two main cases you'll see are:
> >
> > =C2=A0= - computed jumps (i.e. target not known at JIT time)
> > =C2=A0- j= umps outside of the current page
> >
> > >> =C2=A0 = =C2=A0 =C2=A0 =C2=A0 Single largest instruction consumer of code translatio= n:
> > >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - liveness_pass= _1(), that constitutes 21.5% of the entire
> > >> "emul= ation thread" consumption, or, in other way, almost half of code
&g= t; > >> translation stage (that sits at 44.4%)
> >
>= ; > This is very much driven by how much code generation vs running you = see.
> > In most of my personal benchmarks I never really notice c= ode generation
> > because I give my machines large amounts of RAM= so code tends to stay
> > resident so not need to be re-translate= d. When the optimiser shows up
> > it's usually accompanied by= high TB flush and invalidate counts in "info
> > jit" b= ecause we are doing more translation that we usually do.
> >
&g= t;
> Yes, I think the machine was setup with only 128MB RAM.
><= br>> That would be an interesting experiment for Ahmed actually - to
= > measure impact of given RAM memory to performance.
>
> But= it looks that at least for machines with small RAM, translation
> ph= ase will take significant percentage.
>
> I am attaching call g= raph for translation phase for "Hello World" built
> for mi= ps, and emulated by QEMU: *tb_gen_code() and its calees)
>
> (I= am also attaching the pic in case it is not visible well inline)
>>
>
>
> > I'll also mention my foray into tra= cking down the performance regression
> > of DOSBox Doom:
> = >
> > =C2=A0 https:/= /diasp.eu/posts/8659062
> >
> > it presented a very n= ice demonstration of the increasing complexity (and
> > run time) = of the optimiser which was completely wasted due to
> > self-modif= ying code causing us to regenerate code all the time.
> >
> = > >>
> > >> Please take all this with a little grai= n of salt, since these results are
> > >> just of preliminar= y nature.
> > >>
> > >> I would like to use t= his opportunity to welcome Ahmed Karaman, a talented
> > >> = young man from Egypt, into QEMU development community, that'll work on = "TCG
> > >> Continuous Benchmarking" project this = summer. Please do help them in his
> > >> first steps as our= colleague. Best luck to Ahmed!
> >
> > Welcome to the QE= MU community Ahmed. Feel free to CC me on TCG
> > performance rela= ted patches. I like to see things go faster ;-)
> >
> > -= -
> > Alex Benn=C3=A9e
--000000000000e8687d05a5348cd9--