[Qemu-devel] [PATCH 0/3] RFC: TCG ARM optimizations

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/3] RFC: TCG ARM optimizations
@ 2009-06-29 17:50 Filip Navara
  2009-06-29 17:59 ` Laurent Desnogues
  0 siblings, 1 reply; 4+ messages in thread
From: Filip Navara @ 2009-06-29 17:50 UTC (permalink / raw)
  To: qemu-devel

Hello!

I have been playing with some optimizations on the generated code for
ARM target and x86 host. The result are the following three patches
that improve the performance by 10% for Dhrystone benchmark compiled
for the ARM7TDMI target. Also the size of the output x86 code has
shrunk by up to 40% in some cases. These patches are relatively small
and self-contained, so I would like to get them to merged eventually
if others agree.

I've some ideas for further optimizations, but implementing them would
require substantial effort and the benefits are questionable.

Big thanks goes to Laurent Desnogues who actually had suggested where
the bottlenecks are.

Best regards,
Filip Navara

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3] RFC: TCG ARM optimizations
  2009-06-29 17:50 [Qemu-devel] [PATCH 0/3] RFC: TCG ARM optimizations Filip Navara
@ 2009-06-29 17:59 ` Laurent Desnogues
  2009-06-29 18:26   ` Filip Navara
  0 siblings, 1 reply; 4+ messages in thread
From: Laurent Desnogues @ 2009-06-29 17:59 UTC (permalink / raw)
  To: Filip Navara; +Cc: qemu-devel

On Mon, Jun 29, 2009 at 7:50 PM, Filip Navara<filip.navara@gmail.com> wrote:
>
> Big thanks goes to Laurent Desnogues who actually had suggested where
> the bottlenecks are.

IIRC it's Paul who suggested that idea first on IRC months
ago, as I was complaining about the stupidity of the generated
code :-)


Laurent

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3] RFC: TCG ARM optimizations
  2009-06-29 17:59 ` Laurent Desnogues
@ 2009-06-29 18:26   ` Filip Navara
  2009-06-29 18:58     ` Blue Swirl
  0 siblings, 1 reply; 4+ messages in thread
From: Filip Navara @ 2009-06-29 18:26 UTC (permalink / raw)
  To: Laurent Desnogues; +Cc: qemu-devel

On Mon, Jun 29, 2009 at 7:59 PM, Laurent
Desnogues<laurent.desnogues@gmail.com> wrote:
> On Mon, Jun 29, 2009 at 7:50 PM, Filip Navara<filip.navara@gmail.com> wrote:
>>
>> Big thanks goes to Laurent Desnogues who actually had suggested where
>> the bottlenecks are.
>
> IIRC it's Paul who suggested that idea first on IRC months
> ago,

Yeah, this Paul guy keeps coming with good ideas lately. His work
helped me a lot in writing my bachelor thesis and saved me countless
hours. I own him a beverage at very least, but somehow I doubt he will
come to my little country any time soon.

> as I was complaining about the stupidity of the generated
> code :-)

Let's keep complaining, maybe someone will improve it over the time.

With the applied patches the OP statistics now look like this:

mov_i32 1925
movi_i32 1556
add_i32 518
ld_i32 257
exit_tb 247
brcond_i32 225
qemu_ld32u 219
set_label 207
...

Some minor improvements could be done to the usage of TCG temporary
variables in target-arm/translate.c. That's something that could be
done gradually and without any substantial effort. It would probably
increase the speed by about 1 to 5 percents.

Another idea is to group blocks of conditional instructions to avoid
unnecessary jumps. That would help with code like this:

0x00200d28:  cmp        lr, #0  ; 0x0
0x00200d2c:  movle      r0, #1  ; 0x1
0x00200d30:  movle      r1, r5
0x00200d34:  movle      ip, r0
0x00200d38:  ble        0x200d64

I'm not sure how common pattern this is and I didn't do any further
investigation yet.

Lastly, the code generated for softmmu memory loads/stores could
probably be optimized in some cases. It uses hard-coded registers.
It's not optimized for multiple stores to adjacent locations (pushing
multiple registers to stack) and does all the calculations again and
again. This results not only in recomputing numbers we already have
(as long as the stack is still on the same guest page), but also in
huge TBs. I imagine that doesn't help the processor cache too much.
This would probably benefit all targets. In fact I believe the softmmu
code could be moved out of the TCG target-specific code and into the
main code (with the possibility to override it with optimized
version).

Best regards,
Filip Navara

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3] RFC: TCG ARM optimizations
  2009-06-29 18:26   ` Filip Navara
@ 2009-06-29 18:58     ` Blue Swirl
  0 siblings, 0 replies; 4+ messages in thread
From: Blue Swirl @ 2009-06-29 18:58 UTC (permalink / raw)
  To: Filip Navara; +Cc: Laurent Desnogues, qemu-devel

On 6/29/09, Filip Navara <filip.navara@gmail.com> wrote:
>  Lastly, the code generated for softmmu memory loads/stores could
>  probably be optimized in some cases. It uses hard-coded registers.
>  It's not optimized for multiple stores to adjacent locations (pushing
>  multiple registers to stack) and does all the calculations again and
>  again. This results not only in recomputing numbers we already have
>  (as long as the stack is still on the same guest page), but also in
>  huge TBs. I imagine that doesn't help the processor cache too much.
>  This would probably benefit all targets. In fact I believe the softmmu
>  code could be moved out of the TCG target-specific code and into the
>  main code (with the possibility to override it with optimized
>  version).

Interesting. We could add a new optional TCG instruction op_ld_g2h
(extracted from qemu_ld) that performs the TLB lookup and returns the
host address. When multiple accesses near the same guest address are
detected (how?), the translator can reuse the host address, perform
some math and check if the guest page is still same. If true, ld_raw
can be used, otherwise recalculate the host address.

On the performance side, qemu_ld on Sparc host uses 9 instructions in
the TLB hit case before the access. Maybe this would lower the number
a bit but not too much.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-06-29 18:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-29 17:50 [Qemu-devel] [PATCH 0/3] RFC: TCG ARM optimizations Filip Navara
2009-06-29 17:59 ` Laurent Desnogues
2009-06-29 18:26   ` Filip Navara
2009-06-29 18:58     ` Blue Swirl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).