qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Ian Rogers <irogers@cs.man.ac.uk>
To: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Profiling Qemu for speed?
Date: Mon, 18 Apr 2005 10:51:02 +0100	[thread overview]
Message-ID: <42638306.4050200@cs.man.ac.uk> (raw)
In-Reply-To: <20050418083542.45378.qmail@web54110.mail.yahoo.com>

There are some code sequences that are quite common, for example compare 
followed by branch. A threaded decoder tends to look like:

... // do some work
load <instruction>
mask out opcode
address_of_decoder = load decoder_lookup<opcode>
goto *address_of_decoder

but if you say compare and branch are common then possibly

compare_instruction:
... // do some work
load <instruction>
mask out opcode
if opcode == branch then goto branch_decoder
address_of_decoder = load decoder_lookup<opcode>
goto *address_of_decoder

is more optimal, as in the branch case you only have one load. So it 
seems you can use knowledge of common code sequences to remove one 
memory access. The overall effect of this is going to come down to the 
branch prediction hardware too - so a win isn't obvious. If you have 
longer code sequences then you can use specialization to create a 
speciailized and optimally laid out decoder.

I'm not sure if you can get GCC to generate code sequences like this, 
but you probably at least need to use the -fprofile-generate and 
-fprofile-use options
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

I'm doing this in Java and it appears to be worth upto a 10% speedup in 
an interpreter. So possibly this would save 10% in the translator.

Sorry if this is obvious. Regards,

Ian Rogers
-- http://www.binarytranslator.org/


Daniel J Guinan wrote:

>This conversation, below, is very interesting.  It is precisely this
>part of QEMU that fascinates me and potentially holds the most promise
>for performance gains.  I have even imagined using a genetic algorithm
>to discover optimal block-sizes and instruction re-ordering and
>whatnot.  This could be done in order to generate translation tables of
>guest instruction sequences and host translated instruction sequences.
>Even if  only a handful of very common sequences were translated in
>this fashion, the potential speedups are enormous.
>
>Before even discussing the exotic possibilities, however, we need to
>figure out what is possible within the framework of the current QEMU
>translation system.  Rewiring QEMU to support translating sequences
>(blocks of instructions) rather than single instructions may or may not
>be necessary.  It should be rather simple to instrument QEMU to keep
>track of the most common sequences in order to figure out if there are,
>in fact, sequences that show up with a high enough frequency to make
>this endeavor worthwhile (I would think the answer would be yes). 
>Then, someone skilled in machine code for the host and guest could take
>a stab at hand-coding the translation for the most common couple of
>sequences to see how the performance gains come out.
>
>I would love to see some work in this direction and would be willing to
>help, although my skills are limited in x86 machine.
>
>-Daniel
>
>  
>
>>One thought would be to have a peephole optimizer that looks back
>>over
>>the just translated basic block (or a state machine that matches such
>>sequences as an on-line algorithm) and match against common, known
>>primitive sequences, and replaces them with optimized versions.
>>
>>The kind of profiling you would want to do here is to run, say,
>>windows
>>and take a snapshot of the dynamic code cache, and look for common
>>instruction sequences. Ideally, you could write some software to do
>>this
>>automatically.
>>
>>Anyway, I'm sure there are lots of other ideas laying around.
>>
>>
>>-- John.
>>
>>Another thing I've thought about is checking what sequences of 
>>instructions often appear in x86 programs (such as e.g. "push %ebp; 
>>movl %esp, %ebp") and then creating C-functions which emulate such an
>>
>>antire block, so they can be optimized as a whole by gcc. That would 
>>give a similar performance gain on all supported targets, and not
>>just 
>>on the one you created the peephole optimizer for (+ less work to 
>>debug).
>>
>>The only possible downside is that you can't jump to a particular 
>>instruction in such a block (the same goes for several kinds of 
>>peephole optimizations though). I don't know yet how Qemu exactly
>>keeps 
>>track of the translations it has already performed, whether it
>>supports 
>>multiple existing translations of the same instruction and/or whether
>>
>>it can already automatically invalidate the old block in case it
>>turns 
>>out it needs to be splitted and thus re-translated (I guess it should
>>
>>at least some of these things, since it theory an x86 could jump into
>>
>>the middle of an instruction in order to reinterpret the bytes as 
>>another instruction stream).
>>
>>
>>Jonas
>>
>>
>>Unfortunately it's not that simple. The push instruction may cause an
>>
>>exception. Whatever optimizations you apply you've got to make sure
>>that the 
>>guest state is still consistent when the exception occurs.
>>
>>Paul
>>
>>
>>If we just concatenate the C code of the two procedures, won't gcc
>>take 
>>care of that for us? Or could scheduling mess this up? Maybe there's
>>a 
>>switch to avoid having it reschedule instructions in a way that side 
>>effects happen in a different order? (that would still give us the 
>>advantage of CSE and peephole optimizations)
>>
>>
>>Jonas
>>
>>    
>>
>
>
>
>
>
>_______________________________________________
>Qemu-devel mailing list
>Qemu-devel@nongnu.org
>http://lists.nongnu.org/mailman/listinfo/qemu-devel
>  
>

  reply	other threads:[~2005-04-18  9:56 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-18  8:35 [Qemu-devel] Profiling Qemu for speed? Daniel J Guinan
2005-04-18  9:51 ` Ian Rogers [this message]
2005-04-18 13:44   ` Daniel Egger
2005-04-18 14:12     ` Christian MICHON
2005-04-18 14:29       ` Ian Rogers
2005-04-18 14:19     ` Ian Rogers
2005-04-18 14:40     ` Paul Brook
  -- strict thread matches above, loose matches on Subject: below --
2005-04-18 11:24 Daniel J Guinan
2005-04-17  5:58 Joe Luser
2005-04-17  8:21 ` John R. Hogerhuis
2005-04-17  8:59   ` Jonas Maebe
2005-04-17 10:27     ` Paul Brook
2005-04-17 10:46       ` Jonas Maebe
2005-04-18  1:36         ` Nathaniel G H
2005-04-18  2:11           ` John R. Hogerhuis
2005-04-18  2:39           ` André Braga
2005-04-18  4:31             ` Karl Magdsick
2005-04-17 10:36 ` Paul Brook

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42638306.4050200@cs.man.ac.uk \
    --to=irogers@cs.man.ac.uk \
    --cc=ian.rogers@manchester.ac.uk \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).