* [Qemu-devel] Profiling Qemu for speed?
@ 2005-04-17 5:58 Joe Luser
2005-04-17 8:21 ` John R. Hogerhuis
2005-04-17 10:36 ` Paul Brook
0 siblings, 2 replies; 18+ messages in thread
From: Joe Luser @ 2005-04-17 5:58 UTC (permalink / raw)
To: qemu-devel
Hi folks,
I've been using Qemu on the Mac for a few days now; Several OSes
running (including Windoze), and I'm impressed. The source looks pretty
clean, too.
Has anyone done any profiling work to see where Qemu spends most of its
processing time? It is fast running x86 code on the PPC, but I'd like
to find ways to speed things up even more.
If you know of areas that will make a big difference in speed, please
let me know.
-Nathaniel G H
__________________________________
Do you Yahoo!?
Make Yahoo! your home page
http://www.yahoo.com/r/hs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-17 5:58 Joe Luser
@ 2005-04-17 8:21 ` John R. Hogerhuis
2005-04-17 8:59 ` Jonas Maebe
2005-04-17 10:36 ` Paul Brook
1 sibling, 1 reply; 18+ messages in thread
From: John R. Hogerhuis @ 2005-04-17 8:21 UTC (permalink / raw)
To: qemu-devel
On Sat, 2005-04-16 at 22:58 -0700, Joe Luser wrote:
> Hi folks,
>
> I've been using Qemu on the Mac for a few days now; Several OSes
> running (including Windoze), and I'm impressed. The source looks pretty
> clean, too.
>
> Has anyone done any profiling work to see where Qemu spends most of its
> processing time? It is fast running x86 code on the PPC, but I'd like
> to find ways to speed things up even more.
>
> If you know of areas that will make a big difference in speed, please
> let me know.
>
> -Nathaniel G H
>
>
>From what I've read the Cirrus SVGA emulation probably deserves some
attention. Read through the archives, there have been some recent
discussions.
Beyond that what will always be left is continually tweaking the dynamic
code generator with whatever heuristics and host platform specific stuff
you can conjure up. Pretty much no end to that sort of thing.
QEMU takes executing machine instructions from one virtual computer and
dynamically translates them to the working equivalent on another (the
host), stores them in cache to save reprocessing time (that's the *big*
time-saving heuristic over Bochs) and executes the dynamically generated
code on the host.
Think of it like this: aside from speed, the best generator of dynamic
code would be an expert assembly language programmer on the platform you
are translating code from and to. As you can imagine QEMU has tricks and
leverages GCC but it will never attain the nirvana of being as good as
the assembly language programmer. That's an asymptote. So that's why I
say there will always be lots of things you can do on the dynamic code
generator to speed it up.
One thought would be to have a peephole optimizer that looks back over
the just translated basic block (or a state machine that matches such
sequences as an on-line algorithm) and match against common, known
primitive sequences, and replaces them with optimized versions.
The kind of profiling you would want to do here is to run, say, windows
and take a snapshot of the dynamic code cache, and look for common
instruction sequences. Ideally, you could write some software to do this
automatically.
Anyway, I'm sure there are lots of other ideas laying around.
-- John.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-17 8:21 ` John R. Hogerhuis
@ 2005-04-17 8:59 ` Jonas Maebe
2005-04-17 10:27 ` Paul Brook
0 siblings, 1 reply; 18+ messages in thread
From: Jonas Maebe @ 2005-04-17 8:59 UTC (permalink / raw)
To: qemu-devel
On 17 Apr 2005, at 10:21, John R. Hogerhuis wrote:
> One thought would be to have a peephole optimizer that looks back over
> the just translated basic block (or a state machine that matches such
> sequences as an on-line algorithm) and match against common, known
> primitive sequences, and replaces them with optimized versions.
Another thing I've thought about is checking what sequences of
instructions often appear in x86 programs (such as e.g. "push %ebp;
movl %esp, %ebp") and then creating C-functions which emulate such an
antire block, so they can be optimized as a whole by gcc. That would
give a similar performance gain on all supported targets, and not just
on the one you created the peephole optimizer for (+ less work to
debug).
The only possible downside is that you can't jump to a particular
instruction in such a block (the same goes for several kinds of
peephole optimizations though). I don't know yet how Qemu exactly keeps
track of the translations it has already performed, whether it supports
multiple existing translations of the same instruction and/or whether
it can already automatically invalidate the old block in case it turns
out it needs to be splitted and thus re-translated (I guess it should
at least some of these things, since it theory an x86 could jump into
the middle of an instruction in order to reinterpret the bytes as
another instruction stream).
Jonas
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-17 8:59 ` Jonas Maebe
@ 2005-04-17 10:27 ` Paul Brook
2005-04-17 10:46 ` Jonas Maebe
0 siblings, 1 reply; 18+ messages in thread
From: Paul Brook @ 2005-04-17 10:27 UTC (permalink / raw)
To: qemu-devel
On Sunday 17 April 2005 09:59, Jonas Maebe wrote:
> On 17 Apr 2005, at 10:21, John R. Hogerhuis wrote:
> > One thought would be to have a peephole optimizer that looks back over
> > the just translated basic block (or a state machine that matches such
> > sequences as an on-line algorithm) and match against common, known
> > primitive sequences, and replaces them with optimized versions.
>
> Another thing I've thought about is checking what sequences of
> instructions often appear in x86 programs (such as e.g. "push %ebp;
> movl %esp, %ebp") and then creating C-functions which emulate such an
> antire block, so they can be optimized as a whole by gcc. That would
> give a similar performance gain on all supported targets, and not just
> on the one you created the peephole optimizer for (+ less work to
> debug).
Unfortunately it's not that simple. The push instruction may cause an
exception. Whatever optimizations you apply you've got to make sure that the
guest state is still consistent when the exception occurs.
Paul
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-17 5:58 Joe Luser
2005-04-17 8:21 ` John R. Hogerhuis
@ 2005-04-17 10:36 ` Paul Brook
1 sibling, 0 replies; 18+ messages in thread
From: Paul Brook @ 2005-04-17 10:36 UTC (permalink / raw)
To: qemu-devel
On Sunday 17 April 2005 06:58, Joe Luser wrote:
> If you know of areas that will make a big difference in speed, please
> let me know.
IMHO the most effective speedup for most users is to write custom guest device
drivers for IDE/network/video that talk directly to qemu using some low
overhead mechanism. Basically you want the guest to do as little as possible.
The problem is that to be really useful you need to write open source drivers
for the common operating systems (win9x, winnt/2k/xp, Linux, *BSD)
Paul
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-17 10:27 ` Paul Brook
@ 2005-04-17 10:46 ` Jonas Maebe
2005-04-18 1:36 ` Nathaniel G H
0 siblings, 1 reply; 18+ messages in thread
From: Jonas Maebe @ 2005-04-17 10:46 UTC (permalink / raw)
To: qemu-devel
On 17 Apr 2005, at 12:27, Paul Brook wrote:
> Unfortunately it's not that simple. The push instruction may cause an
> exception. Whatever optimizations you apply you've got to make sure
> that the
> guest state is still consistent when the exception occurs.
If we just concatenate the C code of the two procedures, won't gcc take
care of that for us? Or could scheduling mess this up? Maybe there's a
switch to avoid having it reschedule instructions in a way that side
effects happen in a different order? (that would still give us the
advantage of CSE and peephole optimizations)
Jonas
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-17 10:46 ` Jonas Maebe
@ 2005-04-18 1:36 ` Nathaniel G H
2005-04-18 2:11 ` John R. Hogerhuis
2005-04-18 2:39 ` André Braga
0 siblings, 2 replies; 18+ messages in thread
From: Nathaniel G H @ 2005-04-18 1:36 UTC (permalink / raw)
To: qemu-devel
On 17 Apr 2005, at 12:27, Paul Brook wrote:
> Unfortunately it's not that simple. The push instruction may
> cause an exception. Whatever optimizations you apply you've
> got to make sure that the guest state is still consistent when
> the exception occurs.
I brought this up because I want to speed up the case of x86 code on
the PPC. The biggest trouble with optimizing the code generated by Qemu
is that the two architectures are as different as they are complex.
I was up until 3:00am studying Qemu, and I came to the conclusion that
it doesn't make sense to try speeding up the output code, at least not
yet. A peephole optimizer or hand-coded sequences made to handle common
combinations of instructions would lead to the problems discussed here:
exceptions happening at the right time, self-modifying code, etc.
Worse, the translator might have to spend so much time doing this that
the result would actually be slower execution.
I have another idea: The next-best thing to making faster output is to
make the same output, faster. In other words, speeding up the
translator. Given that the bulk of the translator is in disas_insn()
and all the gen_* functions it calls, this seems like a good place to
begin.
Does anybody know how GCC generates a "switch" statement? disas_insn()
and friends contain dozens of switch statements containing hundreds and
hundreds of cases. I'm not familiar with GCC internals, but I do know
that compilers I've used in the past actually produce the equivalent of
"if.. else if.. else if" code for switch statements. Some compilers
produce table-lookups, but only under certain circumstances, like when
there aren't too many cases and the numbers being tested fit in one
byte. I couldn't find this information on GCC, and unfortunately the
code is so complicated that my initial look through it hasn't told me
much.
Unless someone can show me that GCC produces table-lookups for the
switches in disas_insn(), there's a good opportunity for increased
speed by doing this manually. Do you agree with this assessment?
Please let me know if I'm on the right track. :-)
Kind regards,
Nathaniel G H
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-18 1:36 ` Nathaniel G H
@ 2005-04-18 2:11 ` John R. Hogerhuis
2005-04-18 2:39 ` André Braga
1 sibling, 0 replies; 18+ messages in thread
From: John R. Hogerhuis @ 2005-04-18 2:11 UTC (permalink / raw)
To: qemu-devel
On Sun, 2005-04-17 at 18:36 -0700, Nathaniel G H wrote:
> I was up until 3:00am studying Qemu, and I came to the conclusion that
> it doesn't make sense to try speeding up the output code, at least not
> yet. A peephole optimizer or hand-coded sequences made to handle common
> combinations of instructions would lead to the problems discussed here:
> exceptions happening at the right time, self-modifying code, etc.
Well yeah... I didn't say it would be easy :-)
> Worse, the translator might have to spend so much time doing this that
> the result would actually be slower execution.
Not if you do it right. Remember. by and large you only incur a one-time
hit, and after that you're going as fast as your dynamic translator is
smart.
Here's a heuristic for you: optimize for the general case, not corner
cases. You still have to handle the corner cases, but they are corner
cases for a reason: you don't run into them as often.
>
> I have another idea: The next-best thing to making faster output is to
> make the same output, faster. In other words, speeding up the
> translator. Given that the bulk of the translator is in disas_insn()
> and all the gen_* functions it calls, this seems like a good place to
> begin.
>
Why would that be faster? Most of the time you only dynamically
translate code once. Self modifying code, exceptions, are, well,
exceptions.
The lowest hanging fruit right now is probably the cirrus svga
emulation. That will probably make a huge, noticeable difference.
Then like I said I think there is always room to make the dynamic code
generator smarter.
-- John.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-18 1:36 ` Nathaniel G H
2005-04-18 2:11 ` John R. Hogerhuis
@ 2005-04-18 2:39 ` André Braga
2005-04-18 4:31 ` Karl Magdsick
1 sibling, 1 reply; 18+ messages in thread
From: André Braga @ 2005-04-18 2:39 UTC (permalink / raw)
To: qemu-devel
The problem with table lookups (I'm assuming you're talking about
function pointer vectors) is that they *destroy* spatial locality of
reference that you could otherwise attain by having series of
if-then-else instructions and some clever instruction prefetching
mechanism on modern processors... Not to mention the function call
overhead.
However, for puely aesthetical reasons:
(Disclaimer: I'm not familiar (lack of time, sorry) with what
disas_insn does apart from what is obviously implied by its name, so
maybe the following may make any sense or not...)
If it happens to be the case that disas_insn decodes the instruction
and then proceeds to the native implementation of the opcode in
question, if you used the table lookup only to find the function entry
point and then paste its CONTENT on a code buffer and then concatenate
this with the rest of the translated code, and the whole basic block
would be run from there from that part on (caveat self-modifying
code), then it would really clean up that piece of code. But I suspect
it wouldn't be any faster than the current approach.
--
"A year spent in artificial intelligence is enough to make one believe
in God"-Alan J. Perlis
2005/4/17, Nathaniel G H <rice_burners_suck@yahoo.com>:
> Unless someone can show me that GCC produces table-lookups for the
> switches in disas_insn(), there's a good opportunity for increased
> speed by doing this manually. Do you agree with this assessment?
>
> Please let me know if I'm on the right track. :-)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-18 2:39 ` André Braga
@ 2005-04-18 4:31 ` Karl Magdsick
0 siblings, 0 replies; 18+ messages in thread
From: Karl Magdsick @ 2005-04-18 4:31 UTC (permalink / raw)
To: André Braga, qemu-devel
Ideally, we could force gcc to implement switch statements as indirect
jumps with jump tables inline with the code. However, this may not be
possible.
I think Nathaniel was just saying that gcc is likely generating
several hundred sequential if-else blocks for large switch statements.
This gives you O(n) runtime, whereas a function pointer array gives
you O(1) runtime. (This assumes each case ends with a break... I
haven't looked at the code to which Nathaniel is referring.)
Therefore, for sufficiently large switch statements, sequential
if-else statements will be slower than a function pointer array. The
question is: is n=~ 200 sufficiently large? I think only empirical
testing has a chance of answering this question in everyone's mind.
If you smartly nest your if-else statements, you can at least get
O(log n) runtime, with negligible difference in the leading constant
as compared to the sequential if-else statements.
There's also the possibility of (at startup) dynamically generating an
indirect jump along with a jump table inline with the code (in order
to minimize page faults, and perhaps utilize the prefectch to get the
first few jump addresses in the L2 cache).
Of course, as pointed out elsewhere, the dynamic code generator is
hopefully only invoked rarely, so speeding up the switch statements
may not have a noticeable effect on emulation speed.
-Karl
On 4/17/05, André Braga <meianoite@gmail.com> wrote:
> The problem with table lookups (I'm assuming you're talking about
> function pointer vectors) is that they *destroy* spatial locality of
> reference that you could otherwise attain by having series of
> if-then-else instructions and some clever instruction prefetching
> mechanism on modern processors... Not to mention the function call
> overhead.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
@ 2005-04-18 8:35 Daniel J Guinan
2005-04-18 9:51 ` Ian Rogers
0 siblings, 1 reply; 18+ messages in thread
From: Daniel J Guinan @ 2005-04-18 8:35 UTC (permalink / raw)
To: qemu-devel
This conversation, below, is very interesting. It is precisely this
part of QEMU that fascinates me and potentially holds the most promise
for performance gains. I have even imagined using a genetic algorithm
to discover optimal block-sizes and instruction re-ordering and
whatnot. This could be done in order to generate translation tables of
guest instruction sequences and host translated instruction sequences.
Even if only a handful of very common sequences were translated in
this fashion, the potential speedups are enormous.
Before even discussing the exotic possibilities, however, we need to
figure out what is possible within the framework of the current QEMU
translation system. Rewiring QEMU to support translating sequences
(blocks of instructions) rather than single instructions may or may not
be necessary. It should be rather simple to instrument QEMU to keep
track of the most common sequences in order to figure out if there are,
in fact, sequences that show up with a high enough frequency to make
this endeavor worthwhile (I would think the answer would be yes).
Then, someone skilled in machine code for the host and guest could take
a stab at hand-coding the translation for the most common couple of
sequences to see how the performance gains come out.
I would love to see some work in this direction and would be willing to
help, although my skills are limited in x86 machine.
-Daniel
> One thought would be to have a peephole optimizer that looks back
> over
> the just translated basic block (or a state machine that matches such
> sequences as an on-line algorithm) and match against common, known
> primitive sequences, and replaces them with optimized versions.
>
> The kind of profiling you would want to do here is to run, say,
> windows
> and take a snapshot of the dynamic code cache, and look for common
> instruction sequences. Ideally, you could write some software to do
> this
> automatically.
>
> Anyway, I'm sure there are lots of other ideas laying around.
>
>
> -- John.
>
> Another thing I've thought about is checking what sequences of
> instructions often appear in x86 programs (such as e.g. "push %ebp;
> movl %esp, %ebp") and then creating C-functions which emulate such an
>
> antire block, so they can be optimized as a whole by gcc. That would
> give a similar performance gain on all supported targets, and not
> just
> on the one you created the peephole optimizer for (+ less work to
> debug).
>
> The only possible downside is that you can't jump to a particular
> instruction in such a block (the same goes for several kinds of
> peephole optimizations though). I don't know yet how Qemu exactly
> keeps
> track of the translations it has already performed, whether it
> supports
> multiple existing translations of the same instruction and/or whether
>
> it can already automatically invalidate the old block in case it
> turns
> out it needs to be splitted and thus re-translated (I guess it should
>
> at least some of these things, since it theory an x86 could jump into
>
> the middle of an instruction in order to reinterpret the bytes as
> another instruction stream).
>
>
> Jonas
>
>
> Unfortunately it's not that simple. The push instruction may cause an
>
> exception. Whatever optimizations you apply you've got to make sure
> that the
> guest state is still consistent when the exception occurs.
>
> Paul
>
>
> If we just concatenate the C code of the two procedures, won't gcc
> take
> care of that for us? Or could scheduling mess this up? Maybe there's
> a
> switch to avoid having it reschedule instructions in a way that side
> effects happen in a different order? (that would still give us the
> advantage of CSE and peephole optimizations)
>
>
> Jonas
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-18 8:35 [Qemu-devel] Profiling Qemu for speed? Daniel J Guinan
@ 2005-04-18 9:51 ` Ian Rogers
2005-04-18 13:44 ` Daniel Egger
0 siblings, 1 reply; 18+ messages in thread
From: Ian Rogers @ 2005-04-18 9:51 UTC (permalink / raw)
To: qemu-devel
There are some code sequences that are quite common, for example compare
followed by branch. A threaded decoder tends to look like:
... // do some work
load <instruction>
mask out opcode
address_of_decoder = load decoder_lookup<opcode>
goto *address_of_decoder
but if you say compare and branch are common then possibly
compare_instruction:
... // do some work
load <instruction>
mask out opcode
if opcode == branch then goto branch_decoder
address_of_decoder = load decoder_lookup<opcode>
goto *address_of_decoder
is more optimal, as in the branch case you only have one load. So it
seems you can use knowledge of common code sequences to remove one
memory access. The overall effect of this is going to come down to the
branch prediction hardware too - so a win isn't obvious. If you have
longer code sequences then you can use specialization to create a
speciailized and optimally laid out decoder.
I'm not sure if you can get GCC to generate code sequences like this,
but you probably at least need to use the -fprofile-generate and
-fprofile-use options
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
I'm doing this in Java and it appears to be worth upto a 10% speedup in
an interpreter. So possibly this would save 10% in the translator.
Sorry if this is obvious. Regards,
Ian Rogers
-- http://www.binarytranslator.org/
Daniel J Guinan wrote:
>This conversation, below, is very interesting. It is precisely this
>part of QEMU that fascinates me and potentially holds the most promise
>for performance gains. I have even imagined using a genetic algorithm
>to discover optimal block-sizes and instruction re-ordering and
>whatnot. This could be done in order to generate translation tables of
>guest instruction sequences and host translated instruction sequences.
>Even if only a handful of very common sequences were translated in
>this fashion, the potential speedups are enormous.
>
>Before even discussing the exotic possibilities, however, we need to
>figure out what is possible within the framework of the current QEMU
>translation system. Rewiring QEMU to support translating sequences
>(blocks of instructions) rather than single instructions may or may not
>be necessary. It should be rather simple to instrument QEMU to keep
>track of the most common sequences in order to figure out if there are,
>in fact, sequences that show up with a high enough frequency to make
>this endeavor worthwhile (I would think the answer would be yes).
>Then, someone skilled in machine code for the host and guest could take
>a stab at hand-coding the translation for the most common couple of
>sequences to see how the performance gains come out.
>
>I would love to see some work in this direction and would be willing to
>help, although my skills are limited in x86 machine.
>
>-Daniel
>
>
>
>>One thought would be to have a peephole optimizer that looks back
>>over
>>the just translated basic block (or a state machine that matches such
>>sequences as an on-line algorithm) and match against common, known
>>primitive sequences, and replaces them with optimized versions.
>>
>>The kind of profiling you would want to do here is to run, say,
>>windows
>>and take a snapshot of the dynamic code cache, and look for common
>>instruction sequences. Ideally, you could write some software to do
>>this
>>automatically.
>>
>>Anyway, I'm sure there are lots of other ideas laying around.
>>
>>
>>-- John.
>>
>>Another thing I've thought about is checking what sequences of
>>instructions often appear in x86 programs (such as e.g. "push %ebp;
>>movl %esp, %ebp") and then creating C-functions which emulate such an
>>
>>antire block, so they can be optimized as a whole by gcc. That would
>>give a similar performance gain on all supported targets, and not
>>just
>>on the one you created the peephole optimizer for (+ less work to
>>debug).
>>
>>The only possible downside is that you can't jump to a particular
>>instruction in such a block (the same goes for several kinds of
>>peephole optimizations though). I don't know yet how Qemu exactly
>>keeps
>>track of the translations it has already performed, whether it
>>supports
>>multiple existing translations of the same instruction and/or whether
>>
>>it can already automatically invalidate the old block in case it
>>turns
>>out it needs to be splitted and thus re-translated (I guess it should
>>
>>at least some of these things, since it theory an x86 could jump into
>>
>>the middle of an instruction in order to reinterpret the bytes as
>>another instruction stream).
>>
>>
>>Jonas
>>
>>
>>Unfortunately it's not that simple. The push instruction may cause an
>>
>>exception. Whatever optimizations you apply you've got to make sure
>>that the
>>guest state is still consistent when the exception occurs.
>>
>>Paul
>>
>>
>>If we just concatenate the C code of the two procedures, won't gcc
>>take
>>care of that for us? Or could scheduling mess this up? Maybe there's
>>a
>>switch to avoid having it reschedule instructions in a way that side
>>effects happen in a different order? (that would still give us the
>>advantage of CSE and peephole optimizations)
>>
>>
>>Jonas
>>
>>
>>
>
>
>
>
>
>_______________________________________________
>Qemu-devel mailing list
>Qemu-devel@nongnu.org
>http://lists.nongnu.org/mailman/listinfo/qemu-devel
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
@ 2005-04-18 11:24 Daniel J Guinan
0 siblings, 0 replies; 18+ messages in thread
From: Daniel J Guinan @ 2005-04-18 11:24 UTC (permalink / raw)
To: qemu-devel
> > I was up until 3:00am studying Qemu, and I came to the conclusion
> that
> > it doesn't make sense to try speeding up the output code, at least
> not
> > yet. A peephole optimizer or hand-coded sequences made to handle
> common
> > combinations of instructions would lead to the problems discussed
> here:
> > exceptions happening at the right time, self-modifying code, etc.
>
> Well yeah... I didn't say it would be easy :-)
I would like to suggest that perhaps it would be useful to experiment
anyway. Self-modifying code cannot be that prevalent (unless you are a
virus writer - which, yes, means that probably any standard Windows
installation of any maturity would stop working ;-), exceptions are
potentially more dangerous and branching into a translated blocks would
need to be dealt with by (probably) retranslating on an
instruction-by-instruction level and then allowing the branch to occur
into the traditional translation.
At any rate, making this a compile-time (or even runtime) option would
allow for basic experimentation. Until we can go from instruction
translation to logic translation, this system will never fullfill it's
potential. Instruction translation can maintain the property of system
correctness, which is indeed broken by logic translation (e.g. perhaps
exceptions don't occur when they normally would, or need to be handled
and translated resulting in slightly different behavior). However, I
would like to assert that with the proper balance and approach, exact
system correctness may not be strictly neccessary (I am thinking of
certain exception cases) and logic translation has the potential to be
wickidly faster (if the PPC can do something in 6 steps that takes an
x86 15 steps, but there is an outside chance of a division by zero
exception, why do it in 87 steps? Shouldn't we at least try to see if
we can make it work in 6? or 12?).
> > Worse, the translator might have to spend so much time doing this
> that
> > the result would actually be slower execution.
>
> Not if you do it right. Remember. by and large you only incur a
> one-time
> hit, and after that you're going as fast as your dynamic translator
> is
> smart.
>
> Here's a heuristic for you: optimize for the general case, not corner
> cases. You still have to handle the corner cases, but they are corner
> cases for a reason: you don't run into them as often.
Exactly, as I hinted at in a previous post, it may turn out that some
sequence of (for example) 5 instructions is so common in all software
that the act of hand coding a replacement for those 5 instructions
could result in a noticible overall performance improvement. And then
you can start scaling the performance up from there. I will guarantee
that we can find some sequences that do not have problems with
exceptions that would be candidates (guaranteeing on branch and
self-modifying is not as easy). But not to try?
-Daniel
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-18 9:51 ` Ian Rogers
@ 2005-04-18 13:44 ` Daniel Egger
2005-04-18 14:12 ` Christian MICHON
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Daniel Egger @ 2005-04-18 13:44 UTC (permalink / raw)
To: ian.rogers, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 771 bytes --]
On 18.04.2005, at 11:51, Ian Rogers wrote:
> I'm not sure if you can get GCC to generate code sequences like this,
> but you probably at least need to use the -fprofile-generate and
> -fprofile-use options
> http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Feedback optimisation (FDO) will not work for two reasons:
a) qemu itself is something like a realtime compiler so FDO
will only speed up the compiler but not the generated code
b) FDO will only provide speed boosts if the feedback phase
has a chance to analyse a representative work pattern that
is hopefully also repetitive
After all FDO is mostly about making a tradeoff size/speed
and rearranging code (mostly branches) to avoid branch
mispredictions of the CPU.
Servus,
Daniel
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 186 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-18 13:44 ` Daniel Egger
@ 2005-04-18 14:12 ` Christian MICHON
2005-04-18 14:29 ` Ian Rogers
2005-04-18 14:19 ` Ian Rogers
2005-04-18 14:40 ` Paul Brook
2 siblings, 1 reply; 18+ messages in thread
From: Christian MICHON @ 2005-04-18 14:12 UTC (permalink / raw)
To: qemu-devel
I did months ago gcc/FDO with a xp/lite installation as a "repetitive task" :)
I did not improve the timings after all the effort.
Christian
On 4/18/05, Daniel Egger <de@axiros.com> wrote:
> On 18.04.2005, at 11:51, Ian Rogers wrote:
>
> > I'm not sure if you can get GCC to generate code sequences like this,
> > but you probably at least need to use the -fprofile-generate and
> > -fprofile-use options
> > http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
>
> Feedback optimisation (FDO) will not work for two reasons:
> a) qemu itself is something like a realtime compiler so FDO
> will only speed up the compiler but not the generated code
> b) FDO will only provide speed boosts if the feedback phase
> has a chance to analyse a representative work pattern that
> is hopefully also repetitive
>
> After all FDO is mostly about making a tradeoff size/speed
> and rearranging code (mostly branches) to avoid branch
> mispredictions of the CPU.
>
> Servus,
> Daniel
>
>
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel
>
>
>
>
--
Christian
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-18 13:44 ` Daniel Egger
2005-04-18 14:12 ` Christian MICHON
@ 2005-04-18 14:19 ` Ian Rogers
2005-04-18 14:40 ` Paul Brook
2 siblings, 0 replies; 18+ messages in thread
From: Ian Rogers @ 2005-04-18 14:19 UTC (permalink / raw)
To: Daniel Egger, qemu-devel
Sorry, I was responding to Karl Magdsick's point about the cost of
switch statements relating to Nathaniel G.H.'s point about the cost of
translation/generation. FDO works in the case of interpreters and
translators from my experience as code sequences are pretty predictable
things (you just don't tend to choose instructions at random). Improving
dynamically generated code is the best way to improve performance, as
that's where you spend your time. However, the time spent compiling
means you only want to optimise hot spots. Well thought out simple
just-in-time compilation is often very hard to beat with an optimising
compiler, for example, knowing your target architecture well means you
can get a really good static register mapping to the host architecture.
These lessons are well known in the literature. I'd still be interested
to know if the translator's profiled performance improved using FDO. I
have results from writing things in Java, where FDO is par for the
course, but they hardly apply to QEMU/GCC.
Regards,
Ian Rogers
-- http://www.binarytranslator.org/
Daniel Egger wrote:
> On 18.04.2005, at 11:51, Ian Rogers wrote:
>
>> I'm not sure if you can get GCC to generate code sequences like this,
>> but you probably at least need to use the -fprofile-generate and
>> -fprofile-use options
>> http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
>
>
> Feedback optimisation (FDO) will not work for two reasons:
> a) qemu itself is something like a realtime compiler so FDO
> will only speed up the compiler but not the generated code
> b) FDO will only provide speed boosts if the feedback phase
> has a chance to analyse a representative work pattern that
> is hopefully also repetitive
>
> After all FDO is mostly about making a tradeoff size/speed
> and rearranging code (mostly branches) to avoid branch
> mispredictions of the CPU.
>
> Servus,
> Daniel
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-18 14:12 ` Christian MICHON
@ 2005-04-18 14:29 ` Ian Rogers
0 siblings, 0 replies; 18+ messages in thread
From: Ian Rogers @ 2005-04-18 14:29 UTC (permalink / raw)
To: Christian MICHON, qemu-devel
Christian MICHON wrote:
>I did months ago gcc/FDO with a xp/lite installation as a "repetitive task" :)
>I did not improve the timings after all the effort.
>
>
could this be down to the tables used to find the
translators/generators? are they constant? is it possible to make them
amenable to feedback directed analysis? x86 is also an odd case, it
could be that load-store architectures with more registers notice FDO
more as the x86 can access the L1 caches similarly to a large register
file. If the problem is gcc's feedback analysis then you could always
try Intel's compiler. Regards,
Ian Rogers
-- http://www.binarytranslator.org/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Profiling Qemu for speed?
2005-04-18 13:44 ` Daniel Egger
2005-04-18 14:12 ` Christian MICHON
2005-04-18 14:19 ` Ian Rogers
@ 2005-04-18 14:40 ` Paul Brook
2 siblings, 0 replies; 18+ messages in thread
From: Paul Brook @ 2005-04-18 14:40 UTC (permalink / raw)
To: qemu-devel; +Cc: ian.rogers
On Monday 18 April 2005 14:44, Daniel Egger wrote:
> On 18.04.2005, at 11:51, Ian Rogers wrote:
> > I'm not sure if you can get GCC to generate code sequences like this,
> > but you probably at least need to use the -fprofile-generate and
> > -fprofile-use options
> > http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
>
> Feedback optimisation (FDO) will not work for two reasons:
> a) qemu itself is something like a realtime compiler so FDO
> will only speed up the compiler but not the generated code
> b) FDO will only provide speed boosts if the feedback phase
> has a chance to analyse a representative work pattern that
> is hopefully also repetitive
>
> After all FDO is mostly about making a tradeoff size/speed
> and rearranging code (mostly branches) to avoid branch
> mispredictions of the CPU.
IMHO Profile feedback (in fact any compiler optimizations) are unlikely to
provide any benefit for qemu itself.
However profile feedback can be useful once we start making the dynamic
translator smarter. In particular profile information can be used to
determine how hard to try optimizing the generated code. For code that is
executed rarely we want to generate it as quickly as possible, for code that
is executed may times we can afford the extra overhead to make the resulting
code faster. I've seen measuremets that indicate most code is either executed
once, or is executed many many times.
With current qemu it doesn't make all that much difference as we only have the
quick-and-dumb code generator.
Paul
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2005-04-18 14:50 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-18 8:35 [Qemu-devel] Profiling Qemu for speed? Daniel J Guinan
2005-04-18 9:51 ` Ian Rogers
2005-04-18 13:44 ` Daniel Egger
2005-04-18 14:12 ` Christian MICHON
2005-04-18 14:29 ` Ian Rogers
2005-04-18 14:19 ` Ian Rogers
2005-04-18 14:40 ` Paul Brook
-- strict thread matches above, loose matches on Subject: below --
2005-04-18 11:24 Daniel J Guinan
2005-04-17 5:58 Joe Luser
2005-04-17 8:21 ` John R. Hogerhuis
2005-04-17 8:59 ` Jonas Maebe
2005-04-17 10:27 ` Paul Brook
2005-04-17 10:46 ` Jonas Maebe
2005-04-18 1:36 ` Nathaniel G H
2005-04-18 2:11 ` John R. Hogerhuis
2005-04-18 2:39 ` André Braga
2005-04-18 4:31 ` Karl Magdsick
2005-04-17 10:36 ` Paul Brook
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).