Re: [Qemu-devel] i386 emulation: improved flag handing

From: Fabrice Bellard <fabrice@bellard.org>
To: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] i386 emulation: improved flag handing
Date: Sun, 29 Aug 2004 14:58:58 +0200	[thread overview]
Message-ID: <4131D312.3090008@bellard.org> (raw)
In-Reply-To: <1093734467.7506.131.camel@kubu.opensource.se>

Hi,

The current QEMU eflags handling is not efficient for inc/dec as it must 
recompute the C flag which is not modified by inc/dec. I think this is 
the most important slowdown due to eflags handling. A simple solution 
would just be to save CC_OP/CC_SRC/CC_DST instead of computing 'CF'. A 
test is still needed if an inc/dec is followed by inc/dec to avoid 
saving CC_OP/CC_SRC/CC_DST again.

So the eflags state would be:
CC_OP
CC_SRC
CC_DST
CC_OP_C
CC_DST_C

if CC_OP == CC_OP_INC/DEC then all eflags except C are computed from 
CC_SRC. 'CF' is computed from CC_OP_C, CC_DST_C and CC_SRC (CC_OP_C must 
never be CC_OP_INC/DEC).

Your solution seems a little too complicated for the expected gain. Try 
to compare it with my proposal.

Just for your information, my next developments will consist in 
improving QEMU performance in the x86 on x86 case to match (or exceed 
:-)) the VMware or VirtualPC level of performance. The downside is that 
some kernel support will be needed. The kernel support will of course 
remain optional. This mode of operation will replace 'qemu-fast'.

For the x86 on PowerPC case, better usage of the host registers would 
give a performance boost. In particular, CC_SRC and CC_DST should be 
saved in host registers too.

Fabrice.

Magnus Damm wrote:
> Hi all,
> 
> Here is something that I've been thinking about the last week. I hope it
> can lead to improved performance.
> 
> / magnus
> 
> 
> The flag emulation code today:
> -------------------------------
> 
> The implementation today is rather straightforward and simple:
> 
> 1. Each emulated instruction that modifies any flag will update up to
> three variables containing instruction type (CC_OP), source value
> (CC_SRC) and destination value (CC_DST). If the instruction not modifies
> all flags, the previous flags are calculated - hopefully only the carry
> flag.
> 
> 2. When a instruction depends on a flag, all flags (or just the carry
> flag) are calculated from the stored information.
> 
> 3. During the opcode to micro operations translation, the last type of
> flag instruction (CC_OP) is kept track of and only written if necessary.
> 
> 4. After the translation between the i386 opcodes and the micro
> operations has taken place, a optimization step takes place and replaces
> micro operations that are redundant with NOPs.
> 
> 
> Improved flag handling - a more fine grained approach:
> ------------------------------------------------------
> 
> By looking at the "status flag summary" in my 486 book I understand that
> there are basically three groups of x86 instructions that modify flags.
> Note that this does not include rare single-flag modifying instructions.
> 
>    OF SF ZF AF PF CF
> A  x  x  x  x  x  x   
> B  x  x  x  x  x      
> C  x              x   
> 
> Say hello to group A, group B and group C. Group A contains the most
> common flag operations, group B is basically INC and DEC while group C
> contains various shift instructions.
> 
> Each group is kept track of with two variables, CC_SRC_<group> and
> CC_DST_<group>. The current value of the EFLAGS register is stored in a
> variable called CC_EFLAGS. A 32 bit variable, CC_CACHE is used to store
> the state of each flag. Six tables, one for each flag (cc_table_<flag>)
> are used to lookup flag calculating functions.
> 
> 
> CC_CACHE format:
> 
> 12 bits flag state      18 bits group info
> 
> OF SF ZF AF PF CF       A      B      C 
>                         
> NN NN NN NN NN NN       NNNNNN NNNNNN NNNNNN
> 
> Each flag has a two bit field indicating the state:
> 
> 0 -> flag is up to date, no need to flush cache.
> 1 -> flag was last modified by group A
> 2 -> flag was last modified by group B
> 3 -> flag was last modified by group C
> 
> 
> When an instruction that belongs to group A is translated into micro
> operations, the last micro operation will perform up to three variable
> writes:
> 
> 1. CC_CACHE is written with all flags states set to 1 (indicating the
> flag belongs to group A) and group info A field is set to the
> instruction number (compare with CC_OP today). This is a single 32 bit
> write.
> 
> 2. CC_DST_A is set in the same way as CC_DST today.
> 
> 3. If required, CC_SRC_A is set too.
> 
> When a group B or C instruction is translated, the last micro operation
> will perform:
> 
> 1. CC_CACHE is modified (read-modify-write) to update the flags and
> group info field B or C. For group B, all flags except CF are set to 2
> (indicating group B). For group C, the OF and CF fields are set to 3
> indicating group C.
> 
> 2. For group B CC_DST_B is written, for group C CC_DST_C is written.
> 
> 3. If required CC_SRC_B or CC_SRC_C is written.
> 
> Because group A instructions are the most common ones, the group A
> implementation is faster (no read-modify-write) than group B and C.
> 
> 
> Question: What happens when an instruction needs to test one or more
> flags? Answer: Before the flag can be used to calculate anything micro
> operations that flush the state of each flag must be performed. One
> micro operation per flag. The post-translation optimization step could
> probably change more than N flag flush micro operations into one micro
> operation flushing all flags if that would be more efficient.
> 
> When the cache of one flag is flushed, the corresponding flag state
> field in CC_CACHE is read out and used as a index into cc_group_<flag>
> to point out the function used to flush the flag.
> 
> cc_group_<flag>[0] will all point to a function that just returns,
> remember that a flag state of 0 means that the flag is up to date.
> The other functions will calculate the flag based on CC_DST_<group> and
> CC_SRC_<group>, store the result in CC_EFLAGS and then mark the flag
> state in CC_CACHE as 0 to indicate that the flag now is up to date.
> The actual implementation of the flag calculation code will of course
> vary, for some flags the code could be shared between all instruction
> types in one group. Example: ZF and PF are probably handled in the same
> way for all group A instructions. Other flags will probably need a
> second look up dealing with the instruction type.
> 
> 
> So, what my improved flag handling scheme basically does is to divide
> the load of calculating the flags into a several small pieces. Only the
> flags required by an instruction must be flushed. I hope that some
> cycles could be saved by not calculating all flags. The downside is of
> course that it will be less efficient to update all flags compared with
> the implementation today. And that it is less efficient to modify group
> B/C (read-modify-write) and store CC_DST_B/C + CC_SRC_B/C, than just
> store CC_OP, CC_DST and CC_SRC like today.
> 
> A good thing though is that it is always possible to set any flag in the
> EFLAGS register without recalculating any other flags. And, of course, I
> feel that it would be easier to add more advanced optimization code
> later on...
> 
> Should I start hacking on a patch? Or would it be a waste of time?
> Please let me know what you think. Thanks!
> 
> 
> 
> 
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel
> 
>