From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.33)
	id 1C1CHG-0003UO-UA
	for qemu-devel@nongnu.org; Sat, 28 Aug 2004 19:06:31 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.33)
	id 1C1CHF-0003Rn-1r
	for qemu-devel@nongnu.org; Sat, 28 Aug 2004 19:06:29 -0400
Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.33) id 1C1CHE-0003Rh-SV
	for qemu-devel@nongnu.org; Sat, 28 Aug 2004 19:06:28 -0400
Received: from [213.80.72.10] (helo=kubrik.opensource.se)
	by monty-python.gnu.org with esmtp (Exim 4.34) id 1C1CC5-00082p-JJ
	for qemu-devel@nongnu.org; Sat, 28 Aug 2004 19:01:10 -0400
Received: from 192.168.1.16 (unknown [213.80.72.14])
	by kubrik.opensource.se (Postfix) with ESMTP id 4A3A73752C
	for <qemu-devel@nongnu.org>; Sun, 29 Aug 2004 00:49:16 +0200 (CEST)
From: Magnus Damm <damm@opensource.se>
Content-Type: text/plain
Message-Id: <1093734467.7506.131.camel@kubu.opensource.se>
Mime-Version: 1.0
Date: Sun, 29 Aug 2004 01:07:47 +0200
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] i386 emulation: improved flag handing
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

Hi all,

Here is something that I've been thinking about the last week. I hope it
can lead to improved performance.

/ magnus


The flag emulation code today:
-------------------------------

The implementation today is rather straightforward and simple:

1. Each emulated instruction that modifies any flag will update up to
three variables containing instruction type (CC_OP), source value
(CC_SRC) and destination value (CC_DST). If the instruction not modifies
all flags, the previous flags are calculated - hopefully only the carry
flag.

2. When a instruction depends on a flag, all flags (or just the carry
flag) are calculated from the stored information.

3. During the opcode to micro operations translation, the last type of
flag instruction (CC_OP) is kept track of and only written if necessary.

4. After the translation between the i386 opcodes and the micro
operations has taken place, a optimization step takes place and replaces
micro operations that are redundant with NOPs.


Improved flag handling - a more fine grained approach:
------------------------------------------------------

By looking at the "status flag summary" in my 486 book I understand that
there are basically three groups of x86 instructions that modify flags.
Note that this does not include rare single-flag modifying instructions.

   OF SF ZF AF PF CF
A  x  x  x  x  x  x   
B  x  x  x  x  x      
C  x              x   

Say hello to group A, group B and group C. Group A contains the most
common flag operations, group B is basically INC and DEC while group C
contains various shift instructions.

Each group is kept track of with two variables, CC_SRC_<group> and
CC_DST_<group>. The current value of the EFLAGS register is stored in a
variable called CC_EFLAGS. A 32 bit variable, CC_CACHE is used to store
the state of each flag. Six tables, one for each flag (cc_table_<flag>)
are used to lookup flag calculating functions.


CC_CACHE format:

12 bits flag state      18 bits group info

OF SF ZF AF PF CF       A      B      C 
                        
NN NN NN NN NN NN       NNNNNN NNNNNN NNNNNN

Each flag has a two bit field indicating the state:

0 -> flag is up to date, no need to flush cache.
1 -> flag was last modified by group A
2 -> flag was last modified by group B
3 -> flag was last modified by group C


When an instruction that belongs to group A is translated into micro
operations, the last micro operation will perform up to three variable
writes:

1. CC_CACHE is written with all flags states set to 1 (indicating the
flag belongs to group A) and group info A field is set to the
instruction number (compare with CC_OP today). This is a single 32 bit
write.

2. CC_DST_A is set in the same way as CC_DST today.

3. If required, CC_SRC_A is set too.

When a group B or C instruction is translated, the last micro operation
will perform:

1. CC_CACHE is modified (read-modify-write) to update the flags and
group info field B or C. For group B, all flags except CF are set to 2
(indicating group B). For group C, the OF and CF fields are set to 3
indicating group C.

2. For group B CC_DST_B is written, for group C CC_DST_C is written.

3. If required CC_SRC_B or CC_SRC_C is written.

Because group A instructions are the most common ones, the group A
implementation is faster (no read-modify-write) than group B and C.


Question: What happens when an instruction needs to test one or more
flags? Answer: Before the flag can be used to calculate anything micro
operations that flush the state of each flag must be performed. One
micro operation per flag. The post-translation optimization step could
probably change more than N flag flush micro operations into one micro
operation flushing all flags if that would be more efficient.

When the cache of one flag is flushed, the corresponding flag state
field in CC_CACHE is read out and used as a index into cc_group_<flag>
to point out the function used to flush the flag.

cc_group_<flag>[0] will all point to a function that just returns,
remember that a flag state of 0 means that the flag is up to date.
The other functions will calculate the flag based on CC_DST_<group> and
CC_SRC_<group>, store the result in CC_EFLAGS and then mark the flag
state in CC_CACHE as 0 to indicate that the flag now is up to date.
The actual implementation of the flag calculation code will of course
vary, for some flags the code could be shared between all instruction
types in one group. Example: ZF and PF are probably handled in the same
way for all group A instructions. Other flags will probably need a
second look up dealing with the instruction type.


So, what my improved flag handling scheme basically does is to divide
the load of calculating the flags into a several small pieces. Only the
flags required by an instruction must be flushed. I hope that some
cycles could be saved by not calculating all flags. The downside is of
course that it will be less efficient to update all flags compared with
the implementation today. And that it is less efficient to modify group
B/C (read-modify-write) and store CC_DST_B/C + CC_SRC_B/C, than just
store CC_OP, CC_DST and CC_SRC like today.

A good thing though is that it is always possible to set any flag in the
EFLAGS register without recalculating any other flags. And, of course, I
feel that it would be easier to add more advanced optimization code
later on...

Should I start hacking on a patch? Or would it be a waste of time?
Please let me know what you think. Thanks!