From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.33) id 1C1CHG-0003UO-UA for qemu-devel@nongnu.org; Sat, 28 Aug 2004 19:06:31 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.33) id 1C1CHF-0003Rn-1r for qemu-devel@nongnu.org; Sat, 28 Aug 2004 19:06:29 -0400 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1C1CHE-0003Rh-SV for qemu-devel@nongnu.org; Sat, 28 Aug 2004 19:06:28 -0400 Received: from [213.80.72.10] (helo=kubrik.opensource.se) by monty-python.gnu.org with esmtp (Exim 4.34) id 1C1CC5-00082p-JJ for qemu-devel@nongnu.org; Sat, 28 Aug 2004 19:01:10 -0400 Received: from 192.168.1.16 (unknown [213.80.72.14]) by kubrik.opensource.se (Postfix) with ESMTP id 4A3A73752C for ; Sun, 29 Aug 2004 00:49:16 +0200 (CEST) From: Magnus Damm Content-Type: text/plain Message-Id: <1093734467.7506.131.camel@kubu.opensource.se> Mime-Version: 1.0 Date: Sun, 29 Aug 2004 01:07:47 +0200 Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] i386 emulation: improved flag handing Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Hi all, Here is something that I've been thinking about the last week. I hope it can lead to improved performance. / magnus The flag emulation code today: ------------------------------- The implementation today is rather straightforward and simple: 1. Each emulated instruction that modifies any flag will update up to three variables containing instruction type (CC_OP), source value (CC_SRC) and destination value (CC_DST). If the instruction not modifies all flags, the previous flags are calculated - hopefully only the carry flag. 2. When a instruction depends on a flag, all flags (or just the carry flag) are calculated from the stored information. 3. During the opcode to micro operations translation, the last type of flag instruction (CC_OP) is kept track of and only written if necessary. 4. After the translation between the i386 opcodes and the micro operations has taken place, a optimization step takes place and replaces micro operations that are redundant with NOPs. Improved flag handling - a more fine grained approach: ------------------------------------------------------ By looking at the "status flag summary" in my 486 book I understand that there are basically three groups of x86 instructions that modify flags. Note that this does not include rare single-flag modifying instructions. OF SF ZF AF PF CF A x x x x x x B x x x x x C x x Say hello to group A, group B and group C. Group A contains the most common flag operations, group B is basically INC and DEC while group C contains various shift instructions. Each group is kept track of with two variables, CC_SRC_ and CC_DST_. The current value of the EFLAGS register is stored in a variable called CC_EFLAGS. A 32 bit variable, CC_CACHE is used to store the state of each flag. Six tables, one for each flag (cc_table_) are used to lookup flag calculating functions. CC_CACHE format: 12 bits flag state 18 bits group info OF SF ZF AF PF CF A B C NN NN NN NN NN NN NNNNNN NNNNNN NNNNNN Each flag has a two bit field indicating the state: 0 -> flag is up to date, no need to flush cache. 1 -> flag was last modified by group A 2 -> flag was last modified by group B 3 -> flag was last modified by group C When an instruction that belongs to group A is translated into micro operations, the last micro operation will perform up to three variable writes: 1. CC_CACHE is written with all flags states set to 1 (indicating the flag belongs to group A) and group info A field is set to the instruction number (compare with CC_OP today). This is a single 32 bit write. 2. CC_DST_A is set in the same way as CC_DST today. 3. If required, CC_SRC_A is set too. When a group B or C instruction is translated, the last micro operation will perform: 1. CC_CACHE is modified (read-modify-write) to update the flags and group info field B or C. For group B, all flags except CF are set to 2 (indicating group B). For group C, the OF and CF fields are set to 3 indicating group C. 2. For group B CC_DST_B is written, for group C CC_DST_C is written. 3. If required CC_SRC_B or CC_SRC_C is written. Because group A instructions are the most common ones, the group A implementation is faster (no read-modify-write) than group B and C. Question: What happens when an instruction needs to test one or more flags? Answer: Before the flag can be used to calculate anything micro operations that flush the state of each flag must be performed. One micro operation per flag. The post-translation optimization step could probably change more than N flag flush micro operations into one micro operation flushing all flags if that would be more efficient. When the cache of one flag is flushed, the corresponding flag state field in CC_CACHE is read out and used as a index into cc_group_ to point out the function used to flush the flag. cc_group_[0] will all point to a function that just returns, remember that a flag state of 0 means that the flag is up to date. The other functions will calculate the flag based on CC_DST_ and CC_SRC_, store the result in CC_EFLAGS and then mark the flag state in CC_CACHE as 0 to indicate that the flag now is up to date. The actual implementation of the flag calculation code will of course vary, for some flags the code could be shared between all instruction types in one group. Example: ZF and PF are probably handled in the same way for all group A instructions. Other flags will probably need a second look up dealing with the instruction type. So, what my improved flag handling scheme basically does is to divide the load of calculating the flags into a several small pieces. Only the flags required by an instruction must be flushed. I hope that some cycles could be saved by not calculating all flags. The downside is of course that it will be less efficient to update all flags compared with the implementation today. And that it is less efficient to modify group B/C (read-modify-write) and store CC_DST_B/C + CC_SRC_B/C, than just store CC_OP, CC_DST and CC_SRC like today. A good thing though is that it is always possible to set any flag in the EFLAGS register without recalculating any other flags. And, of course, I feel that it would be easier to add more advanced optimization code later on... Should I start hacking on a patch? Or would it be a waste of time? Please let me know what you think. Thanks!