From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.33) id 1BytpP-0007Pb-6y for qemu-devel@nongnu.org; Sun, 22 Aug 2004 11:00:18 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.33) id 1BytpN-0007PB-5w for qemu-devel@nongnu.org; Sun, 22 Aug 2004 11:00:14 -0400 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1BytpN-0007P8-1u for qemu-devel@nongnu.org; Sun, 22 Aug 2004 11:00:13 -0400 Received: from [213.80.72.10] (helo=kubrik.opensource.se) by monty-python.gnu.org with esmtp (Exim 4.34) id 1Bytkb-0007E1-H6 for qemu-devel@nongnu.org; Sun, 22 Aug 2004 10:55:18 -0400 Received: from 192.168.1.16 (unknown [213.80.72.14]) by kubrik.opensource.se (Postfix) with ESMTP id 8B7B13752C for ; Sun, 22 Aug 2004 16:43:25 +0200 (CEST) From: Magnus Damm Content-Type: text/plain Message-Id: <1093186895.1266.223.camel@kubu.opensource.se> Mime-Version: 1.0 Date: Sun, 22 Aug 2004 17:01:54 +0200 Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] instruction optimization thoughts Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Hi all, Today I've been playing around with qemu trying to understand how the emulation works. I've tried some debug flags and looked at log files. This is how I believe the translation between x86 opcodes and micro operations is performed today, please correct me if I am wrong: gen_intermediate_code_internal() in target-i386/translate.c is used to build intermediate code. The function disas_insn() is used to convert each opcode into several micro operations. When the block is finished, the function optimize_flags() is used to optimize away some flag related micro operations. After looking at some log files I wonder if it would be possible to reduce the number of micro operations (especially the ones involved in flag handling) by analyzing resources used and set by each x86 instruction and then feed that information into the code that converts x86 opcodes into micro operations. Have a look at the following example: ---------------- IN: 0x300a8b99: pop ebx 0x300a8b9a: add ebx 0x300a8ba0: mov DWORD PTR [ebp-684],eax 0x300a8ba6: xor edx,edx 0x300a8ba8: lea eax,[ebp-528] 0x300a8bae: mov esi,esi 0x300a8bb0: inc edx 0x300a8bb1: mov DWORD PTR [eax] 0x300a8bb7: add eax 0x300a8bba: cmp edx 0x300a8bbd: jbe 0x300a8bb0 If we analyze the x86 instructions and keep track of resources first, instead of generating the micro operations directly, we would come up with a table containing resource information related to each x86 instruction. This table contains data about required resources and resources that will be set by each instruction. The table could also quite easily be extended to contain flags that mark if resources are constant or not which leads to further optimization possibilities later. instruction | resources required | resources set pop ebx | ESP | EBX add ebx,0x11927 | EBX | EBX OF SF ZF AF PF CF mov ..ebp-684],eax | EBP EAX | IO xor edx,edx | EDX | EDX OF SF ZF AF PF CF lea eax,[ebp-528] | EBP | EAX mov esi,esi | ESI | ESI inc edx | EDX | EDX OF SF ZF AF PF mov ..[eax], 0 | EAX | IO add eax, 4 | EAX | EAX OF SF ZF AF PF CF cmp edx, 0x4a | EDX | OF SF ZF AF PF CF jbe .. | EIP CF ZF | EIP Then we perform a optimization step. This step removes resources marked as set that are redundant. Maybe the code for this step could be shared by many target processors, think of it as some kind of generic resource optimizer. After optimization: instruction | resources required | resources set pop ebx | ESP | EBX add ebx,0x11927 | EBX | EBX mov ..ebp-684],eax | EBP EAX | IO xor edx,edx | EDX | EDX lea eax,[ebp-528] | EBP | EAX mov esi,esi | ESI | ESI inc edx | EDX | EDX mov ..[eax], 0 | EAX | IO add eax, 4 | EAX | EAX cmp edx, 0x4a | EDX | OF SF ZF AF PF CF jbe .. | EIP CF ZF | EIP Several flag-related resources have been removed above. No other registers have been removed, but that would also be possible. The information left in the table is fed into the code that translates the x86 opcodes into micro operations and it is up to that code to generate as few micro operations as possible. I guess what I am trying to say is that it would be cool to add a generic optimization step before the opcode to micro operations translation. But would it be useful? Or just slow? Any thoughts? Maybe the flag handling code is fast enough today? / magnus