* [Qemu-devel] instruction optimization thoughts
@ 2004-08-22 15:01 Magnus Damm
2004-08-24 1:20 ` dguinan
0 siblings, 1 reply; 4+ messages in thread
From: Magnus Damm @ 2004-08-22 15:01 UTC (permalink / raw)
To: qemu-devel
Hi all,
Today I've been playing around with qemu trying to understand how the
emulation works. I've tried some debug flags and looked at log files.
This is how I believe the translation between x86 opcodes and micro
operations is performed today, please correct me if I am wrong:
gen_intermediate_code_internal() in target-i386/translate.c is used to
build intermediate code. The function disas_insn() is used to convert
each opcode into several micro operations. When the block is finished,
the function optimize_flags() is used to optimize away some flag related
micro operations.
After looking at some log files I wonder if it would be possible to
reduce the number of micro operations (especially the ones involved in
flag handling) by analyzing resources used and set by each x86
instruction and then feed that information into the code that converts
x86 opcodes into micro operations.
Have a look at the following example:
----------------
IN:
0x300a8b99: pop ebx
0x300a8b9a: add ebx
0x300a8ba0: mov DWORD PTR [ebp-684],eax
0x300a8ba6: xor edx,edx
0x300a8ba8: lea eax,[ebp-528]
0x300a8bae: mov esi,esi
0x300a8bb0: inc edx
0x300a8bb1: mov DWORD PTR [eax]
0x300a8bb7: add eax
0x300a8bba: cmp edx
0x300a8bbd: jbe 0x300a8bb0
If we analyze the x86 instructions and keep track of resources first,
instead of generating the micro operations directly, we would come up
with a table containing resource information related to each x86
instruction. This table contains data about required resources and
resources that will be set by each instruction.
The table could also quite easily be extended to contain flags that mark
if resources are constant or not which leads to further optimization
possibilities later.
instruction | resources required | resources set
pop ebx | ESP | EBX
add ebx,0x11927 | EBX | EBX OF SF ZF AF PF CF
mov ..ebp-684],eax | EBP EAX | IO
xor edx,edx | EDX | EDX OF SF ZF AF PF CF
lea eax,[ebp-528] | EBP | EAX
mov esi,esi | ESI | ESI
inc edx | EDX | EDX OF SF ZF AF PF
mov ..[eax], 0 | EAX | IO
add eax, 4 | EAX | EAX OF SF ZF AF PF CF
cmp edx, 0x4a | EDX | OF SF ZF AF PF CF
jbe .. | EIP CF ZF | EIP
Then we perform a optimization step. This step removes resources marked
as set that are redundant. Maybe the code for this step could be shared
by many target processors, think of it as some kind of generic resource
optimizer.
After optimization:
instruction | resources required | resources set
pop ebx | ESP | EBX
add ebx,0x11927 | EBX | EBX
mov ..ebp-684],eax | EBP EAX | IO
xor edx,edx | EDX | EDX
lea eax,[ebp-528] | EBP | EAX
mov esi,esi | ESI | ESI
inc edx | EDX | EDX
mov ..[eax], 0 | EAX | IO
add eax, 4 | EAX | EAX
cmp edx, 0x4a | EDX | OF SF ZF AF PF CF
jbe .. | EIP CF ZF | EIP
Several flag-related resources have been removed above. No other
registers have been removed, but that would also be possible. The
information left in the table is fed into the code that translates the
x86 opcodes into micro operations and it is up to that code to generate
as few micro operations as possible.
I guess what I am trying to say is that it would be cool to add a
generic optimization step before the opcode to micro operations
translation. But would it be useful? Or just slow?
Any thoughts? Maybe the flag handling code is fast enough today?
/ magnus
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] instruction optimization thoughts
2004-08-22 15:01 [Qemu-devel] instruction optimization thoughts Magnus Damm
@ 2004-08-24 1:20 ` dguinan
2004-08-24 11:45 ` Elefterios Stamatogiannakis
0 siblings, 1 reply; 4+ messages in thread
From: dguinan @ 2004-08-24 1:20 UTC (permalink / raw)
To: qemu-devel
I have been thinking along these lines and believe that it could be
taken even further. Let's assume that your pre-translation table
optimizer could be made to work. As a second step after eliminating
redundancy and removing effective NOPs, entries in the intermediate
instruction stream could be considered keys into a database of
translations. Depending on the amount of memory we want to spend,
these keys could be small or even quite large - I imagine that we could
go through a process of using something along the lines of a genetic
algorithm to build highly optimized translation blocks for the most
commonly occurring streams of instructions. The rudimentary building
blocks that the genetic algorithm would work with would be all the
various translations that a particular instruction stream could result
in (e.g. different ordering of order independent instructions, etc..).
The tables would, of course, be built off-line (a datafile could be
placed in CVS, allowing us to contribute a large amount of upfront
compute time to build highly optimized tables for a variety of
different platforms).
Thinking along these lines is important. This project not only
presents a fantastic emulator, but the "possibility" of eventually
reaching par or better performance (if we are smart about it).
-Daniel
On Aug 22, 2004, at 5:01 AM, Magnus Damm wrote:
> Hi all,
>
> Today I've been playing around with qemu trying to understand how the
> emulation works. I've tried some debug flags and looked at log files.
>
> This is how I believe the translation between x86 opcodes and micro
> operations is performed today, please correct me if I am wrong:
>
> gen_intermediate_code_internal() in target-i386/translate.c is used to
> build intermediate code. The function disas_insn() is used to convert
> each opcode into several micro operations. When the block is finished,
> the function optimize_flags() is used to optimize away some flag
> related
> micro operations.
>
> After looking at some log files I wonder if it would be possible to
> reduce the number of micro operations (especially the ones involved in
> flag handling) by analyzing resources used and set by each x86
> instruction and then feed that information into the code that converts
> x86 opcodes into micro operations.
>
> Have a look at the following example:
>
> ----------------
> IN:
> 0x300a8b99: pop ebx
> 0x300a8b9a: add ebx
> 0x300a8ba0: mov DWORD PTR [ebp-684],eax
> 0x300a8ba6: xor edx,edx
> 0x300a8ba8: lea eax,[ebp-528]
> 0x300a8bae: mov esi,esi
> 0x300a8bb0: inc edx
> 0x300a8bb1: mov DWORD PTR [eax]
> 0x300a8bb7: add eax
> 0x300a8bba: cmp edx
> 0x300a8bbd: jbe 0x300a8bb0
>
> If we analyze the x86 instructions and keep track of resources first,
> instead of generating the micro operations directly, we would come up
> with a table containing resource information related to each x86
> instruction. This table contains data about required resources and
> resources that will be set by each instruction.
>
> The table could also quite easily be extended to contain flags that
> mark
> if resources are constant or not which leads to further optimization
> possibilities later.
>
> instruction | resources required | resources set
>
> pop ebx | ESP | EBX
> add ebx,0x11927 | EBX | EBX OF SF ZF AF PF CF
> mov ..ebp-684],eax | EBP EAX | IO
> xor edx,edx | EDX | EDX OF SF ZF AF PF CF
> lea eax,[ebp-528] | EBP | EAX
> mov esi,esi | ESI | ESI
>
> inc edx | EDX | EDX OF SF ZF AF PF
> mov ..[eax], 0 | EAX | IO
> add eax, 4 | EAX | EAX OF SF ZF AF PF CF
> cmp edx, 0x4a | EDX | OF SF ZF AF PF CF
> jbe .. | EIP CF ZF | EIP
>
> Then we perform a optimization step. This step removes resources marked
> as set that are redundant. Maybe the code for this step could be shared
> by many target processors, think of it as some kind of generic resource
> optimizer.
>
> After optimization:
>
> instruction | resources required | resources set
>
> pop ebx | ESP | EBX
> add ebx,0x11927 | EBX | EBX
> mov ..ebp-684],eax | EBP EAX | IO
> xor edx,edx | EDX | EDX
> lea eax,[ebp-528] | EBP | EAX
> mov esi,esi | ESI | ESI
>
> inc edx | EDX | EDX
> mov ..[eax], 0 | EAX | IO
> add eax, 4 | EAX | EAX
> cmp edx, 0x4a | EDX | OF SF ZF AF PF CF
> jbe .. | EIP CF ZF | EIP
>
> Several flag-related resources have been removed above. No other
> registers have been removed, but that would also be possible. The
> information left in the table is fed into the code that translates the
> x86 opcodes into micro operations and it is up to that code to generate
> as few micro operations as possible.
>
> I guess what I am trying to say is that it would be cool to add a
> generic optimization step before the opcode to micro operations
> translation. But would it be useful? Or just slow?
>
> Any thoughts? Maybe the flag handling code is fast enough today?
>
> / magnus
>
>
>
>
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] instruction optimization thoughts
2004-08-24 1:20 ` dguinan
@ 2004-08-24 11:45 ` Elefterios Stamatogiannakis
2004-08-24 15:36 ` Piotr Krysik
0 siblings, 1 reply; 4+ messages in thread
From: Elefterios Stamatogiannakis @ 2004-08-24 11:45 UTC (permalink / raw)
To: qemu-devel
I don't think the database solution would work. First of all the
database would have to be very big in order to be effective. That means
that in order to lookup the streams of instructions in the database you
would thrash the cache (lots and lots of memory reads to various places).
The Magnus's idea is very interesting because it looks like a very
simple peephole optimizer. That is why it would be effective for blocks
that get executed more than once. For a little more work qemu would have
more efficient code to execute.
Nevertheless i don't now how much of Magnus idea qemu implements right
now. Maybe "Condition code optimisations" + "CPU state optimisations"
from ' http://fabrice.bellard.free.fr/qemu/qemu-tech.html ' come very
close to what Magnus suggests.
Fabrice or anyone else with qemu internals knowledge would be more
qualified to answer.
teris.
ps All these code optimizing ideas pale in effectiveness with what a mmu
optimization work would produce. There is a reason why there is a
qemu-fast and a qemu-soft. If somehow these too could be consolidated
then the performance gain would be considerable.... I think.
dguinan@mac.com wrote:
> I have been thinking along these lines and believe that it could be
> taken even further. Let's assume that your pre-translation table
> optimizer could be made to work. As a second step after eliminating
> redundancy and removing effective NOPs, entries in the intermediate
> instruction stream could be considered keys into a database of
> translations. Depending on the amount of memory we want to spend, these
> keys could be small or even quite large - I imagine that we could go
> through a process of using something along the lines of a genetic
> algorithm to build highly optimized translation blocks for the most
> commonly occurring streams of instructions. The rudimentary building
> blocks that the genetic algorithm would work with would be all the
> various translations that a particular instruction stream could result
> in (e.g. different ordering of order independent instructions, etc..).
> The tables would, of course, be built off-line (a datafile could be
> placed in CVS, allowing us to contribute a large amount of upfront
> compute time to build highly optimized tables for a variety of different
> platforms).
>
> Thinking along these lines is important. This project not only presents
> a fantastic emulator, but the "possibility" of eventually reaching par
> or better performance (if we are smart about it).
>
> -Daniel
>
> On Aug 22, 2004, at 5:01 AM, Magnus Damm wrote:
>
>> Hi all,
>>
>> Today I've been playing around with qemu trying to understand how the
>> emulation works. I've tried some debug flags and looked at log files.
>>
>> This is how I believe the translation between x86 opcodes and micro
>> operations is performed today, please correct me if I am wrong:
>>
>> gen_intermediate_code_internal() in target-i386/translate.c is used to
>> build intermediate code. The function disas_insn() is used to convert
>> each opcode into several micro operations. When the block is finished,
>> the function optimize_flags() is used to optimize away some flag related
>> micro operations.
>>
>> After looking at some log files I wonder if it would be possible to
>> reduce the number of micro operations (especially the ones involved in
>> flag handling) by analyzing resources used and set by each x86
>> instruction and then feed that information into the code that converts
>> x86 opcodes into micro operations.
>>
>> Have a look at the following example:
>>
>> ----------------
>> IN:
>> 0x300a8b99: pop ebx
>> 0x300a8b9a: add ebx
>> 0x300a8ba0: mov DWORD PTR [ebp-684],eax
>> 0x300a8ba6: xor edx,edx
>> 0x300a8ba8: lea eax,[ebp-528]
>> 0x300a8bae: mov esi,esi
>> 0x300a8bb0: inc edx
>> 0x300a8bb1: mov DWORD PTR [eax]
>> 0x300a8bb7: add eax
>> 0x300a8bba: cmp edx
>> 0x300a8bbd: jbe 0x300a8bb0
>>
>> If we analyze the x86 instructions and keep track of resources first,
>> instead of generating the micro operations directly, we would come up
>> with a table containing resource information related to each x86
>> instruction. This table contains data about required resources and
>> resources that will be set by each instruction.
>>
>> The table could also quite easily be extended to contain flags that mark
>> if resources are constant or not which leads to further optimization
>> possibilities later.
>>
>> instruction | resources required | resources set
>>
>> pop ebx | ESP | EBX
>> add ebx,0x11927 | EBX | EBX OF SF ZF AF PF CF
>> mov ..ebp-684],eax | EBP EAX | IO
>> xor edx,edx | EDX | EDX OF SF ZF AF PF CF
>> lea eax,[ebp-528] | EBP | EAX
>> mov esi,esi | ESI | ESI
>>
>> inc edx | EDX | EDX OF SF ZF AF PF
>> mov ..[eax], 0 | EAX | IO
>> add eax, 4 | EAX | EAX OF SF ZF AF PF CF
>> cmp edx, 0x4a | EDX | OF SF ZF AF PF CF
>> jbe .. | EIP CF ZF | EIP
>>
>> Then we perform a optimization step. This step removes resources marked
>> as set that are redundant. Maybe the code for this step could be shared
>> by many target processors, think of it as some kind of generic resource
>> optimizer.
>>
>> After optimization:
>>
>> instruction | resources required | resources set
>>
>> pop ebx | ESP | EBX
>> add ebx,0x11927 | EBX | EBX
>> mov ..ebp-684],eax | EBP EAX | IO
>> xor edx,edx | EDX | EDX
>> lea eax,[ebp-528] | EBP | EAX
>> mov esi,esi | ESI | ESI
>>
>> inc edx | EDX | EDX
>> mov ..[eax], 0 | EAX | IO
>> add eax, 4 | EAX | EAX
>> cmp edx, 0x4a | EDX | OF SF ZF AF PF CF
>> jbe .. | EIP CF ZF | EIP
>>
>> Several flag-related resources have been removed above. No other
>> registers have been removed, but that would also be possible. The
>> information left in the table is fed into the code that translates the
>> x86 opcodes into micro operations and it is up to that code to generate
>> as few micro operations as possible.
>>
>> I guess what I am trying to say is that it would be cool to add a
>> generic optimization step before the opcode to micro operations
>> translation. But would it be useful? Or just slow?
>>
>> Any thoughts? Maybe the flag handling code is fast enough today?
>>
>> / magnus
>>
>>
>>
>>
>> _______________________________________________
>> Qemu-devel mailing list
>> Qemu-devel@nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/qemu-devel
>>
>
>
>
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] instruction optimization thoughts
2004-08-24 11:45 ` Elefterios Stamatogiannakis
@ 2004-08-24 15:36 ` Piotr Krysik
0 siblings, 0 replies; 4+ messages in thread
From: Piotr Krysik @ 2004-08-24 15:36 UTC (permalink / raw)
To: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 3389 bytes --]
--- Elefterios Stamatogiannakis
<estama@dblab.ece.ntua.gr> wrote:
> I don't think the database solution would work.
> First of all the database would have to be very
> big in order to be effective. That means that in
> order to lookup the streams of instructions in
> the database you would thrash the cache (lots
> and lots of memory reads to various places).
Hi!
The database lookup can be very fast -- use good
hash function and memory mapped file. For best
performance used code fragments should be located
in sequence (single page fault and disk read would
bring a few useful code fragments to RAM). For
this reason the database should be optimized for
specific programs executed by particular user.
But once you have a nice optimizer for building
the database why not integrate it in Qemu to
optimize most frequently executed blocks (HotSpot)?
Once we find the optimization to be CPU-expensive,
we could add small persistent database.
[...]
> teris.
>
> ps All these code optimizing ideas pale in
> effectiveness with what a mmu optimization work
> would produce. There is a reason why there is a
> qemu-fast and a qemu-soft. If somehow these too
> could be consolidated then the performance gain
> would be considerable....
> I think.
The MMU optimization improves performance a lot,
but most of qemu-fast speed comes from code-copy
optimization (to compare, run benchmarks with
qemu-fast -no-code-copy). The MMU optimization
is important for code-copy as it allows running
blocks that read/write memory.
For MMU optimization to work it is necessary to
have as big as possible continuous region of virtual
address space that is dedicated to the guest.
For this reason qemu-fast uses special memory
layout and it causes some problems (it requires
static compilation, hacking of libc). Qemu-fast
will never be as portable as softmmu is. Except,
maybe, when running 32-bit guest on 64-bit host.
I did some experiments to check if optimization
of softmmu, by using techniques of MMU optimization
is feasible. For this I tried to remove memory
layout constraint of qemu-fast by using a memory
"mapping table".
In this experiments I redirect memory access of
guest code via a mapping table to an area of
virtual address space, where guest pages are
mapped (using mmap). This area can be much smaller
than guest address space, so I minimize problems
of qemu-fast and improve its portability. In
future this approach could help MMU optimization
to enter qemu-softmmu.
As memory access is much simpler that softmmu,
the benchmarks give better results (but still
it's 20% slower than qemu-fast with -no-code-copy).
When running real OS, Linux seems faster then under
softmmu but Windows 98 -- much slower. The problem
with Windows is that it "likes" to modify pages
where code is executing, and this causes lots of
page faults.
I'm attaching a patch, so other developers can see
what I'm doing. Before using the patch, please make
sure you can build working qemu-fast with your
setup. To see it working run qemu-fast with
-no-code-copy option.
Fabrice, does it make sense to modify code-copy
to be compatible with this patch (I know it's a lot
of work)?
Regards,
Piotrek
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: fast-map-env-2.patch --]
[-- Type: text/x-patch; name="fast-map-env-2.patch", Size: 23069 bytes --]
diff -ru qemu-snapshot-2004-08-04_23/cpu-all.h qemu-snapshot-2004-08-04_23-fast-map/cpu-all.h
--- qemu-snapshot-2004-08-04_23/cpu-all.h 2004-07-05 23:25:09.000000000 +0200
+++ qemu-snapshot-2004-08-04_23-fast-map/cpu-all.h 2004-08-19 00:28:30.000000000 +0200
@@ -24,6 +24,17 @@
#define WORDS_ALIGNED
#endif
+/* keep in sync with exec-all.h
+ */
+#ifndef offsetof
+#define offsetof(type, field) ((size_t) &((type *)0)->field)
+#endif
+
+/* XXX: assume sizeof(long) >= sizeof(void*)
+ */
+#define map_target2host(env, ptr) \
+ (env->map[((unsigned long) (ptr)) >> MAP_BLOCK_BITS] + ((unsigned long) (ptr)))
+
/* some important defines:
*
* WORDS_ALIGNED : if defined, the host cpu can only make word aligned
@@ -181,6 +192,67 @@
*(uint8_t *)ptr = v;
}
+static inline int ldub_map(void *ptr)
+{
+#if defined(__i386__)
+ int val;
+ asm volatile (
+ "mov %3, %%eax\n"
+ "shr %2, %%eax\n"
+ "mov %1(%%ebp,%%eax,4), %%eax\n"
+ "movzbl (%3,%%eax,1), %0\n"
+ : "=r" (val)
+ : "m" (*(uint8_t *)offsetof(CPUX86State, map[0])),
+ "I" (MAP_BLOCK_BITS),
+ "r" (ptr)
+ : "%eax");
+ return (val);
+#else
+#error unsupported target CPU
+#endif
+}
+
+static inline int ldsb_map(void *ptr)
+{
+#if defined(__i386__)
+ int val;
+ asm volatile (
+ "mov %3, %%eax\n"
+ "shr %2, %%eax\n"
+ "mov %1(%%ebp,%%eax,4), %%eax\n"
+ "movsbl (%3,%%eax,1), %0\n"
+ : "=r" (val)
+ : "m" (*(uint8_t *)offsetof(CPUX86State, map[0])),
+ "I" (MAP_BLOCK_BITS),
+ "r" (ptr)
+ : "%eax");
+ return (val);
+#else
+#error unsupported target CPU
+#endif
+}
+
+static inline void stb_map(void *ptr, int v)
+{
+#if defined(__i386__)
+ asm volatile (
+ "mov %2, %%eax\n"
+ "shr %1, %%eax\n"
+ "mov %0(%%ebp,%%eax,4), %%eax\n"
+ "movb %b3, (%2,%%eax)\n"
+ :
+ : "m" (*(uint8_t *)offsetof(CPUX86State, map[0])),
+ "I" (MAP_BLOCK_BITS),
+ "r" (ptr),
+ "r" (v)
+ : "%eax");
+#else
+#error unsupported target CPU
+#endif
+}
+
+
+
/* NOTE: on arm, putting 2 in /proc/sys/debug/alignment so that the
kernel handles unaligned load/stores may give better results, but
it is a system wide setting : bad */
@@ -467,6 +539,105 @@
*(uint64_t *)ptr = v;
}
+static inline int lduw_map(void *ptr)
+{
+#if defined(__i386__)
+ int val;
+ asm volatile (
+ "mov %3, %%eax\n"
+ "shr %2, %%eax\n"
+ "mov %1(%%ebp,%%eax,4), %%eax\n"
+ "movzwl (%3,%%eax,1), %0\n"
+ : "=r" (val)
+ : "m" (*(uint8_t *)offsetof(CPUX86State, map[0])),
+ "I" (MAP_BLOCK_BITS),
+ "r" (ptr)
+ : "%eax");
+ return (val);
+#else
+#error unsupported target CPU
+#endif
+}
+
+static inline int ldsw_map(void *ptr)
+{
+#if defined(__i386__)
+ int val;
+ asm volatile (
+ "mov %3, %%eax\n"
+ "shr %2, %%eax\n"
+ "mov %1(%%ebp,%%eax,4), %%eax\n"
+ "movswl (%3,%%eax,1), %0\n"
+ : "=r" (val)
+ : "m" (*(uint8_t *)offsetof(CPUX86State, map[0])),
+ "I" (MAP_BLOCK_BITS),
+ "r" (ptr)
+ : "%eax");
+ return (val);
+#else
+#error unsupported target CPU
+#endif
+}
+
+static inline int ldl_map(void *ptr)
+{
+#if defined(__i386__)
+ int val;
+ asm volatile (
+ "mov %3, %%eax\n"
+ "shr %2, %%eax\n"
+ "mov %1(%%ebp,%%eax,4), %%eax\n"
+ "movl (%3,%%eax,1), %0\n"
+ : "=r" (val)
+ : "m" (*(uint8_t *)offsetof(CPUX86State, map[0])),
+ "I" (MAP_BLOCK_BITS),
+ "r" (ptr)
+ : "%eax");
+ return (val);
+#else
+#error unsupported target CPU
+#endif
+}
+
+static inline void stw_map(void *ptr, int v)
+{
+#if defined(__i386__)
+ asm volatile (
+ "mov %2, %%eax\n"
+ "shr %1, %%eax\n"
+ "mov %0(%%ebp,%%eax,4), %%eax\n"
+ "movw %w3, (%2,%%eax)\n"
+ :
+ : "m" (*(uint8_t *)offsetof(CPUX86State, map[0])),
+ "I" (MAP_BLOCK_BITS),
+ "r" (ptr),
+ "r" (v)
+ : "%eax");
+ /* XXX PK: clobber memory? */
+#else
+#error unsupported target CPU
+#endif
+}
+
+static inline void stl_map(void *ptr, int v)
+{
+#if defined(__i386__)
+ asm volatile (
+ "mov %2, %%eax\n"
+ "shr %1, %%eax\n"
+ "mov %0(%%ebp,%%eax,4), %%eax\n"
+ "movl %3, (%2,%%eax)\n"
+ :
+ : "m" (*(uint8_t *)offsetof(CPUX86State, map[0])),
+ "I" (MAP_BLOCK_BITS),
+ "r" (ptr),
+ "r" (v)
+ : "%eax");
+#else
+#error unsupported target CPU
+#endif
+}
+
/* float access */
static inline float ldfl_raw(void *ptr)
diff -ru qemu-snapshot-2004-08-04_23/cpu-exec.c qemu-snapshot-2004-08-04_23-fast-map/cpu-exec.c
--- qemu-snapshot-2004-08-04_23/cpu-exec.c 2004-07-14 19:20:55.000000000 +0200
+++ qemu-snapshot-2004-08-04_23-fast-map/cpu-exec.c 2004-08-19 00:05:30.000000000 +0200
@@ -814,10 +814,13 @@
{
struct ucontext *uc = puc;
unsigned long pc;
+ unsigned long addr;
int trapno;
+ int res;
#ifndef REG_EIP
/* for glibc 2.1 */
+#define REG_EAX EAX
#define REG_EIP EIP
#define REG_ERR ERR
#define REG_TRAPNO TRAPNO
@@ -831,10 +834,20 @@
return 1;
} else
#endif
- return handle_cpu_signal(pc, (unsigned long)info->si_addr,
- trapno == 0xe ?
- (uc->uc_mcontext.gregs[REG_ERR] >> 1) & 1 : 0,
- &uc->uc_sigmask, puc);
+ {
+ /* EAX == env->map[addr >> MAP_BLOCK_BITS]
+ * see *_map functions in cpu-all.h
+ */
+ /* XXX: check opcode at pc to detect possible inconsistency?
+ */
+ addr = (unsigned long)info->si_addr - uc->uc_mcontext.gregs[REG_EAX];
+ res = handle_cpu_signal(pc, addr,
+ trapno == 0xe ?
+ (uc->uc_mcontext.gregs[REG_ERR] >> 1) & 1 : 0,
+ &uc->uc_sigmask, puc);
+ uc->uc_mcontext.gregs[REG_EAX] = (uint32_t)cpu_single_env->map[addr >> MAP_BLOCK_BITS];
+ return (res);
+ }
}
#elif defined(__x86_64__)
diff -ru qemu-snapshot-2004-08-04_23/exec.c qemu-snapshot-2004-08-04_23-fast-map/exec.c
--- qemu-snapshot-2004-08-04_23/exec.c 2004-07-05 23:25:10.000000000 +0200
+++ qemu-snapshot-2004-08-04_23-fast-map/exec.c 2004-08-19 00:22:03.000000000 +0200
@@ -45,7 +45,7 @@
#define SMC_BITMAP_USE_THRESHOLD 10
-#define MMAP_AREA_START 0x00000000
+#define MMAP_AREA_START (MAP_PAGE_SIZE + MAP_BLOCK_SIZE + MAP_PAGE_SIZE)
#define MMAP_AREA_END 0xa8000000
TranslationBlock tbs[CODE_GEN_MAX_BLOCKS];
@@ -125,6 +125,73 @@
FILE *logfile;
int loglevel;
+/* XXX: NOT TESTED
+ */
+static void *map_mmap(CPUState *env, target_ulong begin, target_ulong length, int prot, int flags,
+ int fd, off_t offset)
+{
+ char *addr;
+ void *res;
+
+ addr = map_target2host(env, begin);
+ if ((addr < (char *)MMAP_AREA_START) || ((char *)MMAP_AREA_END <= addr))
+ abort();
+ res = mmap((void *)addr, length, prot, flags, fd, offset);
+ if (res == MAP_FAILED)
+ return (res);
+ if (!(begin && (MAP_BLOCK_SIZE - 1)) && ((char *)MMAP_AREA_START <= map_target2host(env, begin - MAP_PAGE_SIZE))) {
+ addr = map_target2host(env, begin - MAP_PAGE_SIZE) + MAP_PAGE_SIZE;
+ if ((addr < (char *)MMAP_AREA_START) || ((char *)MMAP_AREA_END <= addr))
+ abort();
+ res = mmap((void *)addr, length, prot, flags, fd, offset);
+ }
+ return (res);
+}
+
+/* XXX: NOT TESTED
+ */
+static int map_munmap(CPUState *env, target_ulong begin, target_ulong length)
+{
+ char *addr;
+ int res;
+
+ addr = map_target2host(env, begin);
+ if ((addr < (char *)MMAP_AREA_START) || ((char *)MMAP_AREA_END <= addr))
+ abort();
+ res = munmap((void *)addr, length);
+ if (res == - 1)
+ return (res);
+ if (!(begin && (MAP_BLOCK_SIZE - 1)) && ((char *)MMAP_AREA_START <= map_target2host(env, begin - MAP_PAGE_SIZE))) {
+ addr = map_target2host(env, begin - MAP_PAGE_SIZE) + MAP_PAGE_SIZE;
+ if ((addr < (char *)MMAP_AREA_START) || ((char *)MMAP_AREA_END <= addr))
+ abort();
+ res = munmap((void *)addr, length);
+ }
+ return (res);
+}
+
+/* XXX: NOT TESTED
+ */
+static int map_mprotect(CPUState *env, target_ulong begin, target_ulong length, int prot)
+{
+ char *addr;
+ int res;
+
+ addr = map_target2host(env, begin);
+ if ((addr < (char *)MMAP_AREA_START) || ((char *)MMAP_AREA_END <= addr))
+ abort();
+ res = mprotect((void *)addr, length, prot);
+ if (res == - 1)
+ return (res);
+ if (!(begin && (MAP_BLOCK_SIZE - 1)) && ((char *)MMAP_AREA_START <= map_target2host(env, begin - MAP_PAGE_SIZE))) {
+ addr = map_target2host(env, begin - MAP_PAGE_SIZE) + MAP_PAGE_SIZE;
+ if ((addr < (char *)MMAP_AREA_START) || ((char *)MMAP_AREA_END <= addr))
+ abort();
+ res = mprotect((void *)addr, length, prot);
+ }
+ return (res);
+}
+
static void page_init(void)
{
/* NOTE: we can always suppose that qemu_host_page_size >=
@@ -836,8 +903,8 @@
prot = 0;
for(addr = host_start; addr < host_end; addr += TARGET_PAGE_SIZE)
prot |= page_get_flags(addr);
- mprotect((void *)host_start, qemu_host_page_size,
- (prot & PAGE_BITS) & ~PAGE_WRITE);
+ map_mprotect(host_start, qemu_host_page_size,
+ (prot & PAGE_BITS) & ~PAGE_WRITE);
#ifdef DEBUG_TB_INVALIDATE
printf("protecting code page: 0x%08lx\n",
host_start);
@@ -1313,8 +1380,9 @@
}
#if !defined(CONFIG_SOFTMMU)
- if (addr < MMAP_AREA_END)
- munmap((void *)addr, TARGET_PAGE_SIZE);
+ if (((char *)MMAP_AREA_START <= map_target2host(env, addr))
+ && (map_target2host(env, addr) < (char *)MMAP_AREA_END))
+ map_munmap(env, addr, TARGET_PAGE_SIZE);
#endif
}
@@ -1341,8 +1409,9 @@
#if !defined(CONFIG_SOFTMMU)
/* NOTE: as we generated the code for this page, it is already at
least readable */
- if (addr < MMAP_AREA_END)
- mprotect((void *)addr, TARGET_PAGE_SIZE, PROT_READ);
+ if (((char *)MMAP_AREA_START <= map_target2host(env, addr))
+ && (map_target2host(env, addr) < (char *)MMAP_AREA_END))
+ map_mprotect(env, addr, TARGET_PAGE_SIZE, PROT_READ);
#endif
}
@@ -1418,9 +1487,10 @@
if (p->valid_tag == virt_valid_tag &&
p->phys_addr >= start && p->phys_addr < end &&
(p->prot & PROT_WRITE)) {
- if (addr < MMAP_AREA_END) {
- mprotect((void *)addr, TARGET_PAGE_SIZE,
- p->prot & ~PROT_WRITE);
+ if (((char *)MMAP_AREA_END <= map_target2host(env, addr))
+ && (map_target2host(env, addr) < (char *)MMAP_AREA_END)) {
+ map_mprotect(env, addr, TARGET_PAGE_SIZE,
+ p->prot & ~PROT_WRITE);
}
}
addr += TARGET_PAGE_SIZE;
@@ -1556,34 +1626,58 @@
} else {
void *map_addr;
- if (vaddr >= MMAP_AREA_END) {
- ret = 2;
- } else {
- if (prot & PROT_WRITE) {
- if ((pd & ~TARGET_PAGE_MASK) == IO_MEM_ROM ||
+ if (prot & PROT_WRITE) {
+ if ((pd & ~TARGET_PAGE_MASK) == IO_MEM_ROM ||
#if defined(TARGET_HAS_SMC) || 1
- first_tb ||
+ first_tb ||
#endif
- ((pd & ~TARGET_PAGE_MASK) == IO_MEM_RAM &&
- !cpu_physical_memory_is_dirty(pd))) {
- /* ROM: we do as if code was inside */
- /* if code is present, we only map as read only and save the
- original mapping */
- VirtPageDesc *vp;
-
- vp = virt_page_find_alloc(vaddr >> TARGET_PAGE_BITS);
- vp->phys_addr = pd;
- vp->prot = prot;
- vp->valid_tag = virt_valid_tag;
- prot &= ~PAGE_WRITE;
- }
+ ((pd & ~TARGET_PAGE_MASK) == IO_MEM_RAM &&
+ !cpu_physical_memory_is_dirty(pd))) {
+ /* ROM: we do as if code was inside */
+ /* if code is present, we only map as read only and save the
+ original mapping */
+ VirtPageDesc *vp;
+
+ vp = virt_page_find_alloc(vaddr >> TARGET_PAGE_BITS);
+ vp->phys_addr = pd;
+ vp->prot = prot;
+ vp->valid_tag = virt_valid_tag;
+ prot &= ~PAGE_WRITE;
}
- map_addr = mmap((void *)vaddr, TARGET_PAGE_SIZE, prot,
- MAP_SHARED | MAP_FIXED, phys_ram_fd, (pd & TARGET_PAGE_MASK));
- if (map_addr == MAP_FAILED) {
- cpu_abort(env, "mmap failed when mapped physical address 0x%08x to virtual address 0x%08x\n",
- paddr, vaddr);
+ }
+ /* if null block [MAP_PAGE_SIZE ... MAP_PAGE_SIZE + MAP_BLOCK_SIZE + MAP_PAGE_SIZE), alloc new block
+ */
+ /* XXX: handle unaligned access on block bounduary (need to allocate block for address vaddr - MAP_PAGE_SIZE)
+ */
+ if ((map_target2host(env, vaddr) < (char *)MMAP_AREA_START)
+ || ((char *)MMAP_AREA_END <= map_target2host(env, vaddr))) {
+ static uint32_t block_next = MMAP_AREA_START;
+ uint32_t block;
+ int i;
+
+ block = block_next;
+ block_next = block + MAP_BLOCK_SIZE + MAP_PAGE_SIZE;
+ if (block_next > MMAP_AREA_END) {
+ block = MMAP_AREA_START;
+ block_next = block + MAP_BLOCK_SIZE + MAP_PAGE_SIZE;
}
+ /* invalidate pointers to chosen block
+ */
+ /* XXX: NOT TESTED
+ */
+ for (i = 0; i < (1L << (MAP_ADDR_BITS - MAP_BLOCK_BITS)); ++ i)
+ if (env->map[i] == (char *)(block - i * (MAP_BLOCK_SIZE + MAP_PAGE_SIZE))) {
+ env->map[i] = (char *)(MAP_PAGE_SIZE - i * (MAP_BLOCK_SIZE + MAP_PAGE_SIZE));
+ munmap((void *)block, MAP_BLOCK_SIZE + MAP_PAGE_SIZE);
+ }
+ i = vaddr >> MAP_BLOCK_BITS;
+ env->map[i] = (char *)(block - (i << MAP_BLOCK_BITS));
+ }
+ map_addr = map_mmap(env, vaddr, TARGET_PAGE_SIZE, prot,
+ MAP_SHARED | MAP_FIXED, phys_ram_fd, (pd & TARGET_PAGE_MASK));
+ if (map_addr == MAP_FAILED) {
+ cpu_abort(env, "mmap failed when mapped physical address 0x%08x to virtual address 0x%08x\n",
+ paddr, vaddr);
}
}
}
@@ -1604,7 +1698,8 @@
addr &= TARGET_PAGE_MASK;
/* if it is not mapped, no need to worry here */
- if (addr >= MMAP_AREA_END)
+ if ((map_target2host(cpu_single_env, addr) < (char *)MMAP_AREA_START)
+ || (map_target2host(cpu_single_env, addr) >= (char *)MMAP_AREA_END))
return 0;
vp = virt_page_find(addr >> TARGET_PAGE_BITS);
if (!vp)
@@ -1619,7 +1714,7 @@
printf("page_unprotect: addr=0x%08x phys_addr=0x%08x prot=%x\n",
addr, vp->phys_addr, vp->prot);
#endif
- if (mprotect((void *)addr, TARGET_PAGE_SIZE, vp->prot) < 0)
+ if (map_mprotect(cpu_single_env, addr, TARGET_PAGE_SIZE, vp->prot) < 0)
cpu_abort(cpu_single_env, "error mprotect addr=0x%lx prot=%d\n",
(unsigned long)addr, vp->prot);
/* set the dirty bit */
@@ -1754,8 +1849,8 @@
if (prot & PAGE_WRITE_ORG) {
pindex = (address - host_start) >> TARGET_PAGE_BITS;
if (!(p1[pindex].flags & PAGE_WRITE)) {
- mprotect((void *)host_start, qemu_host_page_size,
- (prot & PAGE_BITS) | PAGE_WRITE);
+ map_mprotect(host_start, qemu_host_page_size,
+ (prot & PAGE_BITS) | PAGE_WRITE);
p1[pindex].flags |= PAGE_WRITE;
/* and since the content will be modified, we must invalidate
the corresponding translated code. */
diff -ru qemu-snapshot-2004-08-04_23/target-i386/cpu.h qemu-snapshot-2004-08-04_23-fast-map/target-i386/cpu.h
--- qemu-snapshot-2004-08-04_23/target-i386/cpu.h 2004-07-12 22:33:47.000000000 +0200
+++ qemu-snapshot-2004-08-04_23-fast-map/target-i386/cpu.h 2004-08-19 00:28:38.000000000 +0200
@@ -20,6 +20,12 @@
#ifndef CPU_I386_H
#define CPU_I386_H
+#define MAP_PAGE_BITS 12
+#define MAP_BLOCK_BITS 24
+#define MAP_ADDR_BITS 32
+#define MAP_PAGE_SIZE (1L << MAP_PAGE_BITS)
+#define MAP_BLOCK_SIZE (1L << MAP_BLOCK_BITS)
+
#define TARGET_LONG_BITS 32
/* target supports implicit self modifying code */
@@ -291,6 +297,9 @@
int32_t df; /* D flag : 1 if D = 0, -1 if D = 1 */
uint32_t hflags; /* hidden flags, see HF_xxx constants */
+ /* offset <= 127 to enable assembly optimization */
+ void *map[1L << (MAP_ADDR_BITS - MAP_BLOCK_BITS)];
+
/* FPU state */
unsigned int fpstt; /* top of stack index */
unsigned int fpus;
diff -ru qemu-snapshot-2004-08-04_23/target-i386/op.c qemu-snapshot-2004-08-04_23-fast-map/target-i386/op.c
--- qemu-snapshot-2004-08-04_23/target-i386/op.c 2004-08-03 23:37:41.000000000 +0200
+++ qemu-snapshot-2004-08-04_23-fast-map/target-i386/op.c 2004-08-19 00:04:06.000000000 +0200
@@ -390,7 +390,7 @@
/* memory access */
-#define MEMSUFFIX _raw
+#define MEMSUFFIX _map
#include "ops_mem.h"
#if !defined(CONFIG_USER_ONLY)
diff -ru qemu-snapshot-2004-08-04_23/target-i386/ops_template_mem.h qemu-snapshot-2004-08-04_23-fast-map/target-i386/ops_template_mem.h
--- qemu-snapshot-2004-08-04_23/target-i386/ops_template_mem.h 2004-01-18 22:44:40.000000000 +0100
+++ qemu-snapshot-2004-08-04_23-fast-map/target-i386/ops_template_mem.h 2004-08-19 00:04:06.000000000 +0200
@@ -23,11 +23,11 @@
#if MEM_WRITE == 0
#if DATA_BITS == 8
-#define MEM_SUFFIX b_raw
+#define MEM_SUFFIX b_map
#elif DATA_BITS == 16
-#define MEM_SUFFIX w_raw
+#define MEM_SUFFIX w_map
#elif DATA_BITS == 32
-#define MEM_SUFFIX l_raw
+#define MEM_SUFFIX l_map
#endif
#elif MEM_WRITE == 1
diff -ru qemu-snapshot-2004-08-04_23/target-i386/translate.c qemu-snapshot-2004-08-04_23-fast-map/target-i386/translate.c
--- qemu-snapshot-2004-08-04_23/target-i386/translate.c 2004-06-13 15:26:14.000000000 +0200
+++ qemu-snapshot-2004-08-04_23-fast-map/target-i386/translate.c 2004-08-19 00:04:06.000000000 +0200
@@ -394,7 +394,7 @@
};
static GenOpFunc *gen_op_arithc_mem_T0_T1_cc[9][2] = {
- DEF_ARITHC(_raw)
+ DEF_ARITHC(_map)
#ifndef CONFIG_USER_ONLY
DEF_ARITHC(_kernel)
DEF_ARITHC(_user)
@@ -423,7 +423,7 @@
};
static GenOpFunc *gen_op_cmpxchg_mem_T0_T1_EAX_cc[9] = {
- DEF_CMPXCHG(_raw)
+ DEF_CMPXCHG(_map)
#ifndef CONFIG_USER_ONLY
DEF_CMPXCHG(_kernel)
DEF_CMPXCHG(_user)
@@ -467,7 +467,7 @@
};
static GenOpFunc *gen_op_shift_mem_T0_T1_cc[9][8] = {
- DEF_SHIFT(_raw)
+ DEF_SHIFT(_map)
#ifndef CONFIG_USER_ONLY
DEF_SHIFT(_kernel)
DEF_SHIFT(_user)
@@ -498,7 +498,7 @@
};
static GenOpFunc1 *gen_op_shiftd_mem_T0_T1_im_cc[9][2] = {
- DEF_SHIFTD(_raw, im)
+ DEF_SHIFTD(_map, im)
#ifndef CONFIG_USER_ONLY
DEF_SHIFTD(_kernel, im)
DEF_SHIFTD(_user, im)
@@ -506,7 +506,7 @@
};
static GenOpFunc *gen_op_shiftd_mem_T0_T1_ECX_cc[9][2] = {
- DEF_SHIFTD(_raw, ECX)
+ DEF_SHIFTD(_map, ECX)
#ifndef CONFIG_USER_ONLY
DEF_SHIFTD(_kernel, ECX)
DEF_SHIFTD(_user, ECX)
@@ -540,8 +540,8 @@
};
static GenOpFunc *gen_op_lds_T0_A0[3 * 3] = {
- gen_op_ldsb_raw_T0_A0,
- gen_op_ldsw_raw_T0_A0,
+ gen_op_ldsb_map_T0_A0,
+ gen_op_ldsw_map_T0_A0,
NULL,
#ifndef CONFIG_USER_ONLY
gen_op_ldsb_kernel_T0_A0,
@@ -555,8 +555,8 @@
};
static GenOpFunc *gen_op_ldu_T0_A0[3 * 3] = {
- gen_op_ldub_raw_T0_A0,
- gen_op_lduw_raw_T0_A0,
+ gen_op_ldub_map_T0_A0,
+ gen_op_lduw_map_T0_A0,
NULL,
#ifndef CONFIG_USER_ONLY
@@ -572,9 +572,9 @@
/* sign does not matter, except for lidt/lgdt call (TODO: fix it) */
static GenOpFunc *gen_op_ld_T0_A0[3 * 3] = {
- gen_op_ldub_raw_T0_A0,
- gen_op_lduw_raw_T0_A0,
- gen_op_ldl_raw_T0_A0,
+ gen_op_ldub_map_T0_A0,
+ gen_op_lduw_map_T0_A0,
+ gen_op_ldl_map_T0_A0,
#ifndef CONFIG_USER_ONLY
gen_op_ldub_kernel_T0_A0,
@@ -588,9 +588,9 @@
};
static GenOpFunc *gen_op_ld_T1_A0[3 * 3] = {
- gen_op_ldub_raw_T1_A0,
- gen_op_lduw_raw_T1_A0,
- gen_op_ldl_raw_T1_A0,
+ gen_op_ldub_map_T1_A0,
+ gen_op_lduw_map_T1_A0,
+ gen_op_ldl_map_T1_A0,
#ifndef CONFIG_USER_ONLY
gen_op_ldub_kernel_T1_A0,
@@ -604,9 +604,9 @@
};
static GenOpFunc *gen_op_st_T0_A0[3 * 3] = {
- gen_op_stb_raw_T0_A0,
- gen_op_stw_raw_T0_A0,
- gen_op_stl_raw_T0_A0,
+ gen_op_stb_map_T0_A0,
+ gen_op_stw_map_T0_A0,
+ gen_op_stl_map_T0_A0,
#ifndef CONFIG_USER_ONLY
gen_op_stb_kernel_T0_A0,
@@ -621,8 +621,8 @@
static GenOpFunc *gen_op_st_T1_A0[3 * 3] = {
NULL,
- gen_op_stw_raw_T1_A0,
- gen_op_stl_raw_T1_A0,
+ gen_op_stw_map_T1_A0,
+ gen_op_stl_map_T1_A0,
#ifndef CONFIG_USER_ONLY
NULL,
@@ -4321,7 +4321,7 @@
DEF_READF( )
- DEF_READF(_raw)
+ DEF_READF(_map)
#ifndef CONFIG_USER_ONLY
DEF_READF(_kernel)
DEF_READF(_user)
@@ -4440,7 +4440,7 @@
DEF_WRITEF( )
- DEF_WRITEF(_raw)
+ DEF_WRITEF(_map)
#ifndef CONFIG_USER_ONLY
DEF_WRITEF(_kernel)
DEF_WRITEF(_user)
@@ -4479,7 +4479,7 @@
[INDEX_op_rorl ## SUFFIX ## _T0_T1_cc] = INDEX_op_rorl ## SUFFIX ## _T0_T1,
DEF_SIMPLER( )
- DEF_SIMPLER(_raw)
+ DEF_SIMPLER(_map)
#ifndef CONFIG_USER_ONLY
DEF_SIMPLER(_kernel)
DEF_SIMPLER(_user)
diff -ru qemu-snapshot-2004-08-04_23/vl.c qemu-snapshot-2004-08-04_23-fast-map/vl.c
--- qemu-snapshot-2004-08-04_23/vl.c 2004-08-04 00:09:30.000000000 +0200
+++ qemu-snapshot-2004-08-04_23-fast-map/vl.c 2004-08-19 00:35:26.000000000 +0200
@@ -3035,6 +3035,8 @@
/* init CPU state */
env = cpu_init();
+ for (i = 0; i < (1L << (MAP_ADDR_BITS - MAP_BLOCK_BITS)); ++ i)
+ env->map[i] = (char *)(MAP_PAGE_SIZE - i * MAP_BLOCK_SIZE);
global_env = env;
cpu_single_env = env;
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2004-08-24 17:27 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-22 15:01 [Qemu-devel] instruction optimization thoughts Magnus Damm
2004-08-24 1:20 ` dguinan
2004-08-24 11:45 ` Elefterios Stamatogiannakis
2004-08-24 15:36 ` Piotr Krysik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).