* [Qemu-devel] Get only TCG code without execution @ 2012-01-15 23:09 Rajat Goyal 2012-01-16 5:32 ` Mulyadi Santosa 2012-01-16 8:41 ` Stefan Hajnoczi 0 siblings, 2 replies; 19+ messages in thread From: Rajat Goyal @ 2012-01-15 23:09 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 787 bytes --] I am doing a project to build a daemonic ARM emulator using QEMU. One of the requirements is to get the complete TCG code for any multi-threaded ARM program that I run on QEMU. I do not need QEMU to execute the program and show me the output. Just the entire TCG code. The latest version of qemu-arm seems to break while running pthread parallel ARM binaries, ie, qemu-arm terminates without completing execution and hence, the entire TCG code cannot be captured in the log. Is there a way by which I can get the complete TCG code for pthread parallel binaries in exchange for not making QEMU execute the binary? Any help would be appreciated. -- Rajat Goyal 5th year undergraduate student Integrated Master of Technology Mathematics and Computing Department of Mathematics IIT Delhi [-- Attachment #2: Type: text/html, Size: 840 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-15 23:09 [Qemu-devel] Get only TCG code without execution Rajat Goyal @ 2012-01-16 5:32 ` Mulyadi Santosa 2012-01-16 8:41 ` Stefan Hajnoczi 1 sibling, 0 replies; 19+ messages in thread From: Mulyadi Santosa @ 2012-01-16 5:32 UTC (permalink / raw) To: Rajat Goyal; +Cc: qemu-devel Hi.... On Mon, Jan 16, 2012 at 06:09, Rajat Goyal <rajat.goyal.90@gmail.com> wrote: Is there a way by which I can get the > complete TCG code for pthread parallel binaries in exchange for not making > QEMU execute the binary? The thing is, the way I see it, TCG is meant to be like JIT compiler. Whereas what you're going to do is referring to static compiler. Assuming your program has no interactive part (no user input, no need to wait keypress etc), maybe you can just comment out the Qemu code part that jump into translated block NB: You were referrring to qemu user mode emulation, right? -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-15 23:09 [Qemu-devel] Get only TCG code without execution Rajat Goyal 2012-01-16 5:32 ` Mulyadi Santosa @ 2012-01-16 8:41 ` Stefan Hajnoczi 2012-01-16 12:23 ` Rajat Goyal 1 sibling, 1 reply; 19+ messages in thread From: Stefan Hajnoczi @ 2012-01-16 8:41 UTC (permalink / raw) To: Rajat Goyal; +Cc: qemu-devel On Sun, Jan 15, 2012 at 11:09:18PM +0000, Rajat Goyal wrote: > I am doing a project to build a daemonic ARM emulator using QEMU. One of > the requirements is to get the complete TCG code for any multi-threaded ARM > program that I run on QEMU. I do not need QEMU to execute the program and > show me the output. Just the entire TCG code. The latest version of > qemu-arm seems to break while running pthread parallel ARM binaries, ie, > qemu-arm terminates without completing execution and hence, the entire TCG > code cannot be captured in the log. Is there a way by which I can get the > complete TCG code for pthread parallel binaries in exchange for not making > QEMU execute the binary? QEMU is a dynamic binary translator. You don't know the next block without executing the current block. It's not possible to translate a whole program without executing it - remember it can load shared libraries, use self-modifying code, or just employ indirect jumps which you cannot analyze statically. In the general case it's not possible. Can you explain why you're trying to do this? Stefan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-16 8:41 ` Stefan Hajnoczi @ 2012-01-16 12:23 ` Rajat Goyal 2012-01-16 12:29 ` Peter Maydell 0 siblings, 1 reply; 19+ messages in thread From: Rajat Goyal @ 2012-01-16 12:23 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: qemu-devel [-- Attachment #1: Type: text/plain, Size: 2422 bytes --] Thanks for your text, Stefan. The situation is like this. The most basic multi-threaded program (using pthreads) which just prints something like "I am Thread 1" and "I am Thread 2" does not work over the QEMU user emulator. There are no output messages saying "I am thread 1" etc. when the program binary is run over qemu-arm or qemu-i386. For qemu-i386, the reason is alright - there is no implementation for the futex syscall. But for qemu-arm, the syscall trace shows *" *** longjmp causes uninitialized stack frame ***: qemu-arm terminated"*. And hence, the entire TCG code for the binary is not obtained in the QEMU log since QEMU does not complete execution of the binary. What is the way out of this? The reason I need TCG code is because my project work is to write a semantics for TCG micro-operations and then compare my semantics with a semantics for ARM instructions being written by someone else. To test my semantics, I need the corresponding TCG code for several different multi-threaded ARM binaries. Many thanks in anticipation, Rajat. On Mon, Jan 16, 2012 at 8:41 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Sun, Jan 15, 2012 at 11:09:18PM +0000, Rajat Goyal wrote: > > I am doing a project to build a daemonic ARM emulator using QEMU. One of > > the requirements is to get the complete TCG code for any multi-threaded > ARM > > program that I run on QEMU. I do not need QEMU to execute the program and > > show me the output. Just the entire TCG code. The latest version of > > qemu-arm seems to break while running pthread parallel ARM binaries, ie, > > qemu-arm terminates without completing execution and hence, the entire > TCG > > code cannot be captured in the log. Is there a way by which I can get the > > complete TCG code for pthread parallel binaries in exchange for not > making > > QEMU execute the binary? > > QEMU is a dynamic binary translator. You don't know the next block > without executing the current block. It's not possible to translate a > whole program without executing it - remember it can load shared > libraries, use self-modifying code, or just employ indirect jumps which > you cannot analyze statically. > > In the general case it's not possible. Can you explain why you're > trying to do this? > > Stefan > -- Rajat Goyal 5th year undergraduate student Integrated Master of Technology Mathematics and Computing Department of Mathematics IIT Delhi [-- Attachment #2: Type: text/html, Size: 2967 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-16 12:23 ` Rajat Goyal @ 2012-01-16 12:29 ` Peter Maydell 2012-01-17 1:04 ` 陳韋任 0 siblings, 1 reply; 19+ messages in thread From: Peter Maydell @ 2012-01-16 12:29 UTC (permalink / raw) To: Rajat Goyal; +Cc: Stefan Hajnoczi, qemu-devel On 16 January 2012 12:23, Rajat Goyal <rajat.goyal.90@gmail.com> wrote: > The situation is like this. The most basic multi-threaded program (using > pthreads) which just prints something like "I am Thread 1" and "I am Thread > 2" does not work over the QEMU user emulator. There are no output messages > saying "I am thread 1" etc. when the program binary is run over qemu-arm or > qemu-i386. For qemu-i386, the reason is alright - there is no implementation > for the futex syscall. But for qemu-arm, the syscall trace shows " *** > longjmp causes uninitialized stack frame ***: qemu-arm terminated". And > hence, the entire TCG code for the binary is not obtained in the QEMU log > since QEMU does not complete execution of the binary. Which version of QEMU are you using? The "uninitialized stack frame" bug should be fixed in 1.0: https://bugs.launchpad.net/qemu/+bug/823902 > What is the way out of this? The reason I need TCG code is because my > project work is to write a semantics for TCG micro-operations and then > compare my semantics with a semantics for ARM instructions being written by > someone else. To test my semantics, I need the corresponding TCG code for > several different multi-threaded ARM binaries. Why does this have to be a multi-threaded binary? In the multithreaded case, the instructions executed by QEMU won't be deterministic (it will depend on how the host OS schedules the multiple threads) so it's going to be hard to compare a long trace output to something else. -- PMM ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-16 12:29 ` Peter Maydell @ 2012-01-17 1:04 ` 陳韋任 2012-01-17 8:33 ` Peter Maydell 0 siblings, 1 reply; 19+ messages in thread From: 陳韋任 @ 2012-01-17 1:04 UTC (permalink / raw) To: Peter Maydell; +Cc: Stefan Hajnoczi, Rajat Goyal, qemu-devel > > What is the way out of this? The reason I need TCG code is because my > > project work is to write a semantics for TCG micro-operations and then > > compare my semantics with a semantics for ARM instructions being written by > > someone else. To test my semantics, I need the corresponding TCG code for > > several different multi-threaded ARM binaries. > > Why does this have to be a multi-threaded binary? In the multithreaded > case, the instructions executed by QEMU won't be deterministic (it will > depend on how the host OS schedules the multiple threads) so it's going > to be hard to compare a long trace output to something else. I guess Rajat's goal is to compare the "semantics" of TCG ops and ARM binary, therefore the non-deterministic might not be the issue. Or he want to use "semantics" to solve the non-deterministic problem. Regards, chenwj -- Wei-Ren Chen (陳韋任) Computer Systems Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667 Homepage: http://people.cs.nctu.edu.tw/~chenwj ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-17 1:04 ` 陳韋任 @ 2012-01-17 8:33 ` Peter Maydell 2012-01-19 16:00 ` Rajat Goyal 0 siblings, 1 reply; 19+ messages in thread From: Peter Maydell @ 2012-01-17 8:33 UTC (permalink / raw) To: 陳韋任; +Cc: Stefan Hajnoczi, Rajat Goyal, qemu-devel On 17 January 2012 01:04, 陳韋任 <chenwj@iis.sinica.edu.tw> wrote: >> > What is the way out of this? The reason I need TCG code is because my >> > project work is to write a semantics for TCG micro-operations and then >> > compare my semantics with a semantics for ARM instructions being written by >> > someone else. To test my semantics, I need the corresponding TCG code for >> > several different multi-threaded ARM binaries. >> >> Why does this have to be a multi-threaded binary? In the multithreaded >> case, the instructions executed by QEMU won't be deterministic (it will >> depend on how the host OS schedules the multiple threads) so it's going >> to be hard to compare a long trace output to something else. > > I guess Rajat's goal is to compare the "semantics" of TCG ops and ARM binary, > therefore the non-deterministic might not be the issue. Or he want to use > "semantics" to solve the non-deterministic problem. But if you're looking at the semantics at a level where you don't care about the non-determinism of the threading, you might just as well look at them at an individual instruction or TB level, in which case a single threaded program is just as good and less confusing, surely? -- PMM ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-17 8:33 ` Peter Maydell @ 2012-01-19 16:00 ` Rajat Goyal 2012-01-19 16:15 ` Peter Maydell 2012-01-20 6:12 ` 陳韋任 0 siblings, 2 replies; 19+ messages in thread From: Rajat Goyal @ 2012-01-19 16:00 UTC (permalink / raw) To: Peter Maydell; +Cc: qemu-devel [-- Attachment #1: Type: text/plain, Size: 2397 bytes --] Thank you so much for your help Peter. I was using version 0.15.1. On 1.0, it works like a dream! I was not talking about semantics of individual instructions but semantics of the whole multi-threaded program. Multi-threaded programs can lead to several different (most of which are unintended) states of the CPU. What states are possible is described in a mathematically rigorous definition of the ARM memory model. My task is to implement this memory model over TCG ops and then compare the results on several different (multi-threaded) litmus tests with the implementation of the memory model over ARM instructions. For the same task, I need QEMU to give me the TCG translation for code which it never branches into and hence, never needs to translate and execute (because ARM multiprocessors can perform speculative execution). Rajat. On Tue, Jan 17, 2012 at 8:33 AM, Peter Maydell <peter.maydell@linaro.org>wrote: > On 17 January 2012 01:04, 陳韋任 <chenwj@iis.sinica.edu.tw> wrote: > >> > What is the way out of this? The reason I need TCG code is because my > >> > project work is to write a semantics for TCG micro-operations and then > >> > compare my semantics with a semantics for ARM instructions being > written by > >> > someone else. To test my semantics, I need the corresponding TCG code > for > >> > several different multi-threaded ARM binaries. > >> > >> Why does this have to be a multi-threaded binary? In the multithreaded > >> case, the instructions executed by QEMU won't be deterministic (it will > >> depend on how the host OS schedules the multiple threads) so it's going > >> to be hard to compare a long trace output to something else. > > > > I guess Rajat's goal is to compare the "semantics" of TCG ops and ARM > binary, > > therefore the non-deterministic might not be the issue. Or he want to use > > "semantics" to solve the non-deterministic problem. > > But if you're looking at the semantics at a level where you don't > care about the non-determinism of the threading, you might just > as well look at them at an individual instruction or TB level, > in which case a single threaded program is just as good and less > confusing, surely? > > -- PMM > -- Rajat Goyal 5th year undergraduate student Integrated Master of Technology Mathematics and Computing Department of Mathematics IIT Delhi [-- Attachment #2: Type: text/html, Size: 2926 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-19 16:00 ` Rajat Goyal @ 2012-01-19 16:15 ` Peter Maydell 2012-01-20 6:38 ` 陳韋任 2012-01-20 6:12 ` 陳韋任 1 sibling, 1 reply; 19+ messages in thread From: Peter Maydell @ 2012-01-19 16:15 UTC (permalink / raw) To: Rajat Goyal; +Cc: qemu-devel On 19 January 2012 16:00, Rajat Goyal <rajat.goyal.90@gmail.com> wrote: > Thank you so much for your help Peter. I was using version 0.15.1. On 1.0, > it works like a dream! Good. > I was not talking about semantics of individual instructions but semantics > of the whole multi-threaded program. Multi-threaded programs can lead to > several different (most of which are unintended) states of the CPU. What > states are possible is described in a mathematically rigorous definition of > the ARM memory model. My task is to implement this memory model over TCG ops > and then compare the results on several different (multi-threaded) litmus > tests with the implementation of the memory model over ARM instructions. Some points to note: * The current QEMU code has some known race conditions which can cause crashes/hangs in heavily threaded programs in linux-user mode; see eg https://bugs.launchpad.net/qemu/+bug/668799 * We don't really make a serious attempt at implementing the ARM memory model in QEMU; our load/store exclusive implementation is pretty hopeless, for instance * In linux-user mode we basically just pass loads/stores/etc through as host-cpu loads/stores, so you get whatever the host's memory model semantics are, not what the guest CPU is supposed to do * a combination of the above plus the fact we don't implement caches in system emulation mode means that our implementation of all the barrier insns is a simple no-op; you'll never see barriers at the TCG op level > For > the same task, I need QEMU to give me the TCG translation for code which it > never branches into and hence, never needs to translate and execute (because > ARM multiprocessors can perform speculative execution). QEMU does not do TCG translation for code which it doesn't branch into. Indeed, it's not actually possible to tell whether it is code and not data until you've branched into it... -- PMM ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-19 16:15 ` Peter Maydell @ 2012-01-20 6:38 ` 陳韋任 2012-01-21 0:21 ` Jamie Lokier 0 siblings, 1 reply; 19+ messages in thread From: 陳韋任 @ 2012-01-20 6:38 UTC (permalink / raw) To: Peter Maydell; +Cc: Rajat Goyal, qemu-devel > > I was not talking about semantics of individual instructions but semantics > > of the whole multi-threaded program. Multi-threaded programs can lead to > > several different (most of which are unintended) states of the CPU. What > > states are possible is described in a mathematically rigorous definition of > > the ARM memory model. My task is to implement this memory model over TCG ops > > and then compare the results on several different (multi-threaded) litmus > > tests with the implementation of the memory model over ARM instructions. > > Some points to note: > * The current QEMU code has some known race conditions which can cause > crashes/hangs in heavily threaded programs in linux-user mode; see eg > https://bugs.launchpad.net/qemu/+bug/668799 > * We don't really make a serious attempt at implementing the ARM memory > model in QEMU; our load/store exclusive implementation is pretty hopeless, > for instance > * In linux-user mode we basically just pass loads/stores/etc through as > host-cpu loads/stores, so you get whatever the host's memory model semantics > are, not what the guest CPU is supposed to do > * a combination of the above plus the fact we don't implement caches in > system emulation mode means that our implementation of all the barrier > insns is a simple no-op; you'll never see barriers at the TCG op level What's load/store exclusive implementation? And as a general emulator, QEMU shouldn't implement any architecture-specific memory model, right? What comes into my mind is QEMU only need to follow guest memory operations when translates guest binary to TCG ops. When translate TCG ops to host binary, it also has to be careful not to mess up the memory ordering. Regards, chenwj -- Wei-Ren Chen (陳韋任) Computer Systems Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667 Homepage: http://people.cs.nctu.edu.tw/~chenwj ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-20 6:38 ` 陳韋任 @ 2012-01-21 0:21 ` Jamie Lokier 2012-02-02 19:35 ` Rajat Goyal 0 siblings, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2012-01-21 0:21 UTC (permalink / raw) To: 陳韋任; +Cc: Peter Maydell, Rajat Goyal, qemu-devel 陳韋任 wrote: > What's load/store exclusive implementation? It's how some architectures do atomic operations, instead of having atomic instructions like x86 does. > And as a general emulator, QEMU shouldn't implement any > architecture-specific memory model, right? What comes into my mind > is QEMU only need to follow guest memory operations when translates > guest binary to TCG ops. When translate TCG ops to host binary, it > also has to be careful not to mess up the memory ordering. The error occurs when emulating two or more guest CPUs in parallel using two or more host CPUs for speed. Then "not mess up the memory ordering" may require barrier instructions in the host binary code, depending on the guest and host architectures. Without barrier instructions, the CPUs reorder memory accesses even if the instruction order is kept the same. This reordering done by the CPU is called the memory model. TCG cannot currently produce these barrier instructions, and it's not clear if it will ever be able to do so efficiently. -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-21 0:21 ` Jamie Lokier @ 2012-02-02 19:35 ` Rajat Goyal 0 siblings, 0 replies; 19+ messages in thread From: Rajat Goyal @ 2012-02-02 19:35 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 1798 bytes --] Hi, I have modified QEMU to act as a TCG compiler and give me the TCG code for the whole binary. However, I cannot find a way to obtain the last address in the binary. The symbol table loaded into syminfos contains only the address of the last symbol. Not the address of the last machine instruction. I can obtain this if I can obtain the length of the last section in the ELF. How can I do that in QEMU? Thanks, Rajat. On Sat, Jan 21, 2012 at 12:21 AM, Jamie Lokier <jamie@shareable.org> wrote: > 陳韋任 wrote: > > What's load/store exclusive implementation? > > It's how some architectures do atomic operations, instead of having > atomic instructions like x86 does. > > > And as a general emulator, QEMU shouldn't implement any > > architecture-specific memory model, right? What comes into my mind > > is QEMU only need to follow guest memory operations when translates > > guest binary to TCG ops. When translate TCG ops to host binary, it > > also has to be careful not to mess up the memory ordering. > > The error occurs when emulating two or more guest CPUs in parallel > using two or more host CPUs for speed. Then "not mess up the memory > ordering" may require barrier instructions in the host binary code, > depending on the guest and host architectures. Without barrier > instructions, the CPUs reorder memory accesses even if the instruction > order is kept the same. This reordering done by the CPU is called the > memory model. TCG cannot currently produce these barrier instructions, > and it's not clear if it will ever be able to do so efficiently. > > -- Jamie > -- Rajat Goyal 5th year undergraduate student Master of Technology in Mathematics and Computing - Integrated Program Department of Mathematics IIT Delhi [-- Attachment #2: Type: text/html, Size: 2270 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-19 16:00 ` Rajat Goyal 2012-01-19 16:15 ` Peter Maydell @ 2012-01-20 6:12 ` 陳韋任 2012-01-20 9:09 ` Peter Maydell 1 sibling, 1 reply; 19+ messages in thread From: 陳韋任 @ 2012-01-20 6:12 UTC (permalink / raw) To: Rajat Goyal; +Cc: Peter Maydell, qemu-devel > I was not talking about semantics of individual instructions but semantics > of the whole multi-threaded program. Multi-threaded programs can lead to > several different (most of which are unintended) states of the CPU. What > states are possible is described in a mathematically rigorous definition of > the ARM memory model. My task is to implement this memory model over TCG > ops and then compare the results on several different (multi-threaded) > litmus tests with the implementation of the memory model over ARM > instructions. For the same task, I need QEMU to give me the TCG translation > for code which it never branches into and hence, never needs to translate > and execute (because ARM multiprocessors can perform speculative execution). Out of curiosity. What's ARM memory model? From the Wikipedia [1], it seems ARMv7 has the same memory model as IA64. Regards, chenwj [1] http://en.wikipedia.org/wiki/Memory_ordering -- Wei-Ren Chen (陳韋任) Computer Systems Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667 Homepage: http://people.cs.nctu.edu.tw/~chenwj ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-20 6:12 ` 陳韋任 @ 2012-01-20 9:09 ` Peter Maydell 2012-01-20 9:44 ` 陳韋任 0 siblings, 1 reply; 19+ messages in thread From: Peter Maydell @ 2012-01-20 9:09 UTC (permalink / raw) To: 陳韋任; +Cc: Rajat Goyal, qemu-devel On 20 January 2012 06:12, 陳韋任 <chenwj@iis.sinica.edu.tw> wrote: > Out of curiosity. What's ARM memory model? From the Wikipedia [1], it seems > ARMv7 has the same memory model as IA64. The ARM memory model is the set of semantics for memory accesses as defined in the ARM Architecture Reference Manual (covering not just reordering but also exclusive accesses, alignment, barriers, etc). The manual devotes 50 pages to it so I'm not about to try to summarise it here :-) > What's load/store exclusive implementation? How we implement the ARM instructions LDREX/STREX/LDREXD/STREXD/etc. These have documented (complicated!) semantics which our implementation doesn't provide. > And as a general emulator, QEMU shouldn't implement any > architecture-specific memory model, right? Wrong, at least in theory. Ideally QEMU should implement exactly the semantics required by the guest architecture memory model (it's allowed to be stricter than the architecture requires, of course), in the same way it should implement the semantics required by the guest architecture instruction set. A guest binary for ARM can rely on the memory ordering constraints imposed by the memory model just as much as it can rely on the fact that the ADD instruction adds two registers together. In practice, of course (a) this is an enormous amount of work and also slows the emulator down drastically and (b) guest binaries don't actually rely that much on the memory model. And the fairly strict memory model provided by x86 means that for x86 hosts we actually get most of the important bits of the guest memory model right anyway. > What comes into my mind is QEMU only need to follow guest memory > operations when translates guest binary to TCG ops. When translate > TCG ops to host binary, it also has to be careful not to mess up > the memory ordering. This might be doable if TCG provided a set of ops which allowed you to implement the guest memory model; it doesn't. If we ever move to emulating guest SMP in multiple host threads this will become more important, I suspect. >From a pragmatic "we just want to run guests" point of view, what QEMU does now is entirely sufficient; I'm just saying that for a strictly correct emulation of the guest architecture we're a bit lacking. -- PMM ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-20 9:09 ` Peter Maydell @ 2012-01-20 9:44 ` 陳韋任 2012-01-20 10:46 ` Peter Maydell 0 siblings, 1 reply; 19+ messages in thread From: 陳韋任 @ 2012-01-20 9:44 UTC (permalink / raw) To: Peter Maydell; +Cc: Rajat Goyal, qemu-devel, 陳韋任 On Fri, Jan 20, 2012 at 09:09:46AM +0000, Peter Maydell wrote: > On 20 January 2012 06:12, 陳韋任 <chenwj@iis.sinica.edu.tw> wrote: > > Out of curiosity. What's ARM memory model? From the Wikipedia [1], it seems > > ARMv7 has the same memory model as IA64. > > The ARM memory model is the set of semantics for memory > accesses as defined in the ARM Architecture Reference > Manual (covering not just reordering but also exclusive > accesses, alignment, barriers, etc). The manual devotes > 50 pages to it so I'm not about to try to summarise it here :-) Seems the Wikipedia only lists the memory ordering part. ;) > > And as a general emulator, QEMU shouldn't implement any > > architecture-specific memory model, right? > > Wrong, at least in theory. Ideally QEMU should implement exactly > the semantics required by the guest architecture memory model > (it's allowed to be stricter than the architecture requires, of > course), in the same way it should implement the semantics required > by the guest architecture instruction set. A guest binary for ARM > can rely on the memory ordering constraints imposed by the memory > model just as much as it can rely on the fact that the ADD instruction > adds two registers together. In practice, of course (a) this is an > enormous amount of work and also slows the emulator down drastically > and (b) guest binaries don't actually rely that much on the memory > model. And the fairly strict memory model provided by x86 means that > for x86 hosts we actually get most of the important bits of the guest > memory model right anyway. AFAIK, LLVM defines it's own memory model [1] which is inspired by the C++11 memory model. That's why I think instead of implementing architecture-specific memory model, QEMU should define a more general (strict) one. You said, "guest binaries don't actually rely that much on the memory model." I think the reason is those guest binaries are single thread. Memory model is important in multi-threaded case. BTW, our binary translator now can translate x86 binary to ARM binary, and ARM has weaker memory model than x86. [1] http://llvm.org/docs/LangRef.html#memmodel Regards, chenwj P.S. Happy Chinese New Year. :) -- Wei-Ren Chen (陳韋任) Computer Systems Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667 Homepage: http://people.cs.nctu.edu.tw/~chenwj ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-20 9:44 ` 陳韋任 @ 2012-01-20 10:46 ` Peter Maydell 2012-01-20 19:40 ` Jamie Lokier 0 siblings, 1 reply; 19+ messages in thread From: Peter Maydell @ 2012-01-20 10:46 UTC (permalink / raw) To: 陳韋任; +Cc: Rajat Goyal, qemu-devel On 20 January 2012 09:44, 陳韋任 <chenwj@iis.sinica.edu.tw> wrote: > On Fri, Jan 20, 2012 at 09:09:46AM +0000, Peter Maydell wrote: > AFAIK, LLVM defines it's own memory model [1] which is inspired by the C++11 > memory model. That's why I think instead of implementing architecture-specific > memory model, QEMU should define a more general (strict) one. LLVM has the advantage that it can require all its incoming code to adhere to a common memory model (ie something like the C++ one). > You said, > > "guest binaries don't actually rely that much on the memory model." > > I think the reason is those guest binaries are single thread. Memory model is > important in multi-threaded case. BTW, our binary translator now can translate > x86 binary to ARM binary, and ARM has weaker memory model than x86. Yes. At the moment this works for QEMU on ARM hosts because in system mode QEMU itself is single-threaded so the nastier interactions between multiple guest CPUs don't occur (just about every memory model defines that memory interactions within a single thread of execution behave in the obvious manner). I also had in mind that guest binaries tend to make fairly stereotypical use of things like LDREX/STREX rather than relying on obscure details like their interaction with plain load/stores. > P.S. Happy Chinese New Year. :) You too! -- PMM ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-20 10:46 ` Peter Maydell @ 2012-01-20 19:40 ` Jamie Lokier 2012-02-06 7:25 ` 陳韋任 0 siblings, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2012-01-20 19:40 UTC (permalink / raw) To: Peter Maydell; +Cc: Rajat Goyal, qemu-devel, 陳韋任 Peter Maydell wrote: > > "guest binaries don't actually rely that much on the memory model." > > > > I think the reason is those guest binaries are single thread. Memory model is > > important in multi-threaded case. BTW, our binary translator now can translate > > x86 binary to ARM binary, and ARM has weaker memory model than x86. > > Yes. At the moment this works for QEMU on ARM hosts because in > system mode QEMU itself is single-threaded so the nastier interactions > between multiple guest CPUs don't occur (just about every memory model > defines that memory interactions within a single thread of execution > behave in the obvious manner). > I also had in mind that guest binaries > tend to make fairly stereotypical use of things like LDREX/STREX > rather than relying on obscure details like their interaction with > plain load/stores. As x86 doesn't use or need barrier instructions, when translating x86 to (say) run on ARM host, multi-threaded code that needs barriers isn't easy to detect, so barriers may be required between every memory access in the generated ARM code. -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-01-20 19:40 ` Jamie Lokier @ 2012-02-06 7:25 ` 陳韋任 2012-02-10 3:08 ` Jamie Lokier 0 siblings, 1 reply; 19+ messages in thread From: 陳韋任 @ 2012-02-06 7:25 UTC (permalink / raw) To: Jamie Lokier Cc: Peter Maydell, Rajat Goyal, qemu-devel, 陳韋任 > As x86 doesn't use or need barrier instructions, when translating x86 > to (say) run on ARM host, multi-threaded code that needs barriers > isn't easy to detect, so barriers may be required between every memory > access in the generated ARM code. Sounds awful to me. Regardless current QEMU's support for multi-threaded application, it's possible to emulate a architecture with stronger memory model on a weaker one? Regards, chenwj -- Wei-Ren Chen (陳韋任) Computer Systems Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667 Homepage: http://people.cs.nctu.edu.tw/~chenwj ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Get only TCG code without execution 2012-02-06 7:25 ` 陳韋任 @ 2012-02-10 3:08 ` Jamie Lokier 0 siblings, 0 replies; 19+ messages in thread From: Jamie Lokier @ 2012-02-10 3:08 UTC (permalink / raw) To: 陳韋任; +Cc: Peter Maydell, Rajat Goyal, qemu-devel 陳韋任 wrote: > > As x86 doesn't use or need barrier instructions, when translating x86 > > to (say) run on ARM host, multi-threaded code that needs barriers > > isn't easy to detect, so barriers may be required between every memory > > access in the generated ARM code. > > Sounds awful to me. Regardless current QEMU's support for multi-threaded > application, it's possible to emulate a architecture with stronger memory > model on a weaker one? It's possible, unfortunately those barriers tends to be quite expensive and they are needed often, so it would run slowly. Probably a lot slower than using a single host thread with preemption to simulate multiple guest CPUs. But someone should try it and find out. It might be possible to do some deep analysis of the guest to work out which memory accesses don't need barriers, but it's a hard research problem with no guarantee of a good solution. One strategy which comes to mind is simulated MESI or MOESI (cache coherency protocols) at the page level, so independent guest threads never have unsynchronised access to the same page. Or at finer granularity, with more emulation overhead (but still maybe less than barriers). Another is software transactional memory techniques. Neither will run system software at great speed, but certain kinds of mostly-independent processing, for example a guest running mainly userspace number crunching in independent processes, might work alright. -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2012-02-10 3:08 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-01-15 23:09 [Qemu-devel] Get only TCG code without execution Rajat Goyal 2012-01-16 5:32 ` Mulyadi Santosa 2012-01-16 8:41 ` Stefan Hajnoczi 2012-01-16 12:23 ` Rajat Goyal 2012-01-16 12:29 ` Peter Maydell 2012-01-17 1:04 ` 陳韋任 2012-01-17 8:33 ` Peter Maydell 2012-01-19 16:00 ` Rajat Goyal 2012-01-19 16:15 ` Peter Maydell 2012-01-20 6:38 ` 陳韋任 2012-01-21 0:21 ` Jamie Lokier 2012-02-02 19:35 ` Rajat Goyal 2012-01-20 6:12 ` 陳韋任 2012-01-20 9:09 ` Peter Maydell 2012-01-20 9:44 ` 陳韋任 2012-01-20 10:46 ` Peter Maydell 2012-01-20 19:40 ` Jamie Lokier 2012-02-06 7:25 ` 陳韋任 2012-02-10 3:08 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).