[Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
@ 2011-11-29  7:03 陳韋任
  2011-11-30 12:37 ` Alexander Graf
  2011-11-30 12:51 ` Peter Maydell
  0 siblings, 2 replies; 22+ messages in thread
From: 陳韋任 @ 2011-11-29  7:03 UTC (permalink / raw)
  To: qemu-devel

Hi all,

  Our team are working on a project similar to llvm-qemu [1], which is also
based on QEMU and LLVM. Current status is the process mode works fine [2], and
we're moving forward to system mode.

Let me briefly introduce our framework here and state what problem we encounter.
What we do is translating TCG IR into LLVM IR and let LLVM JIT do the codegen.
In our framework, we have both TCG and LLVM codegen capacity. For short-running
application, TCG's code quality is good enough; LLVM codegen is for long-running
application on the other hand. We have two code cache in our framework, one is
the original QEMU code cache (for basic block) and the other is LLVM code cache
(for trace). The concept of trace is the same as the "super-blocks" as mentioned
in the discussion thread [3], which is composed of a set of basic blocks. The
goal is to enlarge the optimization scope and hope the code quality of trace is 
better than the basic block's. Here is the overview of our framework.

    QEMU code cache    LLVM code cache
        (block)            (trace)

          bb1 ------------> trace1  

In our framework, if we find a basic block (bb1) is hot enough (i.e., being
executed many times), we start building a trace (beginning with bb1) and let
LLVM do the codegen. We place the optimized code in the LLVM code cache, and
patch the head of bb1 so that anyone executing bb1 will jump to trace1 directly.
Since we're moving toward system mode, we have to consider situations where
unlinking is needed. Block linking done by QEMU itself and we leave block
unlinking to it. The problem is when/where to break the link between block and
trace. I can only spot two places we should break the block -> trace link so
far [4]. I don't know if I spot them all or I miss something else.

  1. cpu_unlink_tb (exec.c)

  2. tb_phys_invalidate (exec.c)

The big problem is debugging. We test our system by using images downloaded from
the website [5]. Basically, we want to see an operating system being booted
successfully, then login and run some benchmark on it. As a very first step, we
make a very high threshold on trace building. In other words, a basic block must
be executed *many* time to trigger the trace building process. Then we lower the
threshold a bit at a time to see how things work. When something goes wrong, we
might get kernel panic or the system hangs at some point on the booting process.
I have no idea on how to solve this kind of problem. So I'd like to seek for
help/experience/suggestion on the mailing list. I just hope I make the whole
situation clear to you. 

  Thanks!

[1] http://code.google.com/p/llvm-qemu/
[2] I have to admit we only test our framework with SPEC2006 benchmark,
    not with _real_ applications.
[3] http://lists.cs.uiuc.edu/pipermail/llvmdev/2008-April/013689.html
[4] http://lists.gnu.org/archive/html/qemu-devel/2011-09/msg03643.html
[5] http://wiki.qemu.org/Download

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-11-29  7:03 [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques 陳韋任
@ 2011-11-30 12:37 ` Alexander Graf
  2011-12-01  3:50   ` 陳韋任
  2011-12-06  7:39   ` 陳韋任
  2011-11-30 12:51 ` Peter Maydell
  1 sibling, 2 replies; 22+ messages in thread
From: Alexander Graf @ 2011-11-30 12:37 UTC (permalink / raw)
  To: 陳韋任; +Cc: qemu-devel


On 29.11.2011, at 08:03, 陳韋任 wrote:

> Hi all,
> 
>  Our team are working on a project similar to llvm-qemu [1], which is also
> based on QEMU and LLVM. Current status is the process mode works fine [2], and
> we're moving forward to system mode.
> 
> Let me briefly introduce our framework here and state what problem we encounter.
> What we do is translating TCG IR into LLVM IR and let LLVM JIT do the codegen.
> In our framework, we have both TCG and LLVM codegen capacity. For short-running
> application, TCG's code quality is good enough; LLVM codegen is for long-running
> application on the other hand. We have two code cache in our framework, one is
> the original QEMU code cache (for basic block) and the other is LLVM code cache
> (for trace). The concept of trace is the same as the "super-blocks" as mentioned
> in the discussion thread [3], which is composed of a set of basic blocks. The
> goal is to enlarge the optimization scope and hope the code quality of trace is 
> better than the basic block's. Here is the overview of our framework.
> 
> 
>    QEMU code cache    LLVM code cache
>        (block)            (trace)
> 
>          bb1 ------------> trace1  
> 
> 
> In our framework, if we find a basic block (bb1) is hot enough (i.e., being
> executed many times), we start building a trace (beginning with bb1) and let
> LLVM do the codegen. We place the optimized code in the LLVM code cache, and
> patch the head of bb1 so that anyone executing bb1 will jump to trace1 directly.
> Since we're moving toward system mode, we have to consider situations where
> unlinking is needed. Block linking done by QEMU itself and we leave block
> unlinking to it. The problem is when/where to break the link between block and
> trace. I can only spot two places we should break the block -> trace link so
> far [4]. I don't know if I spot them all or I miss something else.
> 
>  1. cpu_unlink_tb (exec.c)
> 
>  2. tb_phys_invalidate (exec.c)

Very cool! I was thinking about this for a while myself now. It's especially appealing these days since you can do the hotspot optimization in a separate thread :).

Especially in system mode, you also need to flush when tb_flush() is called though. And you have to make sure to match hflags and segment descriptors for the links - otherwise you might end up connecting TBs from different processes :).

> 
> The big problem is debugging. We test our system by using images downloaded from
> the website [5]. Basically, we want to see an operating system being booted

For Linux, I can recommend these images:

  http://people.debian.org/~aurel32/qemu/

If you want to be more exotic (minix found a lot of bugs for me back in the day!) you can try the os zoo:

  http://www.oszoo.org/

> successfully, then login and run some benchmark on it. As a very first step, we
> make a very high threshold on trace building. In other words, a basic block must
> be executed *many* time to trigger the trace building process. Then we lower the
> threshold a bit at a time to see how things work. When something goes wrong, we
> might get kernel panic or the system hangs at some point on the booting process.
> I have no idea on how to solve this kind of problem. So I'd like to seek for
> help/experience/suggestion on the mailing list. I just hope I make the whole
> situation clear to you. 

I don't see any better approach to debugging this than the one you're already taking. Try to run as many workloads as you can and see if they break :). Oh and always make the optimization optional, so that you can narrow it down to it and know you didn't hit a generic QEMU bug.


Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-11-30 12:37 ` Alexander Graf
@ 2011-12-01  3:50   ` 陳韋任
  2011-12-01  7:46     ` Stefan Hajnoczi
                       ` (2 more replies)
  2011-12-06  7:39   ` 陳韋任
  1 sibling, 3 replies; 22+ messages in thread
From: 陳韋任 @ 2011-12-01  3:50 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel, 陳韋任

Hi Alex,

> Very cool! I was thinking about this for a while myself now. It's especially appealing these days since you can do the hotspot optimization in a separate thread :).
> 
> Especially in system mode, you also need to flush when tb_flush() is called though. And you have to make sure to match hflags and segment descriptors for the links - otherwise you might end up connecting TBs from different processes :).

  I'll check the tb_flush again. IIRC, we make the code cache big enough so that
there is no need to flush the code cache. But I think we still need to deal with
it in the end.

  The block linking is done by QEMU and we leave it alone. But I don't know QEMU
ever does hflags and segment descriptors check before doing block linking. Could
you point it out? Anyway, here is how we form trace from a set of basic blocks.

1. We insert instrumented code at the beginning of each TCG block to collect how
   many times this block being executed.

2. When a block's execution time, say block A, reaches a pre-defined threshold,
   we follow the run time execution path to collect block B followed A and so on
   to form a trace. This approach is called NET (Next-Executing Tail) [1].

3. Then a trace composed of TCG blocks is sent to a LLVM translator. The translator
   generates the host binary for the trace into a LLVM code cache, and patch the
   beginning of block A (in QEMU code cache) so that anyone executing block A will 
   jump to the corresponding trace and execute.

Above is block to trace link. I think there is no need to do hflags and segment
descriptors check, right? Although I set the trace length to one basic block at
the moment (make the situation simpler), I think we still don't have to check
the blocks' hflags and segment descriptors in the trace to see if they match.

> > successfully, then login and run some benchmark on it. As a very first step, we
> > make a very high threshold on trace building. In other words, a basic block must
> > be executed *many* time to trigger the trace building process. Then we lower the
> > threshold a bit at a time to see how things work. When something goes wrong, we
> > might get kernel panic or the system hangs at some point on the booting process.
> > I have no idea on how to solve this kind of problem. So I'd like to seek for
> > help/experience/suggestion on the mailing list. I just hope I make the whole
> > situation clear to you. 
> 
> I don't see any better approach to debugging this than the one you're already taking. Try to run as many workloads as you can and see if they break :). Oh and always make the optimization optional, so that you can narrow it down to it and know you didn't hit a generic QEMU bug.

  You mean make the trace optimization optional? We have tested our framework in
LLVM-only mode. which means we replace TCG with LLVM entirely. It's _very_ slow
but works. What the generic QEMU bug is? We use QEMU 0.13 and just rely on its
emulation part right now. Does recent version fix major bugs in the emulation
engine?

  Thanks for your advices. :-)

[1] http://www.cs.virginia.edu/kim/docs/micro05.pdf

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  3:50   ` 陳韋任
@ 2011-12-01  7:46     ` Stefan Hajnoczi
  2011-12-01  9:16       ` Alex Bradbury
  2011-12-01  9:30       ` 陳韋任
  2011-12-01 10:23     ` Alexander Graf
  2011-12-01 10:59     ` Peter Maydell
  2 siblings, 2 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2011-12-01  7:46 UTC (permalink / raw)
  To: 陳韋任; +Cc: Alexander Graf, qemu-devel

On Thu, Dec 01, 2011 at 11:50:24AM +0800, 陳韋任 wrote:
> > I don't see any better approach to debugging this than the one you're already taking. Try to run as many workloads as you can and see if they break :). Oh and always make the optimization optional, so that you can narrow it down to it and know you didn't hit a generic QEMU bug.
> 
>   You mean make the trace optimization optional? We have tested our framework in
> LLVM-only mode. which means we replace TCG with LLVM entirely. It's _very_ slow
> but works.

It would be interesting to use an optimized interpreter instead of TCG,
then go to LLVM for hot traces.  This is more HotSpot-like with the idea
being that the interpreter runs through initialization and rarely
executed code without a translation overhead.  For the hot paths LLVM
kicks in and high-quality translated code is executed.

Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  7:46     ` Stefan Hajnoczi
@ 2011-12-01  9:16       ` Alex Bradbury
  2011-12-01  9:30       ` 陳韋任
  1 sibling, 0 replies; 22+ messages in thread
From: Alex Bradbury @ 2011-12-01  9:16 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: sw, Alexander Graf, 陳韋任, qemu-devel

On 1 December 2011 07:46, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> It would be interesting to use an optimized interpreter instead of TCG,
> then go to LLVM for hot traces.  This is more HotSpot-like with the idea
> being that the interpreter runs through initialization and rarely
> executed code without a translation overhead.  For the hot paths LLVM
> kicks in and high-quality translated code is executed.

Might the recently-added TCI be a suitable starting point for this?

Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  7:46     ` Stefan Hajnoczi
  2011-12-01  9:16       ` Alex Bradbury
@ 2011-12-01  9:30       ` 陳韋任
  1 sibling, 0 replies; 22+ messages in thread
From: 陳韋任 @ 2011-12-01  9:30 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Alexander Graf, 陳韋任, qemu-devel

Hi, Stefan

> It would be interesting to use an optimized interpreter instead of TCG,
> then go to LLVM for hot traces.  This is more HotSpot-like with the idea
> being that the interpreter runs through initialization and rarely
> executed code without a translation overhead.  For the hot paths LLVM
> kicks in and high-quality translated code is executed.

  Not sure if it's doable. I can only tell you we rely on QEMU frontend to
disassemble guest binary into TCG IR, then translate TCG IR into LLVM IR.
And talk about the translation overhead, the time QEMU frontend spend is
negligible.

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  3:50   ` 陳韋任
  2011-12-01  7:46     ` Stefan Hajnoczi
@ 2011-12-01 10:23     ` Alexander Graf
  2011-12-04  6:14       ` 陳韋任
  2011-12-01 10:59     ` Peter Maydell
  2 siblings, 1 reply; 22+ messages in thread
From: Alexander Graf @ 2011-12-01 10:23 UTC (permalink / raw)
  To: 陳韋任; +Cc: qemu-devel


On 01.12.2011, at 04:50, 陳韋任 wrote:

> Hi Alex,
> 
>> Very cool! I was thinking about this for a while myself now. It's especially appealing these days since you can do the hotspot optimization in a separate thread :).
>> 
>> Especially in system mode, you also need to flush when tb_flush() is called though. And you have to make sure to match hflags and segment descriptors for the links - otherwise you might end up connecting TBs from different processes :).
> 
>  I'll check the tb_flush again. IIRC, we make the code cache big enough so that
> there is no need to flush the code cache. But I think we still need to deal with
> it in the end.

It is never big enough :). In fact, even a normal system mode guest boot triggers tb_flush usually because the cache is full. And target code can also trigger it manually.

> The block linking is done by QEMU and we leave it alone. But I don't know QEMU
> ever does hflags and segment descriptors check before doing block linking. Could
> you point it out? Anyway, here is how we form trace from a set of basic blocks.

Sure. Just check for every piece of code that executes cpu_get_tb_cpu_state() :).

> 1. We insert instrumented code at the beginning of each TCG block to collect how
>   many times this block being executed.
> 
> 2. When a block's execution time, say block A, reaches a pre-defined threshold,
>   we follow the run time execution path to collect block B followed A and so on
>   to form a trace. This approach is called NET (Next-Executing Tail) [1].
> 
> 3. Then a trace composed of TCG blocks is sent to a LLVM translator. The translator
>   generates the host binary for the trace into a LLVM code cache, and patch the

I don't fully understand this part. Do you disassemble the x86 blob that TCG emitted?

>   beginning of block A (in QEMU code cache) so that anyone executing block A will 
>   jump to the corresponding trace and execute.
> 
> Above is block to trace link. I think there is no need to do hflags and segment
> descriptors check, right? Although I set the trace length to one basic block at

If you only take the choices that QEMU has already patched into the TB for you then no, you don't need to check it yourself, because QEMU already checked it :)

> the moment (make the situation simpler), I think we still don't have to check
> the blocks' hflags and segment descriptors in the trace to see if they match.

Yeah. You only need to be sync'ed with the invalidation then. And make sure you patch the TB atomically, so you don't have a separate thread accidentally run half your code and half the old code.

> 
>>> successfully, then login and run some benchmark on it. As a very first step, we
>>> make a very high threshold on trace building. In other words, a basic block must
>>> be executed *many* time to trigger the trace building process. Then we lower the
>>> threshold a bit at a time to see how things work. When something goes wrong, we
>>> might get kernel panic or the system hangs at some point on the booting process.
>>> I have no idea on how to solve this kind of problem. So I'd like to seek for
>>> help/experience/suggestion on the mailing list. I just hope I make the whole
>>> situation clear to you. 
>> 
>> I don't see any better approach to debugging this than the one you're already taking. Try to run as many workloads as you can and see if they break :). Oh and always make the optimization optional, so that you can narrow it down to it and know you didn't hit a generic QEMU bug.
> 
>  You mean make the trace optimization optional? We have tested our framework in
> LLVM-only mode. which means we replace TCG with LLVM entirely. It's _very_ slow

I was more thinking of making the trace optimization optional as in not optimize but do only TCG like it's done today :).

> but works. What the generic QEMU bug is? We use QEMU 0.13 and just rely on its
> emulation part right now. Does recent version fix major bugs in the emulation
> engine?

I don't know - there are always bug fixes in areas all over the code base. But I guess the parts you've been touching have been pretty stable. Either way, I was really more trying to point out that there could always be bugs in any layer, so having the ability to turn off a layer is in general a good idea :).


Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01 10:23     ` Alexander Graf
@ 2011-12-04  6:14       ` 陳韋任
  2011-12-04 11:29         ` Alexander Graf
  0 siblings, 1 reply; 22+ messages in thread
From: 陳韋任 @ 2011-12-04  6:14 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel, 陳韋任

> > 3. Then a trace composed of TCG blocks is sent to a LLVM translator. The translator
> >   generates the host binary for the trace into a LLVM code cache, and patch the
> 
> I don't fully understand this part. Do you disassemble the x86 blob that TCG emitted?

  We ask TCG to disassemble the guest binary where the trace beginning with
_again_ to get a set of TCG blocks, then sent them to the LLVM translator.
 
> > the moment (make the situation simpler), I think we still don't have to check
> > the blocks' hflags and segment descriptors in the trace to see if they match.
> 
> Yeah. You only need to be sync'ed with the invalidation then. And make sure you patch the TB atomically, so you don't have a separate thread accidentally run half your code and half the old code.

  Sync'ed with the invalidation means tb_flush, cpu_unlink and tb_phys_invalidate?
 
Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-04  6:14       ` 陳韋任
@ 2011-12-04 11:29         ` Alexander Graf
  2011-12-05  6:05           ` 陳韋任
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Graf @ 2011-12-04 11:29 UTC (permalink / raw)
  To: 陳韋任; +Cc: qemu-devel


On 04.12.2011, at 07:14, 陳韋任 wrote:

>>> 3. Then a trace composed of TCG blocks is sent to a LLVM translator. The translator
>>>  generates the host binary for the trace into a LLVM code cache, and patch the
>> 
>> I don't fully understand this part. Do you disassemble the x86 blob that TCG emitted?
> 
>  We ask TCG to disassemble the guest binary where the trace beginning with
> _again_ to get a set of TCG blocks, then sent them to the LLVM translator.

So you have two TCG backends? One to generate real host code and one that goes into your LLVM generator?

> 
>>> the moment (make the situation simpler), I think we still don't have to check
>>> the blocks' hflags and segment descriptors in the trace to see if they match.
>> 
>> Yeah. You only need to be sync'ed with the invalidation then. And make sure you patch the TB atomically, so you don't have a separate thread accidentally run half your code and half the old code.
> 
>  Sync'ed with the invalidation means tb_flush, cpu_unlink and tb_phys_invalidate?

Yup :)


Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-04 11:29         ` Alexander Graf
@ 2011-12-05  6:05           ` 陳韋任
  0 siblings, 0 replies; 22+ messages in thread
From: 陳韋任 @ 2011-12-05  6:05 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel, 陳韋任

> >  We ask TCG to disassemble the guest binary where the trace beginning with
> > _again_ to get a set of TCG blocks, then sent them to the LLVM translator.
> 
> So you have two TCG backends? One to generate real host code and one that goes into your LLVM generator?

  Ah..., I should say we ask QEMU frontend to disassemble the guest binary
to TCG again.

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  3:50   ` 陳韋任
  2011-12-01  7:46     ` Stefan Hajnoczi
  2011-12-01 10:23     ` Alexander Graf
@ 2011-12-01 10:59     ` Peter Maydell
  2 siblings, 0 replies; 22+ messages in thread
From: Peter Maydell @ 2011-12-01 10:59 UTC (permalink / raw)
  To: 陳韋任; +Cc: Alexander Graf, qemu-devel

On 1 December 2011 03:50, 陳韋任 <chenwj@iis.sinica.edu.tw> wrote:
> We use QEMU 0.13

Oops, I missed this. 0.13 is over a year old now. There is zero point
in doing any kind of engineering work of this scale on such an old
codebase. You need to be tracking the head of git master, generally,
if you want (a) any hope of getting your changes back into qemu or
(b) any kind of useful support for problems you encounter during
development.

-- PMM

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-11-30 12:37 ` Alexander Graf
  2011-12-01  3:50   ` 陳韋任
@ 2011-12-06  7:39   ` 陳韋任
  2011-12-19 17:44     ` Alexander Graf
  1 sibling, 1 reply; 22+ messages in thread
From: 陳韋任 @ 2011-12-06  7:39 UTC (permalink / raw)
  To: Alexander Graf; +Cc: qemu-devel, 陳韋任

> If you want to be more exotic (minix found a lot of bugs for me back in the day!) you can try the os zoo:
> 
>   http://www.oszoo.org/

  The website seems down?

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-06  7:39   ` 陳韋任
@ 2011-12-19 17:44     ` Alexander Graf
  0 siblings, 0 replies; 22+ messages in thread
From: Alexander Graf @ 2011-12-19 17:44 UTC (permalink / raw)
  To: 陳韋任; +Cc: qemu-devel


On 06.12.2011, at 08:39, 陳韋任 wrote:

>> If you want to be more exotic (minix found a lot of bugs for me back in the day!) you can try the os zoo:
>> 
>>  http://www.oszoo.org/
> 
>  The website seems down?

Yeah, looks like it's down :(. Too bad.

Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-11-29  7:03 [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques 陳韋任
  2011-11-30 12:37 ` Alexander Graf
@ 2011-11-30 12:51 ` Peter Maydell
  2011-12-01  9:03   ` 陳韋任
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Maydell @ 2011-11-30 12:51 UTC (permalink / raw)
  To: 陳韋任; +Cc: qemu-devel

On 29 November 2011 07:03, 陳韋任 <chenwj@iis.sinica.edu.tw> wrote:
>
>  1. cpu_unlink_tb (exec.c)

This function is broken even for pure TCG -- we know it has a race condition.
As I said on IRC, I think that the right thing to do is to start
by overhauling the current TCG code so that it is:
 (a) properly multithreaded (b) race condition free (c) well documented
 (d) clean code
Then you have a firm foundation you can use as a basis for the LLVM
integration (and in the course of doing this overhaul you'll have
figured out enough of how the current code works to be clear about
where hooks for invalidating your traces need to go).

> The big problem is debugging.

Yes. In this sort of hotspot based design it's very easy to end up
with bugs that are intermittent or painful to reproduce and where
you have very little clue about which version of the code for which
address ended up misgenerated (since timing issues mean that what
code is recompiled and when it is inserted will vary from run to
run). Being able to conveniently get rid of some of this nondeterminism
is vital for tracking down what actually goes wrong.

-- PMM

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-11-30 12:51 ` Peter Maydell
@ 2011-12-01  9:03   ` 陳韋任
  2011-12-01  9:13     ` 陳韋任
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: 陳韋任 @ 2011-12-01  9:03 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-devel, 陳韋任

Hi Peter,

> >  1. cpu_unlink_tb (exec.c)
> 
> This function is broken even for pure TCG -- we know it has a race condition.
> As I said on IRC, I think that the right thing to do is to start
> by overhauling the current TCG code so that it is:
>  (a) properly multithreaded (b) race condition free (c) well documented
>  (d) clean code
> Then you have a firm foundation you can use as a basis for the LLVM
> integration (and in the course of doing this overhaul you'll have
> figured out enough of how the current code works to be clear about
> where hooks for invalidating your traces need to go).

  I must say I totally agree with you on overhauling the current TCG code. But
my boss might have no such patient on this. ;) If there is a plan out there, I'll
be very happy to join in.

  I read the thread talking about the broken tb_unlink [1], and I'm surprised
that tb_unlink is broken even under single-threaded mode and system mode. You
mentioned (b) could be the IO thread in [1]. I think we don't enable IO thread
in system mode right now. My concern is if I spot _all_ place/situation that I
need to break the link between block and trace. 

> > The big problem is debugging.
> 
> Yes. In this sort of hotspot based design it's very easy to end up
> with bugs that are intermittent or painful to reproduce and where
> you have very little clue about which version of the code for which
> address ended up misgenerated (since timing issues mean that what
> code is recompiled and when it is inserted will vary from run to
> run). Being able to conveniently get rid of some of this nondeterminism
> is vital for tracking down what actually goes wrong.

  Misgenerated code might not be an issue now since we have tested our framework
in LLVM-only mode. I think the problem still is about the link/unlink stuff.
The first problem I have while lowering the threshold is the broken one generate
a few traces (2, actually) that a work one doesn't. When boot the linux image
downloaded from the QEMU website, the system hangs on the booting process (see
attach if you're interested). Simply put, the system hangs after printing

  ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1

which turns out should be function check_timer (arch/i386/kernel/io_apic.c). I
am not a Linux kernel expert and have no idea how to solve this. The culprit
traces beginning with 0xc01111b8 and 0xc01111d7. Here is their corresponding
guest binary.

----------------
IN:
0xc01111b8:  add    0xc04fa798,%eax
0xc01111be:  mov    (%eax),%eax
0xc01111c0:  ret

----------------
IN:
0xc01111d7:  mov    $0x108,%eax
0xc01111dc:  call   0xc01111b8

I compile the linux kernel with debug info and without inline function, then
objdump vmlinux to see what the source code might be. I guess because linux-0.2.img
has other stuff besides vmlinux (kernel image), the addresses above can only be
used as an approximation or even useless. I only find one spot having the same
code sequence (I believe) as 0xc01111b8 but can't find the other one so far.
See below,

static inline unsigned int readl(const volatile void __iomem *addr)
{
        return *(volatile unsigned int __force *) addr;
c0214a90:       03 05 44 56 4f c0       add    0xc04f5644,%eax
c0214a96:       8b 00                   mov    (%eax),%eax
#define FSEC_TO_USEC (1000000000UL)

int hpet_readl(unsigned long a)
{
        return readl(hpet_virt_address + a);
}
c0214a98:       c3                      ret

  This is the whole story so far. :-) Any comment are welcome!

[1] http://lists.gnu.org/archive/html/qemu-devel/2011-11/msg02447.html

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  9:03   ` 陳韋任
@ 2011-12-01  9:13     ` 陳韋任
  2011-12-01  9:15     ` Max Filippov
  2011-12-01 10:12     ` Peter Maydell
  2 siblings, 0 replies; 22+ messages in thread
From: 陳韋任 @ 2011-12-01  9:13 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 225 bytes --]

  Forget the attachment.

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

[-- Attachment #2: system_hang.png --]
[-- Type: image/png, Size: 138996 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  9:03   ` 陳韋任
  2011-12-01  9:13     ` 陳韋任
@ 2011-12-01  9:15     ` Max Filippov
  2011-12-01  9:25       ` 陳韋任
  2011-12-01 10:12     ` Peter Maydell
  2 siblings, 1 reply; 22+ messages in thread
From: Max Filippov @ 2011-12-01  9:15 UTC (permalink / raw)
  To: 陳韋任; +Cc: Peter Maydell, qemu-devel

>  Misgenerated code might not be an issue now since we have tested our framework
> in LLVM-only mode. I think the problem still is about the link/unlink stuff.
> The first problem I have while lowering the threshold is the broken one generate
> a few traces (2, actually) that a work one doesn't. When boot the linux image
> downloaded from the QEMU website, the system hangs on the booting process (see
> attach if you're interested). Simply put, the system hangs after printing

There's no attachment in this mail. I can try to help you resolving it
if you provide more information.

>  ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
>
> which turns out should be function check_timer (arch/i386/kernel/io_apic.c). I

If it hangs inside QEMU itself then you may try to backport commit
4f61927a41a098d06e642ffdea5fc285dc3a0e6b that fixes
infinite loop caused by hpet interrupt probing.

-- 
Thanks.
-- Max

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  9:15     ` Max Filippov
@ 2011-12-01  9:25       ` 陳韋任
  2011-12-01  9:41         ` Max Filippov
  0 siblings, 1 reply; 22+ messages in thread
From: 陳韋任 @ 2011-12-01  9:25 UTC (permalink / raw)
  To: Max Filippov; +Cc: Peter Maydell, qemu-devel, 陳韋任

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

Hi, Max

On Thu, Dec 01, 2011 at 12:15:06PM +0300, Max Filippov wrote:
> > 瓱isgenerated code might not be an issue now since we have tested our framework
> > in LLVM-only mode. I think the problem still is about the link/unlink stuff.
> > The first problem I have while lowering the threshold is the broken one generate
> > a few traces (2, actually) that a work one doesn't. When boot the linux image
> > downloaded from the QEMU website, the system hangs on the booting process (see
> > attach if you're interested). Simply put, the system hangs after printing
> 
> There's no attachment in this mail. I can try to help you resolving it
> if you provide more information.

  Sorry about that, see the attachment please. What kind of information you want
to know?
 
> > ?..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> >
> > which turns out should be function check_timer (arch/i386/kernel/io_apic.c). I
> 
> If it hangs inside QEMU itself then you may try to backport commit
> 4f61927a41a098d06e642ffdea5fc285dc3a0e6b that fixes
> infinite loop caused by hpet interrupt probing.

  I don't understand. What "it hangs inside QEMU itself" supposed to mean?

  Thanks!

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

[-- Attachment #2: system_hang.png --]
[-- Type: image/png, Size: 138996 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  9:25       ` 陳韋任
@ 2011-12-01  9:41         ` Max Filippov
  2011-12-06  6:37           ` 陳韋任
  0 siblings, 1 reply; 22+ messages in thread
From: Max Filippov @ 2011-12-01  9:41 UTC (permalink / raw)
  To: 陳韋任; +Cc: Peter Maydell, qemu-devel

>> There's no attachment in this mail. I can try to help you resolving it
>> if you provide more information.
>
>  Sorry about that, see the attachment please. What kind of information you want
> to know?

If your code is available online I can try it myself, the question is
where is it hosted then.
If not, then link to kernel binary and qemu exec trace would help me to start.

>> > ?..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
>> >
>> > which turns out should be function check_timer (arch/i386/kernel/io_apic.c). I
>>
>> If it hangs inside QEMU itself then you may try to backport commit
>> 4f61927a41a098d06e642ffdea5fc285dc3a0e6b that fixes
>> infinite loop caused by hpet interrupt probing.
>
>  I don't understand. What "it hangs inside QEMU itself" supposed to mean?

QEMU doesn't execute guest code doing something for itself vs. QEMU
executes guest code in loop checking for something that doesn't
happen.

I'm talking about the first case. They may be distinguished from e.g.
guest debugger connected to QEMU's gdbstub -- in the former case it
cannot break guest execution by ^C.

-- 
Thanks.
-- Max

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  9:41         ` Max Filippov
@ 2011-12-06  6:37           ` 陳韋任
  0 siblings, 0 replies; 22+ messages in thread
From: 陳韋任 @ 2011-12-06  6:37 UTC (permalink / raw)
  To: Max Filippov; +Cc: Peter Maydell, qemu-devel, 陳韋任

Hi Max,

> If your code is available online I can try it myself, the question is
> where is it hosted then.
> If not, then link to kernel binary and qemu exec trace would help me to start.

  Personally, I really want to make our work public, but I am not the decision
maker. I'll push it toward open source however.
 
> >> > ?..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> >> >
> >> > which turns out should be function check_timer (arch/i386/kernel/io_apic.c). I
> >>
> >> If it hangs inside QEMU itself then you may try to backport commit
> >> 4f61927a41a098d06e642ffdea5fc285dc3a0e6b that fixes
> >> infinite loop caused by hpet interrupt probing.
> >
> > 狢 don't understand. What "it hangs inside QEMU itself" supposed to mean?
> 
> QEMU doesn't execute guest code doing something for itself vs. QEMU
> executes guest code in loop checking for something that doesn't
> happen.
> 
> I'm talking about the first case. They may be distinguished from e.g.
> guest debugger connected to QEMU's gdbstub -- in the former case it
> cannot break guest execution by ^C.

  It turns out this is our IBTC optimization problem [1]. The IBTC should take
cross page boundary constraint into consideration as block linking does (at
least in QEMU current design) [2].

  As I said before, we have two code caches in our framework: one for basic
block, the other for trace. I forgot to turn off trace's IBTC optimization as
it doesn't consider cross page boundary right now. As a workaround, we return to
QEMU (dispatcher) while doing IBTC lookup, and the problem I mentioned
disappeared. Sometimes I feel I am chaseing a ghost when debug our system. ;-)

[1] http://lists.gnu.org/archive/html/qemu-devel/2011-08/msg01424.html
[2] http://lists.nongnu.org/archive/html/qemu-devel/2011-08/msg02249.html

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01  9:03   ` 陳韋任
  2011-12-01  9:13     ` 陳韋任
  2011-12-01  9:15     ` Max Filippov
@ 2011-12-01 10:12     ` Peter Maydell
  2011-12-01 10:40       ` 陳韋任
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Maydell @ 2011-12-01 10:12 UTC (permalink / raw)
  To: 陳韋任; +Cc: qemu-devel

On 1 December 2011 09:03, 陳韋任 <chenwj@iis.sinica.edu.tw> wrote:
>  I read the thread talking about the broken tb_unlink [1], and I'm surprised
> that tb_unlink is broken even under single-threaded mode and system mode. You
> mentioned (b) could be the IO thread in [1]. I think we don't enable IO thread
> in system mode right now. My concern is if I spot _all_ place/situation that I
> need to break the link between block and trace.

The IO thread is always enabled in QEMU these days.

-- PMM

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques
  2011-12-01 10:12     ` Peter Maydell
@ 2011-12-01 10:40       ` 陳韋任
  0 siblings, 0 replies; 22+ messages in thread
From: 陳韋任 @ 2011-12-01 10:40 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-devel, 陳韋任

> The IO thread is always enabled in QEMU these days.

  We use QEMU 0.13. I think IO thread is not enabled by default.

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-12-19 17:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-29  7:03 [Qemu-devel] Improve QEMU performance with LLVM codegen and other techniques 陳韋任
2011-11-30 12:37 ` Alexander Graf
2011-12-01  3:50   ` 陳韋任
2011-12-01  7:46     ` Stefan Hajnoczi
2011-12-01  9:16       ` Alex Bradbury
2011-12-01  9:30       ` 陳韋任
2011-12-01 10:23     ` Alexander Graf
2011-12-04  6:14       ` 陳韋任
2011-12-04 11:29         ` Alexander Graf
2011-12-05  6:05           ` 陳韋任
2011-12-01 10:59     ` Peter Maydell
2011-12-06  7:39   ` 陳韋任
2011-12-19 17:44     ` Alexander Graf
2011-11-30 12:51 ` Peter Maydell
2011-12-01  9:03   ` 陳韋任
2011-12-01  9:13     ` 陳韋任
2011-12-01  9:15     ` Max Filippov
2011-12-01  9:25       ` 陳韋任
2011-12-01  9:41         ` Max Filippov
2011-12-06  6:37           ` 陳韋任
2011-12-01 10:12     ` Peter Maydell
2011-12-01 10:40       ` 陳韋任

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).