From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:44776)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chenwj@cs.nctu.edu.tw>) id 1RVxfY-0008MQ-9y
	for qemu-devel@nongnu.org; Wed, 30 Nov 2011 22:50:45 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <chenwj@cs.nctu.edu.tw>) id 1RVxfW-0007Cj-SH
	for qemu-devel@nongnu.org; Wed, 30 Nov 2011 22:50:44 -0500
Received: from csmailer.cs.nctu.edu.tw ([140.113.235.130]:57273)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chenwj@cs.nctu.edu.tw>) id 1RVxfV-0007CY-T4
	for qemu-devel@nongnu.org; Wed, 30 Nov 2011 22:50:42 -0500
Date: Thu, 1 Dec 2011 11:50:24 +0800
From: =?utf-8?B?6Zmz6Z+L5Lu7?= <chenwj@iis.sinica.edu.tw>
Message-ID: <20111201035024.GA88545@cs.nctu.edu.tw>
References: <20111129070343.GA3585@cs.nctu.edu.tw>
	<BD39A319-B197-4763-B335-5B23CAC1FE02@suse.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <BD39A319-B197-4763-B335-5B23CAC1FE02@suse.de>
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] Improve QEMU performance with LLVM codegen and
 other techniques
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alexander Graf <agraf@suse.de>
Cc: qemu-devel@nongnu.org, =?utf-8?B?6Zmz6Z+L5Lu7?= <chenwj@iis.sinica.edu.tw>

Hi Alex,

> Very cool! I was thinking about this for a while myself now. It's espec=
ially appealing these days since you can do the hotspot optimization in a=
 separate thread :).
>=20
> Especially in system mode, you also need to flush when tb_flush() is ca=
lled though. And you have to make sure to match hflags and segment descri=
ptors for the links - otherwise you might end up connecting TBs from diff=
erent processes :).

  I'll check the tb_flush again. IIRC, we make the code cache big enough =
so that
there is no need to flush the code cache. But I think we still need to de=
al with
it in the end.

  The block linking is done by QEMU and we leave it alone. But I don't kn=
ow QEMU
ever does hflags and segment descriptors check before doing block linking=
. Could
you point it out? Anyway, here is how we form trace from a set of basic b=
locks.

1. We insert instrumented code at the beginning of each TCG block to coll=
ect how
   many times this block being executed.

2. When a block's execution time, say block A, reaches a pre-defined thre=
shold,
   we follow the run time execution path to collect block B followed A an=
d so on
   to form a trace. This approach is called NET (Next-Executing Tail) [1]=
.

3. Then a trace composed of TCG blocks is sent to a LLVM translator. The =
translator
   generates the host binary for the trace into a LLVM code cache, and pa=
tch the
   beginning of block A (in QEMU code cache) so that anyone executing blo=
ck A will=20
   jump to the corresponding trace and execute.

Above is block to trace link. I think there is no need to do hflags and s=
egment
descriptors check, right? Although I set the trace length to one basic bl=
ock at
the moment (make the situation simpler), I think we still don't have to c=
heck
the blocks' hflags and segment descriptors in the trace to see if they ma=
tch.
=20
> > successfully, then login and run some benchmark on it. As a very firs=
t step, we
> > make a very high threshold on trace building. In other words, a basic=
 block must
> > be executed *many* time to trigger the trace building process. Then w=
e lower the
> > threshold a bit at a time to see how things work. When something goes=
 wrong, we
> > might get kernel panic or the system hangs at some point on the booti=
ng process.
> > I have no idea on how to solve this kind of problem. So I'd like to s=
eek for
> > help/experience/suggestion on the mailing list. I just hope I make th=
e whole
> > situation clear to you.=20
>=20
> I don't see any better approach to debugging this than the one you're a=
lready taking. Try to run as many workloads as you can and see if they br=
eak :). Oh and always make the optimization optional, so that you can nar=
row it down to it and know you didn't hit a generic QEMU bug.

  You mean make the trace optimization optional? We have tested our frame=
work in
LLVM-only mode. which means we replace TCG with LLVM entirely. It's _very=
_ slow
but works. What the generic QEMU bug is? We use QEMU 0.13 and just rely o=
n its
emulation part right now. Does recent version fix major bugs in the emula=
tion
engine?

  Thanks for your advices. :-)

[1] http://www.cs.virginia.edu/kim/docs/micro05.pdf

Regards,
chenwj

--=20
Wei-Ren Chen (=E9=99=B3=E9=9F=8B=E4=BB=BB)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj