From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Gf3z4-0001n5-Gv
	for qemu-devel@nongnu.org; Tue, 31 Oct 2006 19:29:34 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Gf3z3-0001ma-N8
	for qemu-devel@nongnu.org; Tue, 31 Oct 2006 19:29:34 -0500
Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Gf3z3-0001mV-K4
	for qemu-devel@nongnu.org; Tue, 31 Oct 2006 19:29:33 -0500
Received: from [65.74.133.4] (helo=mail.codesourcery.com)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA:32) (Exim 4.52) id 1Gf3z3-0002LC-8q
	for qemu-devel@nongnu.org; Tue, 31 Oct 2006 19:29:33 -0500
From: Paul Brook <paul@codesourcery.com>
Subject: Re: [Qemu-devel] qemu vs gcc4
Date: Wed, 1 Nov 2006 00:29:28 +0000
References: <45391B22.1050608@palmsource.com>
	<200610312208.20278.paul@codesourcery.com>
	<200610311900.15886.rob@landley.net>
In-Reply-To: <200610311900.15886.rob@landley.net>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Message-Id: <200611010029.29406.paul@codesourcery.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

> Actually it sounds additive rather than multiplicative. =A0Does each targ=
et
> have an entirely unrelated set of ops, or is there a shared set of
> primitive ops plus some oddballs?

The shared set of primitive ops is basically qops :-)
You probably could figure out a single common qet of qops, then write assem=
bly=20
and glue them together like we do with dyngen. However once you've done tha=
t=20
you've implemented most of what's needed for fully dynamic qops, so it=20
doesn't really seem worth it.

> But backing up and just accepting that for a moment, in theory what you
> need is some way to compile a C function to machine code, and then unwrap
> that function into a .raw file containing just the machine code.  So the
> only per-compiler thing would be this unwrapper thingy. =20

Right.

> But I already know=20
> that doesn't work because it doesn't explain the "unable to find spill
> register" problem.=20

That a separate gcc bug. It gets stuck when you tell it not to use half the=
=20
registers, then ask it to do 64-bit math. This is one of the reasons=20
eliminating the fixed registers is a good idea.

> > It corresponds to "T0" in dyngen. In addition to the actual CPU state,
> > dyngen
> > uses 3 fixed register as scratch workspace. for qop purposes these are
> > part of the guest CPU state. They're only there to aid conversion of the
> > translation code, they'll go away eventually.
>
> Presumably the m68k target is pure qop, and hasn't got this sort of thing?

Correct.
There is one use of T0 left for communicating with the TB chaining code, bu=
t=20
that's it and will probably go away eventually.

> > > Or the value currently in a qreg has a type associated with it, but
> > > the next value stored in that qreg may have a different type?
> >
> > A qreg has a fixed type. The value stored in that qreg has that type. To
> > convert it to a different type you need to use an explicit conversion
> > qop.
>
> So values don't have types, the qregs the values are _in_ have types.  But
> I thought there were an unlimited number of them (well, 1024 or so), and
> they're dynamically allocated (at least some of the time).  How does it
> keep track of the type of a given qreg?  (When you convert, you copy valu=
es
> from one qreg into another?)

Yes. Conversion is just like any other qop. It reads one qreg, and writes t=
he=20
result to a different qreg which happens to be a different type.

> > > Possible translation: you can feed a qreg containing an I64 value to a
> > > qop taking an i32 argument, and it'll typecast the sucker down
> > > intelligently, but if you produce an I32 result and expect to use that
> > > qreg's value as an I64 argument later, you have to call a
> > > sign-extending qop on it first?
> >
> > Exactly.
> > If you mix I32,F32 and/or F64 in this way Bad Things will happen.
>
> Presumably just the same kinds of Bad Things as "float f; *(int *)&f;"?

Or qemu will get confused and crash.

> > > seeing end with _im which I presume means "immediate".  The alternati=
ve
> > > is _cc, but what does that mean?  (Presumably not "closed captioned".)
> >
> > _cc are variants that set the condition codes. I may have got T0 and T1
> > backwards in the first 3 lines.
>
> Ah!
>
> Is this written down anywhere?  I've read Fabrice's paper and the design
> documentation, and I'm not remembering this.  It's quite possible I missed
> it when my brain filled up, though.

Dunno.

> > > Um, is my earlier characterization of "unwrapping stuff" at all close?
> >
> > Not entirely. I'm also replacing fixed locations (T2) with dynamicall
> > allocated qregs.
>
> The dynamic allocation buys you what?  (Less spilling?)

More-or-less. It makes it easier to optimize. The code generator can pick w=
hat=20
to put in registers, or even not put them there at all, instead of having t=
o=20
do things exactly how you told it.

It also means you don't need to reserve that register, avoiding the gcc una=
ble=20
to find spill register bug you mentioned above.

> > Most x86 instructions set the condition code flags. However most of the
> > time these flags are ignored. eg. if you have to consecutive add
> > instructions the first will set the flags, and the second will
> > immediately overwrite them.
> >
> > qemu contains a back-propagation pass that will remove the code to set
> > the flags after the first instruction. Currently this is implemented by
> > changing an addl_cc op into a plain addl op.
>
> I actually understood that.  Yay!
>
> > The flag-setting code would most likely require several qops to
> > implement, so
> > it would be much harder to prove it is not needed and remove it. So the=
re
> > is a mechanism for adding extra target qops, doing the flag elimination
> > pass, then expanding those to generic qops.
>
> Um, wouldn't the flag setting code be fairly straightforward as a qop that
> comes right _before_ the other op, as in "set the flags for doing this wi=
th
> these registers", that does nothing but set the flags (I.E. it wouldn't
> modify the contents of any the registers, so it could be immediately
> followed by the appropriate add or shift or so on), and then the flag
> setting pass could just turn all the ones that weren't needed into
> QOP_NULL?

Theoretically possible, but not so easy in practice. Especially when you ge=
t=20
things like partial flag clobbers, and lazy flag evaluation. Doing it as a=
=20
target specific hack is much simpler and quicker.

> Or is that what's happening now?  (Do QOPs ever modify their input
> registers, or only the output one?)

The generic qops never modify inputs, and never read outputs. Inputs and=20
outputs can be the same qreg.

> > > Ah, hang on.  There's target_reginfo in translate-all.c, that's using
> > > some of the other values.  So what the heck does translate-all.c do?=
=20
> > > (Shared code called by all the platform-dependent translate functions=
?)
> >
> > There are three fairly independent stages:
> > 1) target-*/translate.c converts guest code into qops.
> > 2) translate-all.c messes about with those qops a bit (allocates host
> > registers, etc).
> > 3) translate-op.c,translate-qop.c and target-*/ turns those qops into
> > host code.
>
> Is pass 2 where the flag elimination pass goes (and presumably any other
> optimizations that might get added)?  No, that can't be the case or the
> m68k code wouldn't need its own implementation of the flag elimination
> pass...

=46lag elimination is at the end of step 1.

> > > > For converting targets you can probably ignore most of the
> > > > translate-all and host-*/ changes. These implement generating code
> > > > from the qops.
> > >
> > > Ok, this implies that qops are a new thing.  Which looking at the code
> > > sort
> > > of supports.  Which means I don't understand what's going on at all.
> >
> > qops and dyngen ops are both small "functions" that are represented in a
> > similar way. The difference is that dyngen ops are target specific fixed
> > functions, whereas qops are generic parameterized functions.
>
> So the 11x11 exponential complexity of qemu producing its own assembly
> output might not be as much of a problem after switching to qops?

RIght. The exponential complexity is if you write the assembly by hand inst=
ead=20
of using gcc to generate it.

> Possibly some of the common qops can have an asm block for 'em, and the
> rest can go through the contortions target-*/op.c is currently doing with
> (glue(glue(blah))) and so on.

Currently we know how to generate code direcly for all qops. Anything more=
=20
complicated must be either put in a helper function or split into multiple=
=20
qops.

> > While they are really separate things, the details have been chosen so =
it
> > should be possible to adapt the existing translate.c code rather than
> > having to rewrite it from scratch. Decoding x86 instruction semantics is
> > complicated :-)
>
> Yay iterative transformation with regression testing.  (And nothing says
> regression testing like booting a Linux distro under the sucker.)

Exactly.

> > Many of the simpler dyngen ops can be replaced with a single qop. Others
> > can be replaces with a sequence of a few qops. Some of the more
> > complicated ones may need to be moved into helper functions.
>
> At some point, I hope to understand helper functions.  But I'm not there
> yet.
>
> > > I need to re-read this later.  My brain's full and I'm deeply confuse=
d.
> >
> > I started off by saying qops were effectively instructions for an
> > imaginary machine. translate-all.c rearranges them so they match up very
> > closely with the instructions available on the host. Once this has been
> > done turning them into binary code is relatively simple.
>
> I sort of thought this is what it was already doing, but apparently not...

We're getting confused with tenses. I mean this once translate-all.c has=20
rearranged the qops we *do* generate host instructions from them without to=
o=20
much effort.

> > If native host FP is not available qemu will include appropriate bits so
> > that
> > after macro expansion and inlining you end up with:
> >
> >   tmp =3D gen_new_qreg(QMODE_I32);
> >   gen_op_helper(HELPER_addf32, tmp, QREG_FOO, QREG_BAR).
> >
> > and the addf32 helper does the floating point addition using the
> > "softfloat" library. The qemu softfloat library implementation may
> > actually use hardware floating point rather than doing everything
> > manually.
>
> No reason (except speed) the code output into a translation block can't do
> function calls.  I think.

That's exactly what a helper function is. Calling functions is complicated,=
 so=20
I've restricted the functions that can be called to explicitly declared=20
helper functions.

Paul