* [Qemu-devel] qemu vs gcc4
@ 2006-10-20 18:53 K. Richard Pixley
2006-10-22 22:06 ` Johannes Schindelin
2006-10-23 1:27 ` Rob Landley
0 siblings, 2 replies; 43+ messages in thread
From: K. Richard Pixley @ 2006-10-20 18:53 UTC (permalink / raw)
To: qemu-devel
Could someone please explain the issue with gcc4, please? Or point me
to an existing explanation?
I mean, I understand that qemu is believed to be building incorrectly
with gcc4. But what is the failure mode folks have been seeing? And
what's being done about it or what needs to be done about it? Is this
an issue for all targets or is it x86 specific?
Thanks,
--rich
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-20 18:53 [Qemu-devel] qemu vs gcc4 K. Richard Pixley
@ 2006-10-22 22:06 ` Johannes Schindelin
2006-10-23 8:16 ` Martin Guy
2006-10-23 1:27 ` Rob Landley
1 sibling, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2006-10-22 22:06 UTC (permalink / raw)
To: K. Richard Pixley; +Cc: qemu-devel
Hi K. Richard,
On Fri, 20 Oct 2006, K. Richard Pixley wrote:
> Could someone please explain the issue with gcc4, please? Or point me
> to an existing explanation?
The issue is that gcc4 optimizes better, but this breaks assumptions of
QEmu.
Example: The basic idea (simplified!) of QEmu is writing C functions which
implement the instructions of the target CPU. Then, code to be emulated is
translated by chaining the _compiled_ functions (corresponding to the
target code) together, but _leaving_ out the return instruction at the end
of the function (otherwise, the resulting code would return already after
the first emulated instruction).
Now, gcc4 can produce code with several return instructions (with no
option to turn that of, as far as I understand). You cannot cut them out,
and therefore you cannot chain the simple functions.
There seem to be other issues, too, like not being able to correctly link
the user emulation code, but I am not that sure about it.
> And what's being done about it or what needs to be done about it?
Paul started to implement a hand-written translator, which does not depend
on gcc, but I guess that project is stalled.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-20 18:53 [Qemu-devel] qemu vs gcc4 K. Richard Pixley
2006-10-22 22:06 ` Johannes Schindelin
@ 2006-10-23 1:27 ` Rob Landley
2006-10-23 1:44 ` Paul Brook
2006-10-23 1:45 ` Johannes Schindelin
1 sibling, 2 replies; 43+ messages in thread
From: Rob Landley @ 2006-10-23 1:27 UTC (permalink / raw)
To: qemu-devel
On Friday 20 October 2006 2:53 pm, K. Richard Pixley wrote:
> Could someone please explain the issue with gcc4, please? Or point me
> to an existing explanation?
>
> I mean, I understand that qemu is believed to be building incorrectly
> with gcc4. But what is the failure mode folks have been seeing? And
> what's being done about it or what needs to be done about it? Is this
> an issue for all targets or is it x86 specific?
There's a patch to fix it in http://busybox.net/downloads/qemu (which works
for 0.8.0 through 0.8.2, dunno about current cvs). This is a collection of
four different patches I got from a gentoo web page via google.
Basically, gcc changed in a way that broke qemu. There's been an open bug
report in gcc ever since, but the GCC developers really aren't interested in
backwards compatability. (Heck, gcc 4.0 breaks building bash 2.05b). The
qemu developers aren't interested in applying ugly patches to support gcc 4.x
until gcc 3.x becomes so obsolete nobody ships it anymore. (And considering
that there are still some niche embedded boards that have hacked up versions
of gcc 2.95 targeting them and nothing else, I wouldn't be surprised if in
five years we have your main compiler and the compiler to build qemu, ala
kgcc under Red Hat 7. *shrug*)
I was pondered trying to get tcc to build qemu, and even made a mercurial copy
of CVS and started collecting old patches from the list (since CVS hadn't had
a single patch checked into it in eight months and there were other old
patches from a full year ago which I needed to apply before the thing would
work on Ubuntu 6.06). But Fabrice showed back up on tuesday and checked in a
patch, and now I've got a fork that's out of sync with mainline. Since I
have no desire to be in Fabrice's way if he still has any interest in the
project, I've mothballed my fork and moved on to other things...
The current state of TCC trying to build qemu-0.8.2 is that it blows up on the
very first file. Just getting it to compile the full source (let alone
actually work) seems like a significant undertaking, but I was trying it as a
learning experience so who knows how much work it actually is. Might be
simple if you know what you're doing...
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 1:27 ` Rob Landley
@ 2006-10-23 1:44 ` Paul Brook
2006-10-23 1:45 ` Johannes Schindelin
1 sibling, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-23 1:44 UTC (permalink / raw)
To: qemu-devel
> Basically, gcc changed in a way that broke qemu. There's been an open bug
> report in gcc ever since, but the GCC developers really aren't interested
> in backwards compatability.
That's not entirely true. There are two problems:
- qemu makes assumptions about the layout of the code gcc generates. This
works by chance on older gcc. This effects all hosts, and is not a gcc bug.
- qemu reserves several registers for its own use. On architecturally crippled
hosts (ie. x86) this means we hit really obscure gcc bugs on x86 because gcc
runs out of registers. This is a gcc bug, but is also relatively easy to
workaround.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 1:27 ` Rob Landley
2006-10-23 1:44 ` Paul Brook
@ 2006-10-23 1:45 ` Johannes Schindelin
2006-10-23 17:53 ` K. Richard Pixley
2006-10-23 18:08 ` Rob Landley
1 sibling, 2 replies; 43+ messages in thread
From: Johannes Schindelin @ 2006-10-23 1:45 UTC (permalink / raw)
To: Rob Landley; +Cc: qemu-devel
Hi Rob,
On Sun, 22 Oct 2006, Rob Landley wrote:
> Basically, gcc changed in a way that broke qemu.
Yes, they did. But even if I understand your frustration (which I share),
I also understand the gcc people. After all, using gcc to create the
blocks for dynamic translation is a _hack_. The result of a compiler run,
though, should work and run -- as fast as possible. So basically, the gcc
people want to achieve a different goal from what we misuse their program
for.
> I was pondered trying to get tcc to build qemu,
(since tcc only supports x86 targets, this is not really a solution.)
> and even made a mercurial copy [...] But Fabrice showed back up on
> tuesday and checked in a patch, and now I've got a fork that's out of
> sync with mainline.
I do not really know Mercurial, but it should make it really easy to merge
two branches (as far as I have been told).
Ciao,
Dscho
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-22 22:06 ` Johannes Schindelin
@ 2006-10-23 8:16 ` Martin Guy
2006-10-23 12:20 ` Paul Brook
2006-10-23 17:41 ` K. Richard Pixley
0 siblings, 2 replies; 43+ messages in thread
From: Martin Guy @ 2006-10-23 8:16 UTC (permalink / raw)
To: qemu-devel
> Now, gcc4 can produce code with several return instructions (with no
> option to turn that of, as far as I understand). You cannot cut them out,
> and therefore you cannot chain the simple functions.
...unless you also map return instructions within the generated
functions into branches to the soon-to-be-dropped final "return"? Not
that I know anything about qemu internals mind u...
M
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 8:16 ` Martin Guy
@ 2006-10-23 12:20 ` Paul Brook
2006-10-23 13:59 ` Avi Kivity
2006-10-23 17:41 ` K. Richard Pixley
1 sibling, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-23 12:20 UTC (permalink / raw)
To: qemu-devel
On Monday 23 October 2006 09:16, Martin Guy wrote:
> > Now, gcc4 can produce code with several return instructions (with no
> > option to turn that of, as far as I understand). You cannot cut them out,
> > and therefore you cannot chain the simple functions.
>
> ...unless you also map return instructions within the generated
> functions into branches to the soon-to-be-dropped final "return"? Not
> that I know anything about qemu internals mind u...
That's exactly what my gcc4 hacks do.
It gets complicated because a x86 uses variable length insn encodings so you
don't know where insn boundaries are, and a jmp instruction is larger than a
ret instruction so it's not always possible to do a straight replacement.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 12:20 ` Paul Brook
@ 2006-10-23 13:59 ` Avi Kivity
2006-10-23 14:10 ` Paul Brook
0 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2006-10-23 13:59 UTC (permalink / raw)
To: paul; +Cc: qemu-devel
Paul Brook wrote:
> On Monday 23 October 2006 09:16, Martin Guy wrote:
>
>>> Now, gcc4 can produce code with several return instructions (with no
>>> option to turn that of, as far as I understand). You cannot cut them out,
>>> and therefore you cannot chain the simple functions.
>>>
>> ...unless you also map return instructions within the generated
>> functions into branches to the soon-to-be-dropped final "return"? Not
>> that I know anything about qemu internals mind u...
>>
>
> That's exactly what my gcc4 hacks do.
>
> It gets complicated because a x86 uses variable length insn encodings so you
> don't know where insn boundaries are, and a jmp instruction is larger than a
> ret instruction so it's not always possible to do a straight replacement.
>
how about
void some_generated_instruction(u32 a1, u32 s2)
{
// code
asm volatile ( "" );
}
that will force the code to fall through to the null asm code, avoiding
premature returns.
if the code uses 'return' explicitly, turn it to a goto just before the
'asm volatile'.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 13:59 ` Avi Kivity
@ 2006-10-23 14:10 ` Paul Brook
2006-10-23 14:28 ` Avi Kivity
0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-23 14:10 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel
> > That's exactly what my gcc4 hacks do.
> >
> > It gets complicated because a x86 uses variable length insn encodings so
> > you don't know where insn boundaries are, and a jmp instruction is larger
> > than a ret instruction so it's not always possible to do a straight
> > replacement.
>
> how about
>
> void some_generated_instruction(u32 a1, u32 s2)
> {
> // code
> asm volatile ( "" );
> }
>
>
> that will force the code to fall through to the null asm code, avoiding
> premature returns.
>
> if the code uses 'return' explicitly, turn it to a goto just before the
> 'asm volatile'.
We already do that. It doesn't stop gcc putting the return in the middle of
the function.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 14:10 ` Paul Brook
@ 2006-10-23 14:28 ` Avi Kivity
2006-10-23 14:31 ` Paul Brook
0 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2006-10-23 14:28 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
Paul Brook wrote:
>>> That's exactly what my gcc4 hacks do.
>>>
>>> It gets complicated because a x86 uses variable length insn encodings so
>>> you don't know where insn boundaries are, and a jmp instruction is larger
>>> than a ret instruction so it's not always possible to do a straight
>>> replacement.
>>>
>> how about
>>
>> void some_generated_instruction(u32 a1, u32 s2)
>> {
>> // code
>> asm volatile ( "" );
>> }
>>
>>
>> that will force the code to fall through to the null asm code, avoiding
>> premature returns.
>>
>> if the code uses 'return' explicitly, turn it to a goto just before the
>> 'asm volatile'.
>>
>
> We already do that. It doesn't stop gcc putting the return in the middle of
> the function.
>
> Paul
>
void f1();
void f2();
void f(int *z, int x, int y)
{
if (x) {
*z = x;
f1();
} else {
*z = y;
f2();
}
asm volatile ("");
}
works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the
-f flag doesn't. No idea if it's consistent across architectures.
(the function calls are there to prevent cmov optimizations)
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 14:28 ` Avi Kivity
@ 2006-10-23 14:31 ` Paul Brook
2006-10-23 14:35 ` Avi Kivity
0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-23 14:31 UTC (permalink / raw)
To: qemu-devel
> > We already do that. It doesn't stop gcc putting the return in the middle
> > of the function.
> >
> > Paul
>
> void f1();
> void f2();
>
> void f(int *z, int x, int y)
> {
> if (x) {
> *z = x;
> f1();
> } else {
> *z = y;
> f2();
> }
> asm volatile ("");
> }
>
> works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the
> -f flag doesn't. No idea if it's consistent across architectures.
It doesn't work reliably though. We already do everything you mention above.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 14:31 ` Paul Brook
@ 2006-10-23 14:35 ` Avi Kivity
0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2006-10-23 14:35 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
Paul Brook wrote:
>>> We already do that. It doesn't stop gcc putting the return in the middle
>>> of the function.
>>>
>>> Paul
>>>
>> void f1();
>> void f2();
>>
>> void f(int *z, int x, int y)
>> {
>> if (x) {
>> *z = x;
>> f1();
>> } else {
>> *z = y;
>> f2();
>> }
>> asm volatile ("");
>> }
>>
>> works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the
>> -f flag doesn't. No idea if it's consistent across architectures.
>>
>
> It doesn't work reliably though. We already do everything you mention above.
>
Okay. Sorry for pestering :)
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 8:16 ` Martin Guy
2006-10-23 12:20 ` Paul Brook
@ 2006-10-23 17:41 ` K. Richard Pixley
2006-10-23 17:58 ` Paul Brook
1 sibling, 1 reply; 43+ messages in thread
From: K. Richard Pixley @ 2006-10-23 17:41 UTC (permalink / raw)
To: Martin Guy; +Cc: qemu-devel
Martin Guy wrote:
>> Now, gcc4 can produce code with several return instructions (with no
>> option to turn that of, as far as I understand). You cannot cut them
>> out,
>> and therefore you cannot chain the simple functions.
>
> ...unless you also map return instructions within the generated
> functions into branches to the soon-to-be-dropped final "return"? Not
> that I know anything about qemu internals mind u...
Seems to me one could also map them into jumps to a null function.
Although, all told, it would seem to me that what might be called for
here is a new gcc target. A gcc target specifically for generating qemu
code. That would just simply generate whatever qemu wanted for function
postamble.
It would probably mean separating out the code intended to run as native
code from the code intended to run on behalf of the emulated target, and
it would mean that you'd need a "gcc-qemu" to build the latter, but it
would solve the problem permanently. It could also then be done in a
cpu independent fashion such that any gcc target port might be converted
trivially into a gcc target-for-qemu port. This should also make the
chaining task much simpler and since that would seem to need to be done
at run time, this could easily be a performance enhancement as well.
I see two real downsides to this approach. The first is that qemu
becomes wed to gcc. That seems to be a defacto requirement now, but
using a custom gcc target would make that marriage pretty permanent.
Creating qemu targets for other compilers would be near impossible,
although if the code were properly separated, you could still use a
non-gcc target for the intended-for-host instructions.
The second downside is that some of the qemu support stuff would no
longer be in the qemu code distribution. Instead, it would be in gcc.
This opens the possiblity for version slew problems and authority over
maintenance issues in the long term. Administratively, it'd be an
additional load.
--rich
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 1:45 ` Johannes Schindelin
@ 2006-10-23 17:53 ` K. Richard Pixley
2006-10-23 18:08 ` Rob Landley
1 sibling, 0 replies; 43+ messages in thread
From: K. Richard Pixley @ 2006-10-23 17:53 UTC (permalink / raw)
To: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 805 bytes --]
Johannes Schindelin wrote:
> On Sun, 22 Oct 2006, Rob Landley wrote:
>
>> Basically, gcc changed in a way that broke qemu.
>>
> Yes, they did. But even if I understand your frustration (which I share),
> I also understand the gcc people. After all, using gcc to create the
> blocks for dynamic translation is a _hack_.
Yes, it is a hack. And short of some guarantees from gcc, (which we
don't have), it is destined to be an ongoing issue.
> The result of a compiler run,
> though, should work and run -- as fast as possible. So basically, the gcc
> people want to achieve a different goal from what we misuse their program
> for.
Creating a qemu variant target for gcc would address both of these
concerns. It would introduce new ones, of course, but it would address
these two.
--rich
[-- Attachment #2: Type: text/html, Size: 1402 bytes --]
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 17:41 ` K. Richard Pixley
@ 2006-10-23 17:58 ` Paul Brook
2006-10-23 18:04 ` K. Richard Pixley
2006-10-30 4:35 ` Rob Landley
0 siblings, 2 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-23 17:58 UTC (permalink / raw)
To: qemu-devel
On Monday 23 October 2006 18:41, K. Richard Pixley wrote:
> Martin Guy wrote:
> >> Now, gcc4 can produce code with several return instructions (with no
> >> option to turn that of, as far as I understand). You cannot cut them
> >> out,
> >> and therefore you cannot chain the simple functions.
> >
> > ...unless you also map return instructions within the generated
> > functions into branches to the soon-to-be-dropped final "return"? Not
> > that I know anything about qemu internals mind u...
>
> Seems to me one could also map them into jumps to a null function.
That doesn't work because you need to free the stack frame.
> Although, all told, it would seem to me that what might be called for
> here is a new gcc target. A gcc target specifically for generating qemu
> code. That would just simply generate whatever qemu wanted for function
> postamble.
Better to just teach qemu how to generate code.
In fact I've already done most of the infrastructure (and a fair amount of the
legwork) for this. The only major missing function is code to do softmmu
load/store ops.
https://nowt.dyndns.org/
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 17:58 ` Paul Brook
@ 2006-10-23 18:04 ` K. Richard Pixley
2006-10-23 18:20 ` Laurent Desnogues
2006-10-23 18:37 ` Paul Brook
2006-10-30 4:35 ` Rob Landley
1 sibling, 2 replies; 43+ messages in thread
From: K. Richard Pixley @ 2006-10-23 18:04 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
Paul Brook wrote:
> Better to just teach qemu how to generate code.
> In fact I've already done most of the infrastructure (and a fair amount of the
> legwork) for this. The only major missing function is code to do softmmu
> load/store ops.
> https://nowt.dyndns.org/
Well, perhaps. Except that with gcc, we get to leverage the ongoing gcc
optimizations, bug fixes, new cpu support, debugger support, etc.
Granted, not all of these are going to be relevant to the qemu
environment, but in a contest between gcc generated code and qemu
generated code, I'll bet on gcc most days.
No doubt there are times when a gcc optimization takes so long that it
costs more time to optimize than would be won back by the running code.
Presumably, qemu generated code would be able to make better decisions
here. Except that we're not talking about using gcc in real time, are
we? So essentially we have near infinite time for optimizations.
--rich
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 1:45 ` Johannes Schindelin
2006-10-23 17:53 ` K. Richard Pixley
@ 2006-10-23 18:08 ` Rob Landley
1 sibling, 0 replies; 43+ messages in thread
From: Rob Landley @ 2006-10-23 18:08 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: qemu-devel
On Sunday 22 October 2006 9:45 pm, Johannes Schindelin wrote:
> > I was pondered trying to get tcc to build qemu,
>
> (since tcc only supports x86 targets, this is not really a solution.)
No, it supports arm as well. (And I merged a recent patch to support arm
EABI.) I remember hearing about a PPC patch (although I never tracked that
down), and I was looking into what I needed to do to make it support x86-64.
> > and even made a mercurial copy [...] But Fabrice showed back up on
> > tuesday and checked in a patch, and now I've got a fork that's out of
> > sync with mainline.
>
> I do not really know Mercurial, but it should make it really easy to merge
> two branches (as far as I have been told).
That's the general idea, yes. (In this case what was merged is a reworking of
a patch I already merged, which I could essentially ignore for now.) The
problem is at a higher level: I'd created a fork based of a project that
looked abandoned, but it turned out not to be abandoned, so the fork looks
like a bad idea in retrospect. *shrug* No shortage of other projects to
work on. (Like QEMU: I still haven't managed to install the x86_64 version
of ubuntu. An older version hung when it got to the desktop, in last week's
version I couldn't even get the bios to bring up grub. Need to thump on it
again, but I'm not quite sure how to debug this.)
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 18:04 ` K. Richard Pixley
@ 2006-10-23 18:20 ` Laurent Desnogues
2006-10-23 18:37 ` Paul Brook
1 sibling, 0 replies; 43+ messages in thread
From: Laurent Desnogues @ 2006-10-23 18:20 UTC (permalink / raw)
To: qemu-devel
K. Richard Pixley a écrit :
> Well, perhaps. Except that with gcc, we get to leverage the ongoing gcc
> optimizations, bug fixes, new cpu support, debugger support, etc.
> Granted, not all of these are going to be relevant to the qemu
> environment, but in a contest between gcc generated code and qemu
> generated code, I'll bet on gcc most days.
>
> No doubt there are times when a gcc optimization takes so long that it
> costs more time to optimize than would be won back by the running code.
> Presumably, qemu generated code would be able to make better decisions
> here. Except that we're not talking about using gcc in real time, are
> we? So essentially we have near infinite time for optimizations.
One emulated instruction is a small C function with very little
opportunity for optimization.
On top of that, for instance, calculating flags can be done much
more efficiently at assembly level by using host flags.
All what gcc brings here is portability.
Laurent
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 18:04 ` K. Richard Pixley
2006-10-23 18:20 ` Laurent Desnogues
@ 2006-10-23 18:37 ` Paul Brook
2006-10-24 23:39 ` Rob Landley
2006-10-31 16:53 ` Rob Landley
1 sibling, 2 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-23 18:37 UTC (permalink / raw)
To: qemu-devel
On Monday 23 October 2006 19:04, K. Richard Pixley wrote:
> Paul Brook wrote:
> > Better to just teach qemu how to generate code.
> > In fact I've already done most of the infrastructure (and a fair amount
> > of the legwork) for this. The only major missing function is code to do
> > softmmu load/store ops.
> > https://nowt.dyndns.org/
>
> Well, perhaps. Except that with gcc, we get to leverage the ongoing gcc
> optimizations, bug fixes, new cpu support, debugger support, etc.
> Granted, not all of these are going to be relevant to the qemu
> environment, but in a contest between gcc generated code and qemu
> generated code, I'll bet on gcc most days.
>
> No doubt there are times when a gcc optimization takes so long that it
> costs more time to optimize than would be won back by the running code.
> Presumably, qemu generated code would be able to make better decisions
> here. Except that we're not talking about using gcc in real time, are
> we? So essentially we have near infinite time for optimizations.
The code we're talking about (op.c) is sufficiently small and simple that
there's nothing the compiler can do with it. In fact many of the ops map
directly onto a single assembly instruction.
To get better translated code we need to do inter-op optimization as code is
translated (even if it's only simple things like register allocation). This
requires qemu be able to generate code at runtime.
Using the gcc backends for dynamic code generation isn't a realistic option.
They're simply too heavyweight to be used "in real time". qemu needs to be
able to efficiently generate short, simple code blocks. Most of the gcc
infrastructure is for optimizations that take longer to run than we're ever
going to get back in improved performance.
I did look at integrating an existing JIT compiler into qemu, but couldn't
find one that fitted nicely, and allowed an incremental conversion.
It turn out that qemu already does most of the hard work, and a code
generation backend is fairly simple. The diff for my current implementation
is <2k lines of common code, plus <1k lines for each of x86, amd64 and ppc32
hosts.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 18:37 ` Paul Brook
@ 2006-10-24 23:39 ` Rob Landley
2006-10-25 0:24 ` Paul Brook
2006-10-31 16:53 ` Rob Landley
1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-24 23:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Paul Brook
On Monday 23 October 2006 2:37 pm, Paul Brook wrote:
> It turn out that qemu already does most of the hard work, and a code
> generation backend is fairly simple. The diff for my current implementation
> is <2k lines of common code, plus <1k lines for each of x86, amd64 and ppc32
> hosts.
My understanding is that the version you linked to with your new backend
currently _only_ supports coldfire/m68k?
Do you have a quick "here's you how try it out" thing? (For example, when I
first show people qemu I boot a knoppix cd image under it. Fast and
shiny. :)
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-24 23:39 ` Rob Landley
@ 2006-10-25 0:24 ` Paul Brook
2006-10-25 19:39 ` Rob Landley
0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-25 0:24 UTC (permalink / raw)
To: Rob Landley; +Cc: qemu-devel
On Wednesday 25 October 2006 00:39, Rob Landley wrote:
> On Monday 23 October 2006 2:37 pm, Paul Brook wrote:
> > It turn out that qemu already does most of the hard work, and a code
> > generation backend is fairly simple. The diff for my current
> > implementation is <2k lines of common code, plus <1k lines for each of
> > x86, amd64 and ppc32 hosts.
>
> My understanding is that the version you linked to with your new backend
> currently _only_ supports coldfire/m68k?
ColdFire is the only target that uses it exclusively. Arm is currently a
hybrid of dyngen and the new backend. So is i386, to a lesser extent. Other
targets have minimal changes necessary to make them work.
> Do you have a quick "here's you how try it out" thing? (For example, when
> I first show people qemu I boot a knoppix cd image under it. Fast and
> shiny. :)
One of my goals when writing it was to be able to reuse most of the existing
qemu code. There should be no user-visible impact. Unless you already
understand how qemu/dyngen works it's not going to mean a lot to you. The end
result is very similar, just a slightly different strategy for getting there.
In theory it should allow better performance, but that's still a way off.
https://nowt.dyndns.org/ has patches against cvs (thought they may be slightly
out of date), and a complete svn repository you can checkout. Build it just
like normal qemu.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-25 0:24 ` Paul Brook
@ 2006-10-25 19:39 ` Rob Landley
2006-10-26 18:09 ` Daniel Jacobowitz
0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-25 19:39 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
On Tuesday 24 October 2006 8:24 pm, Paul Brook wrote:
> ColdFire is the only target that uses it exclusively. Arm is currently a
> hybrid of dyngen and the new backend. So is i386, to a lesser extent.
> Other targets have minimal changes necessary to make them work.
Ok.
> > Do you have a quick "here's you how try it out" thing? (For example, when
> > I first show people qemu I boot a knoppix cd image under it. Fast and
> > shiny. :)
>
> One of my goals when writing it was to be able to reuse most of the existing
> qemu code. There should be no user-visible impact. Unless you already
> understand how qemu/dyngen works it's not going to mean a lot to you.
I read Fabrice's presentation, and looked through the code a bit,
but "understand" is _way_ too strong a word. :)
> The end result is very similar, just a slightly different strategy for
> getting there.
A strategy that might work with gcc 4.x? :)
> In theory it should allow better performance, but that's still a way off.
I was poking at the tcc code to generate stuff and optimize a couple weeks
ago. I don't suppose there's any possible re-use between the two?
> https://nowt.dyndns.org/ has patches against cvs (thought they may be
> slightly out of date), and a complete svn repository you can checkout. Build
> it just like normal qemu.
Which in my case means applying the patch to get it to build with gcc 4.x,
which does indeed apply without rejects to your svn repository.
Unfortunately, the result doesn't build:
gcc -Wall -O2 -g -fno-strict-aliasing -I. -I.. -I/home/landley/qemu/nowt.dyndns.org/qemu/target-sparc -I/home/landley/qemu/nowt.dyndns.org/qemu -I/home/landley/qemu/nowt.dyndns.org/qemu/host-i386 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -I/home/landley/qemu/nowt.dyndns.org/qemu/fpu -I/home/landley/qemu/nowt.dyndns.org/qemu/slirp -c -o
tcx.o /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c
/home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c: In
function ‘tcx_draw_line32’:
/home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c:94: error: invalid lvalue in
increment
/home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c: In
function ‘tcx_draw_line16’:
/home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c:106: error: invalid lvalue in
increment
make[1]: *** [tcx.o] Error 1
make[1]: Leaving directory
`/home/landley/qemu/nowt.dyndns.org/qemu/sparc-softmmu'
make: *** [subdir-sparc-softmmu] Error 2
I don't have gcc 3.x installed on my laptop.
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-25 19:39 ` Rob Landley
@ 2006-10-26 18:09 ` Daniel Jacobowitz
0 siblings, 0 replies; 43+ messages in thread
From: Daniel Jacobowitz @ 2006-10-26 18:09 UTC (permalink / raw)
To: qemu-devel
On Wed, Oct 25, 2006 at 03:39:18PM -0400, Rob Landley wrote:
> gcc -Wall -O2 -g -fno-strict-aliasing -I. -I.. -I/home/landley/qemu/nowt.dyndns.org/qemu/target-sparc -I/home/landley/qemu/nowt.dyndns.org/qemu -I/home/landley/qemu/nowt.dyndns.org/qemu/host-i386 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -I/home/landley/qemu/nowt.dyndns.org/qemu/fpu -I/home/landley/qemu/nowt.dyndns.org/qemu/slirp -c -o
> tcx.o /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c
> /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c: In
> function ???tcx_draw_line32???:
> /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c:94: error: invalid lvalue in
> increment
> /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c: In
> function ???tcx_draw_line16???:
> /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c:106: error: invalid lvalue in
> increment
This is an unrelated problem, and much easier to fix. Don't increment
casts.
--
Daniel Jacobowitz
CodeSourcery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 17:58 ` Paul Brook
2006-10-23 18:04 ` K. Richard Pixley
@ 2006-10-30 4:35 ` Rob Landley
2006-10-30 14:56 ` Paul Brook
1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-30 4:35 UTC (permalink / raw)
To: qemu-devel; +Cc: Paul Brook
On Monday 23 October 2006 1:58 pm, Paul Brook wrote:
> > Although, all told, it would seem to me that what might be called for
> > here is a new gcc target. A gcc target specifically for generating qemu
> > code. That would just simply generate whatever qemu wanted for function
> > postamble.
>
> Better to just teach qemu how to generate code.
> In fact I've already done most of the infrastructure (and a fair amount of
the
> legwork) for this. The only major missing function is code to do softmmu
> load/store ops.
> https://nowt.dyndns.org/
So given that one of the reasons for doing this would be getting away from
depending on specific and increasily out of date versions of gcc to build the
thing, what would be involved in getting this version to build under gcc-4.x?
(I tried, and it didn't...)
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-30 4:35 ` Rob Landley
@ 2006-10-30 14:56 ` Paul Brook
2006-10-30 16:31 ` Rob Landley
0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-30 14:56 UTC (permalink / raw)
To: qemu-devel
On Monday 30 October 2006 04:35, Rob Landley wrote:
> On Monday 23 October 2006 1:58 pm, Paul Brook wrote:
> > > Although, all told, it would seem to me that what might be called for
> > > here is a new gcc target. A gcc target specifically for generating
> > > qemu code. That would just simply generate whatever qemu wanted for
> > > function postamble.
> >
> > Better to just teach qemu how to generate code.
> > In fact I've already done most of the infrastructure (and a fair amount
> > of the legwork) for this. The only major missing function is code to do
> > softmmu load/store ops.
> > https://nowt.dyndns.org/
>
> So given that one of the reasons for doing this would be getting away from
> depending on specific and increasily out of date versions of gcc to build
> the thing, what would be involved in getting this version to build under
> gcc-4.x?
Should work pretty much out the box. Obviously if you build anything other
than m68k then all bets are off.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-30 14:56 ` Paul Brook
@ 2006-10-30 16:31 ` Rob Landley
2006-10-30 16:50 ` Paul Brook
0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-30 16:31 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
On Monday 30 October 2006 9:56 am, Paul Brook wrote:
> On Monday 30 October 2006 04:35, Rob Landley wrote:
> > On Monday 23 October 2006 1:58 pm, Paul Brook wrote:
> > > > Although, all told, it would seem to me that what might be called for
> > > > here is a new gcc target. A gcc target specifically for generating
> > > > qemu code. That would just simply generate whatever qemu wanted for
> > > > function postamble.
> > >
> > > Better to just teach qemu how to generate code.
> > > In fact I've already done most of the infrastructure (and a fair amount
> > > of the legwork) for this. The only major missing function is code to do
> > > softmmu load/store ops.
> > > https://nowt.dyndns.org/
> >
> > So given that one of the reasons for doing this would be getting away from
> > depending on specific and increasily out of date versions of gcc to build
> > the thing, what would be involved in getting this version to build under
> > gcc-4.x?
>
> Should work pretty much out the box. Obviously if you build anything other
> than m68k then all bets are off.
It didn't get to "work", it broke building. (The frighting part is that my
patch at http://busybox.net/downloads/qemu applied to your version without
rejects, but although that helped it get farther, it didn't finish.)
I just did a standard "./configure --shutupaboutthecompilerversion; make; make
install". (x86 is the first target I'm interested in, as it's the easiest to
test and you said it's using at least some of the new code...)
> Paul
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-30 16:31 ` Rob Landley
@ 2006-10-30 16:50 ` Paul Brook
2006-10-30 22:54 ` Stephen Torri
0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-30 16:50 UTC (permalink / raw)
To: Rob Landley; +Cc: qemu-devel
> > > So given that one of the reasons for doing this would be getting away
> > > from depending on specific and increasily out of date versions of gcc
> > > to build the thing, what would be involved in getting this version to
> > > build under gcc-4.x?
> >
> > Should work pretty much out the box. Obviously if you build anything
> > other than m68k then all bets are off.
>
> It didn't get to "work", it broke building. (The frighting part is that my
> patch at http://busybox.net/downloads/qemu applied to your version without
> rejects, but although that helped it get farther, it didn't finish.)
>
> I just did a standard "./configure --shutupaboutthecompilerversion; make;
> make install". (x86 is the first target I'm interested in, as it's the
> easiest to test and you said it's using at least some of the new code...)
As I said before, the x86 target is a hybrid of the new and old code. ie. if
it didn't work before it probably won't work after. configure
with --target-list=m68k-user and it should work fine with gcc4.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-30 16:50 ` Paul Brook
@ 2006-10-30 22:54 ` Stephen Torri
2006-10-30 23:13 ` Paul Brook
0 siblings, 1 reply; 43+ messages in thread
From: Stephen Torri @ 2006-10-30 22:54 UTC (permalink / raw)
To: qemu-devel
> As I said before, the x86 target is a hybrid of the new and old code. ie. if
> it didn't work before it probably won't work after. configure
> with --target-list=m68k-user and it should work fine with gcc4.
>
> Paul
I need a x86 instruction set simulator that can step-by-step execute
could and allow me access to the internals. This is why I have looked at
qemu because of a recommendation from a developer of Ptlsim. It was
suggested that qemu would be lighter weight for what I need. So what do
you suggest I use for a x86 instruction set simulator?
Stephen
--
PhD. Student
Auburn University
Department of Computer Science and Software Engineering
107 Dunstan Hall
Auburn, AL 36849-5347 U.S.A.
(334) 844-4330 (O)
torrisa@auburn.edu
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-30 22:54 ` Stephen Torri
@ 2006-10-30 23:13 ` Paul Brook
0 siblings, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-30 23:13 UTC (permalink / raw)
To: qemu-devel
On Monday 30 October 2006 22:54, Stephen Torri wrote:
> > As I said before, the x86 target is a hybrid of the new and old code. ie.
> > if it didn't work before it probably won't work after. configure
> > with --target-list=m68k-user and it should work fine with gcc4.
> >
> > Paul
>
> I need a x86 instruction set simulator that can step-by-step execute
> could and allow me access to the internals. This is why I have looked at
> qemu because of a recommendation from a developer of Ptlsim. It was
> suggested that qemu would be lighter weight for what I need. So what do
> you suggest I use for a x86 instruction set simulator?
Use qemu, and build it with gcc3.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-23 18:37 ` Paul Brook
2006-10-24 23:39 ` Rob Landley
@ 2006-10-31 16:53 ` Rob Landley
2006-10-31 19:02 ` Paul Brook
1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-31 16:53 UTC (permalink / raw)
To: qemu-devel; +Cc: Paul Brook
On Monday 23 October 2006 2:37 pm, Paul Brook wrote:
> > > Better to just teach qemu how to generate code.
> > > In fact I've already done most of the infrastructure (and a fair amount
> > > of the legwork) for this. The only major missing function is code to do
> > > softmmu load/store ops.
> > > https://nowt.dyndns.org/
I looked at the big diff between that and mainline, and couldn't make heads
nor tails of it in the half-hour I spent on it. I also looked at the svn
history, but there's apparently a year and change of it.
I don't suppose there's a design document somewhere? Or could you quickly
explain "old one did this, new one does this, the code path diverges here,
start reading at this point and expect this and this to happen, and if you go
read this unrelated documentation to get up to speed it might help..."
I'd like to add enough of the new code generation stuff to the existing
targets so it doesn't break when built with gcc4, but so far my interest here
greatly outstrips my ability. I don't even know where to start...
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-31 16:53 ` Rob Landley
@ 2006-10-31 19:02 ` Paul Brook
2006-10-31 20:41 ` Rob Landley
2006-10-31 23:17 ` Rob Landley
0 siblings, 2 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-31 19:02 UTC (permalink / raw)
To: Rob Landley; +Cc: qemu-devel
On Tuesday 31 October 2006 16:53, Rob Landley wrote:
> On Monday 23 October 2006 2:37 pm, Paul Brook wrote:
> > > > Better to just teach qemu how to generate code.
> > > > In fact I've already done most of the infrastructure (and a fair
> > > > amount of the legwork) for this. The only major missing function is
> > > > code to do softmmu load/store ops.
> > > > https://nowt.dyndns.org/
>
> I looked at the big diff between that and mainline, and couldn't make heads
> nor tails of it in the half-hour I spent on it. I also looked at the svn
> history, but there's apparently a year and change of it.
>
> I don't suppose there's a design document somewhere? Or could you quickly
> explain "old one did this, new one does this, the code path diverges here,
> start reading at this point and expect this and this to happen, and if you
> go read this unrelated documentation to get up to speed it might help..."
Not really.
The basic principle is very similar. Host code is decomposed into an
intermediate form consisting of simple operations, then native code is
generated from those operations.
In the existing dyngen implementation most operands to ops are implicit, with
only a few ops taking explicit arguments. The principle with the new system
is that all operands are explicit.
The intermediate representation used by the code generator resembles an
imaginary machine. This machine has various different instructions (qops),
and a nominally infinite register file (qregs). Each qop takes zero or more
arguments, each of which may be an input or output.
In addition to dynamically allocated qregs there are a fixed set of qregs that
map onto the guest CPU state. This is to simplify code generation.
Each qreg has a particular type (32/64 bit, integer or float). It's up to you
ro make sure the argument types match those expected by th qop. It's
generally fairly obvious from the name. eg. add32 adds I32 values, addf64
adds F64 values, etc. The exception is that I64 values can be used in place
of I32. The upper 64-bit of outputs are undefined in this case, and teh value
must be explicitly extended before the full 64 bits are used.
The old dyngen ops are actually implemented as a special case qops.
As an example take the arm instruction
add, r0, r1, r2, lsl #2
This is equivalent to the C expression
r0 = r1 + (r2 << 2)
The old dyngen translate.c would do:
gen_op_movl_T1_r2()
gen_op_shll_T1_im(2)
gen_op_movl_T0_r1();
gen_op_addl(); /* does T0 = T0 + T1 */
gen_op_movl_r0_T0
When fully converted to the new system this would become:
int tmp = gen_new_qreg(); /* Allocate a temporary reg. */
/* gen_im32 is a helper that allocates a new qreg and
initializes it to an immediate value. */
gen_op_add32(tmp, QREG_R2, gen_im32(2));
gen_op_add32(QREG_R0, QREG_R1, tmp);
One of the changes I've made to target-arm/translate.c is to replace all uses
of T2 with new pseudo-regs. IN many cases I've left the code structure as it
was (using the global T0/T1 temporaries), but replaced the dyngen ops with
the equivalent qops. eg. movl and andl now generate mov32 and and32 qops.
The standard qops are defined in qops.def. A target can also define additional
qops in qop-target.def. The target specific qops are to simplify
implementation the i386 static flag propagation pass. the expand_op_*
routines.
For operations that are too complicated to be expressed as qops there is a
mechanism for calling helper functions. The m68k target uses this for
division and a couple of other things.
The implementation make fairly heavy use of the C preprocessor to generate
code from .def files. There's also a small shell script that pulls the
definiteions of the helper routines out of qop-helper.c
The debug dumps can be quite useful. In particular -d in_asm,op will dump the
input asm and the resulting OPs.
For converting targets you can probably ignore most of the translate-all and
host-*/ changes. These implement generating code from the qops. This works by
the host defining a set of "hard" qregs that correspond to host CPU
registers, and constraints for the operands of each qop. Then we do register
allocation and spilling to satisfy those constraints. The qops can then be
assembled directly into binary code.
There is also mechanisms for implementing floating point and 64-bit arithmetic
even if the target doesn't support this natively. The target code doesn't
need to worry about this, it just generates 64-bit/fp qops and they will be
decomposed as neccessary.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-31 19:02 ` Paul Brook
@ 2006-10-31 20:41 ` Rob Landley
2006-10-31 22:08 ` Paul Brook
2006-10-31 23:17 ` Rob Landley
1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-31 20:41 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
Welcome to Stupid Question Theatre! With your host, Paul Brook. Today's
contestant is: Rob Landley. How dumb will it get?
On Tuesday 31 October 2006 2:02 pm, Paul Brook wrote:
> The basic principle is very similar. Host code is decomposed into an
> intermediate form consisting of simple operations, then native code is
> generated from those operations.
I got that part. It's the how I'm still head-scratching over.
The disassembly routines seem relatively compiler-independent, but I'm under
the impression that turning the intermediate result (the string of qops) into
large blocks of translated code involves gluing together a bunch of smaller
blocks of pregenerated code. These pregenerated blocks were spit out by gcc
and are where the all the compiler dependencies that aren't clear bugs come
from.
I thought what you were doing was replacing the pregenerated blocks with
hand-coded assembly statements, but your description here seems to be about
changing the disassembly routines that figure out which qops to string
together in part 2.
> In the existing dyngen implementation most operands to ops are implicit,
> with only a few ops taking explicit arguments. The principle with the new
> system is that all operands are explicit.
Having looked ahead to your example before replying to this, I think I
understand that part now. (Just barely.)
> The intermediate representation used by the code generator resembles an
> imaginary machine. This machine has various different instructions (qops),
> and a nominally infinite register file (qregs).
Each qreg is represented as an integer index?
> Each qop takes zero or more arguments, each of which may be an input or
> output.
The input or output is always one of these qreg indexes? (Some of the
existing ones seem to take immediate values...)
> In addition to dynamically allocated qregs there are a fixed set of qregs
> that map onto the guest CPU state. This is to simplify code generation.
These are indexes 0, 1, and 2?
Ok, looking at target-arm/translate.c, we have:
static inline void gen_op_addl_T0_T1(void)
{
gen_op_add32(QREG_T0, QREG_T0, QREG_T1);
}
So what is QREG_T0 anyway? This is hard to grep for. 'find . | grep -v svn |
xargs grep "QREG_T0"' doesn't produce anything useful, so there's got to be
preprocessor concatenation stuff with ## going on, let's try just QREG on the
*.h files, and yup at the start of qop.h there's this:
enum target_qregs {
QREG_NULL,
#define DEFO32(name, offset) QREG_ ## name,
#define DEFO64(name, offset) DEFO32(name, offset)
#define DEFF32(name, reg) DEFO32(name, reg)
#define DEFF64(name, reg) DEFO32(name, reg)
#define DEFR(name, reg, mode) DEFO32(name, reg)
#include "qregs.def"
And that has "DEFR(T0, AREG1, QMODE_I32)" which... Ok, DEFR() discards the
third argument ("mode") completely, and then DEFO32() discards the second
argument (offset), and what's left is just the name, so it's position
dependent (so why have the darn macros at ALL?)
My brain hurts a lot now. I'm just letting you know. What is all this
complication actually trying to accomplish?
> Each qreg has a particular type (32/64 bit, integer or float).
You mean each qop's arguments have a particular type, and the arguments are
always in qregs? Or each qreg has a type permanently associated with that
qreg? Or the value currently in a qreg has a type associated with it, but
the next value stored in that qreg may have a different type?
> It's up to
> you to make sure the argument types match those expected by the qop. It's
> generally fairly obvious from the name. eg. add32 adds I32 values, addf64
> adds F64 values, etc. The exception is that I64 values can be used in place
> of I32. The upper 64-bit of outputs are undefined in this case, and the
> value must be explicitly extended before the full 64 bits are used.
Possible translation: you can feed a qreg containing an I64 value to a qop
taking an i32 argument, and it'll typecast the sucker down intelligently, but
if you produce an I32 result and expect to use that qreg's value as an I64
argument later, you have to call a sign-extending qop on it first?
> The old dyngen ops are actually implemented as a special case qops.
You mean each dyngen op produces multiple qops? (And/or is a bundle of qops?)
> As an example take the arm instruction
>
> add, r0, r1, r2, lsl #2
>
> This is equivalent to the C expression
>
> r0 = r1 + (r2 << 2)
>
> The old dyngen translate.c would do:
>
> gen_op_movl_T1_r2()
> gen_op_shll_T1_im(2)
> gen_op_movl_T0_r1();
> gen_op_addl(); /* does T0 = T0 + T1 */
> gen_op_movl_r0_T0
Digging down into target-arm/translate.c, function disas_arm_insn(), I'm...
still having to take your word for it. All the gen_op_movl_T1 variants I'm
seeing end with _im which I presume means "immediate". The alternative is
_cc, but what does that mean? (Presumably not "closed captioned".)
> When fully converted to the new system this would become:
>
> int tmp = gen_new_qreg(); /* Allocate a temporary reg. */
> /* gen_im32 is a helper that allocates a new qreg and
> initializes it to an immediate value. */
> gen_op_add32(tmp, QREG_R2, gen_im32(2));
> gen_op_add32(QREG_R0, QREG_R1, tmp);
Ok (still looking at target-arm/translate.c), I think you're not defining
anything new here, you're just removing wrappers like gen_op_add_T1_im()
which just wrap a single call to gen_op_add32(), and untangling the result?
What the heck does gen_intermediate_code() do? It's a wrapper for a function
that returns the same value and takes the exact same arguments in the same
order. All that's different is the name. Why does that exist?
> One of the changes I've made to target-arm/translate.c is to replace all
uses
> of T2 with new pseudo-regs. IN many cases I've left the code structure as it
> was (using the global T0/T1 temporaries), but replaced the dyngen ops with
> the equivalent qops. eg. movl and andl now generate mov32 and and32 qops.
Um, is my earlier characterization of "unwrapping stuff" at all close?
> The standard qops are defined in qops.def. A target can also define
> additional qops in qop-target.def. The target specific qops are to simplify
> implementation the i386 static flag propagation pass. the expand_op_*
> routines.
Yeah, I looked at that and the macros that generate it in qops.h. There seem
to be exactly two states (QREG_BLAH and QREGHI_BLAH) which can be reached
from five different macros. The "offset", "reg", and "mode" entries are
universally ignored, and all you actually _get_ is a big enum of identifiers
in a certain order. I have no idea what's going on.
> For operations that are too complicated to be expressed as qops there is a
> mechanism for calling helper functions. The m68k target uses this for
> division and a couple of other things.
Ok, now I'm really lost.
> The implementation make fairly heavy use of the C preprocessor to generate
> code from .def files. There's also a small shell script that pulls the
> definiteions of the helper routines out of qop-helper.c
Ah, hang on. There's target_reginfo in translate-all.c, that's using some of
the other values. So what the heck does translate-all.c do? (Shared code
called by all the platform-dependent translate functions?)
> The debug dumps can be quite useful. In particular -d in_asm,op will dump
> the input asm and the resulting OPs.
I'll have to find a system with gcc3 installed on it so I can actually try
this out. (Hmmm, I have a Red Hat 9 image I run under qemu, maybe it would
build under that?)
> For converting targets you can probably ignore most of the translate-all and
> host-*/ changes. These implement generating code from the qops.
Ok, this implies that qops are a new thing. Which looking at the code sort of
supports. Which means I don't understand what's going on at all.
> This works
> by the host defining a set of "hard" qregs that correspond to host CPU
> registers, and constraints for the operands of each qop. Then we do register
> allocation and spilling to satisfy those constraints. The qops can then be
> assembled directly into binary code.
I need to re-read this later. My brain's full and I'm deeply confused.
> There is also mechanisms for implementing floating point and 64-bit
> arithmetic even if the target doesn't support this natively. The target code
> doesn't need to worry about this, it just generates 64-bit/fp qops and they
> will be decomposed as neccessary.
The implementation calls the appropriate host functions to handle the floating
point, using soft-float if necessary? (Under the old dyngen thing outputting
blocks of gcc-produced code, I could understand how that works. But if
you're outputting assembly directly... I'm back in the "totally lost" aread
again, I think.)
> Paul
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-31 20:41 ` Rob Landley
@ 2006-10-31 22:08 ` Paul Brook
2006-10-31 22:31 ` Laurent Desnogues
2006-11-01 0:00 ` Rob Landley
0 siblings, 2 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-31 22:08 UTC (permalink / raw)
To: Rob Landley; +Cc: qemu-devel
On Tuesday 31 October 2006 20:41, Rob Landley wrote:
> Welcome to Stupid Question Theatre! With your host, Paul Brook. Today's
> contestant is: Rob Landley. How dumb will it get?
>
> On Tuesday 31 October 2006 2:02 pm, Paul Brook wrote:
> > The basic principle is very similar. Host code is decomposed into an
> > intermediate form consisting of simple operations, then native code is
> > generated from those operations.
>
> I got that part. It's the how I'm still head-scratching over.
>
> The disassembly routines seem relatively compiler-independent, but I'm
> under the impression that turning the intermediate result (the string of
> qops) into large blocks of translated code involves gluing together a bunch
> of smaller blocks of pregenerated code. These pregenerated blocks were
> spit out by gcc and are where the all the compiler dependencies that aren't
> clear bugs come from.
Correct.
> I thought what you were doing was replacing the pregenerated blocks with
> hand-coded assembly statements, but your description here seems to be about
> changing the disassembly routines that figure out which qops to string
> together in part 2.
Replacing the pregenerated blocks with hand written assembly isn't feasible.
Each target has its own set of ops, and each host would need its own assembly
implementation of those ops. Multiply 11 targets by 11 hosts and you get a
unmaintainable mess :-)
> > In the existing dyngen implementation most operands to ops are implicit,
> > with only a few ops taking explicit arguments. The principle with the new
> > system is that all operands are explicit.
>
> Having looked ahead to your example before replying to this, I think I
> understand that part now. (Just barely.)
>
> > The intermediate representation used by the code generator resembles an
> > imaginary machine. This machine has various different instructions
> > (qops), and a nominally infinite register file (qregs).
>
> Each qreg is represented as an integer index?
Yes.
> > Each qop takes zero or more arguments, each of which may be an input or
> > output.
>
> The input or output is always one of these qreg indexes? (Some of the
> existing ones seem to take immediate values...)
It is always a qreg.
Potentially we could decide that some qregs are constants rather than
variables, and use that information for gode generation, but that's a
slightly different issue.
> > In addition to dynamically allocated qregs there are a fixed set of qregs
> > that map onto the guest CPU state. This is to simplify code generation.
>
> These are indexes 0, 1, and 2?
They are defined by th code you quote below. However this is an implementation
detail, and could change. You should use the named constants.
> Ok, looking at target-arm/translate.c, we have:
>
> static inline void gen_op_addl_T0_T1(void)
> {
> gen_op_add32(QREG_T0, QREG_T0, QREG_T1);
> }
>
> So what is QREG_T0 anyway? This is hard to grep for. 'find . | grep -v svn
> | xargs grep "QREG_T0"' doesn't produce anything useful, so there's got to
> be preprocessor concatenation stuff with ## going on, let's try just QREG
> on the *.h files, and yup at the start of qop.h there's this:
It corresponds to "T0" in dyngen. In addition to the actual CPU state, dyngen
uses 3 fixed register as scratch workspace. for qop purposes these are part
of the guest CPU state. They're only there to aid conversion of the
translation code, they'll go away eventually.
> enum target_qregs {
> QREG_NULL,
> #define DEFO32(name, offset) QREG_ ## name,
> #define DEFO64(name, offset) DEFO32(name, offset)
> #define DEFF32(name, reg) DEFO32(name, reg)
> #define DEFF64(name, reg) DEFO32(name, reg)
> #define DEFR(name, reg, mode) DEFO32(name, reg)
> #include "qregs.def"
>
> And that has "DEFR(T0, AREG1, QMODE_I32)" which... Ok, DEFR() discards the
> third argument ("mode") completely, and then DEFO32() discards the second
> argument (offset), and what's left is just the name, so it's position
> dependent (so why have the darn macros at ALL?)
Because qregs.def in included in at least two other places. This is the C
preprocessor trickery I mentioned :-)
> My brain hurts a lot now. I'm just letting you know. What is all this
> complication actually trying to accomplish?
Generation of 3 different things (QREG_* constants, the target_reginfo
structure, and qreg_names) from a single source. This avoid having to keep 3
big hairy arrays in sync with each other.
It's also used implement 64-bit qregs as a pair of 32-bit qregs on 32-bit
hosts.
> > Each qreg has a particular type (32/64 bit, integer or float).
>
> You mean each qop's arguments have a particular type, and the arguments are
> always in qregs? Or each qreg has a type permanently associated with that
> qreg?
Both the above.
> Or the value currently in a qreg has a type associated with it, but
> the next value stored in that qreg may have a different type?
A qreg has a fixed type. The value stored in that qreg has that type. To
convert it to a different type you need to use an explicit conversion qop.
> > It's up to
> > you to make sure the argument types match those expected by the qop. It's
> > generally fairly obvious from the name. eg. add32 adds I32 values, addf64
> > adds F64 values, etc. The exception is that I64 values can be used in
> > place of I32. The upper 64-bit of outputs are undefined in this case, and
> > the value must be explicitly extended before the full 64 bits are used.
>
> Possible translation: you can feed a qreg containing an I64 value to a qop
> taking an i32 argument, and it'll typecast the sucker down intelligently,
> but if you produce an I32 result and expect to use that qreg's value as an
> I64 argument later, you have to call a sign-extending qop on it first?
Exactly.
If you mix I32,F32 and/or F64 in this way Bad Things will happen.
> > The old dyngen ops are actually implemented as a special case qops.
>
> You mean each dyngen op produces multiple qops? (And/or is a bundle of
> qops?)
A dyngen op is a single qop that does magical unknown things.
> > As an example take the arm instruction
> >
> > add, r0, r1, r2, lsl #2
> >
> > This is equivalent to the C expression
> >
> > r0 = r1 + (r2 << 2)
> >
> > The old dyngen translate.c would do:
> >
> > gen_op_movl_T1_r2()
> > gen_op_shll_T1_im(2)
> > gen_op_movl_T0_r1();
> > gen_op_addl(); /* does T0 = T0 + T1 */
> > gen_op_movl_r0_T0
>
> Digging down into target-arm/translate.c, function disas_arm_insn(), I'm...
> still having to take your word for it. All the gen_op_movl_T1 variants I'm
> seeing end with _im which I presume means "immediate". The alternative is
> _cc, but what does that mean? (Presumably not "closed captioned".)
_cc are variants that set the condition codes. I may have got T0 and T1
backwards in the first 3 lines.
> > When fully converted to the new system this would become:
> >
> > int tmp = gen_new_qreg(); /* Allocate a temporary reg. */
> > /* gen_im32 is a helper that allocates a new qreg and
> > initializes it to an immediate value. */
> > gen_op_add32(tmp, QREG_R2, gen_im32(2));
> > gen_op_add32(QREG_R0, QREG_R1, tmp);
>
> Ok (still looking at target-arm/translate.c), I think you're not defining
> anything new here, you're just removing wrappers like gen_op_add_T1_im()
> which just wrap a single call to gen_op_add32(), and untangling the result?
>
> What the heck does gen_intermediate_code() do? It's a wrapper for a
> function that returns the same value and takes the exact same arguments in
> the same order. All that's different is the name. Why does that exist?
Hysterical raisins. ie. nothing useful.
> > One of the changes I've made to target-arm/translate.c is to replace all
> > uses
> > of T2 with new pseudo-regs. IN many cases I've left the code structure as
> > it was (using the global T0/T1 temporaries), but replaced the dyngen ops
> > with the equivalent qops. eg. movl and andl now generate mov32 and and32
> > qops.
>
> Um, is my earlier characterization of "unwrapping stuff" at all close?
Not entirely. I'm also replacing fixed locations (T2) with dynamicall
allocated qregs.
> > The standard qops are defined in qops.def. A target can also define
> > additional qops in qop-target.def. The target specific qops are to
> > simplify implementation the i386 static flag propagation pass. the
> > expand_op_* routines.
>
> Yeah, I looked at that and the macros that generate it in qops.h. There
> seem to be exactly two states (QREG_BLAH and QREGHI_BLAH) which can be
> reached from five different macros. The "offset", "reg", and "mode"
> entries are universally ignored, and all you actually _get_ is a big enum
> of identifiers in a certain order. I have no idea what's going on.
As mentioned above, qregs.def is included elsewhere.
> > For operations that are too complicated to be expressed as qops there is
> > a mechanism for calling helper functions. The m68k target uses this for
> > division and a couple of other things.
>
> Ok, now I'm really lost.
Most x86 instructions set the condition code flags. However most of the time
these flags are ignored. eg. if you have to consecutive add instructions the
first will set the flags, and the second will immediately overwrite them.
qemu contains a back-propagation pass that will remove the code to set the
flags after the first instruction. Currently this is implemented by changing
an addl_cc op into a plain addl op.
The flag-setting code would most likely require several qops to implement, so
it would be much harder to prove it is not needed and remove it. So there is
a mechanism for adding extra target qops, doing the flag elimination pass,
then expanding those to generic qops.
m68k generates the _cc ops neccessary for doing this, but is missing the
back-propagation optimization pass.
On RISC targets like ARM most instructions don't set the condition codes, so
we don't bother doing this.
> > The implementation make fairly heavy use of the C preprocessor to
> > generate code from .def files. There's also a small shell script that
> > pulls the definiteions of the helper routines out of qop-helper.c
>
> Ah, hang on. There's target_reginfo in translate-all.c, that's using some
> of the other values. So what the heck does translate-all.c do? (Shared
> code called by all the platform-dependent translate functions?)
There are three fairly independent stages:
1) target-*/translate.c converts guest code into qops.
2) translate-all.c messes about with those qops a bit (allocates host
registers, etc).
3) translate-op.c,translate-qop.c and target-*/ turns those qops into host
code.
> > The debug dumps can be quite useful. In particular -d in_asm,op will dump
> > the input asm and the resulting OPs.
>
> I'll have to find a system with gcc3 installed on it so I can actually try
> this out. (Hmmm, I have a Red Hat 9 image I run under qemu, maybe it would
> build under that?)
Probably.
> > For converting targets you can probably ignore most of the translate-all
> > and host-*/ changes. These implement generating code from the qops.
>
> Ok, this implies that qops are a new thing. Which looking at the code sort
> of supports. Which means I don't understand what's going on at all.
qops and dyngen ops are both small "functions" that are represented in a
similar way. The difference is that dyngen ops are target specific fixed
functions, whereas qops are generic parameterized functions.
While they are really separate things, the details have been chosen so it
should be possible to adapt the existing translate.c code rather than having
to rewrite it from scratch. Decoding x86 instruction semantics is
complicated :-)
Many of the simpler dyngen ops can be replaced with a single qop. Others can
be replaces with a sequence of a few qops. Some of the more complicated ones
may need to be moved into helper functions.
> > This works
> > by the host defining a set of "hard" qregs that correspond to host CPU
> > registers, and constraints for the operands of each qop. Then we do
> > register allocation and spilling to satisfy those constraints. The qops
> > can then be assembled directly into binary code.
>
> I need to re-read this later. My brain's full and I'm deeply confused.
I started off by saying qops were effectively instructions for an imaginary
machine. translate-all.c rearranges them so they match up very closely with
the instructions available on the host. Once this has been done turning them
into binary code is relatively simple.
> > There is also mechanisms for implementing floating point and 64-bit
> > arithmetic even if the target doesn't support this natively. The target
> > code doesn't need to worry about this, it just generates 64-bit/fp qops
> > and they will be decomposed as neccessary.
>
> The implementation calls the appropriate host functions to handle the
> floating point, using soft-float if necessary? (Under the old dyngen thing
> outputting blocks of gcc-produced code, I could understand how that works.
> But if you're outputting assembly directly... I'm back in the "totally
> lost" aread again, I think.)
Err, sort of. There's a couple of different layers.
In translate.c you'll do something like
tmp = gen_new_qreg(QMODE_F32);
gen_op_addf32(tmp, QREG_FOO, QREG_BAR).
If the host implements the floating point qops 'natively' then this will work
exactly the same as the integer qops and end up as host floating point
instructions. Currently this is not implemented for any hosts.
If native host FP is not available qemu will include appropriate bits so that
after macro expansion and inlining you end up with:
tmp = gen_new_qreg(QMODE_I32);
gen_op_helper(HELPER_addf32, tmp, QREG_FOO, QREG_BAR).
and the addf32 helper does the floating point addition using the "softfloat"
library. The qemu softfloat library implementation may actually use hardware
floating point rather than doing everything manually.
Likewise if the host doesn't have 64-bit operations gen_op_and64 will actually
expand to a pair of and32 operations.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-31 22:08 ` Paul Brook
@ 2006-10-31 22:31 ` Laurent Desnogues
2006-10-31 23:00 ` Paul Brook
2006-11-01 0:00 ` Rob Landley
1 sibling, 1 reply; 43+ messages in thread
From: Laurent Desnogues @ 2006-10-31 22:31 UTC (permalink / raw)
To: qemu-devel
Paul Brook a écrit :
> Replacing the pregenerated blocks with hand written assembly isn't feasible.
> Each target has its own set of ops, and each host would need its own assembly
> implementation of those ops. Multiply 11 targets by 11 hosts and you get a
> unmaintainable mess :-)
Shouldn't you have 11+11 and not 11*11, given your intermediate
representation? And of these 11+11, 11 have to be written
anyway (target). Or did I miss something?
> On RISC targets like ARM most instructions don't set the condition codes, so
> we don't bother doing this.
Except for ARM Thumb ISA which always sets flags. ARM is a bad
RISC example :)
I was wondering if you did some profiling to know how much time
is spent in disas_arm_insn. Of course the profiling results
would be very different for a Linux boot or a synthetic benchmark
(which makes me think that you don't support MMU, do you?).
There is a very nice trick to speed up decoding of ARM
instructions: pick up bits 20-27 and 4-7 and you (almost) get
one instruction per case entry; of course this means using a
generator to write the 4096 entries, but the result was good for
my interpreted ISS, reaching 44 M i/s on an Opteron @2.4GHz
without any compiler dependent trick (such as gcc jump to labels).
Laurent
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-31 22:31 ` Laurent Desnogues
@ 2006-10-31 23:00 ` Paul Brook
0 siblings, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-31 23:00 UTC (permalink / raw)
To: qemu-devel
On Tuesday 31 October 2006 22:31, Laurent Desnogues wrote:
> Paul Brook a écrit :
> > Replacing the pregenerated blocks with hand written assembly isn't
> > feasible. Each target has its own set of ops, and each host would need
> > its own assembly implementation of those ops. Multiply 11 targets by 11
> > hosts and you get a unmaintainable mess :-)
>
> Shouldn't you have 11+11 and not 11*11, given your intermediate
> representation? And of these 11+11, 11 have to be written
> anyway (target). Or did I miss something?
If you use qops (which is a target and host independent intermediate
representation) it's 11 + 11. If you just replace the existing dyngen op.c
with hand written assembly it's 11 * 11.
> > On RISC targets like ARM most instructions don't set the condition codes,
> > so we don't bother doing this.
>
> Except for ARM Thumb ISA which always sets flags. ARM is a bad
> RISC example :)
Bah. Details :-)
> I was wondering if you did some profiling to know how much time
> is spent in disas_arm_insn. Of course the profiling results
> would be very different for a Linux boot or a synthetic benchmark
The qop generator does add some overhead to the code translation. I haven't
done proper benchmarks, but in most cases it doesn't seem to be too bad
(maybe 10%). I'm hoping we can get most of that back.
> (which makes me think that you don't support MMU, do you?).
qemu does implement a MMU.
Currently this still uses the dyngen code, but that's fixable.
> There is a very nice trick to speed up decoding of ARM
> instructions: pick up bits 20-27 and 4-7 and you (almost) get
> one instruction per case entry; of course this means using a
> generator to write the 4096 entries, but the result was good for
> my interpreted ISS, reaching 44 M i/s on an Opteron @2.4GHz
> without any compiler dependent trick (such as gcc jump to labels).
qemu generally gets 100-200MIPS on my 2GHz Opteron.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-31 19:02 ` Paul Brook
2006-10-31 20:41 ` Rob Landley
@ 2006-10-31 23:17 ` Rob Landley
2006-11-01 0:01 ` Paul Brook
1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-31 23:17 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
On Tuesday 31 October 2006 2:02 pm, Paul Brook wrote:
> As an example take the arm instruction
>
> add, r0, r1, r2, lsl #2
>
> This is equivalent to the C expression
>
> r0 = r1 + (r2 << 2)
...
> When fully converted to the new system this would become:
>
> int tmp = gen_new_qreg(); /* Allocate a temporary reg. */
> /* gen_im32 is a helper that allocates a new qreg and
> initializes it to an immediate value. */
> gen_op_add32(tmp, QREG_R2, gen_im32(2));
> gen_op_add32(QREG_R0, QREG_R1, tmp);
I forgot to ask:
Where's the shift? I think the above code means you generate an immediate
value (the 2), add it to R2 with the result going in a spill register, and
then add the spill register to R1, with the result going to R0. Should that
middle line be some kind of gen_op_lshift32() instead of gen_op_add32()?
Do qregs ever get freed? (I'm guessing gen_new_qreg() lasts until the end of
the translated block, and then the next block has its own set of qregs?)
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-31 22:08 ` Paul Brook
2006-10-31 22:31 ` Laurent Desnogues
@ 2006-11-01 0:00 ` Rob Landley
2006-11-01 0:29 ` Paul Brook
1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-11-01 0:00 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
On Tuesday 31 October 2006 5:08 pm, Paul Brook wrote:
> On Tuesday 31 October 2006 20:41, Rob Landley wrote:
> > Welcome to Stupid Question Theatre! With your host, Paul Brook. Today's
> > contestant is: Rob Landley. How dumb will it get?
Bonus round!
> > I thought what you were doing was replacing the pregenerated blocks with
> > hand-coded assembly statements, but your description here seems to be
about
> > changing the disassembly routines that figure out which qops to string
> > together in part 2.
>
> Replacing the pregenerated blocks with hand written assembly isn't feasible.
> Each target has its own set of ops, and each host would need its own
> assembly implementation of those ops. Multiply 11 targets by 11 hosts and
> you get a unmaintainable mess :-)
Actually it sounds additive rather than multiplicative. Does each target have
an entirely unrelated set of ops, or is there a shared set of primitive ops
plus some oddballs?
But backing up and just accepting that for a moment, in theory what you need
is some way to compile a C function to machine code, and then unwrap that
function into a .raw file containing just the machine code. So the only
per-compiler thing would be this unwrapper thingy. But I already know that
doesn't work because it doesn't explain the "unable to find spill register"
problem. Presumably, just beating the right .raw contents out of the
compiler is nontrivial, let alone unwrapping it...
> It corresponds to "T0" in dyngen. In addition to the actual CPU state,
dyngen
> uses 3 fixed register as scratch workspace. for qop purposes these are part
> of the guest CPU state. They're only there to aid conversion of the
> translation code, they'll go away eventually.
Presumably the m68k target is pure qop, and hasn't got this sort of thing?
> > My brain hurts a lot now. I'm just letting you know. What is all this
> > complication actually trying to accomplish?
>
> Generation of 3 different things (QREG_* constants, the target_reginfo
> structure, and qreg_names) from a single source. This avoid having to keep 3
> big hairy arrays in sync with each other.
> It's also used implement 64-bit qregs as a pair of 32-bit qregs on 32-bit
> hosts.
Ok, the QREG_* constants are for the intermediate code the decompiler stuff
generates. I have no idea what target_reginfo and qreg_names are for, but
maybe it'll come to me as I read the code...
> > Or the value currently in a qreg has a type associated with it, but
> > the next value stored in that qreg may have a different type?
>
> A qreg has a fixed type. The value stored in that qreg has that type. To
> convert it to a different type you need to use an explicit conversion qop.
So values don't have types, the qregs the values are _in_ have types. But I
thought there were an unlimited number of them (well, 1024 or so), and
they're dynamically allocated (at least some of the time). How does it keep
track of the type of a given qreg? (When you convert, you copy values from
one qreg into another?)
> > Possible translation: you can feed a qreg containing an I64 value to a qop
> > taking an i32 argument, and it'll typecast the sucker down intelligently,
> > but if you produce an I32 result and expect to use that qreg's value as an
> > I64 argument later, you have to call a sign-extending qop on it first?
>
> Exactly.
> If you mix I32,F32 and/or F64 in this way Bad Things will happen.
Presumably just the same kinds of Bad Things as "float f; *(int *)&f;"?
> > seeing end with _im which I presume means "immediate". The alternative is
> > _cc, but what does that mean? (Presumably not "closed captioned".)
>
> _cc are variants that set the condition codes. I may have got T0 and T1
> backwards in the first 3 lines.
Ah!
Is this written down anywhere? I've read Fabrice's paper and the design
documentation, and I'm not remembering this. It's quite possible I missed it
when my brain filled up, though.
> > Um, is my earlier characterization of "unwrapping stuff" at all close?
>
> Not entirely. I'm also replacing fixed locations (T2) with dynamicall
> allocated qregs.
The dynamic allocation buys you what? (Less spilling?)
> > Ok, now I'm really lost.
>
> Most x86 instructions set the condition code flags. However most of the time
> these flags are ignored. eg. if you have to consecutive add instructions the
> first will set the flags, and the second will immediately overwrite them.
>
> qemu contains a back-propagation pass that will remove the code to set the
> flags after the first instruction. Currently this is implemented by changing
> an addl_cc op into a plain addl op.
I actually understood that. Yay!
> The flag-setting code would most likely require several qops to implement,
> so
> it would be much harder to prove it is not needed and remove it. So there is
> a mechanism for adding extra target qops, doing the flag elimination pass,
> then expanding those to generic qops.
Um, wouldn't the flag setting code be fairly straightforward as a qop that
comes right _before_ the other op, as in "set the flags for doing this with
these registers", that does nothing but set the flags (I.E. it wouldn't
modify the contents of any the registers, so it could be immediately followed
by the appropriate add or shift or so on), and then the flag setting pass
could just turn all the ones that weren't needed into QOP_NULL?
Or is that what's happening now? (Do QOPs ever modify their input registers,
or only the output one?)
> > Ah, hang on. There's target_reginfo in translate-all.c, that's using some
> > of the other values. So what the heck does translate-all.c do? (Shared
> > code called by all the platform-dependent translate functions?)
>
> There are three fairly independent stages:
> 1) target-*/translate.c converts guest code into qops.
> 2) translate-all.c messes about with those qops a bit (allocates host
> registers, etc).
> 3) translate-op.c,translate-qop.c and target-*/ turns those qops into host
> code.
Is pass 2 where the flag elimination pass goes (and presumably any other
optimizations that might get added)? No, that can't be the case or the m68k
code wouldn't need its own implementation of the flag elimination pass...
> > > For converting targets you can probably ignore most of the translate-all
> > > and host-*/ changes. These implement generating code from the qops.
> >
> > Ok, this implies that qops are a new thing. Which looking at the code
sort
> > of supports. Which means I don't understand what's going on at all.
>
> qops and dyngen ops are both small "functions" that are represented in a
> similar way. The difference is that dyngen ops are target specific fixed
> functions, whereas qops are generic parameterized functions.
So the 11x11 exponential complexity of qemu producing its own assembly output
might not be as much of a problem after switching to qops?
Possibly some of the common qops can have an asm block for 'em, and the rest
can go through the contortions target-*/op.c is currently doing with
(glue(glue(blah))) and so on.
> While they are really separate things, the details have been chosen so it
> should be possible to adapt the existing translate.c code rather than having
> to rewrite it from scratch. Decoding x86 instruction semantics is
> complicated :-)
Yay iterative transformation with regression testing. (And nothing says
regression testing like booting a Linux distro under the sucker.)
> Many of the simpler dyngen ops can be replaced with a single qop. Others can
> be replaces with a sequence of a few qops. Some of the more complicated ones
> may need to be moved into helper functions.
At some point, I hope to understand helper functions. But I'm not there yet.
> > I need to re-read this later. My brain's full and I'm deeply confused.
>
> I started off by saying qops were effectively instructions for an imaginary
> machine. translate-all.c rearranges them so they match up very closely with
> the instructions available on the host. Once this has been done turning them
> into binary code is relatively simple.
I sort of thought this is what it was already doing, but apparently not...
> > The implementation calls the appropriate host functions to handle the
> > floating point, using soft-float if necessary? (Under the old dyngen
thing
> > outputting blocks of gcc-produced code, I could understand how that works.
> > But if you're outputting assembly directly... I'm back in the "totally
> > lost" aread again, I think.)
>
> Err, sort of. There's a couple of different layers.
>
> In translate.c you'll do something like
>
> tmp = gen_new_qreg(QMODE_F32);
> gen_op_addf32(tmp, QREG_FOO, QREG_BAR).
>
> If the host implements the floating point qops 'natively' then this will
work
> exactly the same as the integer qops and end up as host floating point
> instructions. Currently this is not implemented for any hosts.
Ok.
> If native host FP is not available qemu will include appropriate bits so
that
> after macro expansion and inlining you end up with:
>
> tmp = gen_new_qreg(QMODE_I32);
> gen_op_helper(HELPER_addf32, tmp, QREG_FOO, QREG_BAR).
>
> and the addf32 helper does the floating point addition using the "softfloat"
> library. The qemu softfloat library implementation may actually use hardware
> floating point rather than doing everything manually.
No reason (except speed) the code output into a translation block can't do
function calls. I think.
> Likewise if the host doesn't have 64-bit operations gen_op_and64 will
actually
> expand to a pair of and32 operations.
Ok.
I'm still trying to follow a translation all the way from source to target.
Just getting application emulation to do "hello world" is pretty darn
complicated. Your dump mode earlier sounded highly interesting. (It's on my
todo list.)
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-10-31 23:17 ` Rob Landley
@ 2006-11-01 0:01 ` Paul Brook
0 siblings, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-11-01 0:01 UTC (permalink / raw)
To: qemu-devel
> Where's the shift? I think the above code means you generate an immediate
> value (the 2), add it to R2 with the result going in a spill register, and
> then add the spill register to R1, with the result going to R0. Should
> that middle line be some kind of gen_op_lshift32() instead of
> gen_op_add32()?
Yes.
> Do qregs ever get freed? (I'm guessing gen_new_qreg() lasts until the end
> of the translated block, and then the next block has its own set of qregs?)
Correct.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-11-01 0:00 ` Rob Landley
@ 2006-11-01 0:29 ` Paul Brook
2006-11-01 1:51 ` Rob Landley
0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-11-01 0:29 UTC (permalink / raw)
To: qemu-devel
> Actually it sounds additive rather than multiplicative. Does each target
> have an entirely unrelated set of ops, or is there a shared set of
> primitive ops plus some oddballs?
The shared set of primitive ops is basically qops :-)
You probably could figure out a single common qet of qops, then write assembly
and glue them together like we do with dyngen. However once you've done that
you've implemented most of what's needed for fully dynamic qops, so it
doesn't really seem worth it.
> But backing up and just accepting that for a moment, in theory what you
> need is some way to compile a C function to machine code, and then unwrap
> that function into a .raw file containing just the machine code. So the
> only per-compiler thing would be this unwrapper thingy.
Right.
> But I already know
> that doesn't work because it doesn't explain the "unable to find spill
> register" problem.
That a separate gcc bug. It gets stuck when you tell it not to use half the
registers, then ask it to do 64-bit math. This is one of the reasons
eliminating the fixed registers is a good idea.
> > It corresponds to "T0" in dyngen. In addition to the actual CPU state,
> > dyngen
> > uses 3 fixed register as scratch workspace. for qop purposes these are
> > part of the guest CPU state. They're only there to aid conversion of the
> > translation code, they'll go away eventually.
>
> Presumably the m68k target is pure qop, and hasn't got this sort of thing?
Correct.
There is one use of T0 left for communicating with the TB chaining code, but
that's it and will probably go away eventually.
> > > Or the value currently in a qreg has a type associated with it, but
> > > the next value stored in that qreg may have a different type?
> >
> > A qreg has a fixed type. The value stored in that qreg has that type. To
> > convert it to a different type you need to use an explicit conversion
> > qop.
>
> So values don't have types, the qregs the values are _in_ have types. But
> I thought there were an unlimited number of them (well, 1024 or so), and
> they're dynamically allocated (at least some of the time). How does it
> keep track of the type of a given qreg? (When you convert, you copy values
> from one qreg into another?)
Yes. Conversion is just like any other qop. It reads one qreg, and writes the
result to a different qreg which happens to be a different type.
> > > Possible translation: you can feed a qreg containing an I64 value to a
> > > qop taking an i32 argument, and it'll typecast the sucker down
> > > intelligently, but if you produce an I32 result and expect to use that
> > > qreg's value as an I64 argument later, you have to call a
> > > sign-extending qop on it first?
> >
> > Exactly.
> > If you mix I32,F32 and/or F64 in this way Bad Things will happen.
>
> Presumably just the same kinds of Bad Things as "float f; *(int *)&f;"?
Or qemu will get confused and crash.
> > > seeing end with _im which I presume means "immediate". The alternative
> > > is _cc, but what does that mean? (Presumably not "closed captioned".)
> >
> > _cc are variants that set the condition codes. I may have got T0 and T1
> > backwards in the first 3 lines.
>
> Ah!
>
> Is this written down anywhere? I've read Fabrice's paper and the design
> documentation, and I'm not remembering this. It's quite possible I missed
> it when my brain filled up, though.
Dunno.
> > > Um, is my earlier characterization of "unwrapping stuff" at all close?
> >
> > Not entirely. I'm also replacing fixed locations (T2) with dynamicall
> > allocated qregs.
>
> The dynamic allocation buys you what? (Less spilling?)
More-or-less. It makes it easier to optimize. The code generator can pick what
to put in registers, or even not put them there at all, instead of having to
do things exactly how you told it.
It also means you don't need to reserve that register, avoiding the gcc unable
to find spill register bug you mentioned above.
> > Most x86 instructions set the condition code flags. However most of the
> > time these flags are ignored. eg. if you have to consecutive add
> > instructions the first will set the flags, and the second will
> > immediately overwrite them.
> >
> > qemu contains a back-propagation pass that will remove the code to set
> > the flags after the first instruction. Currently this is implemented by
> > changing an addl_cc op into a plain addl op.
>
> I actually understood that. Yay!
>
> > The flag-setting code would most likely require several qops to
> > implement, so
> > it would be much harder to prove it is not needed and remove it. So there
> > is a mechanism for adding extra target qops, doing the flag elimination
> > pass, then expanding those to generic qops.
>
> Um, wouldn't the flag setting code be fairly straightforward as a qop that
> comes right _before_ the other op, as in "set the flags for doing this with
> these registers", that does nothing but set the flags (I.E. it wouldn't
> modify the contents of any the registers, so it could be immediately
> followed by the appropriate add or shift or so on), and then the flag
> setting pass could just turn all the ones that weren't needed into
> QOP_NULL?
Theoretically possible, but not so easy in practice. Especially when you get
things like partial flag clobbers, and lazy flag evaluation. Doing it as a
target specific hack is much simpler and quicker.
> Or is that what's happening now? (Do QOPs ever modify their input
> registers, or only the output one?)
The generic qops never modify inputs, and never read outputs. Inputs and
outputs can be the same qreg.
> > > Ah, hang on. There's target_reginfo in translate-all.c, that's using
> > > some of the other values. So what the heck does translate-all.c do?
> > > (Shared code called by all the platform-dependent translate functions?)
> >
> > There are three fairly independent stages:
> > 1) target-*/translate.c converts guest code into qops.
> > 2) translate-all.c messes about with those qops a bit (allocates host
> > registers, etc).
> > 3) translate-op.c,translate-qop.c and target-*/ turns those qops into
> > host code.
>
> Is pass 2 where the flag elimination pass goes (and presumably any other
> optimizations that might get added)? No, that can't be the case or the
> m68k code wouldn't need its own implementation of the flag elimination
> pass...
Flag elimination is at the end of step 1.
> > > > For converting targets you can probably ignore most of the
> > > > translate-all and host-*/ changes. These implement generating code
> > > > from the qops.
> > >
> > > Ok, this implies that qops are a new thing. Which looking at the code
> > > sort
> > > of supports. Which means I don't understand what's going on at all.
> >
> > qops and dyngen ops are both small "functions" that are represented in a
> > similar way. The difference is that dyngen ops are target specific fixed
> > functions, whereas qops are generic parameterized functions.
>
> So the 11x11 exponential complexity of qemu producing its own assembly
> output might not be as much of a problem after switching to qops?
RIght. The exponential complexity is if you write the assembly by hand instead
of using gcc to generate it.
> Possibly some of the common qops can have an asm block for 'em, and the
> rest can go through the contortions target-*/op.c is currently doing with
> (glue(glue(blah))) and so on.
Currently we know how to generate code direcly for all qops. Anything more
complicated must be either put in a helper function or split into multiple
qops.
> > While they are really separate things, the details have been chosen so it
> > should be possible to adapt the existing translate.c code rather than
> > having to rewrite it from scratch. Decoding x86 instruction semantics is
> > complicated :-)
>
> Yay iterative transformation with regression testing. (And nothing says
> regression testing like booting a Linux distro under the sucker.)
Exactly.
> > Many of the simpler dyngen ops can be replaced with a single qop. Others
> > can be replaces with a sequence of a few qops. Some of the more
> > complicated ones may need to be moved into helper functions.
>
> At some point, I hope to understand helper functions. But I'm not there
> yet.
>
> > > I need to re-read this later. My brain's full and I'm deeply confused.
> >
> > I started off by saying qops were effectively instructions for an
> > imaginary machine. translate-all.c rearranges them so they match up very
> > closely with the instructions available on the host. Once this has been
> > done turning them into binary code is relatively simple.
>
> I sort of thought this is what it was already doing, but apparently not...
We're getting confused with tenses. I mean this once translate-all.c has
rearranged the qops we *do* generate host instructions from them without too
much effort.
> > If native host FP is not available qemu will include appropriate bits so
> > that
> > after macro expansion and inlining you end up with:
> >
> > tmp = gen_new_qreg(QMODE_I32);
> > gen_op_helper(HELPER_addf32, tmp, QREG_FOO, QREG_BAR).
> >
> > and the addf32 helper does the floating point addition using the
> > "softfloat" library. The qemu softfloat library implementation may
> > actually use hardware floating point rather than doing everything
> > manually.
>
> No reason (except speed) the code output into a translation block can't do
> function calls. I think.
That's exactly what a helper function is. Calling functions is complicated, so
I've restricted the functions that can be called to explicitly declared
helper functions.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-11-01 0:29 ` Paul Brook
@ 2006-11-01 1:51 ` Rob Landley
2006-11-01 3:22 ` Paul Brook
0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-11-01 1:51 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
On Tuesday 31 October 2006 7:29 pm, Paul Brook wrote:
> > Actually it sounds additive rather than multiplicative. Does each target
> > have an entirely unrelated set of ops, or is there a shared set of
> > primitive ops plus some oddballs?
>
> The shared set of primitive ops is basically qops :-)
> You probably could figure out a single common qet of qops, then write
assembly
> and glue them together like we do with dyngen. However once you've done that
> you've implemented most of what's needed for fully dynamic qops, so it
> doesn't really seem worth it.
I missed a curve. What's "fully dynamic qops"? (There's no translation
cache?)
> > But I already know
> > that doesn't work because it doesn't explain the "unable to find spill
> > register" problem.
>
> That a separate gcc bug. It gets stuck when you tell it not to use half the
> registers, then ask it to do 64-bit math. This is one of the reasons
> eliminating the fixed registers is a good idea.
Sigh. The problems motivating me to learn the code are highly esoteric
breakage, yet I'm still not quite up to the task of understanding what's
going on when all this works _right_. Grumble...
> > > It corresponds to "T0" in dyngen. In addition to the actual CPU state,
> > > dyngen
> > > uses 3 fixed register as scratch workspace. for qop purposes these are
> > > part of the guest CPU state. They're only there to aid conversion of the
> > > translation code, they'll go away eventually.
> >
> > Presumably the m68k target is pure qop, and hasn't got this sort of thing?
>
> Correct.
> There is one use of T0 left for communicating with the TB chaining code, but
> that's it and will probably go away eventually.
Any idea where I can get a toolchain that can output a "hello world" program
for m68k nommu? (Or perhaps you have a statically linked "hello world"
program for the platform lying around?)
Building toolchains is one of my other hobbies but it's a royal pain because
in order to get "hello world" to compile and link you have to supply kernel
headers, build binutils and gcc with various configuration options and path
overrides and such, build uClibc with the result and get them all talking to
each other. I.E. you've got to do hours of work before you get to the first
real "did it work" point, and then backtrack to figure out why the answer is
usually "no". (Prebuilt binary toolchains are useful just to narrow down the
number of possible things that could be broken when you first try out a new
platform.)
> > > > Possible translation: you can feed a qreg containing an I64 value to a
> > > > qop taking an i32 argument, and it'll typecast the sucker down
> > > > intelligently, but if you produce an I32 result and expect to use that
> > > > qreg's value as an I64 argument later, you have to call a
> > > > sign-extending qop on it first?
> > >
> > > Exactly.
> > > If you mix I32,F32 and/or F64 in this way Bad Things will happen.
> >
> > Presumably just the same kinds of Bad Things as "float f; *(int *)&f;"?
>
> Or qemu will get confused and crash.
I've had that happen without qops, although not recently. (I have this nasty
habit of trying Ubuntu's PPC and x86-64 distros under qemu with each new
release. They usually fail in amusing new ways.)
> > > > seeing end with _im which I presume means "immediate". The
alternative
> > > > is _cc, but what does that mean? (Presumably not "closed captioned".)
> > >
> > > _cc are variants that set the condition codes. I may have got T0 and T1
> > > backwards in the first 3 lines.
> >
> > Ah!
> >
> > Is this written down anywhere? I've read Fabrice's paper and the design
> > documentation, and I'm not remembering this. It's quite possible I missed
> > it when my brain filled up, though.
>
> Dunno.
So if at any point I actually understand this stuff, I need to write
documentation? (I can do part 2, part 1 the jury's still out on...)
> It also means you don't need to reserve that register, avoiding the gcc
> unable to find spill register bug you mentioned above.
I'm all for it.
> > Um, wouldn't the flag setting code be fairly straightforward as a qop that
> > comes right _before_ the other op, as in "set the flags for doing this
with
> > these registers", that does nothing but set the flags (I.E. it wouldn't
> > modify the contents of any the registers, so it could be immediately
> > followed by the appropriate add or shift or so on), and then the flag
> > setting pass could just turn all the ones that weren't needed into
> > QOP_NULL?
>
> Theoretically possible, but not so easy in practice. Especially when you get
> things like partial flag clobbers, and lazy flag evaluation. Doing it as a
> target specific hack is much simpler and quicker.
I think I know what partial flag clobbers are (although if you're working your
way back, in theory you could handle it with a mask of exposed bits), but
what's lazy flag evaulation? (I thought that was the point of eliminating
the unused flag setting. Are you saying the hardware also does this and we
have to emulate that?)
> > Or is that what's happening now? (Do QOPs ever modify their input
> > registers, or only the output one?)
>
> The generic qops never modify inputs, and never read outputs. Inputs and
> outputs can be the same qreg.
Hm.
> > > There are three fairly independent stages:
> > > 1) target-*/translate.c converts guest code into qops.
> > > 2) translate-all.c messes about with those qops a bit (allocates host
> > > registers, etc).
> > > 3) translate-op.c,translate-qop.c and target-*/ turns those qops into
> > > host code.
> >
> > Is pass 2 where the flag elimination pass goes (and presumably any other
> > optimizations that might get added)? No, that can't be the case or the
> > m68k code wouldn't need its own implementation of the flag elimination
> > pass...
>
> Flag elimination is at the end of step 1.
Because it's platform specific?
\
> > > qops and dyngen ops are both small "functions" that are represented in a
> > > similar way. The difference is that dyngen ops are target specific fixed
> > > functions, whereas qops are generic parameterized functions.
> >
> > So the 11x11 exponential complexity of qemu producing its own assembly
> > output might not be as much of a problem after switching to qops?
>
> RIght. The exponential complexity is if you write the assembly by hand
> instead of using gcc to generate it.
The exponential complexity is if you have to write different code for each
combination of host and target. If every target disassembles to the same set
of target QOPs, then you could have a hand-written assembly version of each
QOP for each host platform, and still have N rather than N^2 of them.
And I still wanna use tcc to generate it, someday. :)
> > Possibly some of the common qops can have an asm block for 'em, and the
> > rest can go through the contortions target-*/op.c is currently doing with
> > (glue(glue(blah))) and so on.
>
> Currently we know how to generate code direcly for all qops. Anything more
> complicated must be either put in a helper function or split into multiple
> qops.
Split into multiple qops I can understand.
> > > I started off by saying qops were effectively instructions for an
> > > imaginary machine. translate-all.c rearranges them so they match up very
> > > closely with the instructions available on the host. Once this has been
> > > done turning them into binary code is relatively simple.
> >
> > I sort of thought this is what it was already doing, but apparently not...
>
> We're getting confused with tenses. I mean this once translate-all.c has
> rearranged the qops we *do* generate host instructions from them without too
> much effort.
By "already doing" I meant I thought the 0.8.2 code was dong this, before your
new tree switching everything over to qops. (Trying to read dyngen.c reminds
me of reading cgi code that outputs html with embedded javascript.)
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-11-01 1:51 ` Rob Landley
@ 2006-11-01 3:22 ` Paul Brook
2006-11-01 16:34 ` Rob Landley
0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-11-01 3:22 UTC (permalink / raw)
To: Rob Landley; +Cc: qemu-devel
On Wednesday 01 November 2006 01:51, Rob Landley wrote:
> On Tuesday 31 October 2006 7:29 pm, Paul Brook wrote:
> > > Actually it sounds additive rather than multiplicative. Does each
> > > target have an entirely unrelated set of ops, or is there a shared set
> > > of primitive ops plus some oddballs?
> >
> > The shared set of primitive ops is basically qops :-)
> > You probably could figure out a single common qet of qops, then write
>
> assembly
>
> > and glue them together like we do with dyngen. However once you've done
> > that you've implemented most of what's needed for fully dynamic qops, so
> > it doesn't really seem worth it.
>
> I missed a curve. What's "fully dynamic qops"? (There's no translation
> cache?)
I mean all the qop stuff I've implemented.
> > > > It corresponds to "T0" in dyngen. In addition to the actual CPU
> > > > state, dyngen
> > > > uses 3 fixed register as scratch workspace. for qop purposes these
> > > > are part of the guest CPU state. They're only there to aid conversion
> > > > of the translation code, they'll go away eventually.
> > >
> > > Presumably the m68k target is pure qop, and hasn't got this sort of
> > > thing?
> >
> > Correct.
> > There is one use of T0 left for communicating with the TB chaining code,
> > but that's it and will probably go away eventually.
>
> Any idea where I can get a toolchain that can output a "hello world"
> program for m68k nommu? (Or perhaps you have a statically linked "hello
> world" program for the platform lying around?)
Funnily enough I do :-)
http://www.codesourcery.com/gnu_toolchains/coldfire/
> > Theoretically possible, but not so easy in practice. Especially when you
> > get things like partial flag clobbers, and lazy flag evaluation. Doing it
> > as a target specific hack is much simpler and quicker.
>
> I think I know what partial flag clobbers are (although if you're working
> your way back, in theory you could handle it with a mask of exposed bits),
> but what's lazy flag evaulation? (I thought that was the point of
> eliminating the unused flag setting. Are you saying the hardware also does
> this and we have to emulate that?)
Lazy flag evaluation is where you don't bother calculating the actual flags
when executing the flag-setting instruction. Instead you save the
operands/result and compute the flags when you actually need them.
> > > > There are three fairly independent stages:
> > > > 1) target-*/translate.c converts guest code into qops.
> > > > 2) translate-all.c messes about with those qops a bit (allocates host
> > > > registers, etc).
> > > > 3) translate-op.c,translate-qop.c and target-*/ turns those qops into
> > > > host code.
> > >
> > > Is pass 2 where the flag elimination pass goes (and presumably any
> > > other optimizations that might get added)? No, that can't be the case
> > > or the m68k code wouldn't need its own implementation of the flag
> > > elimination pass...
> >
> > Flag elimination is at the end of step 1.
>
> Because it's platform specific?
Yes.
> > > > qops and dyngen ops are both small "functions" that are represented
> > > > in a similar way. The difference is that dyngen ops are target
> > > > specific fixed functions, whereas qops are generic parameterized
> > > > functions.
> > >
> > > So the 11x11 exponential complexity of qemu producing its own assembly
> > > output might not be as much of a problem after switching to qops?
> >
> > RIght. The exponential complexity is if you write the assembly by hand
> > instead of using gcc to generate it.
>
> The exponential complexity is if you have to write different code for each
> combination of host and target. If every target disassembles to the same
> set of target QOPs, then you could have a hand-written assembly version of
> each QOP for each host platform, and still have N rather than N^2 of them.
Right, but by the time you've got everything to use the same set of ops you
may as well teach qemu how to generate code instead of using potted
fragments.
Using hand-written assembly fragments probably doesn't make qemu any faster,
it just removes the gcc dependency. Using qops also allows qemu to generate
better (faster) translated code.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-11-01 3:22 ` Paul Brook
@ 2006-11-01 16:34 ` Rob Landley
2006-11-01 17:01 ` Paul Brook
0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-11-01 16:34 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel
On Tuesday 31 October 2006 10:22 pm, Paul Brook wrote:
> > > and glue them together like we do with dyngen. However once you've done
> > > that you've implemented most of what's needed for fully dynamic qops, so
> > > it doesn't really seem worth it.
> >
> > I missed a curve. What's "fully dynamic qops"? (There's no translation
> > cache?)
>
> I mean all the qop stuff I've implemented.
Still lost. Where does the "fully dynamic" part come in?
> > Any idea where I can get a toolchain that can output a "hello world"
> > program for m68k nommu? (Or perhaps you have a statically linked "hello
> > world" program for the platform lying around?)
>
> Funnily enough I do :-)
> http://www.codesourcery.com/gnu_toolchains/coldfire/
Woot.
This download page was designed by your company's legal department, I take it?
(There's no such thing as GNU/Linux, and you don't have to accept the GPL
because it's based on copyright law, not contract law, so "informed consent"
is not the basis for enforcement.)
> > > Theoretically possible, but not so easy in practice. Especially when you
> > > get things like partial flag clobbers, and lazy flag evaluation. Doing
it
> > > as a target specific hack is much simpler and quicker.
> >
> > I think I know what partial flag clobbers are (although if you're working
> > your way back, in theory you could handle it with a mask of exposed bits),
> > but what's lazy flag evaulation? (I thought that was the point of
> > eliminating the unused flag setting. Are you saying the hardware also
does
> > this and we have to emulate that?)
>
> Lazy flag evaluation is where you don't bother calculating the actual flags
> when executing the flag-setting instruction. Instead you save the
> operands/result and compute the flags when you actually need them.
Such as when the computation's in a loop but you only test after exiting the
loop for other reasons? I'd have to see examples to figure out how it would
make sense to optimize that...
> > The exponential complexity is if you have to write different code for each
> > combination of host and target. If every target disassembles to the same
> > set of target QOPs, then you could have a hand-written assembly version of
> > each QOP for each host platform, and still have N rather than N^2 of them.
>
> Right, but by the time you've got everything to use the same set of ops you
> may as well teach qemu how to generate code instead of using potted
> fragments.
I'm thinking of "here's the code for performing this QOP on this processor".
I'm not sure what distinction you're making.
> Using hand-written assembly fragments probably doesn't make qemu any faster,
> it just removes the gcc dependency.
Which seems like a good thing to me. (Or at least the gcc _version_
dependency.)
> Using qops also allows qemu to generate better (faster) translated code.
Currently, I'm interested in building qemu under compiler versions I actually
have installed, without having to apply patches that the patch authors
consider too disgusting to integrate.
> Paul
Rob
--
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Qemu-devel] qemu vs gcc4
2006-11-01 16:34 ` Rob Landley
@ 2006-11-01 17:01 ` Paul Brook
0 siblings, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-11-01 17:01 UTC (permalink / raw)
To: Rob Landley; +Cc: qemu-devel
On Wednesday 01 November 2006 16:34, Rob Landley wrote:
> On Tuesday 31 October 2006 10:22 pm, Paul Brook wrote:
> > > > and glue them together like we do with dyngen. However once you've
> > > > done that you've implemented most of what's needed for fully dynamic
> > > > qops, so it doesn't really seem worth it.
> > >
> > > I missed a curve. What's "fully dynamic qops"? (There's no
> > > translation cache?)
> >
> > I mean all the qop stuff I've implemented.
>
> Still lost. Where does the "fully dynamic" part come in?
Generating code instead in blindly glueing together precompiled fragments.
> > > Any idea where I can get a toolchain that can output a "hello world"
> > > program for m68k nommu? (Or perhaps you have a statically linked
> > > "hello world" program for the platform lying around?)
> >
> > Funnily enough I do :-)
> > http://www.codesourcery.com/gnu_toolchains/coldfire/
>
> Woot.
>
> This download page was designed by your company's legal department, I take
> it? (There's no such thing as GNU/Linux, and you don't have to accept the
> GPL because it's based on copyright law, not contract law, so "informed
> consent" is not the basis for enforcement.)
I suspect it's more to do with due diligence on our part. While it might not
have any legal meaning on its own, it makes it much harder for people to
claim they infringed copyright accidentally.
> > Lazy flag evaluation is where you don't bother calculating the actual
> > flags when executing the flag-setting instruction. Instead you save the
> > operands/result and compute the flags when you actually need them.
>
> Such as when the computation's in a loop but you only test after exiting
> the loop for other reasons? I'd have to see examples to figure out how it
> would make sense to optimize that...
Ys, or when only some flags are used. eg:
add %eax, %ebx
adc %cex, %edx
The adc instruction only uses the carry flag, then clobbers all the rest. A
naive implementation would evaluate all the flags after the add. With Lazy
evaluation we evaluate just the carry flag before the adc, and know we don't
have to bother calculating the other flags. It also avoids having to evaluate
the flags at a TB boundary.
> > > The exponential complexity is if you have to write different code for
> > > each combination of host and target. If every target disassembles to
> > > the same set of target QOPs, then you could have a hand-written
> > > assembly version of each QOP for each host platform, and still have N
> > > rather than N^2 of them.
> >
> > Right, but by the time you've got everything to use the same set of ops
> > you may as well teach qemu how to generate code instead of using potted
> > fragments.
>
> I'm thinking of "here's the code for performing this QOP on this
> processor". I'm not sure what distinction you're making.
The difference is whether qemu knows how to generate code for that target, or
is blindly glueing binary blobs together.
Paul
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2006-11-01 17:02 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-20 18:53 [Qemu-devel] qemu vs gcc4 K. Richard Pixley
2006-10-22 22:06 ` Johannes Schindelin
2006-10-23 8:16 ` Martin Guy
2006-10-23 12:20 ` Paul Brook
2006-10-23 13:59 ` Avi Kivity
2006-10-23 14:10 ` Paul Brook
2006-10-23 14:28 ` Avi Kivity
2006-10-23 14:31 ` Paul Brook
2006-10-23 14:35 ` Avi Kivity
2006-10-23 17:41 ` K. Richard Pixley
2006-10-23 17:58 ` Paul Brook
2006-10-23 18:04 ` K. Richard Pixley
2006-10-23 18:20 ` Laurent Desnogues
2006-10-23 18:37 ` Paul Brook
2006-10-24 23:39 ` Rob Landley
2006-10-25 0:24 ` Paul Brook
2006-10-25 19:39 ` Rob Landley
2006-10-26 18:09 ` Daniel Jacobowitz
2006-10-31 16:53 ` Rob Landley
2006-10-31 19:02 ` Paul Brook
2006-10-31 20:41 ` Rob Landley
2006-10-31 22:08 ` Paul Brook
2006-10-31 22:31 ` Laurent Desnogues
2006-10-31 23:00 ` Paul Brook
2006-11-01 0:00 ` Rob Landley
2006-11-01 0:29 ` Paul Brook
2006-11-01 1:51 ` Rob Landley
2006-11-01 3:22 ` Paul Brook
2006-11-01 16:34 ` Rob Landley
2006-11-01 17:01 ` Paul Brook
2006-10-31 23:17 ` Rob Landley
2006-11-01 0:01 ` Paul Brook
2006-10-30 4:35 ` Rob Landley
2006-10-30 14:56 ` Paul Brook
2006-10-30 16:31 ` Rob Landley
2006-10-30 16:50 ` Paul Brook
2006-10-30 22:54 ` Stephen Torri
2006-10-30 23:13 ` Paul Brook
2006-10-23 1:27 ` Rob Landley
2006-10-23 1:44 ` Paul Brook
2006-10-23 1:45 ` Johannes Schindelin
2006-10-23 17:53 ` K. Richard Pixley
2006-10-23 18:08 ` Rob Landley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).