[Qemu-devel] qemu vs gcc4

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] qemu vs gcc4
@ 2006-10-20 18:53 K. Richard Pixley
  2006-10-22 22:06 ` Johannes Schindelin
  2006-10-23  1:27 ` Rob Landley
  0 siblings, 2 replies; 43+ messages in thread
From: K. Richard Pixley @ 2006-10-20 18:53 UTC (permalink / raw)
  To: qemu-devel

Could someone please explain the issue with gcc4, please?  Or point me 
to an existing explanation?

I mean, I understand that qemu is believed to be building incorrectly 
with gcc4.  But what is the failure mode folks have been seeing?  And 
what's being done about it or what needs to be done about it?  Is this 
an issue for all targets or is it x86 specific?

Thanks,
--rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-20 18:53 [Qemu-devel] qemu vs gcc4 K. Richard Pixley
@ 2006-10-22 22:06 ` Johannes Schindelin
  2006-10-23  8:16   ` Martin Guy
  2006-10-23  1:27 ` Rob Landley
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2006-10-22 22:06 UTC (permalink / raw)
  To: K. Richard Pixley; +Cc: qemu-devel

Hi K. Richard,

On Fri, 20 Oct 2006, K. Richard Pixley wrote:

> Could someone please explain the issue with gcc4, please?  Or point me 
> to an existing explanation?

The issue is that gcc4 optimizes better, but this breaks assumptions of 
QEmu.

Example: The basic idea (simplified!) of QEmu is writing C functions which 
implement the instructions of the target CPU. Then, code to be emulated is 
translated by chaining the _compiled_ functions (corresponding to the 
target code) together, but _leaving_ out the return instruction at the end 
of the function (otherwise, the resulting code would return already after 
the first emulated instruction).

Now, gcc4 can produce code with several return instructions (with no 
option to turn that of, as far as I understand). You cannot cut them out, 
and therefore you cannot chain the simple functions.

There seem to be other issues, too, like not being able to correctly link 
the user emulation code, but I am not that sure about it.

> And what's being done about it or what needs to be done about it?

Paul started to implement a hand-written translator, which does not depend 
on gcc, but I guess that project is stalled.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-20 18:53 [Qemu-devel] qemu vs gcc4 K. Richard Pixley
  2006-10-22 22:06 ` Johannes Schindelin
@ 2006-10-23  1:27 ` Rob Landley
  2006-10-23  1:44   ` Paul Brook
  2006-10-23  1:45   ` Johannes Schindelin
  1 sibling, 2 replies; 43+ messages in thread
From: Rob Landley @ 2006-10-23  1:27 UTC (permalink / raw)
  To: qemu-devel

On Friday 20 October 2006 2:53 pm, K. Richard Pixley wrote:
> Could someone please explain the issue with gcc4, please?  Or point me 
> to an existing explanation?
> 
> I mean, I understand that qemu is believed to be building incorrectly 
> with gcc4.  But what is the failure mode folks have been seeing?  And 
> what's being done about it or what needs to be done about it?  Is this 
> an issue for all targets or is it x86 specific?

There's a patch to fix it in http://busybox.net/downloads/qemu (which works 
for 0.8.0 through 0.8.2, dunno about current cvs).  This is a collection of 
four different patches I got from a gentoo web page via google.

Basically, gcc changed in a way that broke qemu.  There's been an open bug 
report in gcc ever since, but the GCC developers really aren't interested in 
backwards compatability.  (Heck, gcc 4.0 breaks building bash 2.05b).  The 
qemu developers aren't interested in applying ugly patches to support gcc 4.x 
until gcc 3.x becomes so obsolete nobody ships it anymore.  (And considering 
that there are still some niche embedded boards that have hacked up versions 
of gcc 2.95 targeting them and nothing else, I wouldn't be surprised if in 
five years we have your main compiler and the compiler to build qemu, ala 
kgcc under Red Hat 7.  *shrug*)

I was pondered trying to get tcc to build qemu, and even made a mercurial copy 
of CVS and started collecting old patches from the list (since CVS hadn't had 
a single patch checked into it in eight months and there were other old 
patches from a full year ago which I needed to apply before the thing would 
work on Ubuntu 6.06).  But Fabrice showed back up on tuesday and checked in a 
patch, and now I've got a fork that's out of sync with mainline.  Since I 
have no desire to be in Fabrice's way if he still has any interest in the 
project, I've mothballed my fork and moved on to other things...

The current state of TCC trying to build qemu-0.8.2 is that it blows up on the 
very first file.  Just getting it to compile the full source (let alone 
actually work) seems like a significant undertaking, but I was trying it as a 
learning experience so who knows how much work it actually is.  Might be 
simple if you know what you're doing...

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23  1:27 ` Rob Landley
@ 2006-10-23  1:44   ` Paul Brook
  2006-10-23  1:45   ` Johannes Schindelin
  1 sibling, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-23  1:44 UTC (permalink / raw)
  To: qemu-devel

> Basically, gcc changed in a way that broke qemu.  There's been an open bug
> report in gcc ever since, but the GCC developers really aren't interested
> in backwards compatability. 

That's not entirely true. There are two problems:

- qemu makes assumptions about the layout of the code gcc generates. This 
works by chance on older gcc. This effects all hosts, and is not a gcc bug.

- qemu reserves several registers for its own use. On architecturally crippled 
hosts (ie. x86) this means we hit really obscure gcc bugs on x86 because gcc 
runs out of registers. This is a gcc bug, but is also relatively easy to 
workaround.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23  1:27 ` Rob Landley
  2006-10-23  1:44   ` Paul Brook
@ 2006-10-23  1:45   ` Johannes Schindelin
  2006-10-23 17:53     ` K. Richard Pixley
  2006-10-23 18:08     ` Rob Landley
  1 sibling, 2 replies; 43+ messages in thread
From: Johannes Schindelin @ 2006-10-23  1:45 UTC (permalink / raw)
  To: Rob Landley; +Cc: qemu-devel

Hi Rob,

On Sun, 22 Oct 2006, Rob Landley wrote:

> Basically, gcc changed in a way that broke qemu.

Yes, they did. But even if I understand your frustration (which I share), 
I also understand the gcc people. After all, using gcc to create the 
blocks for dynamic translation is a _hack_. The result of a compiler run, 
though, should work and run -- as fast as possible. So basically, the gcc 
people want to achieve a different goal from what we misuse their program 
for.

> I was pondered trying to get tcc to build qemu,

(since tcc only supports x86 targets, this is not really a solution.)

> and even made a mercurial copy [...] But Fabrice showed back up on 
> tuesday and checked in a patch, and now I've got a fork that's out of 
> sync with mainline.

I do not really know Mercurial, but it should make it really easy to merge 
two branches (as far as I have been told).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-22 22:06 ` Johannes Schindelin
@ 2006-10-23  8:16   ` Martin Guy
  2006-10-23 12:20     ` Paul Brook
  2006-10-23 17:41     ` K. Richard Pixley
  0 siblings, 2 replies; 43+ messages in thread
From: Martin Guy @ 2006-10-23  8:16 UTC (permalink / raw)
  To: qemu-devel

> Now, gcc4 can produce code with several return instructions (with no
> option to turn that of, as far as I understand). You cannot cut them out,
> and therefore you cannot chain the simple functions.

...unless you also map return instructions within the generated
functions into branches to the soon-to-be-dropped final "return"? Not
that I know anything about qemu internals mind u...

    M

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23  8:16   ` Martin Guy
@ 2006-10-23 12:20     ` Paul Brook
  2006-10-23 13:59       ` Avi Kivity
  2006-10-23 17:41     ` K. Richard Pixley
  1 sibling, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-23 12:20 UTC (permalink / raw)
  To: qemu-devel

On Monday 23 October 2006 09:16, Martin Guy wrote:
> > Now, gcc4 can produce code with several return instructions (with no
> > option to turn that of, as far as I understand). You cannot cut them out,
> > and therefore you cannot chain the simple functions.
>
> ...unless you also map return instructions within the generated
> functions into branches to the soon-to-be-dropped final "return"? Not
> that I know anything about qemu internals mind u...

That's exactly what my gcc4 hacks do.

It gets complicated because a x86 uses variable length insn encodings so you 
don't know where insn boundaries are, and a jmp instruction is larger than a 
ret instruction so it's not always possible to do a straight replacement.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 12:20     ` Paul Brook
@ 2006-10-23 13:59       ` Avi Kivity
  2006-10-23 14:10         ` Paul Brook
  0 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2006-10-23 13:59 UTC (permalink / raw)
  To: paul; +Cc: qemu-devel

Paul Brook wrote:
> On Monday 23 October 2006 09:16, Martin Guy wrote:
>   
>>> Now, gcc4 can produce code with several return instructions (with no
>>> option to turn that of, as far as I understand). You cannot cut them out,
>>> and therefore you cannot chain the simple functions.
>>>       
>> ...unless you also map return instructions within the generated
>> functions into branches to the soon-to-be-dropped final "return"? Not
>> that I know anything about qemu internals mind u...
>>     
>
> That's exactly what my gcc4 hacks do.
>
> It gets complicated because a x86 uses variable length insn encodings so you 
> don't know where insn boundaries are, and a jmp instruction is larger than a 
> ret instruction so it's not always possible to do a straight replacement.
>   

how about

void some_generated_instruction(u32 a1, u32 s2)
{
       // code
       asm volatile ( "" );
}


that will force the code to fall through to the null asm code, avoiding 
premature returns.

if the code uses 'return' explicitly, turn it to a goto just before the 
'asm volatile'.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 13:59       ` Avi Kivity
@ 2006-10-23 14:10         ` Paul Brook
  2006-10-23 14:28           ` Avi Kivity
  0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-23 14:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

> > That's exactly what my gcc4 hacks do.
> >
> > It gets complicated because a x86 uses variable length insn encodings so
> > you don't know where insn boundaries are, and a jmp instruction is larger
> > than a ret instruction so it's not always possible to do a straight
> > replacement.
>
> how about
>
> void some_generated_instruction(u32 a1, u32 s2)
> {
>        // code
>        asm volatile ( "" );
> }
>
>
> that will force the code to fall through to the null asm code, avoiding
> premature returns.
>
> if the code uses 'return' explicitly, turn it to a goto just before the
> 'asm volatile'.

We already do that. It doesn't stop gcc putting the return in the middle of 
the function.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 14:10         ` Paul Brook
@ 2006-10-23 14:28           ` Avi Kivity
  2006-10-23 14:31             ` Paul Brook
  0 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2006-10-23 14:28 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

Paul Brook wrote:
>>> That's exactly what my gcc4 hacks do.
>>>
>>> It gets complicated because a x86 uses variable length insn encodings so
>>> you don't know where insn boundaries are, and a jmp instruction is larger
>>> than a ret instruction so it's not always possible to do a straight
>>> replacement.
>>>       
>> how about
>>
>> void some_generated_instruction(u32 a1, u32 s2)
>> {
>>        // code
>>        asm volatile ( "" );
>> }
>>
>>
>> that will force the code to fall through to the null asm code, avoiding
>> premature returns.
>>
>> if the code uses 'return' explicitly, turn it to a goto just before the
>> 'asm volatile'.
>>     
>
> We already do that. It doesn't stop gcc putting the return in the middle of 
> the function.
>
> Paul
>   
void f1();
void f2();

void f(int *z, int x, int y)
{
    if (x) {
        *z = x;
        f1();
    } else {
        *z = y;
        f2();
    }
    asm volatile ("");
}

works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the 
-f flag doesn't.  No idea if it's consistent across architectures.

(the function calls are there to prevent cmov optimizations)




-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 14:28           ` Avi Kivity
@ 2006-10-23 14:31             ` Paul Brook
  2006-10-23 14:35               ` Avi Kivity
  0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-23 14:31 UTC (permalink / raw)
  To: qemu-devel

> > We already do that. It doesn't stop gcc putting the return in the middle
> > of the function.
> >
> > Paul
>
> void f1();
> void f2();
>
> void f(int *z, int x, int y)
> {
>     if (x) {
>         *z = x;
>         f1();
>     } else {
>         *z = y;
>         f2();
>     }
>     asm volatile ("");
> }
>
> works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the
> -f flag doesn't.  No idea if it's consistent across architectures.

It doesn't work reliably though. We already do everything you mention above.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 14:31             ` Paul Brook
@ 2006-10-23 14:35               ` Avi Kivity
  0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2006-10-23 14:35 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

Paul Brook wrote:
>>> We already do that. It doesn't stop gcc putting the return in the middle
>>> of the function.
>>>
>>> Paul
>>>       
>> void f1();
>> void f2();
>>
>> void f(int *z, int x, int y)
>> {
>>     if (x) {
>>         *z = x;
>>         f1();
>>     } else {
>>         *z = y;
>>         f2();
>>     }
>>     asm volatile ("");
>> }
>>
>> works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the
>> -f flag doesn't.  No idea if it's consistent across architectures.
>>     
>
> It doesn't work reliably though. We already do everything you mention above.
>   

Okay.  Sorry for pestering :)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23  8:16   ` Martin Guy
  2006-10-23 12:20     ` Paul Brook
@ 2006-10-23 17:41     ` K. Richard Pixley
  2006-10-23 17:58       ` Paul Brook
  1 sibling, 1 reply; 43+ messages in thread
From: K. Richard Pixley @ 2006-10-23 17:41 UTC (permalink / raw)
  To: Martin Guy; +Cc: qemu-devel

Martin Guy wrote:
>> Now, gcc4 can produce code with several return instructions (with no
>> option to turn that of, as far as I understand). You cannot cut them 
>> out,
>> and therefore you cannot chain the simple functions.
>
> ...unless you also map return instructions within the generated
> functions into branches to the soon-to-be-dropped final "return"? Not
> that I know anything about qemu internals mind u... 
Seems to me one could also map them into jumps to a null function.

Although, all told, it would seem to me that what might be called for 
here is a new gcc target.  A gcc target specifically for generating qemu 
code.  That would just simply generate whatever qemu wanted for function 
postamble.

It would probably mean separating out the code intended to run as native 
code from the code intended to run on behalf of the emulated target, and 
it would mean that you'd need a "gcc-qemu" to build the latter, but it 
would solve the problem permanently.  It could also then be done in a 
cpu independent fashion such that any gcc target port might be converted 
trivially into a gcc target-for-qemu port.  This should also make the 
chaining task much simpler and since that would seem to need to be done 
at run time, this could easily be a performance enhancement as well.

I see two real downsides to this approach.  The first is that qemu 
becomes wed to gcc.  That seems to be a defacto requirement now, but 
using a custom gcc target would make that marriage pretty permanent.  
Creating qemu targets for other compilers would be near impossible, 
although if the code were properly separated, you could still use a 
non-gcc target for the intended-for-host instructions.

The second downside is that some of the qemu support stuff would no 
longer be in the qemu code distribution.  Instead, it would be in gcc.  
This opens the possiblity for version slew problems and authority over 
maintenance issues in the long term.  Administratively, it'd be an 
additional load.

--rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23  1:45   ` Johannes Schindelin
@ 2006-10-23 17:53     ` K. Richard Pixley
  2006-10-23 18:08     ` Rob Landley
  1 sibling, 0 replies; 43+ messages in thread
From: K. Richard Pixley @ 2006-10-23 17:53 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 805 bytes --]

Johannes Schindelin wrote:
> On Sun, 22 Oct 2006, Rob Landley wrote:
>   
>> Basically, gcc changed in a way that broke qemu.
>>     
> Yes, they did. But even if I understand your frustration (which I share), 
> I also understand the gcc people. After all, using gcc to create the 
> blocks for dynamic translation is a _hack_.
Yes, it is a hack.  And short of some guarantees from gcc, (which we 
don't have), it is destined to be an ongoing issue.
> The result of a compiler run, 
> though, should work and run -- as fast as possible. So basically, the gcc 
> people want to achieve a different goal from what we misuse their program 
> for.
Creating a qemu variant target for gcc would address both of these 
concerns.  It would introduce new ones, of course, but it would address 
these two.

--rich

[-- Attachment #2: Type: text/html, Size: 1402 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 17:41     ` K. Richard Pixley
@ 2006-10-23 17:58       ` Paul Brook
  2006-10-23 18:04         ` K. Richard Pixley
  2006-10-30  4:35         ` Rob Landley
  0 siblings, 2 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-23 17:58 UTC (permalink / raw)
  To: qemu-devel

On Monday 23 October 2006 18:41, K. Richard Pixley wrote:
> Martin Guy wrote:
> >> Now, gcc4 can produce code with several return instructions (with no
> >> option to turn that of, as far as I understand). You cannot cut them
> >> out,
> >> and therefore you cannot chain the simple functions.
> >
> > ...unless you also map return instructions within the generated
> > functions into branches to the soon-to-be-dropped final "return"? Not
> > that I know anything about qemu internals mind u...
>
> Seems to me one could also map them into jumps to a null function.

That doesn't work because you need to free the stack frame.

> Although, all told, it would seem to me that what might be called for
> here is a new gcc target.  A gcc target specifically for generating qemu
> code.  That would just simply generate whatever qemu wanted for function
> postamble.

Better to just teach qemu how to generate code.
In fact I've already done most of the infrastructure (and a fair amount of the 
legwork) for this. The only major missing function is code to do softmmu 
load/store ops.
https://nowt.dyndns.org/

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 17:58       ` Paul Brook
@ 2006-10-23 18:04         ` K. Richard Pixley
  2006-10-23 18:20           ` Laurent Desnogues
  2006-10-23 18:37           ` Paul Brook
  2006-10-30  4:35         ` Rob Landley
  1 sibling, 2 replies; 43+ messages in thread
From: K. Richard Pixley @ 2006-10-23 18:04 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

Paul Brook wrote:
> Better to just teach qemu how to generate code.
> In fact I've already done most of the infrastructure (and a fair amount of the 
> legwork) for this. The only major missing function is code to do softmmu 
> load/store ops.
> https://nowt.dyndns.org/
Well, perhaps.  Except that with gcc, we get to leverage the ongoing gcc 
optimizations, bug fixes,  new cpu support, debugger support, etc.  
Granted, not all of these are going to be relevant to the qemu 
environment, but in a contest between gcc generated code and qemu 
generated code, I'll bet on gcc most days.

No doubt there are times when a gcc optimization takes so long that it 
costs more time to optimize than would be won back by the running code.  
Presumably, qemu generated code would be able to make better decisions 
here.  Except that we're not talking about using gcc in real time, are 
we?  So essentially we have near infinite time for optimizations.

--rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23  1:45   ` Johannes Schindelin
  2006-10-23 17:53     ` K. Richard Pixley
@ 2006-10-23 18:08     ` Rob Landley
  1 sibling, 0 replies; 43+ messages in thread
From: Rob Landley @ 2006-10-23 18:08 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: qemu-devel

On Sunday 22 October 2006 9:45 pm, Johannes Schindelin wrote:
> > I was pondered trying to get tcc to build qemu,
> 
> (since tcc only supports x86 targets, this is not really a solution.)

No, it supports arm as well.  (And I merged a recent patch to support arm 
EABI.)  I remember hearing about a PPC patch (although I never tracked that 
down), and I was looking into what I needed to do to make it support x86-64.

> > and even made a mercurial copy [...] But Fabrice showed back up on 
> > tuesday and checked in a patch, and now I've got a fork that's out of 
> > sync with mainline.
> 
> I do not really know Mercurial, but it should make it really easy to merge 
> two branches (as far as I have been told).

That's the general idea, yes.  (In this case what was merged is a reworking of 
a patch I already merged, which I could essentially ignore for now.)  The 
problem is at a higher level: I'd created a fork based of a project that 
looked abandoned, but it turned out not to be abandoned, so the fork looks 
like a bad idea in retrospect.  *shrug*  No shortage of other projects to 
work on.  (Like QEMU: I still haven't managed to install the x86_64 version 
of ubuntu.  An older version hung when it got to the desktop, in last week's 
version I couldn't even get the bios to bring up grub.  Need to thump on it 
again, but I'm not quite sure how to debug this.)

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 18:04         ` K. Richard Pixley
@ 2006-10-23 18:20           ` Laurent Desnogues
  2006-10-23 18:37           ` Paul Brook
  1 sibling, 0 replies; 43+ messages in thread
From: Laurent Desnogues @ 2006-10-23 18:20 UTC (permalink / raw)
  To: qemu-devel

K. Richard Pixley a écrit :
> Well, perhaps.  Except that with gcc, we get to leverage the ongoing gcc 
> optimizations, bug fixes,  new cpu support, debugger support, etc.  
> Granted, not all of these are going to be relevant to the qemu 
> environment, but in a contest between gcc generated code and qemu 
> generated code, I'll bet on gcc most days.
> 
> No doubt there are times when a gcc optimization takes so long that it 
> costs more time to optimize than would be won back by the running code.  
> Presumably, qemu generated code would be able to make better decisions 
> here.  Except that we're not talking about using gcc in real time, are 
> we?  So essentially we have near infinite time for optimizations.

One emulated instruction is a small C function with very little
opportunity for optimization.

On top of that, for instance, calculating flags can be done much
more efficiently at assembly level by using host flags.

All what gcc brings here is portability.


			Laurent

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 18:04         ` K. Richard Pixley
  2006-10-23 18:20           ` Laurent Desnogues
@ 2006-10-23 18:37           ` Paul Brook
  2006-10-24 23:39             ` Rob Landley
  2006-10-31 16:53             ` Rob Landley
  1 sibling, 2 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-23 18:37 UTC (permalink / raw)
  To: qemu-devel

On Monday 23 October 2006 19:04, K. Richard Pixley wrote:
> Paul Brook wrote:
> > Better to just teach qemu how to generate code.
> > In fact I've already done most of the infrastructure (and a fair amount
> > of the legwork) for this. The only major missing function is code to do
> > softmmu load/store ops.
> > https://nowt.dyndns.org/
>
> Well, perhaps.  Except that with gcc, we get to leverage the ongoing gcc
> optimizations, bug fixes,  new cpu support, debugger support, etc.
> Granted, not all of these are going to be relevant to the qemu
> environment, but in a contest between gcc generated code and qemu
> generated code, I'll bet on gcc most days.
>
> No doubt there are times when a gcc optimization takes so long that it
> costs more time to optimize than would be won back by the running code.  
> Presumably, qemu generated code would be able to make better decisions
> here.  Except that we're not talking about using gcc in real time, are
> we?  So essentially we have near infinite time for optimizations.

The code we're talking about (op.c) is sufficiently small and simple that 
there's nothing the compiler can do with it. In fact many of the ops map 
directly onto a single assembly instruction.

To get better translated code we need to do inter-op optimization as code is 
translated (even if it's only simple things like register allocation). This 
requires qemu be able to generate code at runtime.

Using the gcc backends for dynamic code generation isn't a realistic option. 
They're simply too heavyweight to be used "in real time".  qemu needs to be 
able to efficiently generate short, simple code blocks.  Most of the gcc 
infrastructure is for optimizations that take longer to run than we're ever 
going to get back in improved performance.

I did look at integrating an existing JIT compiler into qemu, but couldn't 
find one that fitted nicely, and allowed an incremental conversion.

It turn out that qemu already does most of the hard work, and a code 
generation backend is fairly simple. The diff for my current implementation 
is <2k lines of common code, plus <1k lines for each of x86, amd64 and ppc32 
hosts.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 18:37           ` Paul Brook
@ 2006-10-24 23:39             ` Rob Landley
  2006-10-25  0:24               ` Paul Brook
  2006-10-31 16:53             ` Rob Landley
  1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-24 23:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paul Brook

On Monday 23 October 2006 2:37 pm, Paul Brook wrote:
> It turn out that qemu already does most of the hard work, and a code 
> generation backend is fairly simple. The diff for my current implementation 
> is <2k lines of common code, plus <1k lines for each of x86, amd64 and ppc32 
> hosts.

My understanding is that the version you linked to with your new backend 
currently _only_ supports coldfire/m68k?

Do you have a quick "here's you how try it out" thing?  (For example, when I 
first show people qemu I boot a knoppix cd image under it.  Fast and 
shiny. :)

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-24 23:39             ` Rob Landley
@ 2006-10-25  0:24               ` Paul Brook
  2006-10-25 19:39                 ` Rob Landley
  0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-25  0:24 UTC (permalink / raw)
  To: Rob Landley; +Cc: qemu-devel

On Wednesday 25 October 2006 00:39, Rob Landley wrote:
> On Monday 23 October 2006 2:37 pm, Paul Brook wrote:
> > It turn out that qemu already does most of the hard work, and a code
> > generation backend is fairly simple. The diff for my current
> > implementation is <2k lines of common code, plus <1k lines for each of
> > x86, amd64 and ppc32 hosts.
>
> My understanding is that the version you linked to with your new backend
> currently _only_ supports coldfire/m68k?

ColdFire is the only target that uses it exclusively.  Arm is currently a 
hybrid of dyngen and the new backend.  So is i386, to a lesser extent.  Other 
targets have minimal changes necessary to make them work.

> Do you have a quick "here's you how try it out" thing?  (For example, when
> I first show people qemu I boot a knoppix cd image under it.  Fast and
> shiny. :)

One of my goals when writing it was to be able to reuse most of the existing 
qemu code. There should be no user-visible impact. Unless you already 
understand how qemu/dyngen works it's not going to mean a lot to you. The end 
result is very similar, just a slightly different strategy for getting there.

In theory it should allow better performance, but that's still a way off.

https://nowt.dyndns.org/ has patches against cvs (thought they may be slightly 
out of date), and a complete svn repository you can checkout. Build it just 
like normal qemu.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-25  0:24               ` Paul Brook
@ 2006-10-25 19:39                 ` Rob Landley
  2006-10-26 18:09                   ` Daniel Jacobowitz
  0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-25 19:39 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

On Tuesday 24 October 2006 8:24 pm, Paul Brook wrote:
> ColdFire is the only target that uses it exclusively.  Arm is currently a 
> hybrid of dyngen and the new backend.  So is i386, to a lesser extent. 
> Other targets have minimal changes necessary to make them work.

Ok.

> > Do you have a quick "here's you how try it out" thing?  (For example, when
> > I first show people qemu I boot a knoppix cd image under it.  Fast and
> > shiny. :)
> 
> One of my goals when writing it was to be able to reuse most of the existing 
> qemu code. There should be no user-visible impact. Unless you already 
> understand how qemu/dyngen works it's not going to mean a lot to you.

I read Fabrice's presentation, and looked through the code a bit, 
but "understand" is _way_ too strong a word. :)

> The end result is very similar, just a slightly different strategy for
> getting there. 

A strategy that might work with gcc 4.x? :)

> In theory it should allow better performance, but that's still a way off.

I was poking at the tcc code to generate stuff and optimize a couple weeks 
ago.  I don't suppose there's any possible re-use between the two?

> https://nowt.dyndns.org/ has patches against cvs (thought they may be
> slightly out of date), and a complete svn repository you can checkout. Build
> it just like normal qemu.

Which in my case means applying the patch to get it to build with gcc 4.x, 
which does indeed apply without rejects to your svn repository.  
Unfortunately, the result doesn't build:

gcc -Wall -O2 -g -fno-strict-aliasing -I. -I.. -I/home/landley/qemu/nowt.dyndns.org/qemu/target-sparc -I/home/landley/qemu/nowt.dyndns.org/qemu -I/home/landley/qemu/nowt.dyndns.org/qemu/host-i386 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -I/home/landley/qemu/nowt.dyndns.org/qemu/fpu -I/home/landley/qemu/nowt.dyndns.org/qemu/slirp -c -o 
tcx.o /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c
/home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c: In 
function ‘tcx_draw_line32’:
/home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c:94: error: invalid lvalue in 
increment
/home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c: In 
function ‘tcx_draw_line16’:
/home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c:106: error: invalid lvalue in 
increment
make[1]: *** [tcx.o] Error 1
make[1]: Leaving directory 
`/home/landley/qemu/nowt.dyndns.org/qemu/sparc-softmmu'
make: *** [subdir-sparc-softmmu] Error 2

I don't have gcc 3.x installed on my laptop.

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-25 19:39                 ` Rob Landley
@ 2006-10-26 18:09                   ` Daniel Jacobowitz
  0 siblings, 0 replies; 43+ messages in thread
From: Daniel Jacobowitz @ 2006-10-26 18:09 UTC (permalink / raw)
  To: qemu-devel

On Wed, Oct 25, 2006 at 03:39:18PM -0400, Rob Landley wrote:
> gcc -Wall -O2 -g -fno-strict-aliasing -I. -I.. -I/home/landley/qemu/nowt.dyndns.org/qemu/target-sparc -I/home/landley/qemu/nowt.dyndns.org/qemu -I/home/landley/qemu/nowt.dyndns.org/qemu/host-i386 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -I/home/landley/qemu/nowt.dyndns.org/qemu/fpu -I/home/landley/qemu/nowt.dyndns.org/qemu/slirp -c -o 
> tcx.o /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c
> /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c: In 
> function ???tcx_draw_line32???:
> /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c:94: error: invalid lvalue in 
> increment
> /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c: In 
> function ???tcx_draw_line16???:
> /home/landley/qemu/nowt.dyndns.org/qemu/hw/tcx.c:106: error: invalid lvalue in 
> increment

This is an unrelated problem, and much easier to fix.  Don't increment
casts.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 17:58       ` Paul Brook
  2006-10-23 18:04         ` K. Richard Pixley
@ 2006-10-30  4:35         ` Rob Landley
  2006-10-30 14:56           ` Paul Brook
  1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-30  4:35 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paul Brook

On Monday 23 October 2006 1:58 pm, Paul Brook wrote:
> > Although, all told, it would seem to me that what might be called for
> > here is a new gcc target.  A gcc target specifically for generating qemu
> > code.  That would just simply generate whatever qemu wanted for function
> > postamble.
> 
> Better to just teach qemu how to generate code.
> In fact I've already done most of the infrastructure (and a fair amount of 
the 
> legwork) for this. The only major missing function is code to do softmmu 
> load/store ops.
> https://nowt.dyndns.org/

So given that one of the reasons for doing this would be getting away from 
depending on specific and increasily out of date versions of gcc to build the 
thing, what would be involved in getting this version to build under gcc-4.x?

(I tried, and it didn't...)

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-30  4:35         ` Rob Landley
@ 2006-10-30 14:56           ` Paul Brook
  2006-10-30 16:31             ` Rob Landley
  0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-30 14:56 UTC (permalink / raw)
  To: qemu-devel

On Monday 30 October 2006 04:35, Rob Landley wrote:
> On Monday 23 October 2006 1:58 pm, Paul Brook wrote:
> > > Although, all told, it would seem to me that what might be called for
> > > here is a new gcc target.  A gcc target specifically for generating
> > > qemu code.  That would just simply generate whatever qemu wanted for
> > > function postamble.
> >
> > Better to just teach qemu how to generate code.
> > In fact I've already done most of the infrastructure (and a fair amount
> > of the legwork) for this. The only major missing function is code to do
> > softmmu load/store ops.
> > https://nowt.dyndns.org/
>
> So given that one of the reasons for doing this would be getting away from
> depending on specific and increasily out of date versions of gcc to build
> the thing, what would be involved in getting this version to build under
> gcc-4.x?

Should work pretty much out the box. Obviously if you build anything other 
than m68k then all bets are off.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-30 14:56           ` Paul Brook
@ 2006-10-30 16:31             ` Rob Landley
  2006-10-30 16:50               ` Paul Brook
  0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-30 16:31 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

On Monday 30 October 2006 9:56 am, Paul Brook wrote:
> On Monday 30 October 2006 04:35, Rob Landley wrote:
> > On Monday 23 October 2006 1:58 pm, Paul Brook wrote:
> > > > Although, all told, it would seem to me that what might be called for
> > > > here is a new gcc target.  A gcc target specifically for generating
> > > > qemu code.  That would just simply generate whatever qemu wanted for
> > > > function postamble.
> > >
> > > Better to just teach qemu how to generate code.
> > > In fact I've already done most of the infrastructure (and a fair amount
> > > of the legwork) for this. The only major missing function is code to do
> > > softmmu load/store ops.
> > > https://nowt.dyndns.org/
> >
> > So given that one of the reasons for doing this would be getting away from
> > depending on specific and increasily out of date versions of gcc to build
> > the thing, what would be involved in getting this version to build under
> > gcc-4.x?
> 
> Should work pretty much out the box. Obviously if you build anything other 
> than m68k then all bets are off.

It didn't get to "work", it broke building.  (The frighting part is that my 
patch at http://busybox.net/downloads/qemu applied to your version without 
rejects, but although that helped it get farther, it didn't finish.)

I just did a standard "./configure --shutupaboutthecompilerversion; make; make 
install".  (x86 is the first target I'm interested in, as it's the easiest to 
test and you said it's using at least some of the new code...)

> Paul

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-30 16:31             ` Rob Landley
@ 2006-10-30 16:50               ` Paul Brook
  2006-10-30 22:54                 ` Stephen Torri
  0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-10-30 16:50 UTC (permalink / raw)
  To: Rob Landley; +Cc: qemu-devel

> > > So given that one of the reasons for doing this would be getting away
> > > from depending on specific and increasily out of date versions of gcc
> > > to build the thing, what would be involved in getting this version to
> > > build under gcc-4.x?
> >
> > Should work pretty much out the box. Obviously if you build anything
> > other than m68k then all bets are off.
>
> It didn't get to "work", it broke building.  (The frighting part is that my
> patch at http://busybox.net/downloads/qemu applied to your version without
> rejects, but although that helped it get farther, it didn't finish.)
>
> I just did a standard "./configure --shutupaboutthecompilerversion; make;
> make install".  (x86 is the first target I'm interested in, as it's the
> easiest to test and you said it's using at least some of the new code...)

As I said before, the x86 target is a hybrid of the new and old code. ie. if 
it didn't work before it probably won't work after. configure 
with --target-list=m68k-user and it should work fine with gcc4.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-30 16:50               ` Paul Brook
@ 2006-10-30 22:54                 ` Stephen Torri
  2006-10-30 23:13                   ` Paul Brook
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen Torri @ 2006-10-30 22:54 UTC (permalink / raw)
  To: qemu-devel

> As I said before, the x86 target is a hybrid of the new and old code. ie. if 
> it didn't work before it probably won't work after. configure 
> with --target-list=m68k-user and it should work fine with gcc4.
> 
> Paul

I need a x86 instruction set simulator that can step-by-step execute
could and allow me access to the internals. This is why I have looked at
qemu because of a recommendation from a developer of Ptlsim. It was
suggested that qemu would be lighter weight for what I need. So what do
you suggest I use for a x86 instruction set simulator?

Stephen
-- 
PhD. Student
Auburn University
Department of Computer Science and Software Engineering
107 Dunstan Hall
Auburn, AL 36849-5347  U.S.A.
(334) 844-4330 (O)
torrisa@auburn.edu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-30 22:54                 ` Stephen Torri
@ 2006-10-30 23:13                   ` Paul Brook
  0 siblings, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-30 23:13 UTC (permalink / raw)
  To: qemu-devel

On Monday 30 October 2006 22:54, Stephen Torri wrote:
> > As I said before, the x86 target is a hybrid of the new and old code. ie.
> > if it didn't work before it probably won't work after. configure
> > with --target-list=m68k-user and it should work fine with gcc4.
> >
> > Paul
>
> I need a x86 instruction set simulator that can step-by-step execute
> could and allow me access to the internals. This is why I have looked at
> qemu because of a recommendation from a developer of Ptlsim. It was
> suggested that qemu would be lighter weight for what I need. So what do
> you suggest I use for a x86 instruction set simulator?

Use qemu, and build it with gcc3.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-23 18:37           ` Paul Brook
  2006-10-24 23:39             ` Rob Landley
@ 2006-10-31 16:53             ` Rob Landley
  2006-10-31 19:02               ` Paul Brook
  1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-31 16:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paul Brook

On Monday 23 October 2006 2:37 pm, Paul Brook wrote:
> > > Better to just teach qemu how to generate code.
> > > In fact I've already done most of the infrastructure (and a fair amount
> > > of the legwork) for this. The only major missing function is code to do
> > > softmmu load/store ops.
> > > https://nowt.dyndns.org/

I looked at the big diff between that and mainline, and couldn't make heads 
nor tails of it in the half-hour I spent on it.  I also looked at the svn 
history, but there's apparently a year and change of it.

I don't suppose there's a design document somewhere?  Or could you quickly 
explain "old one did this, new one does this, the code path diverges here, 
start reading at this point and expect this and this to happen, and if you go 
read this unrelated documentation to get up to speed it might help..."

I'd like to add enough of the new code generation stuff to the existing 
targets so it doesn't break when built with gcc4, but so far my interest here 
greatly outstrips my ability.  I don't even know where to start...

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-31 16:53             ` Rob Landley
@ 2006-10-31 19:02               ` Paul Brook
  2006-10-31 20:41                 ` Rob Landley
  2006-10-31 23:17                 ` Rob Landley
  0 siblings, 2 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-31 19:02 UTC (permalink / raw)
  To: Rob Landley; +Cc: qemu-devel

On Tuesday 31 October 2006 16:53, Rob Landley wrote:
> On Monday 23 October 2006 2:37 pm, Paul Brook wrote:
> > > > Better to just teach qemu how to generate code.
> > > > In fact I've already done most of the infrastructure (and a fair
> > > > amount of the legwork) for this. The only major missing function is
> > > > code to do softmmu load/store ops.
> > > > https://nowt.dyndns.org/
>
> I looked at the big diff between that and mainline, and couldn't make heads
> nor tails of it in the half-hour I spent on it.  I also looked at the svn
> history, but there's apparently a year and change of it.
>
> I don't suppose there's a design document somewhere?  Or could you quickly
> explain "old one did this, new one does this, the code path diverges here,
> start reading at this point and expect this and this to happen, and if you
> go read this unrelated documentation to get up to speed it might help..."

Not really.

The basic principle is very similar. Host code is decomposed into an 
intermediate form consisting of simple operations, then native code is 
generated from those operations.

In the existing dyngen implementation most operands to ops are implicit, with 
only a few ops taking explicit arguments. The principle with the new system 
is that all operands are explicit.

The intermediate representation used by the code generator resembles an 
imaginary machine. This machine has various different instructions (qops), 
and a nominally infinite register file (qregs). Each qop takes zero or more 
arguments, each of which may be an input or output.

In addition to dynamically allocated qregs there are a fixed set of qregs that 
map onto the guest CPU state. This is to simplify code generation.

Each qreg has a particular type (32/64 bit, integer or float). It's up to you 
ro make sure the argument types match those expected by th qop. It's 
generally fairly obvious from the name. eg. add32 adds I32 values, addf64 
adds F64 values, etc. The exception is that I64 values can be used in place 
of I32. The upper 64-bit of outputs are undefined in this case, and teh value 
must be explicitly extended before the full 64 bits are used.

The old dyngen ops are actually implemented as a special case qops.

As an example take the arm instruction

  add, r0, r1, r2, lsl #2

This is equivalent to the C expression

 r0 = r1 + (r2 << 2)

The old dyngen translate.c would do:

  gen_op_movl_T1_r2()
  gen_op_shll_T1_im(2)
  gen_op_movl_T0_r1();
  gen_op_addl(); /* does T0 = T0 + T1 */
  gen_op_movl_r0_T0

When fully converted to the new system this would become:

  int tmp = gen_new_qreg(); /* Allocate a temporary reg.  */
  /* gen_im32 is a helper that allocates a new qreg and
     initializes it to an immediate value.  */
  gen_op_add32(tmp, QREG_R2, gen_im32(2));
  gen_op_add32(QREG_R0, QREG_R1, tmp);

One of the changes I've made to target-arm/translate.c is to replace all uses 
of T2 with new pseudo-regs. IN many cases I've left the code structure as it 
was (using the global T0/T1 temporaries), but replaced the dyngen ops with 
the equivalent qops. eg. movl and andl now generate mov32 and and32 qops.

The standard qops are defined in qops.def. A target can also define additional 
qops in qop-target.def. The target specific qops are to simplify 
implementation the i386 static flag propagation pass. the expand_op_* 
routines.

For operations that are too complicated to be expressed as qops there is a 
mechanism for calling helper functions. The m68k target uses this for 
division and a couple of other things.

The implementation make fairly heavy use of the C preprocessor to generate 
code from .def files. There's also a small shell script that pulls the 
definiteions of the helper routines out of qop-helper.c

The debug dumps can be quite useful. In particular -d in_asm,op will dump the 
input asm and the resulting OPs.

For converting targets you can probably ignore most of the translate-all and 
host-*/ changes. These implement generating code from the qops. This works by 
the host defining a set of "hard" qregs that correspond to host CPU 
registers, and constraints for the operands of each qop. Then we do register 
allocation and spilling to satisfy those constraints. The qops can then be 
assembled directly into binary code.

There is also mechanisms for implementing floating point and 64-bit arithmetic 
even if the target doesn't support this natively. The target code doesn't 
need to worry about this, it just generates 64-bit/fp qops and they will be 
decomposed as neccessary.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-31 19:02               ` Paul Brook
@ 2006-10-31 20:41                 ` Rob Landley
  2006-10-31 22:08                   ` Paul Brook
  2006-10-31 23:17                 ` Rob Landley
  1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-31 20:41 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

Welcome to Stupid Question Theatre!  With your host, Paul Brook.  Today's 
contestant is: Rob Landley.  How dumb will it get?

On Tuesday 31 October 2006 2:02 pm, Paul Brook wrote:
> The basic principle is very similar. Host code is decomposed into an 
> intermediate form consisting of simple operations, then native code is 
> generated from those operations.

I got that part.  It's the how I'm still head-scratching over.

The disassembly routines seem relatively compiler-independent, but I'm under 
the impression that turning the intermediate result (the string of qops) into 
large blocks of translated code involves gluing together a bunch of smaller 
blocks of pregenerated code.  These pregenerated blocks were spit out by gcc 
and are where the all the compiler dependencies that aren't clear bugs come 
from.

I thought what you were doing was replacing the pregenerated blocks with 
hand-coded assembly statements, but your description here seems to be about 
changing the disassembly routines that figure out which qops to string 
together in part 2.

> In the existing dyngen implementation most operands to ops are implicit,
> with only a few ops taking explicit arguments. The principle with the new
> system is that all operands are explicit.

Having looked ahead to your example before replying to this, I think I 
understand that part now.  (Just barely.)

> The intermediate representation used by the code generator resembles an 
> imaginary machine. This machine has various different instructions (qops), 
> and a nominally infinite register file (qregs).

Each qreg is represented as an integer index?

> Each qop takes zero or more arguments, each of which may be an input or
> output.

The input or output is always one of these qreg indexes?  (Some of the 
existing ones seem to take immediate values...)

> In addition to dynamically allocated qregs there are a fixed set of qregs
> that map onto the guest CPU state. This is to simplify code generation.

These are indexes 0, 1, and 2?

Ok, looking at target-arm/translate.c, we have:

static inline void gen_op_addl_T0_T1(void)
{
    gen_op_add32(QREG_T0, QREG_T0, QREG_T1);
}

So what is QREG_T0 anyway?  This is hard to grep for. 'find . | grep -v svn | 
xargs grep "QREG_T0"' doesn't produce anything useful, so there's got to be 
preprocessor concatenation stuff with ## going on, let's try just QREG on the 
*.h files, and yup at the start of qop.h there's this:

enum target_qregs {
    QREG_NULL,
#define DEFO32(name, offset) QREG_ ## name,
#define DEFO64(name, offset) DEFO32(name, offset)
#define DEFF32(name, reg) DEFO32(name, reg)
#define DEFF64(name, reg) DEFO32(name, reg)
#define DEFR(name, reg, mode) DEFO32(name, reg)
#include "qregs.def"

And that has "DEFR(T0, AREG1, QMODE_I32)" which...  Ok, DEFR() discards the 
third argument ("mode") completely, and then DEFO32() discards the second 
argument (offset), and what's left is just the name, so it's position 
dependent (so why have the darn macros at ALL?)

My brain hurts a lot now.  I'm just letting you know.  What is all this 
complication actually trying to accomplish?

> Each qreg has a particular type (32/64 bit, integer or float).

You mean each qop's arguments have a particular type, and the arguments are 
always in qregs?  Or each qreg has a type permanently associated with that 
qreg?  Or the value currently in a qreg has a type associated with it, but 
the next value stored in that qreg may have a different type?

> It's up to 
> you to make sure the argument types match those expected by the qop. It's 
> generally fairly obvious from the name. eg. add32 adds I32 values, addf64 
> adds F64 values, etc. The exception is that I64 values can be used in place 
> of I32. The upper 64-bit of outputs are undefined in this case, and the
> value must be explicitly extended before the full 64 bits are used.

Possible translation: you can feed a qreg containing an I64 value to a qop 
taking an i32 argument, and it'll typecast the sucker down intelligently, but 
if you produce an I32 result and expect to use that qreg's value as an I64 
argument later, you have to call a sign-extending qop on it first?

> The old dyngen ops are actually implemented as a special case qops.

You mean each dyngen op produces multiple qops?  (And/or is a bundle of qops?)

> As an example take the arm instruction
> 
>   add, r0, r1, r2, lsl #2
> 
> This is equivalent to the C expression
> 
>  r0 = r1 + (r2 << 2)
> 
> The old dyngen translate.c would do:
> 
>   gen_op_movl_T1_r2()
>   gen_op_shll_T1_im(2)
>   gen_op_movl_T0_r1();
>   gen_op_addl(); /* does T0 = T0 + T1 */
>   gen_op_movl_r0_T0

Digging down into target-arm/translate.c, function disas_arm_insn(), I'm...  
still having to take your word for it.  All the gen_op_movl_T1 variants I'm 
seeing end with _im which I presume means "immediate".  The alternative is 
_cc, but what does that mean?  (Presumably not "closed captioned".)

> When fully converted to the new system this would become:
> 
>   int tmp = gen_new_qreg(); /* Allocate a temporary reg.  */
>   /* gen_im32 is a helper that allocates a new qreg and
>      initializes it to an immediate value.  */
>   gen_op_add32(tmp, QREG_R2, gen_im32(2));
>   gen_op_add32(QREG_R0, QREG_R1, tmp);

Ok (still looking at target-arm/translate.c), I think you're not defining 
anything new here, you're just removing wrappers like gen_op_add_T1_im() 
which just wrap a single call to gen_op_add32(), and untangling the result?

What the heck does gen_intermediate_code() do?  It's a wrapper for a function 
that returns the same value and takes the exact same arguments in the same 
order.  All that's different is the name.  Why does that exist?

> One of the changes I've made to target-arm/translate.c is to replace all 
uses 
> of T2 with new pseudo-regs. IN many cases I've left the code structure as it 
> was (using the global T0/T1 temporaries), but replaced the dyngen ops with 
> the equivalent qops. eg. movl and andl now generate mov32 and and32 qops.

Um, is my earlier characterization of "unwrapping stuff" at all close?

> The standard qops are defined in qops.def. A target can also define
> additional qops in qop-target.def. The target specific qops are to simplify 
> implementation the i386 static flag propagation pass. the expand_op_* 
> routines.

Yeah, I looked at that and the macros that generate it in qops.h.  There seem 
to be exactly two states (QREG_BLAH and QREGHI_BLAH) which can be reached 
from five different macros.  The "offset", "reg", and "mode" entries are 
universally ignored, and all you actually _get_ is a big enum of identifiers 
in a certain order.  I have no idea what's going on.

> For operations that are too complicated to be expressed as qops there is a 
> mechanism for calling helper functions. The m68k target uses this for 
> division and a couple of other things.

Ok, now I'm really lost.

> The implementation make fairly heavy use of the C preprocessor to generate 
> code from .def files. There's also a small shell script that pulls the 
> definiteions of the helper routines out of qop-helper.c

Ah, hang on.  There's target_reginfo in translate-all.c, that's using some of 
the other values.  So what the heck does translate-all.c do?  (Shared code 
called by all the platform-dependent translate functions?)

> The debug dumps can be quite useful. In particular -d in_asm,op will dump
> the input asm and the resulting OPs.

I'll have to find a system with gcc3 installed on it so I can actually try 
this out.  (Hmmm, I have a Red Hat 9 image I run under qemu, maybe it would 
build under that?)

> For converting targets you can probably ignore most of the translate-all and 
> host-*/ changes. These implement generating code from the qops.

Ok, this implies that qops are a new thing.  Which looking at the code sort of 
supports.  Which means I don't understand what's going on at all.

> This works 
> by the host defining a set of "hard" qregs that correspond to host CPU 
> registers, and constraints for the operands of each qop. Then we do register 
> allocation and spilling to satisfy those constraints. The qops can then be 
> assembled directly into binary code.

I need to re-read this later.  My brain's full and I'm deeply confused.

> There is also mechanisms for implementing floating point and 64-bit
> arithmetic even if the target doesn't support this natively. The target code
> doesn't need to worry about this, it just generates 64-bit/fp qops and they
> will be decomposed as neccessary.

The implementation calls the appropriate host functions to handle the floating 
point, using soft-float if necessary?  (Under the old dyngen thing outputting 
blocks of gcc-produced code, I could understand how that works.  But if 
you're outputting assembly directly...  I'm back in the "totally lost" aread 
again, I think.)

> Paul

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-31 20:41                 ` Rob Landley
@ 2006-10-31 22:08                   ` Paul Brook
  2006-10-31 22:31                     ` Laurent Desnogues
  2006-11-01  0:00                     ` Rob Landley
  0 siblings, 2 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-31 22:08 UTC (permalink / raw)
  To: Rob Landley; +Cc: qemu-devel

On Tuesday 31 October 2006 20:41, Rob Landley wrote:
> Welcome to Stupid Question Theatre!  With your host, Paul Brook.  Today's
> contestant is: Rob Landley.  How dumb will it get?
>
> On Tuesday 31 October 2006 2:02 pm, Paul Brook wrote:
> > The basic principle is very similar. Host code is decomposed into an
> > intermediate form consisting of simple operations, then native code is
> > generated from those operations.
>
> I got that part.  It's the how I'm still head-scratching over.
>
> The disassembly routines seem relatively compiler-independent, but I'm
> under the impression that turning the intermediate result (the string of
> qops) into large blocks of translated code involves gluing together a bunch
> of smaller blocks of pregenerated code.  These pregenerated blocks were
> spit out by gcc and are where the all the compiler dependencies that aren't
> clear bugs come from.

Correct.

> I thought what you were doing was replacing the pregenerated blocks with
> hand-coded assembly statements, but your description here seems to be about
> changing the disassembly routines that figure out which qops to string
> together in part 2.

Replacing the pregenerated blocks with hand written assembly isn't feasible. 
Each target has its own set of ops, and each host would need its own assembly 
implementation of those ops. Multiply 11 targets by 11 hosts and you get a 
unmaintainable mess :-)

> > In the existing dyngen implementation most operands to ops are implicit,
> > with only a few ops taking explicit arguments. The principle with the new
> > system is that all operands are explicit.
>
> Having looked ahead to your example before replying to this, I think I
> understand that part now.  (Just barely.)
>
> > The intermediate representation used by the code generator resembles an
> > imaginary machine. This machine has various different instructions
> > (qops), and a nominally infinite register file (qregs).
>
> Each qreg is represented as an integer index?

Yes.

> > Each qop takes zero or more arguments, each of which may be an input or
> > output.
>
> The input or output is always one of these qreg indexes?  (Some of the
> existing ones seem to take immediate values...)

It is always a qreg.
Potentially we could decide that some qregs are constants rather than 
variables, and use that information for gode generation, but that's a 
slightly different issue.

> > In addition to dynamically allocated qregs there are a fixed set of qregs
> > that map onto the guest CPU state. This is to simplify code generation.
>
> These are indexes 0, 1, and 2?

They are defined by th code you quote below. However this is an implementation 
detail, and could change. You should use the named constants.

> Ok, looking at target-arm/translate.c, we have:
>
> static inline void gen_op_addl_T0_T1(void)
> {
>     gen_op_add32(QREG_T0, QREG_T0, QREG_T1);
> }
>
> So what is QREG_T0 anyway?  This is hard to grep for. 'find . | grep -v svn
> | xargs grep "QREG_T0"' doesn't produce anything useful, so there's got to
> be preprocessor concatenation stuff with ## going on, let's try just QREG
> on the *.h files, and yup at the start of qop.h there's this:

It corresponds to "T0" in dyngen. In addition to the actual CPU state, dyngen 
uses 3 fixed register as scratch workspace. for qop purposes these are part 
of the guest CPU state. They're only there to aid conversion of the 
translation code, they'll go away eventually.

> enum target_qregs {
>     QREG_NULL,
> #define DEFO32(name, offset) QREG_ ## name,
> #define DEFO64(name, offset) DEFO32(name, offset)
> #define DEFF32(name, reg) DEFO32(name, reg)
> #define DEFF64(name, reg) DEFO32(name, reg)
> #define DEFR(name, reg, mode) DEFO32(name, reg)
> #include "qregs.def"
>
> And that has "DEFR(T0, AREG1, QMODE_I32)" which...  Ok, DEFR() discards the
> third argument ("mode") completely, and then DEFO32() discards the second
> argument (offset), and what's left is just the name, so it's position
> dependent (so why have the darn macros at ALL?)

Because qregs.def in included in at least two other places. This is the C 
preprocessor trickery I mentioned :-)

> My brain hurts a lot now.  I'm just letting you know.  What is all this
> complication actually trying to accomplish?

Generation of 3 different things (QREG_* constants, the target_reginfo 
structure, and qreg_names) from a single source. This avoid having to keep 3 
big hairy arrays in sync with each other.
It's also used implement 64-bit qregs as a pair of 32-bit qregs on 32-bit 
hosts.

> > Each qreg has a particular type (32/64 bit, integer or float).
>
> You mean each qop's arguments have a particular type, and the arguments are
> always in qregs?  Or each qreg has a type permanently associated with that
> qreg?  

Both the above.

> Or the value currently in a qreg has a type associated with it, but 
> the next value stored in that qreg may have a different type?

A qreg has a fixed type. The value stored in that qreg has that type. To 
convert it to a different type you need to use an explicit conversion qop.

> > It's up to
> > you to make sure the argument types match those expected by the qop. It's
> > generally fairly obvious from the name. eg. add32 adds I32 values, addf64
> > adds F64 values, etc. The exception is that I64 values can be used in
> > place of I32. The upper 64-bit of outputs are undefined in this case, and
> > the value must be explicitly extended before the full 64 bits are used.
>
> Possible translation: you can feed a qreg containing an I64 value to a qop
> taking an i32 argument, and it'll typecast the sucker down intelligently,
> but if you produce an I32 result and expect to use that qreg's value as an
> I64 argument later, you have to call a sign-extending qop on it first?

Exactly.
If you mix I32,F32 and/or F64 in this way Bad Things will happen.

> > The old dyngen ops are actually implemented as a special case qops.
>
> You mean each dyngen op produces multiple qops?  (And/or is a bundle of
> qops?)

A dyngen op is a single qop that does magical unknown things.

> > As an example take the arm instruction
> >
> >   add, r0, r1, r2, lsl #2
> >
> > This is equivalent to the C expression
> >
> >  r0 = r1 + (r2 << 2)
> >
> > The old dyngen translate.c would do:
> >
> >   gen_op_movl_T1_r2()
> >   gen_op_shll_T1_im(2)
> >   gen_op_movl_T0_r1();
> >   gen_op_addl(); /* does T0 = T0 + T1 */
> >   gen_op_movl_r0_T0
>
> Digging down into target-arm/translate.c, function disas_arm_insn(), I'm...
> still having to take your word for it.  All the gen_op_movl_T1 variants I'm
> seeing end with _im which I presume means "immediate".  The alternative is
> _cc, but what does that mean?  (Presumably not "closed captioned".)

_cc are variants that set the condition codes. I may have got T0 and T1 
backwards in the first 3 lines.

> > When fully converted to the new system this would become:
> >
> >   int tmp = gen_new_qreg(); /* Allocate a temporary reg.  */
> >   /* gen_im32 is a helper that allocates a new qreg and
> >      initializes it to an immediate value.  */
> >   gen_op_add32(tmp, QREG_R2, gen_im32(2));
> >   gen_op_add32(QREG_R0, QREG_R1, tmp);
>
> Ok (still looking at target-arm/translate.c), I think you're not defining
> anything new here, you're just removing wrappers like gen_op_add_T1_im()
> which just wrap a single call to gen_op_add32(), and untangling the result?
>
> What the heck does gen_intermediate_code() do?  It's a wrapper for a
> function that returns the same value and takes the exact same arguments in
> the same order.  All that's different is the name.  Why does that exist?

Hysterical raisins. ie. nothing useful.

> > One of the changes I've made to target-arm/translate.c is to replace all
> > uses 
> > of T2 with new pseudo-regs. IN many cases I've left the code structure as
> > it was (using the global T0/T1 temporaries), but replaced the dyngen ops
> > with the equivalent qops. eg. movl and andl now generate mov32 and and32
> > qops.
>
> Um, is my earlier characterization of "unwrapping stuff" at all close?

Not entirely. I'm also replacing fixed locations (T2) with dynamicall 
allocated qregs.

> > The standard qops are defined in qops.def. A target can also define
> > additional qops in qop-target.def. The target specific qops are to
> > simplify implementation the i386 static flag propagation pass. the
> > expand_op_* routines.
>
> Yeah, I looked at that and the macros that generate it in qops.h.  There
> seem to be exactly two states (QREG_BLAH and QREGHI_BLAH) which can be
> reached from five different macros.  The "offset", "reg", and "mode"
> entries are universally ignored, and all you actually _get_ is a big enum
> of identifiers in a certain order.  I have no idea what's going on.

As mentioned above, qregs.def is included elsewhere.

> > For operations that are too complicated to be expressed as qops there is
> > a mechanism for calling helper functions. The m68k target uses this for
> > division and a couple of other things.
>
> Ok, now I'm really lost.

Most x86 instructions set the condition code flags. However most of the time 
these flags are ignored. eg. if you have to consecutive add instructions the 
first will set the flags, and the second will immediately overwrite them.

qemu contains a back-propagation pass that will remove the code to set the 
flags after the first instruction. Currently this is implemented by changing 
an addl_cc op into a plain addl op.

The flag-setting code would most likely require several qops to implement, so 
it would be much harder to prove it is not needed and remove it. So there is 
a mechanism for adding extra target qops, doing the flag elimination pass, 
then expanding those to generic qops.

m68k generates the _cc ops neccessary for doing this, but is missing the 
back-propagation optimization pass.

On RISC targets like ARM most instructions don't set the condition codes, so 
we don't bother doing this.

> > The implementation make fairly heavy use of the C preprocessor to
> > generate code from .def files. There's also a small shell script that
> > pulls the definiteions of the helper routines out of qop-helper.c
>
> Ah, hang on.  There's target_reginfo in translate-all.c, that's using some
> of the other values.  So what the heck does translate-all.c do?  (Shared
> code called by all the platform-dependent translate functions?)

There are three fairly independent stages:
1) target-*/translate.c converts guest code into qops.
2) translate-all.c messes about with those qops a bit (allocates host 
registers, etc).
3) translate-op.c,translate-qop.c and target-*/ turns those qops into host 
code.

> > The debug dumps can be quite useful. In particular -d in_asm,op will dump
> > the input asm and the resulting OPs.
>
> I'll have to find a system with gcc3 installed on it so I can actually try
> this out.  (Hmmm, I have a Red Hat 9 image I run under qemu, maybe it would
> build under that?)

Probably.

> > For converting targets you can probably ignore most of the translate-all
> > and host-*/ changes. These implement generating code from the qops.
>
> Ok, this implies that qops are a new thing.  Which looking at the code sort
> of supports.  Which means I don't understand what's going on at all.

qops and dyngen ops are both small "functions" that are represented in a 
similar way. The difference is that dyngen ops are target specific fixed 
functions, whereas qops are generic parameterized functions.

While they are really separate things, the details have been chosen so it 
should be possible to adapt the existing translate.c code rather than having 
to rewrite it from scratch. Decoding x86 instruction semantics is 
complicated :-)

Many of the simpler dyngen ops can be replaced with a single qop. Others can 
be replaces with a sequence of a few qops. Some of the more complicated ones 
may need to be moved into helper functions.

> > This works
> > by the host defining a set of "hard" qregs that correspond to host CPU
> > registers, and constraints for the operands of each qop. Then we do
> > register allocation and spilling to satisfy those constraints. The qops
> > can then be assembled directly into binary code.
>
> I need to re-read this later.  My brain's full and I'm deeply confused.

I started off by saying qops were effectively instructions for an imaginary 
machine. translate-all.c rearranges them so they match up very closely with 
the instructions available on the host. Once this has been done turning them 
into binary code is relatively simple.

> > There is also mechanisms for implementing floating point and 64-bit
> > arithmetic even if the target doesn't support this natively. The target
> > code doesn't need to worry about this, it just generates 64-bit/fp qops
> > and they will be decomposed as neccessary.
>
> The implementation calls the appropriate host functions to handle the
> floating point, using soft-float if necessary?  (Under the old dyngen thing
> outputting blocks of gcc-produced code, I could understand how that works. 
> But if you're outputting assembly directly...  I'm back in the "totally
> lost" aread again, I think.)

Err, sort of. There's a couple of different layers.

In translate.c you'll do something like

  tmp = gen_new_qreg(QMODE_F32);
  gen_op_addf32(tmp, QREG_FOO, QREG_BAR).

If the host implements the floating point qops 'natively' then this will work 
exactly the same as the integer qops and end up as host floating point 
instructions. Currently this is not implemented for any hosts.

If native host FP is not available qemu will include appropriate bits so that 
after macro expansion and inlining you end up with:

  tmp = gen_new_qreg(QMODE_I32);
  gen_op_helper(HELPER_addf32, tmp, QREG_FOO, QREG_BAR).

and the addf32 helper does the floating point addition using the "softfloat" 
library. The qemu softfloat library implementation may actually use hardware 
floating point rather than doing everything manually.

Likewise if the host doesn't have 64-bit operations gen_op_and64 will actually 
expand to a pair of and32 operations.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-31 22:08                   ` Paul Brook
@ 2006-10-31 22:31                     ` Laurent Desnogues
  2006-10-31 23:00                       ` Paul Brook
  2006-11-01  0:00                     ` Rob Landley
  1 sibling, 1 reply; 43+ messages in thread
From: Laurent Desnogues @ 2006-10-31 22:31 UTC (permalink / raw)
  To: qemu-devel

Paul Brook a écrit :
> Replacing the pregenerated blocks with hand written assembly isn't feasible. 
> Each target has its own set of ops, and each host would need its own assembly 
> implementation of those ops. Multiply 11 targets by 11 hosts and you get a 
> unmaintainable mess :-)

Shouldn't you have 11+11 and not 11*11, given your intermediate
representation?  And of these 11+11, 11 have to be written
anyway (target).  Or did I miss something?

> On RISC targets like ARM most instructions don't set the condition codes, so 
> we don't bother doing this.

Except for ARM Thumb ISA which always sets flags.  ARM is a bad
RISC example :)

I was wondering if you did some profiling to know how much time
is spent in disas_arm_insn.  Of course the profiling results
would be very different for a Linux boot or a synthetic benchmark
(which makes me think that you don't support MMU, do you?).
There is a very nice trick to speed up decoding of ARM
instructions:  pick up bits 20-27 and 4-7 and you (almost) get
one instruction per case entry;  of course this means using a
generator to write the 4096 entries, but the result was good for
my interpreted ISS, reaching 44 M i/s on an Opteron @2.4GHz
without any compiler dependent trick (such as gcc jump to labels).

			Laurent

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-31 22:31                     ` Laurent Desnogues
@ 2006-10-31 23:00                       ` Paul Brook
  0 siblings, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-10-31 23:00 UTC (permalink / raw)
  To: qemu-devel

On Tuesday 31 October 2006 22:31, Laurent Desnogues wrote:
> Paul Brook a écrit :
> > Replacing the pregenerated blocks with hand written assembly isn't
> > feasible. Each target has its own set of ops, and each host would need
> > its own assembly implementation of those ops. Multiply 11 targets by 11
> > hosts and you get a unmaintainable mess :-)
>
> Shouldn't you have 11+11 and not 11*11, given your intermediate
> representation?  And of these 11+11, 11 have to be written
> anyway (target).  Or did I miss something?

If you use qops (which is a target and host independent intermediate 
representation) it's 11 + 11. If you just replace the existing dyngen op.c 
with hand written assembly it's 11 * 11.

> > On RISC targets like ARM most instructions don't set the condition codes,
> > so we don't bother doing this.
>
> Except for ARM Thumb ISA which always sets flags.  ARM is a bad
> RISC example :)

Bah. Details :-)

> I was wondering if you did some profiling to know how much time
> is spent in disas_arm_insn.  Of course the profiling results
> would be very different for a Linux boot or a synthetic benchmark

The qop generator does add some overhead to the code translation. I haven't 
done proper benchmarks, but in most cases it doesn't seem to be too bad 
(maybe 10%). I'm hoping we can get most of that back.

> (which makes me think that you don't support MMU, do you?).

qemu does implement a MMU.
Currently this still uses the dyngen code, but that's fixable.

> There is a very nice trick to speed up decoding of ARM
> instructions:  pick up bits 20-27 and 4-7 and you (almost) get
> one instruction per case entry;  of course this means using a
> generator to write the 4096 entries, but the result was good for
> my interpreted ISS, reaching 44 M i/s on an Opteron @2.4GHz
> without any compiler dependent trick (such as gcc jump to labels).

qemu generally gets 100-200MIPS on my 2GHz Opteron.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-31 19:02               ` Paul Brook
  2006-10-31 20:41                 ` Rob Landley
@ 2006-10-31 23:17                 ` Rob Landley
  2006-11-01  0:01                   ` Paul Brook
  1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-10-31 23:17 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

On Tuesday 31 October 2006 2:02 pm, Paul Brook wrote:

> As an example take the arm instruction
> 
>   add, r0, r1, r2, lsl #2
> 
> This is equivalent to the C expression
> 
>  r0 = r1 + (r2 << 2)
...
> When fully converted to the new system this would become:
> 
>   int tmp = gen_new_qreg(); /* Allocate a temporary reg.  */
>   /* gen_im32 is a helper that allocates a new qreg and
>      initializes it to an immediate value.  */
>   gen_op_add32(tmp, QREG_R2, gen_im32(2));
>   gen_op_add32(QREG_R0, QREG_R1, tmp);

I forgot to ask:

Where's the shift?  I think the above code means you generate an immediate 
value (the 2), add it to R2 with the result going in a spill register, and 
then add the spill register to R1, with the result going to R0.  Should that 
middle line be some kind of gen_op_lshift32() instead of gen_op_add32()?

Do qregs ever get freed?  (I'm guessing gen_new_qreg() lasts until the end of 
the translated block, and then the next block has its own set of qregs?)

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-31 22:08                   ` Paul Brook
  2006-10-31 22:31                     ` Laurent Desnogues
@ 2006-11-01  0:00                     ` Rob Landley
  2006-11-01  0:29                       ` Paul Brook
  1 sibling, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-11-01  0:00 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

On Tuesday 31 October 2006 5:08 pm, Paul Brook wrote:
> On Tuesday 31 October 2006 20:41, Rob Landley wrote:
> > Welcome to Stupid Question Theatre!  With your host, Paul Brook.  Today's
> > contestant is: Rob Landley.  How dumb will it get?

Bonus round!

> > I thought what you were doing was replacing the pregenerated blocks with
> > hand-coded assembly statements, but your description here seems to be 
about
> > changing the disassembly routines that figure out which qops to string
> > together in part 2.
> 
> Replacing the pregenerated blocks with hand written assembly isn't feasible. 
> Each target has its own set of ops, and each host would need its own
> assembly implementation of those ops. Multiply 11 targets by 11 hosts and
> you get a unmaintainable mess :-)

Actually it sounds additive rather than multiplicative.  Does each target have 
an entirely unrelated set of ops, or is there a shared set of primitive ops 
plus some oddballs?

But backing up and just accepting that for a moment, in theory what you need 
is some way to compile a C function to machine code, and then unwrap that 
function into a .raw file containing just the machine code.  So the only 
per-compiler thing would be this unwrapper thingy.  But I already know that 
doesn't work because it doesn't explain the "unable to find spill register" 
problem.  Presumably, just beating the right .raw contents out of the 
compiler is nontrivial, let alone unwrapping it...

> It corresponds to "T0" in dyngen. In addition to the actual CPU state, 
dyngen 
> uses 3 fixed register as scratch workspace. for qop purposes these are part 
> of the guest CPU state. They're only there to aid conversion of the 
> translation code, they'll go away eventually.

Presumably the m68k target is pure qop, and hasn't got this sort of thing?

> > My brain hurts a lot now.  I'm just letting you know.  What is all this
> > complication actually trying to accomplish?
> 
> Generation of 3 different things (QREG_* constants, the target_reginfo 
> structure, and qreg_names) from a single source. This avoid having to keep 3 
> big hairy arrays in sync with each other.
> It's also used implement 64-bit qregs as a pair of 32-bit qregs on 32-bit 
> hosts.

Ok, the QREG_* constants are for the intermediate code the decompiler stuff 
generates.  I have no idea what target_reginfo and qreg_names are for, but 
maybe it'll come to me as I read the code...

> > Or the value currently in a qreg has a type associated with it, but 
> > the next value stored in that qreg may have a different type?
> 
> A qreg has a fixed type. The value stored in that qreg has that type. To 
> convert it to a different type you need to use an explicit conversion qop.

So values don't have types, the qregs the values are _in_ have types.  But I 
thought there were an unlimited number of them (well, 1024 or so), and 
they're dynamically allocated (at least some of the time).  How does it keep 
track of the type of a given qreg?  (When you convert, you copy values from 
one qreg into another?)

> > Possible translation: you can feed a qreg containing an I64 value to a qop
> > taking an i32 argument, and it'll typecast the sucker down intelligently,
> > but if you produce an I32 result and expect to use that qreg's value as an
> > I64 argument later, you have to call a sign-extending qop on it first?
> 
> Exactly.
> If you mix I32,F32 and/or F64 in this way Bad Things will happen.

Presumably just the same kinds of Bad Things as "float f; *(int *)&f;"?

> > seeing end with _im which I presume means "immediate".  The alternative is
> > _cc, but what does that mean?  (Presumably not "closed captioned".)
> 
> _cc are variants that set the condition codes. I may have got T0 and T1 
> backwards in the first 3 lines.

Ah!

Is this written down anywhere?  I've read Fabrice's paper and the design 
documentation, and I'm not remembering this.  It's quite possible I missed it 
when my brain filled up, though.

> > Um, is my earlier characterization of "unwrapping stuff" at all close?
> 
> Not entirely. I'm also replacing fixed locations (T2) with dynamicall 
> allocated qregs.

The dynamic allocation buys you what?  (Less spilling?)

> > Ok, now I'm really lost.
> 
> Most x86 instructions set the condition code flags. However most of the time 
> these flags are ignored. eg. if you have to consecutive add instructions the 
> first will set the flags, and the second will immediately overwrite them.
> 
> qemu contains a back-propagation pass that will remove the code to set the 
> flags after the first instruction. Currently this is implemented by changing 
> an addl_cc op into a plain addl op.

I actually understood that.  Yay!

> The flag-setting code would most likely require several qops to implement,
> so  
> it would be much harder to prove it is not needed and remove it. So there is 
> a mechanism for adding extra target qops, doing the flag elimination pass, 
> then expanding those to generic qops.

Um, wouldn't the flag setting code be fairly straightforward as a qop that 
comes right _before_ the other op, as in "set the flags for doing this with 
these registers", that does nothing but set the flags (I.E. it wouldn't 
modify the contents of any the registers, so it could be immediately followed 
by the appropriate add or shift or so on), and then the flag setting pass 
could just turn all the ones that weren't needed into QOP_NULL?

Or is that what's happening now?  (Do QOPs ever modify their input registers, 
or only the output one?)

> > Ah, hang on.  There's target_reginfo in translate-all.c, that's using some
> > of the other values.  So what the heck does translate-all.c do?  (Shared
> > code called by all the platform-dependent translate functions?)
> 
> There are three fairly independent stages:
> 1) target-*/translate.c converts guest code into qops.
> 2) translate-all.c messes about with those qops a bit (allocates host 
> registers, etc).
> 3) translate-op.c,translate-qop.c and target-*/ turns those qops into host 
> code.

Is pass 2 where the flag elimination pass goes (and presumably any other 
optimizations that might get added)?  No, that can't be the case or the m68k 
code wouldn't need its own implementation of the flag elimination pass...

> > > For converting targets you can probably ignore most of the translate-all
> > > and host-*/ changes. These implement generating code from the qops.
> >
> > Ok, this implies that qops are a new thing.  Which looking at the code 
sort
> > of supports.  Which means I don't understand what's going on at all.
> 
> qops and dyngen ops are both small "functions" that are represented in a 
> similar way. The difference is that dyngen ops are target specific fixed 
> functions, whereas qops are generic parameterized functions.

So the 11x11 exponential complexity of qemu producing its own assembly output 
might not be as much of a problem after switching to qops?

Possibly some of the common qops can have an asm block for 'em, and the rest 
can go through the contortions target-*/op.c is currently doing with 
(glue(glue(blah))) and so on.

> While they are really separate things, the details have been chosen so it 
> should be possible to adapt the existing translate.c code rather than having 
> to rewrite it from scratch. Decoding x86 instruction semantics is 
> complicated :-)

Yay iterative transformation with regression testing.  (And nothing says 
regression testing like booting a Linux distro under the sucker.)

> Many of the simpler dyngen ops can be replaced with a single qop. Others can 
> be replaces with a sequence of a few qops. Some of the more complicated ones 
> may need to be moved into helper functions.

At some point, I hope to understand helper functions.  But I'm not there yet.

> > I need to re-read this later.  My brain's full and I'm deeply confused.
> 
> I started off by saying qops were effectively instructions for an imaginary 
> machine. translate-all.c rearranges them so they match up very closely with 
> the instructions available on the host. Once this has been done turning them 
> into binary code is relatively simple.

I sort of thought this is what it was already doing, but apparently not...

> > The implementation calls the appropriate host functions to handle the
> > floating point, using soft-float if necessary?  (Under the old dyngen 
thing
> > outputting blocks of gcc-produced code, I could understand how that works. 
> > But if you're outputting assembly directly...  I'm back in the "totally
> > lost" aread again, I think.)
> 
> Err, sort of. There's a couple of different layers.
> 
> In translate.c you'll do something like
> 
>   tmp = gen_new_qreg(QMODE_F32);
>   gen_op_addf32(tmp, QREG_FOO, QREG_BAR).
> 
> If the host implements the floating point qops 'natively' then this will 
work 
> exactly the same as the integer qops and end up as host floating point 
> instructions. Currently this is not implemented for any hosts.

Ok.

> If native host FP is not available qemu will include appropriate bits so 
that 
> after macro expansion and inlining you end up with:
> 
>   tmp = gen_new_qreg(QMODE_I32);
>   gen_op_helper(HELPER_addf32, tmp, QREG_FOO, QREG_BAR).
> 
> and the addf32 helper does the floating point addition using the "softfloat" 
> library. The qemu softfloat library implementation may actually use hardware 
> floating point rather than doing everything manually.

No reason (except speed) the code output into a translation block can't do 
function calls.  I think.

> Likewise if the host doesn't have 64-bit operations gen_op_and64 will 
actually 
> expand to a pair of and32 operations.

Ok.

I'm still trying to follow a translation all the way from source to target.  
Just getting application emulation to do "hello world" is pretty darn 
complicated.  Your dump mode earlier sounded highly interesting.  (It's on my 
todo list.)

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-10-31 23:17                 ` Rob Landley
@ 2006-11-01  0:01                   ` Paul Brook
  0 siblings, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-11-01  0:01 UTC (permalink / raw)
  To: qemu-devel

> Where's the shift?  I think the above code means you generate an immediate
> value (the 2), add it to R2 with the result going in a spill register, and
> then add the spill register to R1, with the result going to R0.  Should
> that middle line be some kind of gen_op_lshift32() instead of
> gen_op_add32()?

Yes.

> Do qregs ever get freed?  (I'm guessing gen_new_qreg() lasts until the end
> of the translated block, and then the next block has its own set of qregs?)

Correct.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-11-01  0:00                     ` Rob Landley
@ 2006-11-01  0:29                       ` Paul Brook
  2006-11-01  1:51                         ` Rob Landley
  0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-11-01  0:29 UTC (permalink / raw)
  To: qemu-devel

> Actually it sounds additive rather than multiplicative.  Does each target
> have an entirely unrelated set of ops, or is there a shared set of
> primitive ops plus some oddballs?

The shared set of primitive ops is basically qops :-)
You probably could figure out a single common qet of qops, then write assembly 
and glue them together like we do with dyngen. However once you've done that 
you've implemented most of what's needed for fully dynamic qops, so it 
doesn't really seem worth it.

> But backing up and just accepting that for a moment, in theory what you
> need is some way to compile a C function to machine code, and then unwrap
> that function into a .raw file containing just the machine code.  So the
> only per-compiler thing would be this unwrapper thingy.  

Right.

> But I already know 
> that doesn't work because it doesn't explain the "unable to find spill
> register" problem. 

That a separate gcc bug. It gets stuck when you tell it not to use half the 
registers, then ask it to do 64-bit math. This is one of the reasons 
eliminating the fixed registers is a good idea.

> > It corresponds to "T0" in dyngen. In addition to the actual CPU state,
> > dyngen
> > uses 3 fixed register as scratch workspace. for qop purposes these are
> > part of the guest CPU state. They're only there to aid conversion of the
> > translation code, they'll go away eventually.
>
> Presumably the m68k target is pure qop, and hasn't got this sort of thing?

Correct.
There is one use of T0 left for communicating with the TB chaining code, but 
that's it and will probably go away eventually.

> > > Or the value currently in a qreg has a type associated with it, but
> > > the next value stored in that qreg may have a different type?
> >
> > A qreg has a fixed type. The value stored in that qreg has that type. To
> > convert it to a different type you need to use an explicit conversion
> > qop.
>
> So values don't have types, the qregs the values are _in_ have types.  But
> I thought there were an unlimited number of them (well, 1024 or so), and
> they're dynamically allocated (at least some of the time).  How does it
> keep track of the type of a given qreg?  (When you convert, you copy values
> from one qreg into another?)

Yes. Conversion is just like any other qop. It reads one qreg, and writes the 
result to a different qreg which happens to be a different type.

> > > Possible translation: you can feed a qreg containing an I64 value to a
> > > qop taking an i32 argument, and it'll typecast the sucker down
> > > intelligently, but if you produce an I32 result and expect to use that
> > > qreg's value as an I64 argument later, you have to call a
> > > sign-extending qop on it first?
> >
> > Exactly.
> > If you mix I32,F32 and/or F64 in this way Bad Things will happen.
>
> Presumably just the same kinds of Bad Things as "float f; *(int *)&f;"?

Or qemu will get confused and crash.

> > > seeing end with _im which I presume means "immediate".  The alternative
> > > is _cc, but what does that mean?  (Presumably not "closed captioned".)
> >
> > _cc are variants that set the condition codes. I may have got T0 and T1
> > backwards in the first 3 lines.
>
> Ah!
>
> Is this written down anywhere?  I've read Fabrice's paper and the design
> documentation, and I'm not remembering this.  It's quite possible I missed
> it when my brain filled up, though.

Dunno.

> > > Um, is my earlier characterization of "unwrapping stuff" at all close?
> >
> > Not entirely. I'm also replacing fixed locations (T2) with dynamicall
> > allocated qregs.
>
> The dynamic allocation buys you what?  (Less spilling?)

More-or-less. It makes it easier to optimize. The code generator can pick what 
to put in registers, or even not put them there at all, instead of having to 
do things exactly how you told it.

It also means you don't need to reserve that register, avoiding the gcc unable 
to find spill register bug you mentioned above.

> > Most x86 instructions set the condition code flags. However most of the
> > time these flags are ignored. eg. if you have to consecutive add
> > instructions the first will set the flags, and the second will
> > immediately overwrite them.
> >
> > qemu contains a back-propagation pass that will remove the code to set
> > the flags after the first instruction. Currently this is implemented by
> > changing an addl_cc op into a plain addl op.
>
> I actually understood that.  Yay!
>
> > The flag-setting code would most likely require several qops to
> > implement, so
> > it would be much harder to prove it is not needed and remove it. So there
> > is a mechanism for adding extra target qops, doing the flag elimination
> > pass, then expanding those to generic qops.
>
> Um, wouldn't the flag setting code be fairly straightforward as a qop that
> comes right _before_ the other op, as in "set the flags for doing this with
> these registers", that does nothing but set the flags (I.E. it wouldn't
> modify the contents of any the registers, so it could be immediately
> followed by the appropriate add or shift or so on), and then the flag
> setting pass could just turn all the ones that weren't needed into
> QOP_NULL?

Theoretically possible, but not so easy in practice. Especially when you get 
things like partial flag clobbers, and lazy flag evaluation. Doing it as a 
target specific hack is much simpler and quicker.

> Or is that what's happening now?  (Do QOPs ever modify their input
> registers, or only the output one?)

The generic qops never modify inputs, and never read outputs. Inputs and 
outputs can be the same qreg.

> > > Ah, hang on.  There's target_reginfo in translate-all.c, that's using
> > > some of the other values.  So what the heck does translate-all.c do? 
> > > (Shared code called by all the platform-dependent translate functions?)
> >
> > There are three fairly independent stages:
> > 1) target-*/translate.c converts guest code into qops.
> > 2) translate-all.c messes about with those qops a bit (allocates host
> > registers, etc).
> > 3) translate-op.c,translate-qop.c and target-*/ turns those qops into
> > host code.
>
> Is pass 2 where the flag elimination pass goes (and presumably any other
> optimizations that might get added)?  No, that can't be the case or the
> m68k code wouldn't need its own implementation of the flag elimination
> pass...

Flag elimination is at the end of step 1.

> > > > For converting targets you can probably ignore most of the
> > > > translate-all and host-*/ changes. These implement generating code
> > > > from the qops.
> > >
> > > Ok, this implies that qops are a new thing.  Which looking at the code
> > > sort
> > > of supports.  Which means I don't understand what's going on at all.
> >
> > qops and dyngen ops are both small "functions" that are represented in a
> > similar way. The difference is that dyngen ops are target specific fixed
> > functions, whereas qops are generic parameterized functions.
>
> So the 11x11 exponential complexity of qemu producing its own assembly
> output might not be as much of a problem after switching to qops?

RIght. The exponential complexity is if you write the assembly by hand instead 
of using gcc to generate it.

> Possibly some of the common qops can have an asm block for 'em, and the
> rest can go through the contortions target-*/op.c is currently doing with
> (glue(glue(blah))) and so on.

Currently we know how to generate code direcly for all qops. Anything more 
complicated must be either put in a helper function or split into multiple 
qops.

> > While they are really separate things, the details have been chosen so it
> > should be possible to adapt the existing translate.c code rather than
> > having to rewrite it from scratch. Decoding x86 instruction semantics is
> > complicated :-)
>
> Yay iterative transformation with regression testing.  (And nothing says
> regression testing like booting a Linux distro under the sucker.)

Exactly.

> > Many of the simpler dyngen ops can be replaced with a single qop. Others
> > can be replaces with a sequence of a few qops. Some of the more
> > complicated ones may need to be moved into helper functions.
>
> At some point, I hope to understand helper functions.  But I'm not there
> yet.
>
> > > I need to re-read this later.  My brain's full and I'm deeply confused.
> >
> > I started off by saying qops were effectively instructions for an
> > imaginary machine. translate-all.c rearranges them so they match up very
> > closely with the instructions available on the host. Once this has been
> > done turning them into binary code is relatively simple.
>
> I sort of thought this is what it was already doing, but apparently not...

We're getting confused with tenses. I mean this once translate-all.c has 
rearranged the qops we *do* generate host instructions from them without too 
much effort.

> > If native host FP is not available qemu will include appropriate bits so
> > that
> > after macro expansion and inlining you end up with:
> >
> >   tmp = gen_new_qreg(QMODE_I32);
> >   gen_op_helper(HELPER_addf32, tmp, QREG_FOO, QREG_BAR).
> >
> > and the addf32 helper does the floating point addition using the
> > "softfloat" library. The qemu softfloat library implementation may
> > actually use hardware floating point rather than doing everything
> > manually.
>
> No reason (except speed) the code output into a translation block can't do
> function calls.  I think.

That's exactly what a helper function is. Calling functions is complicated, so 
I've restricted the functions that can be called to explicitly declared 
helper functions.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-11-01  0:29                       ` Paul Brook
@ 2006-11-01  1:51                         ` Rob Landley
  2006-11-01  3:22                           ` Paul Brook
  0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-11-01  1:51 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

On Tuesday 31 October 2006 7:29 pm, Paul Brook wrote:
> > Actually it sounds additive rather than multiplicative.  Does each target
> > have an entirely unrelated set of ops, or is there a shared set of
> > primitive ops plus some oddballs?
> 
> The shared set of primitive ops is basically qops :-)
> You probably could figure out a single common qet of qops, then write 
assembly 
> and glue them together like we do with dyngen. However once you've done that 
> you've implemented most of what's needed for fully dynamic qops, so it 
> doesn't really seem worth it.

I missed a curve.  What's "fully dynamic qops"?  (There's no translation 
cache?)

> > But I already know 
> > that doesn't work because it doesn't explain the "unable to find spill
> > register" problem. 
> 
> That a separate gcc bug. It gets stuck when you tell it not to use half the 
> registers, then ask it to do 64-bit math. This is one of the reasons 
> eliminating the fixed registers is a good idea.

Sigh.  The problems motivating me to learn the code are highly esoteric 
breakage, yet I'm still not quite up to the task of understanding what's 
going on when all this works _right_.  Grumble... 

> > > It corresponds to "T0" in dyngen. In addition to the actual CPU state,
> > > dyngen
> > > uses 3 fixed register as scratch workspace. for qop purposes these are
> > > part of the guest CPU state. They're only there to aid conversion of the
> > > translation code, they'll go away eventually.
> >
> > Presumably the m68k target is pure qop, and hasn't got this sort of thing?
> 
> Correct.
> There is one use of T0 left for communicating with the TB chaining code, but 
> that's it and will probably go away eventually.

Any idea where I can get a toolchain that can output a "hello world" program 
for m68k nommu?  (Or perhaps you have a statically linked "hello world" 
program for the platform lying around?)

Building toolchains is one of my other hobbies but it's a royal pain because 
in order to get "hello world" to compile and link you have to supply kernel 
headers, build binutils and gcc with various configuration options and path 
overrides and such, build uClibc with the result and get them all talking to 
each other.  I.E. you've got to do hours of work before you get to the first 
real "did it work" point, and then backtrack to figure out why the answer is 
usually "no".  (Prebuilt binary toolchains are useful just to narrow down the 
number of possible things that could be broken when you first try out a new 
platform.)

> > > > Possible translation: you can feed a qreg containing an I64 value to a
> > > > qop taking an i32 argument, and it'll typecast the sucker down
> > > > intelligently, but if you produce an I32 result and expect to use that
> > > > qreg's value as an I64 argument later, you have to call a
> > > > sign-extending qop on it first?
> > >
> > > Exactly.
> > > If you mix I32,F32 and/or F64 in this way Bad Things will happen.
> >
> > Presumably just the same kinds of Bad Things as "float f; *(int *)&f;"?
> 
> Or qemu will get confused and crash.

I've had that happen without qops, although not recently.  (I have this nasty 
habit of trying Ubuntu's PPC and x86-64 distros under qemu with each new 
release.  They usually fail in amusing new ways.)

> > > > seeing end with _im which I presume means "immediate".  The 
alternative
> > > > is _cc, but what does that mean?  (Presumably not "closed captioned".)
> > >
> > > _cc are variants that set the condition codes. I may have got T0 and T1
> > > backwards in the first 3 lines.
> >
> > Ah!
> >
> > Is this written down anywhere?  I've read Fabrice's paper and the design
> > documentation, and I'm not remembering this.  It's quite possible I missed
> > it when my brain filled up, though.
> 
> Dunno.

So if at any point I actually understand this stuff, I need to write 
documentation?  (I can do part 2, part 1 the jury's still out on...)

> It also means you don't need to reserve that register, avoiding the gcc
> unable to find spill register bug you mentioned above.

I'm all for it.

> > Um, wouldn't the flag setting code be fairly straightforward as a qop that
> > comes right _before_ the other op, as in "set the flags for doing this 
with
> > these registers", that does nothing but set the flags (I.E. it wouldn't
> > modify the contents of any the registers, so it could be immediately
> > followed by the appropriate add or shift or so on), and then the flag
> > setting pass could just turn all the ones that weren't needed into
> > QOP_NULL?
> 
> Theoretically possible, but not so easy in practice. Especially when you get 
> things like partial flag clobbers, and lazy flag evaluation. Doing it as a 
> target specific hack is much simpler and quicker.

I think I know what partial flag clobbers are (although if you're working your 
way back, in theory you could handle it with a mask of exposed bits), but 
what's lazy flag evaulation?  (I thought that was the point of eliminating 
the unused flag setting.  Are you saying the hardware also does this and we 
have to emulate that?)

> > Or is that what's happening now?  (Do QOPs ever modify their input
> > registers, or only the output one?)
> 
> The generic qops never modify inputs, and never read outputs. Inputs and 
> outputs can be the same qreg.

Hm.

> > > There are three fairly independent stages:
> > > 1) target-*/translate.c converts guest code into qops.
> > > 2) translate-all.c messes about with those qops a bit (allocates host
> > > registers, etc).
> > > 3) translate-op.c,translate-qop.c and target-*/ turns those qops into
> > > host code.
> >
> > Is pass 2 where the flag elimination pass goes (and presumably any other
> > optimizations that might get added)?  No, that can't be the case or the
> > m68k code wouldn't need its own implementation of the flag elimination
> > pass...
> 
> Flag elimination is at the end of step 1.

Because it's platform specific?
\
> > > qops and dyngen ops are both small "functions" that are represented in a
> > > similar way. The difference is that dyngen ops are target specific fixed
> > > functions, whereas qops are generic parameterized functions.
> >
> > So the 11x11 exponential complexity of qemu producing its own assembly
> > output might not be as much of a problem after switching to qops?
> 
> RIght. The exponential complexity is if you write the assembly by hand
> instead of using gcc to generate it.

The exponential complexity is if you have to write different code for each 
combination of host and target.  If every target disassembles to the same set 
of target QOPs, then you could have a hand-written assembly version of each 
QOP for each host platform, and still have N rather than N^2 of them.

And I still wanna use tcc to generate it, someday. :)

> > Possibly some of the common qops can have an asm block for 'em, and the
> > rest can go through the contortions target-*/op.c is currently doing with
> > (glue(glue(blah))) and so on.
> 
> Currently we know how to generate code direcly for all qops. Anything more 
> complicated must be either put in a helper function or split into multiple 
> qops.

Split into multiple qops I can understand.

> > > I started off by saying qops were effectively instructions for an
> > > imaginary machine. translate-all.c rearranges them so they match up very
> > > closely with the instructions available on the host. Once this has been
> > > done turning them into binary code is relatively simple.
> >
> > I sort of thought this is what it was already doing, but apparently not...
> 
> We're getting confused with tenses. I mean this once translate-all.c has 
> rearranged the qops we *do* generate host instructions from them without too 
> much effort.

By "already doing" I meant I thought the 0.8.2 code was dong this, before your 
new tree switching everything over to qops.  (Trying to read dyngen.c reminds 
me of reading cgi code that outputs html with embedded javascript.)

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-11-01  1:51                         ` Rob Landley
@ 2006-11-01  3:22                           ` Paul Brook
  2006-11-01 16:34                             ` Rob Landley
  0 siblings, 1 reply; 43+ messages in thread
From: Paul Brook @ 2006-11-01  3:22 UTC (permalink / raw)
  To: Rob Landley; +Cc: qemu-devel

On Wednesday 01 November 2006 01:51, Rob Landley wrote:
> On Tuesday 31 October 2006 7:29 pm, Paul Brook wrote:
> > > Actually it sounds additive rather than multiplicative.  Does each
> > > target have an entirely unrelated set of ops, or is there a shared set
> > > of primitive ops plus some oddballs?
> >
> > The shared set of primitive ops is basically qops :-)
> > You probably could figure out a single common qet of qops, then write
>
> assembly
>
> > and glue them together like we do with dyngen. However once you've done
> > that you've implemented most of what's needed for fully dynamic qops, so
> > it doesn't really seem worth it.
>
> I missed a curve.  What's "fully dynamic qops"?  (There's no translation
> cache?)

I mean all the qop stuff I've implemented.

> > > > It corresponds to "T0" in dyngen. In addition to the actual CPU
> > > > state, dyngen
> > > > uses 3 fixed register as scratch workspace. for qop purposes these
> > > > are part of the guest CPU state. They're only there to aid conversion
> > > > of the translation code, they'll go away eventually.
> > >
> > > Presumably the m68k target is pure qop, and hasn't got this sort of
> > > thing?
> >
> > Correct.
> > There is one use of T0 left for communicating with the TB chaining code,
> > but that's it and will probably go away eventually.
>
> Any idea where I can get a toolchain that can output a "hello world"
> program for m68k nommu?  (Or perhaps you have a statically linked "hello
> world" program for the platform lying around?)

Funnily enough I do :-)
http://www.codesourcery.com/gnu_toolchains/coldfire/

> > Theoretically possible, but not so easy in practice. Especially when you
> > get things like partial flag clobbers, and lazy flag evaluation. Doing it
> > as a target specific hack is much simpler and quicker.
>
> I think I know what partial flag clobbers are (although if you're working
> your way back, in theory you could handle it with a mask of exposed bits),
> but what's lazy flag evaulation?  (I thought that was the point of
> eliminating the unused flag setting.  Are you saying the hardware also does
> this and we have to emulate that?)

Lazy flag evaluation is where you don't bother calculating the actual flags 
when executing the flag-setting instruction. Instead you save the 
operands/result and compute the flags when you actually need them.

> > > > There are three fairly independent stages:
> > > > 1) target-*/translate.c converts guest code into qops.
> > > > 2) translate-all.c messes about with those qops a bit (allocates host
> > > > registers, etc).
> > > > 3) translate-op.c,translate-qop.c and target-*/ turns those qops into
> > > > host code.
> > >
> > > Is pass 2 where the flag elimination pass goes (and presumably any
> > > other optimizations that might get added)?  No, that can't be the case
> > > or the m68k code wouldn't need its own implementation of the flag
> > > elimination pass...
> >
> > Flag elimination is at the end of step 1.
>
> Because it's platform specific?

Yes.

> > > > qops and dyngen ops are both small "functions" that are represented
> > > > in a similar way. The difference is that dyngen ops are target
> > > > specific fixed functions, whereas qops are generic parameterized
> > > > functions.
> > >
> > > So the 11x11 exponential complexity of qemu producing its own assembly
> > > output might not be as much of a problem after switching to qops?
> >
> > RIght. The exponential complexity is if you write the assembly by hand
> > instead of using gcc to generate it.
>
> The exponential complexity is if you have to write different code for each
> combination of host and target.  If every target disassembles to the same
> set of target QOPs, then you could have a hand-written assembly version of
> each QOP for each host platform, and still have N rather than N^2 of them.

Right, but by the time you've got everything to use the same set of ops you 
may as well teach qemu how to generate code instead of using potted 
fragments.

Using hand-written assembly fragments probably doesn't make qemu any faster, 
it just removes the gcc dependency. Using qops also allows qemu to generate 
better (faster) translated code.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-11-01  3:22                           ` Paul Brook
@ 2006-11-01 16:34                             ` Rob Landley
  2006-11-01 17:01                               ` Paul Brook
  0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2006-11-01 16:34 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

On Tuesday 31 October 2006 10:22 pm, Paul Brook wrote:

> > > and glue them together like we do with dyngen. However once you've done
> > > that you've implemented most of what's needed for fully dynamic qops, so
> > > it doesn't really seem worth it.
> >
> > I missed a curve.  What's "fully dynamic qops"?  (There's no translation
> > cache?)
> 
> I mean all the qop stuff I've implemented.

Still lost.  Where does the "fully dynamic" part come in?

> > Any idea where I can get a toolchain that can output a "hello world"
> > program for m68k nommu?  (Or perhaps you have a statically linked "hello
> > world" program for the platform lying around?)
> 
> Funnily enough I do :-)
> http://www.codesourcery.com/gnu_toolchains/coldfire/

Woot.

This download page was designed by your company's legal department, I take it?  
(There's no such thing as GNU/Linux, and you don't have to accept the GPL 
because it's based on copyright law, not contract law, so "informed consent" 
is not the basis for enforcement.)

> > > Theoretically possible, but not so easy in practice. Especially when you
> > > get things like partial flag clobbers, and lazy flag evaluation. Doing 
it
> > > as a target specific hack is much simpler and quicker.
> >
> > I think I know what partial flag clobbers are (although if you're working
> > your way back, in theory you could handle it with a mask of exposed bits),
> > but what's lazy flag evaulation?  (I thought that was the point of
> > eliminating the unused flag setting.  Are you saying the hardware also 
does
> > this and we have to emulate that?)
> 
> Lazy flag evaluation is where you don't bother calculating the actual flags 
> when executing the flag-setting instruction. Instead you save the 
> operands/result and compute the flags when you actually need them.

Such as when the computation's in a loop but you only test after exiting the 
loop for other reasons?  I'd have to see examples to figure out how it would 
make sense to optimize that...

> > The exponential complexity is if you have to write different code for each
> > combination of host and target.  If every target disassembles to the same
> > set of target QOPs, then you could have a hand-written assembly version of
> > each QOP for each host platform, and still have N rather than N^2 of them.
> 
> Right, but by the time you've got everything to use the same set of ops you 
> may as well teach qemu how to generate code instead of using potted 
> fragments.

I'm thinking of "here's the code for performing this QOP on this processor".  
I'm not sure what distinction you're making.

> Using hand-written assembly fragments probably doesn't make qemu any faster, 
> it just removes the gcc dependency.

Which seems like a good thing to me.  (Or at least the gcc _version_ 
dependency.)

> Using qops also allows qemu to generate better (faster) translated code.

Currently, I'm interested in building qemu under compiler versions I actually 
have installed, without having to apply patches that the patch authors 
consider too disgusting to integrate.

> Paul

Rob
-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] qemu vs gcc4
  2006-11-01 16:34                             ` Rob Landley
@ 2006-11-01 17:01                               ` Paul Brook
  0 siblings, 0 replies; 43+ messages in thread
From: Paul Brook @ 2006-11-01 17:01 UTC (permalink / raw)
  To: Rob Landley; +Cc: qemu-devel

On Wednesday 01 November 2006 16:34, Rob Landley wrote:
> On Tuesday 31 October 2006 10:22 pm, Paul Brook wrote:
> > > > and glue them together like we do with dyngen. However once you've
> > > > done that you've implemented most of what's needed for fully dynamic
> > > > qops, so it doesn't really seem worth it.
> > >
> > > I missed a curve.  What's "fully dynamic qops"?  (There's no
> > > translation cache?)
> >
> > I mean all the qop stuff I've implemented.
>
> Still lost.  Where does the "fully dynamic" part come in?

Generating code instead in blindly glueing together precompiled fragments.

> > > Any idea where I can get a toolchain that can output a "hello world"
> > > program for m68k nommu?  (Or perhaps you have a statically linked
> > > "hello world" program for the platform lying around?)
> >
> > Funnily enough I do :-)
> > http://www.codesourcery.com/gnu_toolchains/coldfire/
>
> Woot.
>
> This download page was designed by your company's legal department, I take
> it? (There's no such thing as GNU/Linux, and you don't have to accept the
> GPL because it's based on copyright law, not contract law, so "informed
> consent" is not the basis for enforcement.)

I suspect it's more to do with due diligence on our part. While it might not 
have any legal meaning on its own, it makes it much harder for people to 
claim they infringed copyright accidentally. 

> > Lazy flag evaluation is where you don't bother calculating the actual
> > flags when executing the flag-setting instruction. Instead you save the
> > operands/result and compute the flags when you actually need them.
>
> Such as when the computation's in a loop but you only test after exiting
> the loop for other reasons?  I'd have to see examples to figure out how it
> would make sense to optimize that...

Ys, or when only some flags are used. eg:

 add %eax, %ebx
 adc %cex, %edx

The adc instruction only uses the carry flag, then clobbers all the rest. A 
naive implementation would evaluate all the flags after the add. With Lazy 
evaluation we evaluate just the carry flag before the adc, and know we don't 
have to bother calculating the other flags. It also avoids having to evaluate 
the flags at a TB boundary.

> > > The exponential complexity is if you have to write different code for
> > > each combination of host and target.  If every target disassembles to
> > > the same set of target QOPs, then you could have a hand-written
> > > assembly version of each QOP for each host platform, and still have N
> > > rather than N^2 of them.
> >
> > Right, but by the time you've got everything to use the same set of ops
> > you may as well teach qemu how to generate code instead of using potted
> > fragments.
>
> I'm thinking of "here's the code for performing this QOP on this
> processor". I'm not sure what distinction you're making.

The difference is whether qemu knows how to generate code for that target, or 
is blindly glueing binary blobs together.

Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2006-11-01 17:02 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-20 18:53 [Qemu-devel] qemu vs gcc4 K. Richard Pixley
2006-10-22 22:06 ` Johannes Schindelin
2006-10-23  8:16   ` Martin Guy
2006-10-23 12:20     ` Paul Brook
2006-10-23 13:59       ` Avi Kivity
2006-10-23 14:10         ` Paul Brook
2006-10-23 14:28           ` Avi Kivity
2006-10-23 14:31             ` Paul Brook
2006-10-23 14:35               ` Avi Kivity
2006-10-23 17:41     ` K. Richard Pixley
2006-10-23 17:58       ` Paul Brook
2006-10-23 18:04         ` K. Richard Pixley
2006-10-23 18:20           ` Laurent Desnogues
2006-10-23 18:37           ` Paul Brook
2006-10-24 23:39             ` Rob Landley
2006-10-25  0:24               ` Paul Brook
2006-10-25 19:39                 ` Rob Landley
2006-10-26 18:09                   ` Daniel Jacobowitz
2006-10-31 16:53             ` Rob Landley
2006-10-31 19:02               ` Paul Brook
2006-10-31 20:41                 ` Rob Landley
2006-10-31 22:08                   ` Paul Brook
2006-10-31 22:31                     ` Laurent Desnogues
2006-10-31 23:00                       ` Paul Brook
2006-11-01  0:00                     ` Rob Landley
2006-11-01  0:29                       ` Paul Brook
2006-11-01  1:51                         ` Rob Landley
2006-11-01  3:22                           ` Paul Brook
2006-11-01 16:34                             ` Rob Landley
2006-11-01 17:01                               ` Paul Brook
2006-10-31 23:17                 ` Rob Landley
2006-11-01  0:01                   ` Paul Brook
2006-10-30  4:35         ` Rob Landley
2006-10-30 14:56           ` Paul Brook
2006-10-30 16:31             ` Rob Landley
2006-10-30 16:50               ` Paul Brook
2006-10-30 22:54                 ` Stephen Torri
2006-10-30 23:13                   ` Paul Brook
2006-10-23  1:27 ` Rob Landley
2006-10-23  1:44   ` Paul Brook
2006-10-23  1:45   ` Johannes Schindelin
2006-10-23 17:53     ` K. Richard Pixley
2006-10-23 18:08     ` Rob Landley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).