[Qemu-devel] [PATCH 0/5] tcg/arm: fixes and optimizations

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/5] tcg/arm: fixes and optimizations
@ 2012-10-09 20:30 Aurelien Jarno
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 1/5] tcg/arm: fix TLB access in qemu-ld/st ops Aurelien Jarno
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Aurelien Jarno @ 2012-10-09 20:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Aurelien Jarno

This patch series fixes the TCG arm backend for the MIPS target, as well
as for big endian targets when not using the ARMv6+ instructions set. 
The corresponding patches are candidate for a stable release.

The last two patches add some optimizations.


Aurelien Jarno (5):
  tcg/arm: fix TLB access in qemu-ld/st ops
  tcg/arm: fix cross-endian qemu_st16
  target-openrisc: remove conflicting definitions from cpu.h
  tcg/arm: optimize tcg_out_goto_label
  tcg/arm: improve direct jump

 exec-all.h            |   24 ++---------
 exec.c                |    4 --
 target-openrisc/cpu.h |   18 ---------
 tcg/arm/tcg-target.c  |  107 ++++++++++++++++++++++++++-----------------------
 4 files changed, 61 insertions(+), 92 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 1/5] tcg/arm: fix TLB access in qemu-ld/st ops
  2012-10-09 20:30 [Qemu-devel] [PATCH 0/5] tcg/arm: fixes and optimizations Aurelien Jarno
@ 2012-10-09 20:30 ` Aurelien Jarno
  2012-10-09 21:13   ` Peter Maydell
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 2/5] tcg/arm: fix cross-endian qemu_st16 Aurelien Jarno
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Aurelien Jarno @ 2012-10-09 20:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, qemu-stable, Aurelien Jarno

The TCG arm backend considers likely that the offset to the TLB
entries does not exceed 12 bits for mem_index = 0. In practice this is
not true for at list the MIPS target.

The current patch fixes that by loading the bits 23-12 with a separate
instruction, and using loads with address writeback, independently of
the value of mem_idx. In total this allow a 24-bit offset, which is a
lot more than needed.

Cc: Andrzej Zaborowski <balrogg@gmail.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: qemu-stable@nongnu.org
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 tcg/arm/tcg-target.c |   73 +++++++++++++++++++++++++-------------------------
 1 file changed, 37 insertions(+), 36 deletions(-)

diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 737200e..6cde512 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -624,6 +624,19 @@ static inline void tcg_out_ld32_12(TCGContext *s, int cond,
                         (rn << 16) | (rd << 12) | ((-im) & 0xfff));
 }
 
+/* Offset pre-increment with base writeback.  */
+static inline void tcg_out_ld32_12wb(TCGContext *s, int cond,
+                                     int rd, int rn, tcg_target_long im)
+{
+    if (im >= 0) {
+        tcg_out32(s, (cond << 28) | 0x05b00000 |
+                       (rn << 16) | (rd << 12) | (im & 0xfff));
+    } else {
+        tcg_out32(s, (cond << 28) | 0x05300000 |
+                       (rn << 16) | (rd << 12) | ((-im) & 0xfff));
+    }
+}
+
 static inline void tcg_out_st32_12(TCGContext *s, int cond,
                 int rd, int rn, tcg_target_long im)
 {
@@ -1056,7 +1069,7 @@ static inline void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
 {
     int addr_reg, data_reg, data_reg2, bswap;
 #ifdef CONFIG_SOFTMMU
-    int mem_index, s_bits;
+    int mem_index, s_bits, tlb_offset;
     TCGReg argreg;
 # if TARGET_LONG_BITS == 64
     int addr_reg2;
@@ -1096,19 +1109,14 @@ static inline void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
                     TCG_REG_R0, TCG_REG_R8, CPU_TLB_SIZE - 1);
     tcg_out_dat_reg(s, COND_AL, ARITH_ADD, TCG_REG_R0, TCG_AREG0,
                     TCG_REG_R0, SHIFT_IMM_LSL(CPU_TLB_ENTRY_BITS));
-    /* In the
-     *  ldr r1 [r0, #(offsetof(CPUArchState, tlb_table[mem_index][0].addr_read))]
-     * below, the offset is likely to exceed 12 bits if mem_index != 0 and
-     * not exceed otherwise, so use an
-     *  add r0, r0, #(mem_index * sizeof *CPUArchState.tlb_table)
-     * before.
-     */
-    if (mem_index)
+    /* We assume that the offset is contained within 24 bits.  */
+    tlb_offset = offsetof(CPUArchState, tlb_table[mem_index][0].addr_read);
+    if (tlb_offset > 0xfff) {
         tcg_out_dat_imm(s, COND_AL, ARITH_ADD, TCG_REG_R0, TCG_REG_R0,
-                        (mem_index << (TLB_SHIFT & 1)) |
-                        ((16 - (TLB_SHIFT >> 1)) << 8));
-    tcg_out_ld32_12(s, COND_AL, TCG_REG_R1, TCG_REG_R0,
-                    offsetof(CPUArchState, tlb_table[0][0].addr_read));
+                        0xa00 | (tlb_offset >> 12));
+        tlb_offset &= 0xfff;
+    }
+    tcg_out_ld32_12wb(s, COND_AL, TCG_REG_R1, TCG_REG_R0, tlb_offset);
     tcg_out_dat_reg(s, COND_AL, ARITH_CMP, 0, TCG_REG_R1,
                     TCG_REG_R8, SHIFT_IMM_LSL(TARGET_PAGE_BITS));
     /* Check alignment.  */
@@ -1116,15 +1124,14 @@ static inline void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
         tcg_out_dat_imm(s, COND_EQ, ARITH_TST,
                         0, addr_reg, (1 << s_bits) - 1);
 #  if TARGET_LONG_BITS == 64
-    /* XXX: possibly we could use a block data load or writeback in
-     * the first access.  */
-    tcg_out_ld32_12(s, COND_EQ, TCG_REG_R1, TCG_REG_R0,
-                    offsetof(CPUArchState, tlb_table[0][0].addr_read) + 4);
+    /* XXX: possibly we could use a block data load in the first access.  */
+    tcg_out_ld32_12(s, COND_EQ, TCG_REG_R1, TCG_REG_R0, 4);
     tcg_out_dat_reg(s, COND_EQ, ARITH_CMP, 0,
                     TCG_REG_R1, addr_reg2, SHIFT_IMM_LSL(0));
 #  endif
     tcg_out_ld32_12(s, COND_EQ, TCG_REG_R1, TCG_REG_R0,
-                    offsetof(CPUArchState, tlb_table[0][0].addend));
+                    offsetof(CPUTLBEntry, addend)
+                    - offsetof(CPUTLBEntry, addr_read));
 
     switch (opc) {
     case 0:
@@ -1273,7 +1280,7 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
 {
     int addr_reg, data_reg, data_reg2, bswap;
 #ifdef CONFIG_SOFTMMU
-    int mem_index, s_bits;
+    int mem_index, s_bits, tlb_offset;
     TCGReg argreg;
 # if TARGET_LONG_BITS == 64
     int addr_reg2;
@@ -1310,19 +1317,14 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
                     TCG_REG_R0, TCG_REG_R8, CPU_TLB_SIZE - 1);
     tcg_out_dat_reg(s, COND_AL, ARITH_ADD, TCG_REG_R0,
                     TCG_AREG0, TCG_REG_R0, SHIFT_IMM_LSL(CPU_TLB_ENTRY_BITS));
-    /* In the
-     *  ldr r1 [r0, #(offsetof(CPUArchState, tlb_table[mem_index][0].addr_write))]
-     * below, the offset is likely to exceed 12 bits if mem_index != 0 and
-     * not exceed otherwise, so use an
-     *  add r0, r0, #(mem_index * sizeof *CPUArchState.tlb_table)
-     * before.
-     */
-    if (mem_index)
+    /* We assume that the offset is contained within 24 bits.  */
+    tlb_offset = offsetof(CPUArchState, tlb_table[mem_index][0].addr_write);
+    if (tlb_offset > 0xfff) {
         tcg_out_dat_imm(s, COND_AL, ARITH_ADD, TCG_REG_R0, TCG_REG_R0,
-                        (mem_index << (TLB_SHIFT & 1)) |
-                        ((16 - (TLB_SHIFT >> 1)) << 8));
-    tcg_out_ld32_12(s, COND_AL, TCG_REG_R1, TCG_REG_R0,
-                    offsetof(CPUArchState, tlb_table[0][0].addr_write));
+                        0xa00 | (tlb_offset >> 12));
+        tlb_offset &= 0xfff;
+    }
+    tcg_out_ld32_12wb(s, COND_AL, TCG_REG_R1, TCG_REG_R0, tlb_offset);
     tcg_out_dat_reg(s, COND_AL, ARITH_CMP, 0, TCG_REG_R1,
                     TCG_REG_R8, SHIFT_IMM_LSL(TARGET_PAGE_BITS));
     /* Check alignment.  */
@@ -1330,15 +1332,14 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
         tcg_out_dat_imm(s, COND_EQ, ARITH_TST,
                         0, addr_reg, (1 << s_bits) - 1);
 #  if TARGET_LONG_BITS == 64
-    /* XXX: possibly we could use a block data load or writeback in
-     * the first access.  */
-    tcg_out_ld32_12(s, COND_EQ, TCG_REG_R1, TCG_REG_R0,
-                    offsetof(CPUArchState, tlb_table[0][0].addr_write) + 4);
+    /* XXX: possibly we could use a block data load in the first access.  */
+    tcg_out_ld32_12(s, COND_EQ, TCG_REG_R1, TCG_REG_R0, 4);
     tcg_out_dat_reg(s, COND_EQ, ARITH_CMP, 0,
                     TCG_REG_R1, addr_reg2, SHIFT_IMM_LSL(0));
 #  endif
     tcg_out_ld32_12(s, COND_EQ, TCG_REG_R1, TCG_REG_R0,
-                    offsetof(CPUArchState, tlb_table[0][0].addend));
+                    offsetof(CPUTLBEntry, addend)
+                    - offsetof(CPUTLBEntry, addr_write));
 
     switch (opc) {
     case 0:
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] tcg/arm: fix TLB access in qemu-ld/st ops
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 1/5] tcg/arm: fix TLB access in qemu-ld/st ops Aurelien Jarno
@ 2012-10-09 21:13   ` Peter Maydell
  2012-10-10 10:00     ` Laurent Desnogues
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2012-10-09 21:13 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: qemu-devel, qemu-stable

On 9 October 2012 21:30, Aurelien Jarno <aurelien@aurel32.net> wrote:
> The TCG arm backend considers likely that the offset to the TLB
> entries does not exceed 12 bits for mem_index = 0. In practice this is
> not true for at list the MIPS target.
>
> The current patch fixes that by loading the bits 23-12 with a separate
> instruction, and using loads with address writeback, independently of
> the value of mem_idx. In total this allow a 24-bit offset, which is a
> lot more than needed.

Would probably be good to assert() that the bits above 24 are zero,
though.

> Cc: Andrzej Zaborowski <balrogg@gmail.com>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
> ---
>  tcg/arm/tcg-target.c |   73 +++++++++++++++++++++++++-------------------------
>  1 file changed, 37 insertions(+), 36 deletions(-)
>
> diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
> index 737200e..6cde512 100644
> --- a/tcg/arm/tcg-target.c
> +++ b/tcg/arm/tcg-target.c
> @@ -624,6 +624,19 @@ static inline void tcg_out_ld32_12(TCGContext *s, int cond,
>                          (rn << 16) | (rd << 12) | ((-im) & 0xfff));
>  }
>
> +/* Offset pre-increment with base writeback.  */
> +static inline void tcg_out_ld32_12wb(TCGContext *s, int cond,
> +                                     int rd, int rn, tcg_target_long im)
> +{

Worth asserting that rd != rn ? (that's an UNPREDICTABLE case
for ldr with writeback)

> +    if (im >= 0) {
> +        tcg_out32(s, (cond << 28) | 0x05b00000 |
> +                       (rn << 16) | (rd << 12) | (im & 0xfff));
> +    } else {
> +        tcg_out32(s, (cond << 28) | 0x05300000 |
> +                       (rn << 16) | (rd << 12) | ((-im) & 0xfff));
> +    }

you could avoid the if() by just setting the U bit using "(im >= 0) << 23"
Clearer? Dunno.

Looks OK otherwise (though I haven't tested it.)

Random aside: we emit a pretty long string of COND_EQ predicated
code in these functions, especially in the 64 bit address and
byteswapped cases. It might be interesting to see if performance
on an A9, say, was any better if we just did a conditional branch
over it instead... Probably borderline to no-effect, though.

-- PMM

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] tcg/arm: fix TLB access in qemu-ld/st ops
  2012-10-09 21:13   ` Peter Maydell
@ 2012-10-10 10:00     ` Laurent Desnogues
  2012-10-10 10:08       ` Peter Maydell
  0 siblings, 1 reply; 12+ messages in thread
From: Laurent Desnogues @ 2012-10-10 10:00 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-devel, Aurelien Jarno, qemu-stable

On Tue, Oct 9, 2012 at 11:13 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 9 October 2012 21:30, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> The TCG arm backend considers likely that the offset to the TLB
>> entries does not exceed 12 bits for mem_index = 0. In practice this is
>> not true for at list the MIPS target.
>>
>> The current patch fixes that by loading the bits 23-12 with a separate
>> instruction, and using loads with address writeback, independently of
>> the value of mem_idx. In total this allow a 24-bit offset, which is a
>> lot more than needed.
>
> Would probably be good to assert() that the bits above 24 are zero,
> though.
>
>> Cc: Andrzej Zaborowski <balrogg@gmail.com>
>> Cc: Peter Maydell <peter.maydell@linaro.org>
>> Cc: qemu-stable@nongnu.org
>> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
>> ---
>>  tcg/arm/tcg-target.c |   73 +++++++++++++++++++++++++-------------------------
>>  1 file changed, 37 insertions(+), 36 deletions(-)
>>
>> diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
>> index 737200e..6cde512 100644
>> --- a/tcg/arm/tcg-target.c
>> +++ b/tcg/arm/tcg-target.c
>> @@ -624,6 +624,19 @@ static inline void tcg_out_ld32_12(TCGContext *s, int cond,
>>                          (rn << 16) | (rd << 12) | ((-im) & 0xfff));
>>  }
>>
>> +/* Offset pre-increment with base writeback.  */
>> +static inline void tcg_out_ld32_12wb(TCGContext *s, int cond,
>> +                                     int rd, int rn, tcg_target_long im)
>> +{
>
> Worth asserting that rd != rn ? (that's an UNPREDICTABLE case
> for ldr with writeback)
>
>> +    if (im >= 0) {
>> +        tcg_out32(s, (cond << 28) | 0x05b00000 |
>> +                       (rn << 16) | (rd << 12) | (im & 0xfff));
>> +    } else {
>> +        tcg_out32(s, (cond << 28) | 0x05300000 |
>> +                       (rn << 16) | (rd << 12) | ((-im) & 0xfff));
>> +    }
>
> you could avoid the if() by just setting the U bit using "(im >= 0) << 23"
> Clearer? Dunno.

You also have to negate the im value so it's not enough to just
change the opcode.


Laurent

> Looks OK otherwise (though I haven't tested it.)
>
> Random aside: we emit a pretty long string of COND_EQ predicated
> code in these functions, especially in the 64 bit address and
> byteswapped cases. It might be interesting to see if performance
> on an A9, say, was any better if we just did a conditional branch
> over it instead... Probably borderline to no-effect, though.
>
> -- PMM
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] tcg/arm: fix TLB access in qemu-ld/st ops
  2012-10-10 10:00     ` Laurent Desnogues
@ 2012-10-10 10:08       ` Peter Maydell
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Maydell @ 2012-10-10 10:08 UTC (permalink / raw)
  To: Laurent Desnogues; +Cc: qemu-devel, Aurelien Jarno, qemu-stable

On 10 October 2012 11:00, Laurent Desnogues <laurent.desnogues@gmail.com> wrote:
> On Tue, Oct 9, 2012 at 11:13 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 9 October 2012 21:30, Aurelien Jarno <aurelien@aurel32.net> wrote:
>>> +    if (im >= 0) {
>>> +        tcg_out32(s, (cond << 28) | 0x05b00000 |
>>> +                       (rn << 16) | (rd << 12) | (im & 0xfff));
>>> +    } else {
>>> +        tcg_out32(s, (cond << 28) | 0x05300000 |
>>> +                       (rn << 16) | (rd << 12) | ((-im) & 0xfff));
>>> +    }
>>
>> you could avoid the if() by just setting the U bit using "(im >= 0) << 23"
>> Clearer? Dunno.
>
> You also have to negate the im value so it's not enough to just
> change the opcode.

Doh, of course. Forget what I wrote, then :-)

-- PMM

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 2/5] tcg/arm: fix cross-endian qemu_st16
  2012-10-09 20:30 [Qemu-devel] [PATCH 0/5] tcg/arm: fixes and optimizations Aurelien Jarno
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 1/5] tcg/arm: fix TLB access in qemu-ld/st ops Aurelien Jarno
@ 2012-10-09 20:30 ` Aurelien Jarno
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 3/5] target-openrisc: remove conflicting definitions from cpu.h Aurelien Jarno
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Aurelien Jarno @ 2012-10-09 20:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, qemu-stable, Aurelien Jarno

The bswap16 TCG opcode assumes that the high bytes of the temp equal
to 0 before calling it. The ARM backend implementation takes this
assumption to slightly optimize the generated code.

The same implementation is called for implementing the cross-endian
qemu_st16 opcode, where this assumption is not true anymore. One way to
fix that would be to zero the high bytes before calling it. Given the
store instruction just ignore them, it is possible to provide a slightly
more optimized version. With ARMv6+ the rev16 instruction does the work
correctly. For lower ARM versions the patch provides a version which
behaves correctly with non-zero high bytes, but fill them with junk.

Cc: Andrzej Zaborowski <balrogg@gmail.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: qemu-stable@nongnu.org
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 tcg/arm/tcg-target.c |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 6cde512..3191903 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -596,6 +596,22 @@ static inline void tcg_out_bswap16(TCGContext *s, int cond, int rd, int rn)
     }
 }
 
+/* swap the two low bytes assuming that the two high input bytes and the
+   two high output bit can hold any value. */
+static inline void tcg_out_bswap16st(TCGContext *s, int cond, int rd, int rn)
+{
+    if (use_armv6_instructions) {
+        /* rev16 */
+        tcg_out32(s, 0x06bf0fb0 | (cond << 28) | (rd << 12) | rn);
+    } else {
+        tcg_out_dat_reg(s, cond, ARITH_MOV,
+                        TCG_REG_R8, 0, rn, SHIFT_IMM_LSR(8));
+        tcg_out_dat_imm(s, cond, ARITH_AND, TCG_REG_R8, TCG_REG_R8, 0xff);
+        tcg_out_dat_reg(s, cond, ARITH_ORR,
+                        rd, TCG_REG_R8, rn, SHIFT_IMM_LSL(8));
+    }
+}
+
 static inline void tcg_out_bswap32(TCGContext *s, int cond, int rd, int rn)
 {
     if (use_armv6_instructions) {
@@ -1347,7 +1363,7 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
         break;
     case 1:
         if (bswap) {
-            tcg_out_bswap16(s, COND_EQ, TCG_REG_R0, data_reg);
+            tcg_out_bswap16st(s, COND_EQ, TCG_REG_R0, data_reg);
             tcg_out_st16_r(s, COND_EQ, TCG_REG_R0, addr_reg, TCG_REG_R1);
         } else {
             tcg_out_st16_r(s, COND_EQ, data_reg, addr_reg, TCG_REG_R1);
@@ -1433,7 +1449,7 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
         break;
     case 1:
         if (bswap) {
-            tcg_out_bswap16(s, COND_AL, TCG_REG_R0, data_reg);
+            tcg_out_bswap16st(s, COND_AL, TCG_REG_R0, data_reg);
             tcg_out_st16_8(s, COND_AL, TCG_REG_R0, addr_reg, 0);
         } else {
             tcg_out_st16_8(s, COND_AL, data_reg, addr_reg, 0);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 3/5] target-openrisc: remove conflicting definitions from cpu.h
  2012-10-09 20:30 [Qemu-devel] [PATCH 0/5] tcg/arm: fixes and optimizations Aurelien Jarno
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 1/5] tcg/arm: fix TLB access in qemu-ld/st ops Aurelien Jarno
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 2/5] tcg/arm: fix cross-endian qemu_st16 Aurelien Jarno
@ 2012-10-09 20:30 ` Aurelien Jarno
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 4/5] tcg/arm: optimize tcg_out_goto_label Aurelien Jarno
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump Aurelien Jarno
  4 siblings, 0 replies; 12+ messages in thread
From: Aurelien Jarno @ 2012-10-09 20:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Jia Liu, Aurelien Jarno, qemu-stable

On an ARM host, the registers definitions from cpu.h clash
with /usr/include/sys/ucontext.h. As there are unused, just remove
them.

Cc: Jia Liu <proljc@gmail.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 target-openrisc/cpu.h |   18 ------------------
 1 file changed, 18 deletions(-)

diff --git a/target-openrisc/cpu.h b/target-openrisc/cpu.h
index de21a87..244788c 100644
--- a/target-openrisc/cpu.h
+++ b/target-openrisc/cpu.h
@@ -89,24 +89,6 @@ enum {
 /* Interrupt */
 #define NR_IRQS  32
 
-/* Registers */
-enum {
-    R0 = 0, R1, R2, R3, R4, R5, R6, R7, R8, R9, R10,
-    R11, R12, R13, R14, R15, R16, R17, R18, R19, R20,
-    R21, R22, R23, R24, R25, R26, R27, R28, R29, R30,
-    R31
-};
-
-/* Register aliases */
-enum {
-    R_ZERO = R0,
-    R_SP = R1,
-    R_FP = R2,
-    R_LR = R9,
-    R_RV = R11,
-    R_RVH = R12
-};
-
 /* Unit presece register */
 enum {
     UPR_UP = (1 << 0),
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 4/5] tcg/arm: optimize tcg_out_goto_label
  2012-10-09 20:30 [Qemu-devel] [PATCH 0/5] tcg/arm: fixes and optimizations Aurelien Jarno
                   ` (2 preceding siblings ...)
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 3/5] target-openrisc: remove conflicting definitions from cpu.h Aurelien Jarno
@ 2012-10-09 20:30 ` Aurelien Jarno
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump Aurelien Jarno
  4 siblings, 0 replies; 12+ messages in thread
From: Aurelien Jarno @ 2012-10-09 20:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Aurelien Jarno

tcg_out_goto_label is only used inside a TB, so there is no reason
for not using 24-bit branches even for COND_AL.

Cc: Andrzej Zaborowski <balrogg@gmail.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 tcg/arm/tcg-target.c |    7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 3191903..fafbd5d 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -959,14 +959,9 @@ static inline void tcg_out_goto_label(TCGContext *s, int cond, int label_index)
 {
     TCGLabel *l = &s->labels[label_index];
 
-    if (l->has_value)
+    if (l->has_value) {
         tcg_out_goto(s, cond, l->u.value);
-    else if (cond == COND_AL) {
-        tcg_out_ld32_12(s, COND_AL, TCG_REG_PC, TCG_REG_PC, -4);
-        tcg_out_reloc(s, s->code_ptr, R_ARM_ABS32, label_index, 31337);
-        s->code_ptr += 4;
     } else {
-        /* Probably this should be preferred even for COND_AL... */
         tcg_out_reloc(s, s->code_ptr, R_ARM_PC24, label_index, 31337);
         tcg_out_b_noaddr(s, cond);
     }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump
  2012-10-09 20:30 [Qemu-devel] [PATCH 0/5] tcg/arm: fixes and optimizations Aurelien Jarno
                   ` (3 preceding siblings ...)
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 4/5] tcg/arm: optimize tcg_out_goto_label Aurelien Jarno
@ 2012-10-09 20:30 ` Aurelien Jarno
  2012-10-10 13:21   ` Laurent Desnogues
  4 siblings, 1 reply; 12+ messages in thread
From: Aurelien Jarno @ 2012-10-09 20:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Aurelien Jarno

Use ldr pc, [pc, #-4] kind of branch for direct jump. This removes the
need to flush the icache on TB linking, and allow to remove the limit
on the code generation buffer.

This improves the boot-up speed of a MIPS guest by 11%.

Cc: Andrzej Zaborowski <balrogg@gmail.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 exec-all.h           |   24 ++++--------------------
 exec.c               |    4 ----
 tcg/arm/tcg-target.c |    7 +------
 3 files changed, 5 insertions(+), 30 deletions(-)

diff --git a/exec-all.h b/exec-all.h
index 6516da0..662b916 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -224,26 +224,10 @@ static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
 #elif defined(__arm__)
 static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
 {
-#if !QEMU_GNUC_PREREQ(4, 1)
-    register unsigned long _beg __asm ("a1");
-    register unsigned long _end __asm ("a2");
-    register unsigned long _flg __asm ("a3");
-#endif
-
-    /* we could use a ldr pc, [pc, #-4] kind of branch and avoid the flush */
-    *(uint32_t *)jmp_addr =
-        (*(uint32_t *)jmp_addr & ~0xffffff)
-        | (((addr - (jmp_addr + 8)) >> 2) & 0xffffff);
-
-#if QEMU_GNUC_PREREQ(4, 1)
-    __builtin___clear_cache((char *) jmp_addr, (char *) jmp_addr + 4);
-#else
-    /* flush icache */
-    _beg = jmp_addr;
-    _end = jmp_addr + 4;
-    _flg = 0;
-    __asm __volatile__ ("swi 0x9f0002" : : "r" (_beg), "r" (_end), "r" (_flg));
-#endif
+    /* Patch the branch destination. It uses a ldr pc, [pc, #-4] kind
+       of branch so we write absolute address and we don't need to
+       flush icache. */
+    *(uint32_t *)jmp_addr = addr;
 }
 #elif defined(__sparc__)
 void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr);
diff --git a/exec.c b/exec.c
index 7899042..8d115ac 100644
--- a/exec.c
+++ b/exec.c
@@ -546,10 +546,6 @@ static void code_gen_alloc(unsigned long tb_size)
         start = (void *) 0x40000000UL;
         if (code_gen_buffer_size > (512 * 1024 * 1024))
             code_gen_buffer_size = (512 * 1024 * 1024);
-#elif defined(__arm__)
-        /* Keep the buffer no bigger than 16MB to branch between blocks */
-        if (code_gen_buffer_size > 16 * 1024 * 1024)
-            code_gen_buffer_size = 16 * 1024 * 1024;
 #elif defined(__s390x__)
         /* Map the buffer so that we can use direct calls and branches.  */
         /* We have a +- 4GB range on the branches; leave some slop.  */
diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index fafbd5d..e04cfa7 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -1501,14 +1501,9 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_goto_tb:
         if (s->tb_jmp_offset) {
             /* Direct jump method */
-#if defined(USE_DIRECT_JUMP)
-            s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
-            tcg_out_b_noaddr(s, COND_AL);
-#else
             tcg_out_ld32_12(s, COND_AL, TCG_REG_PC, TCG_REG_PC, -4);
             s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
-            tcg_out32(s, 0);
-#endif
+            tcg_out32(s, (int)s->code_ptr + 4);
         } else {
             /* Indirect jump method */
 #if 1
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump
  2012-10-09 20:30 ` [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump Aurelien Jarno
@ 2012-10-10 13:21   ` Laurent Desnogues
  2012-10-10 14:28     ` Aurelien Jarno
  0 siblings, 1 reply; 12+ messages in thread
From: Laurent Desnogues @ 2012-10-10 13:21 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: Peter Maydell, qemu-devel

On Tue, Oct 9, 2012 at 10:30 PM, Aurelien Jarno <aurelien@aurel32.net> wrote:
> Use ldr pc, [pc, #-4] kind of branch for direct jump. This removes the
> need to flush the icache on TB linking, and allow to remove the limit
> on the code generation buffer.

I'm not sure I like it.  In general having data in the middle
of code will increase I/D cache and I/D TLB pressure.

> This improves the boot-up speed of a MIPS guest by 11%.

Boot speed is very specific.  Did you test some other code?
Also what was your host?

Testing on a quad core Cortex-A9, using all of your patches
(including TCG optimizations), I get this running nbench i386
in user mode:

TEST                : Iter/sec.  : Old Index  : New Index
                    :            : Pentium 90 : AMD K6/233
--------------------:------------:------------:-----------
NUMERIC SORT        :     119.48 :       3.06 :       1.01
STRING SORT         :     7.7907 :       3.48 :       0.54
BITFIELD            : 2.2049e+07 :       3.78 :       0.79
FP EMULATION        :      5.094 :       2.44 :       0.56
FOURIER             :     483.73 :       0.55 :       0.31
ASSIGNMENT          :      1.778 :       6.77 :       1.75
IDEA                :     341.43 :       5.22 :       1.55
HUFFMAN             :     45.942 :       1.27 :       0.41
NEURAL NET          :    0.16667 :       0.27 :       0.11
LU DECOMPOSITION    :      5.969 :       0.31 :       0.22
===================ORIGINAL BYTEMARK RESULTS==============
INTEGER INDEX       : 3.319
FLOATING-POINT INDEX: 0.357
=======================LINUX DATA BELOW===================
MEMORY INDEX        : 0.907
INTEGER INDEX       : 0.774
FLOATING-POINT INDEX: 0.198

Not using this patch, I get:

TEST                : Iter/sec.  : Old Index  : New Index
                    :            : Pentium 90 : AMD K6/233
--------------------:------------:------------:-----------
NUMERIC SORT        :     121.88 :       3.13 :       1.03
STRING SORT         :     7.8438 :       3.50 :       0.54
BITFIELD            : 2.2597e+07 :       3.88 :       0.81
FP EMULATION        :     5.1424 :       2.47 :       0.57
FOURIER             :     466.04 :       0.53 :       0.30
ASSIGNMENT          :      1.809 :       6.88 :       1.79
IDEA                :     359.28 :       5.50 :       1.63
HUFFMAN             :     46.225 :       1.28 :       0.41
NEURAL NET          :    0.16644 :       0.27 :       0.11
LU DECOMPOSITION    :       5.77 :       0.30 :       0.22
===================ORIGINAL BYTEMARK RESULTS==============
INTEGER INDEX       : 3.384
FLOATING-POINT INDEX: 0.349
=======================LINUX DATA BELOW===================
MEMORY INDEX        : 0.922
INTEGER INDEX       : 0.790
FLOATING-POINT INDEX: 0.193

This patch doesn't bring any speedup in that case.

I guess we need more testing as a synthetic benchmark is as
specific as kernel booting :-)


Laurent

> Cc: Andrzej Zaborowski <balrogg@gmail.com>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
> ---
>  exec-all.h           |   24 ++++--------------------
>  exec.c               |    4 ----
>  tcg/arm/tcg-target.c |    7 +------
>  3 files changed, 5 insertions(+), 30 deletions(-)
>
> diff --git a/exec-all.h b/exec-all.h
> index 6516da0..662b916 100644
> --- a/exec-all.h
> +++ b/exec-all.h
> @@ -224,26 +224,10 @@ static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
>  #elif defined(__arm__)
>  static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
>  {
> -#if !QEMU_GNUC_PREREQ(4, 1)
> -    register unsigned long _beg __asm ("a1");
> -    register unsigned long _end __asm ("a2");
> -    register unsigned long _flg __asm ("a3");
> -#endif
> -
> -    /* we could use a ldr pc, [pc, #-4] kind of branch and avoid the flush */
> -    *(uint32_t *)jmp_addr =
> -        (*(uint32_t *)jmp_addr & ~0xffffff)
> -        | (((addr - (jmp_addr + 8)) >> 2) & 0xffffff);
> -
> -#if QEMU_GNUC_PREREQ(4, 1)
> -    __builtin___clear_cache((char *) jmp_addr, (char *) jmp_addr + 4);
> -#else
> -    /* flush icache */
> -    _beg = jmp_addr;
> -    _end = jmp_addr + 4;
> -    _flg = 0;
> -    __asm __volatile__ ("swi 0x9f0002" : : "r" (_beg), "r" (_end), "r" (_flg));
> -#endif
> +    /* Patch the branch destination. It uses a ldr pc, [pc, #-4] kind
> +       of branch so we write absolute address and we don't need to
> +       flush icache. */
> +    *(uint32_t *)jmp_addr = addr;
>  }
>  #elif defined(__sparc__)
>  void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr);
> diff --git a/exec.c b/exec.c
> index 7899042..8d115ac 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -546,10 +546,6 @@ static void code_gen_alloc(unsigned long tb_size)
>          start = (void *) 0x40000000UL;
>          if (code_gen_buffer_size > (512 * 1024 * 1024))
>              code_gen_buffer_size = (512 * 1024 * 1024);
> -#elif defined(__arm__)
> -        /* Keep the buffer no bigger than 16MB to branch between blocks */
> -        if (code_gen_buffer_size > 16 * 1024 * 1024)
> -            code_gen_buffer_size = 16 * 1024 * 1024;
>  #elif defined(__s390x__)
>          /* Map the buffer so that we can use direct calls and branches.  */
>          /* We have a +- 4GB range on the branches; leave some slop.  */
> diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
> index fafbd5d..e04cfa7 100644
> --- a/tcg/arm/tcg-target.c
> +++ b/tcg/arm/tcg-target.c
> @@ -1501,14 +1501,9 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>      case INDEX_op_goto_tb:
>          if (s->tb_jmp_offset) {
>              /* Direct jump method */
> -#if defined(USE_DIRECT_JUMP)
> -            s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
> -            tcg_out_b_noaddr(s, COND_AL);
> -#else
>              tcg_out_ld32_12(s, COND_AL, TCG_REG_PC, TCG_REG_PC, -4);
>              s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
> -            tcg_out32(s, 0);
> -#endif
> +            tcg_out32(s, (int)s->code_ptr + 4);
>          } else {
>              /* Indirect jump method */
>  #if 1
> --
> 1.7.10.4
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump
  2012-10-10 13:21   ` Laurent Desnogues
@ 2012-10-10 14:28     ` Aurelien Jarno
  2012-10-10 14:43       ` Laurent Desnogues
  0 siblings, 1 reply; 12+ messages in thread
From: Aurelien Jarno @ 2012-10-10 14:28 UTC (permalink / raw)
  To: Laurent Desnogues; +Cc: Peter Maydell, qemu-devel

On Wed, Oct 10, 2012 at 03:21:48PM +0200, Laurent Desnogues wrote:
> On Tue, Oct 9, 2012 at 10:30 PM, Aurelien Jarno <aurelien@aurel32.net> wrote:
> > Use ldr pc, [pc, #-4] kind of branch for direct jump. This removes the
> > need to flush the icache on TB linking, and allow to remove the limit
> > on the code generation buffer.
> 
> I'm not sure I like it.  In general having data in the middle
> of code will increase I/D cache and I/D TLB pressure.

Agreed. On the other hand, this patch remove the synchronization of
the instruction cache for TB linking/unlinking.

> > This improves the boot-up speed of a MIPS guest by 11%.
> 
> Boot speed is very specific.  Did you test some other code?
> Also what was your host?

I tested it on a Cortex-A8 machine. I have only tested MIPS, but I can
do more tests, like running the openssl testsuite in the emulated guest.

> Testing on a quad core Cortex-A9, using all of your patches
> (including TCG optimizations), I get this running nbench i386
> in user mode:
> 
> TEST                : Iter/sec.  : Old Index  : New Index
>                     :            : Pentium 90 : AMD K6/233
> --------------------:------------:------------:-----------
> NUMERIC SORT        :     119.48 :       3.06 :       1.01
> STRING SORT         :     7.7907 :       3.48 :       0.54
> BITFIELD            : 2.2049e+07 :       3.78 :       0.79
> FP EMULATION        :      5.094 :       2.44 :       0.56
> FOURIER             :     483.73 :       0.55 :       0.31
> ASSIGNMENT          :      1.778 :       6.77 :       1.75
> IDEA                :     341.43 :       5.22 :       1.55
> HUFFMAN             :     45.942 :       1.27 :       0.41
> NEURAL NET          :    0.16667 :       0.27 :       0.11
> LU DECOMPOSITION    :      5.969 :       0.31 :       0.22
> ===================ORIGINAL BYTEMARK RESULTS==============
> INTEGER INDEX       : 3.319
> FLOATING-POINT INDEX: 0.357
> =======================LINUX DATA BELOW===================
> MEMORY INDEX        : 0.907
> INTEGER INDEX       : 0.774
> FLOATING-POINT INDEX: 0.198
> 
> Not using this patch, I get:
> 
> TEST                : Iter/sec.  : Old Index  : New Index
>                     :            : Pentium 90 : AMD K6/233
> --------------------:------------:------------:-----------
> NUMERIC SORT        :     121.88 :       3.13 :       1.03
> STRING SORT         :     7.8438 :       3.50 :       0.54
> BITFIELD            : 2.2597e+07 :       3.88 :       0.81
> FP EMULATION        :     5.1424 :       2.47 :       0.57
> FOURIER             :     466.04 :       0.53 :       0.30
> ASSIGNMENT          :      1.809 :       6.88 :       1.79
> IDEA                :     359.28 :       5.50 :       1.63
> HUFFMAN             :     46.225 :       1.28 :       0.41
> NEURAL NET          :    0.16644 :       0.27 :       0.11
> LU DECOMPOSITION    :       5.77 :       0.30 :       0.22
> ===================ORIGINAL BYTEMARK RESULTS==============
> INTEGER INDEX       : 3.384
> FLOATING-POINT INDEX: 0.349
> =======================LINUX DATA BELOW===================
> MEMORY INDEX        : 0.922
> INTEGER INDEX       : 0.790
> FLOATING-POINT INDEX: 0.193
> 
> This patch doesn't bring any speedup in that case.
> 
> I guess we need more testing as a synthetic benchmark is as
> specific as kernel booting :-)
> 

This doesn't really surprise me. The goal of the patch is to remove the
limit of 16MB for the generated code. I really doubt you reach such a
limit in user mode unless you use some complex code.

On the other hand in system mode, this can be already reached once the
whole guest kernel is translated, so cached code is dropped and has to
be re-translated regularly. Re-translating guest code is clearly more
expensive than the increase of I/D cache and I/D TLB pressure.

The other way to allow more than 16MB of generated code would be to
disable direct jump on ARM. It adds one 32-bit constant loading + one
memory load, but then you don't have the I/D cache and TLB issue.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump
  2012-10-10 14:28     ` Aurelien Jarno
@ 2012-10-10 14:43       ` Laurent Desnogues
  0 siblings, 0 replies; 12+ messages in thread
From: Laurent Desnogues @ 2012-10-10 14:43 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: Peter Maydell, qemu-devel

On Wed, Oct 10, 2012 at 4:28 PM, Aurelien Jarno <aurelien@aurel32.net> wrote:
> On Wed, Oct 10, 2012 at 03:21:48PM +0200, Laurent Desnogues wrote:
>> On Tue, Oct 9, 2012 at 10:30 PM, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> > Use ldr pc, [pc, #-4] kind of branch for direct jump. This removes the
>> > need to flush the icache on TB linking, and allow to remove the limit
>> > on the code generation buffer.
>>
>> I'm not sure I like it.  In general having data in the middle
>> of code will increase I/D cache and I/D TLB pressure.
>
> Agreed. On the other hand, this patch remove the synchronization of
> the instruction cache for TB linking/unlinking.

TB linking/unlinking should happen less often than code execution.

>> > This improves the boot-up speed of a MIPS guest by 11%.
>>
>> Boot speed is very specific.  Did you test some other code?
>> Also what was your host?
>
> I tested it on a Cortex-A8 machine. I have only tested MIPS, but I can
> do more tests, like running the openssl testsuite in the emulated guest.

Yes, please.

[...]
> This doesn't really surprise me. The goal of the patch is to remove the
> limit of 16MB for the generated code. I really doubt you reach such a
> limit in user mode unless you use some complex code.
>
> On the other hand in system mode, this can be already reached once the
> whole guest kernel is translated, so cached code is dropped and has to
> be re-translated regularly. Re-translating guest code is clearly more
> expensive than the increase of I/D cache and I/D TLB pressure.

Ha yes, that's a real problem.  What about having some define
and/or runtime flag to keep both caches sync and your ldr PC
change in QEMU?

> The other way to allow more than 16MB of generated code would be to
> disable direct jump on ARM. It adds one 32-bit constant loading + one
> memory load, but then you don't have the I/D cache and TLB issue.

The performance hit would be even worse :-)


Laurent

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-10-10 14:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-09 20:30 [Qemu-devel] [PATCH 0/5] tcg/arm: fixes and optimizations Aurelien Jarno
2012-10-09 20:30 ` [Qemu-devel] [PATCH 1/5] tcg/arm: fix TLB access in qemu-ld/st ops Aurelien Jarno
2012-10-09 21:13   ` Peter Maydell
2012-10-10 10:00     ` Laurent Desnogues
2012-10-10 10:08       ` Peter Maydell
2012-10-09 20:30 ` [Qemu-devel] [PATCH 2/5] tcg/arm: fix cross-endian qemu_st16 Aurelien Jarno
2012-10-09 20:30 ` [Qemu-devel] [PATCH 3/5] target-openrisc: remove conflicting definitions from cpu.h Aurelien Jarno
2012-10-09 20:30 ` [Qemu-devel] [PATCH 4/5] tcg/arm: optimize tcg_out_goto_label Aurelien Jarno
2012-10-09 20:30 ` [Qemu-devel] [PATCH 5/5] tcg/arm: improve direct jump Aurelien Jarno
2012-10-10 13:21   ` Laurent Desnogues
2012-10-10 14:28     ` Aurelien Jarno
2012-10-10 14:43       ` Laurent Desnogues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).