qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization
@ 2013-08-26 21:00 Richard Henderson
  2013-08-26 21:00 ` [Qemu-devel] [PULL 1/7] tcg: Tidy generated code for tcg_outN Richard Henderson
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-26 21:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: anthony, aurelien


The following is a rebase of the i386 portion of the v2 patch set.
That incorporated the minor comments from Aurelien from v1.  This
pull request omits the ARM portion of the patch set, as that has
yet to receive any review.

I'd like to get this patch set pulled, because I have three other
patch sets that depend on this.


r~


The following changes since commit f7ad538e1ea130c8b6f3abb06ad6c856242c799e:

  Merge remote-tracking branch 'stefanha/block' into staging (2013-08-26 09:19:50 -0500)

are available in the git repository at:


  git://github.com/rth7680/qemu.git tcg-ool

for you to fetch changes up to 401c227b0a1134245ec61c6c5a9997cfc963c8e4:

  tcg-i386: Use new return-argument ld/st helpers (2013-08-26 13:31:54 -0700)

----------------------------------------------------------------
Richard Henderson (7):
      tcg: Tidy generated code for tcg_outN
      tcg-i386: Add and use tcg_out64
      tcg-i386: Try pc-relative lea for constant formation
      tcg-i386: Tidy qemu_ld/st slow path
      tcg: Add mmu helpers that take a return address argument
      tcg: Tidy softmmu_template.h
      tcg-i386: Use new return-argument ld/st helpers

 include/exec/exec-all.h         |  13 +-
 include/exec/softmmu_defs.h     |  46 +++---
 include/exec/softmmu_template.h | 309 ++++++++++++++++------------------------
 tcg/i386/tcg-target.c           | 274 +++++++++++++++++------------------
 tcg/tcg.c                       |  17 ++-
 5 files changed, 295 insertions(+), 364 deletions(-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PULL 1/7] tcg: Tidy generated code for tcg_outN
  2013-08-26 21:00 [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Richard Henderson
@ 2013-08-26 21:00 ` Richard Henderson
  2013-08-26 21:00 ` [Qemu-devel] [PULL 2/7] tcg-i386: Add and use tcg_out64 Richard Henderson
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-26 21:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: anthony, aurelien

Aliasing was forcing s->code_ptr to be re-read after the store.
Keep the pointer in a local variable to help the compiler.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 tcg/tcg.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index dac8224..42c95af 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -121,14 +121,16 @@ static inline void tcg_out8(TCGContext *s, uint8_t v)
 
 static inline void tcg_out16(TCGContext *s, uint16_t v)
 {
-    *(uint16_t *)s->code_ptr = v;
-    s->code_ptr += 2;
+    uint8_t *p = s->code_ptr;
+    *(uint16_t *)p = v;
+    s->code_ptr = p + 2;
 }
 
 static inline void tcg_out32(TCGContext *s, uint32_t v)
 {
-    *(uint32_t *)s->code_ptr = v;
-    s->code_ptr += 4;
+    uint8_t *p = s->code_ptr;
+    *(uint32_t *)p = v;
+    s->code_ptr = p + 4;
 }
 
 /* label relocation processing */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PULL 2/7] tcg-i386: Add and use tcg_out64
  2013-08-26 21:00 [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Richard Henderson
  2013-08-26 21:00 ` [Qemu-devel] [PULL 1/7] tcg: Tidy generated code for tcg_outN Richard Henderson
@ 2013-08-26 21:00 ` Richard Henderson
  2013-08-26 21:00 ` [Qemu-devel] [PULL 3/7] tcg-i386: Try pc-relative lea for constant formation Richard Henderson
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-26 21:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: anthony, aurelien

No point in splitting the write into 32-bit pieces.

Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 tcg/i386/tcg-target.c | 3 +--
 tcg/tcg.c             | 7 +++++++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 87eeab3..841bd75 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -552,8 +552,7 @@ static void tcg_out_movi(TCGContext *s, TCGType type,
         tcg_out32(s, arg);
     } else {
         tcg_out_opc(s, OPC_MOVL_Iv + P_REXW + LOWREGMASK(ret), 0, ret, 0);
-        tcg_out32(s, arg);
-        tcg_out32(s, arg >> 31 >> 1);
+        tcg_out64(s, arg);
     }
 }
 
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 42c95af..19bd5a3 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -133,6 +133,13 @@ static inline void tcg_out32(TCGContext *s, uint32_t v)
     s->code_ptr = p + 4;
 }
 
+static inline void tcg_out64(TCGContext *s, uint64_t v)
+{
+    uint8_t *p = s->code_ptr;
+    *(uint64_t *)p = v;
+    s->code_ptr = p + 8;
+}
+
 /* label relocation processing */
 
 static void tcg_out_reloc(TCGContext *s, uint8_t *code_ptr, int type,
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PULL 3/7] tcg-i386: Try pc-relative lea for constant formation
  2013-08-26 21:00 [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Richard Henderson
  2013-08-26 21:00 ` [Qemu-devel] [PULL 1/7] tcg: Tidy generated code for tcg_outN Richard Henderson
  2013-08-26 21:00 ` [Qemu-devel] [PULL 2/7] tcg-i386: Add and use tcg_out64 Richard Henderson
@ 2013-08-26 21:00 ` Richard Henderson
  2013-08-26 21:00 ` [Qemu-devel] [PULL 4/7] tcg-i386: Tidy qemu_ld/st slow path Richard Henderson
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-26 21:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: anthony, aurelien

Use a 7 byte lea before the ultimate 10 byte movq.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 tcg/i386/tcg-target.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 841bd75..8226171 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -541,19 +541,34 @@ static inline void tcg_out_mov(TCGContext *s, TCGType type,
 static void tcg_out_movi(TCGContext *s, TCGType type,
                          TCGReg ret, tcg_target_long arg)
 {
+    tcg_target_long diff;
+
     if (arg == 0) {
         tgen_arithr(s, ARITH_XOR, ret, ret);
         return;
-    } else if (arg == (uint32_t)arg || type == TCG_TYPE_I32) {
+    }
+    if (arg == (uint32_t)arg || type == TCG_TYPE_I32) {
         tcg_out_opc(s, OPC_MOVL_Iv + LOWREGMASK(ret), 0, ret, 0);
         tcg_out32(s, arg);
-    } else if (arg == (int32_t)arg) {
+        return;
+    }
+    if (arg == (int32_t)arg) {
         tcg_out_modrm(s, OPC_MOVL_EvIz + P_REXW, 0, ret);
         tcg_out32(s, arg);
-    } else {
-        tcg_out_opc(s, OPC_MOVL_Iv + P_REXW + LOWREGMASK(ret), 0, ret, 0);
-        tcg_out64(s, arg);
+        return;
     }
+
+    /* Try a 7 byte pc-relative lea before the 10 byte movq.  */
+    diff = arg - ((tcg_target_long)s->code_ptr + 7);
+    if (diff == (int32_t)diff) {
+        tcg_out_opc(s, OPC_LEA | P_REXW, ret, 0, 0);
+        tcg_out8(s, (LOWREGMASK(ret) << 3) | 5);
+        tcg_out32(s, diff);
+        return;
+    }
+
+    tcg_out_opc(s, OPC_MOVL_Iv + P_REXW + LOWREGMASK(ret), 0, ret, 0);
+    tcg_out64(s, arg);
 }
 
 static inline void tcg_out_pushi(TCGContext *s, tcg_target_long val)
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PULL 4/7] tcg-i386: Tidy qemu_ld/st slow path
  2013-08-26 21:00 [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Richard Henderson
                   ` (2 preceding siblings ...)
  2013-08-26 21:00 ` [Qemu-devel] [PULL 3/7] tcg-i386: Try pc-relative lea for constant formation Richard Henderson
@ 2013-08-26 21:00 ` Richard Henderson
  2013-08-26 21:00 ` [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument Richard Henderson
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-26 21:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: anthony, aurelien

Use existing stack space for arguments; don't push/pop.
Use less ifdefs and more C ifs.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 tcg/i386/tcg-target.c | 165 ++++++++++++++++++++++----------------------------
 1 file changed, 74 insertions(+), 91 deletions(-)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 8226171..fba50f8 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -608,6 +608,14 @@ static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
     tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
 }
 
+static inline void tcg_out_sti(TCGContext *s, TCGType type, TCGReg base,
+                               tcg_target_long ofs, tcg_target_long val)
+{
+    int opc = OPC_MOVL_EvIz + (type == TCG_TYPE_I64 ? P_REXW : 0);
+    tcg_out_modrm_offset(s, opc, 0, base, ofs);
+    tcg_out32(s, val);
+}
+
 static void tcg_out_shifti(TCGContext *s, int subopc, int reg, int count)
 {
     /* Propagate an opcode prefix, such as P_DATA16.  */
@@ -1463,22 +1471,12 @@ static void add_qemu_ldst_label(TCGContext *s,
 /*
  * Generate code for the slow path for a load at the end of block
  */
-static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
-    int s_bits;
-    int opc = label->opc;
-    int mem_index = label->mem_index;
-#if TCG_TARGET_REG_BITS == 32
-    int stack_adjust;
-    int addrlo_reg = label->addrlo_reg;
-    int addrhi_reg = label->addrhi_reg;
-#endif
-    int data_reg = label->datalo_reg;
-    int data_reg2 = label->datahi_reg;
-    uint8_t *raddr = label->raddr;
-    uint8_t **label_ptr = &label->label_ptr[0];
-
-    s_bits = opc & 3;
+    int opc = l->opc;
+    int s_bits = opc & 3;
+    TCGReg data_reg;
+    uint8_t **label_ptr = &l->label_ptr[0];
 
     /* resolve label address */
     *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
@@ -1486,22 +1484,27 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
         *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
     }
 
-#if TCG_TARGET_REG_BITS == 32
-    tcg_out_pushi(s, mem_index);
-    stack_adjust = 4;
-    if (TARGET_LONG_BITS == 64) {
-        tcg_out_push(s, addrhi_reg);
-        stack_adjust += 4;
+    if (TCG_TARGET_REG_BITS == 32) {
+        int ofs = 0;
+
+        tcg_out_st(s, TCG_TYPE_PTR, TCG_AREG0, TCG_REG_ESP, ofs);
+        ofs += 4;
+
+        tcg_out_st(s, TCG_TYPE_I32, l->addrlo_reg, TCG_REG_ESP, ofs);
+        ofs += 4;
+
+        if (TARGET_LONG_BITS == 64) {
+            tcg_out_st(s, TCG_TYPE_I32, l->addrhi_reg, TCG_REG_ESP, ofs);
+            ofs += 4;
+        }
+
+        tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, l->mem_index);
+    } else {
+        tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
+        /* The second argument is already loaded with addrlo.  */
+        tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2],
+                     l->mem_index);
     }
-    tcg_out_push(s, addrlo_reg);
-    stack_adjust += 4;
-    tcg_out_push(s, TCG_AREG0);
-    stack_adjust += 4;
-#else
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
-    /* The second argument is already loaded with addrlo.  */
-    tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], mem_index);
-#endif
 
     /* Code generation of qemu_ld/st's slow path calling MMU helper
 
@@ -1520,18 +1523,10 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
     tcg_out8(s, 5);
     /* Dummy backward jump having information of fast path'pc for MMU helpers */
     tcg_out8(s, OPC_JMP_long);
-    *(int32_t *)s->code_ptr = (int32_t)(raddr - s->code_ptr - 4);
+    *(int32_t *)s->code_ptr = (int32_t)(l->raddr - s->code_ptr - 4);
     s->code_ptr += 4;
 
-#if TCG_TARGET_REG_BITS == 32
-    if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
-        /* Pop and discard.  This is 2 bytes smaller than the add.  */
-        tcg_out_pop(s, TCG_REG_ECX);
-    } else if (stack_adjust != 0) {
-        tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
-    }
-#endif
-
+    data_reg = l->datalo_reg;
     switch(opc) {
     case 0 | 4:
         tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
@@ -1559,10 +1554,10 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
         } else if (data_reg == TCG_REG_EDX) {
             /* xchg %edx, %eax */
             tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EAX);
+            tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_EAX);
         } else {
             tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EDX);
+            tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_EDX);
         }
         break;
     default:
@@ -1570,28 +1565,17 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
     }
 
     /* Jump to the code corresponding to next IR of qemu_st */
-    tcg_out_jmp(s, (tcg_target_long)raddr);
+    tcg_out_jmp(s, (tcg_target_long)l->raddr);
 }
 
 /*
  * Generate code for the slow path for a store at the end of block
  */
-static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
-    int s_bits;
-    int stack_adjust;
-    int opc = label->opc;
-    int mem_index = label->mem_index;
-    int data_reg = label->datalo_reg;
-#if TCG_TARGET_REG_BITS == 32
-    int data_reg2 = label->datahi_reg;
-    int addrlo_reg = label->addrlo_reg;
-    int addrhi_reg = label->addrhi_reg;
-#endif
-    uint8_t *raddr = label->raddr;
-    uint8_t **label_ptr = &label->label_ptr[0];
-
-    s_bits = opc & 3;
+    int opc = l->opc;
+    int s_bits = opc & 3;
+    uint8_t **label_ptr = &l->label_ptr[0];
 
     /* resolve label address */
     *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
@@ -1599,31 +1583,37 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
         *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
     }
 
-#if TCG_TARGET_REG_BITS == 32
-    tcg_out_pushi(s, mem_index);
-    stack_adjust = 4;
-    if (opc == 3) {
-        tcg_out_push(s, data_reg2);
-        stack_adjust += 4;
-    }
-    tcg_out_push(s, data_reg);
-    stack_adjust += 4;
-    if (TARGET_LONG_BITS == 64) {
-        tcg_out_push(s, addrhi_reg);
-        stack_adjust += 4;
+    if (TCG_TARGET_REG_BITS == 32) {
+        int ofs = 0;
+
+        tcg_out_st(s, TCG_TYPE_PTR, TCG_AREG0, TCG_REG_ESP, ofs);
+        ofs += 4;
+
+        tcg_out_st(s, TCG_TYPE_I32, l->addrlo_reg, TCG_REG_ESP, ofs);
+        ofs += 4;
+
+        if (TARGET_LONG_BITS == 64) {
+            tcg_out_st(s, TCG_TYPE_I32, l->addrhi_reg, TCG_REG_ESP, ofs);
+            ofs += 4;
+        }
+
+        tcg_out_st(s, TCG_TYPE_I32, l->datalo_reg, TCG_REG_ESP, ofs);
+        ofs += 4;
+
+        if (opc == 3) {
+            tcg_out_st(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_ESP, ofs);
+            ofs += 4;
+        }
+
+        tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, l->mem_index);
+    } else {
+        tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
+        /* The second argument is already loaded with addrlo.  */
+        tcg_out_mov(s, (opc == 3 ? TCG_TYPE_I64 : TCG_TYPE_I32),
+                    tcg_target_call_iarg_regs[2], l->datalo_reg);
+        tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[3],
+                     l->mem_index);
     }
-    tcg_out_push(s, addrlo_reg);
-    stack_adjust += 4;
-    tcg_out_push(s, TCG_AREG0);
-    stack_adjust += 4;
-#else
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
-    /* The second argument is already loaded with addrlo.  */
-    tcg_out_mov(s, (opc == 3 ? TCG_TYPE_I64 : TCG_TYPE_I32),
-                tcg_target_call_iarg_regs[2], data_reg);
-    tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[3], mem_index);
-    stack_adjust = 0;
-#endif
 
     /* Code generation of qemu_ld/st's slow path calling MMU helper
 
@@ -1642,18 +1632,11 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
     tcg_out8(s, 5);
     /* Dummy backward jump having information of fast path'pc for MMU helpers */
     tcg_out8(s, OPC_JMP_long);
-    *(int32_t *)s->code_ptr = (int32_t)(raddr - s->code_ptr - 4);
+    *(int32_t *)s->code_ptr = (int32_t)(l->raddr - s->code_ptr - 4);
     s->code_ptr += 4;
 
-    if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
-        /* Pop and discard.  This is 2 bytes smaller than the add.  */
-        tcg_out_pop(s, TCG_REG_ECX);
-    } else if (stack_adjust != 0) {
-        tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
-    }
-
     /* Jump to the code corresponding to next IR of qemu_st */
-    tcg_out_jmp(s, (tcg_target_long)raddr);
+    tcg_out_jmp(s, (tcg_target_long)l->raddr);
 }
 
 /*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument
  2013-08-26 21:00 [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Richard Henderson
                   ` (3 preceding siblings ...)
  2013-08-26 21:00 ` [Qemu-devel] [PULL 4/7] tcg-i386: Tidy qemu_ld/st slow path Richard Henderson
@ 2013-08-26 21:00 ` Richard Henderson
  2013-08-26 22:26   ` Paolo Bonzini
  2013-08-26 23:26   ` Peter Maydell
  2013-08-26 21:00 ` [Qemu-devel] [PULL 6/7] tcg: Tidy softmmu_template.h Richard Henderson
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-26 21:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: anthony, aurelien

Allow the code that tcg generates to be less obtuse, passing in
the return address directly instead of computing it in the helper.

Maintain the old entrance point unchanged as an alternate entry point.

Delete the helper_st*_cmmu prototypes; the implementations did not exist.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 include/exec/softmmu_defs.h     | 46 ++++++++++++++++++++++++++---------------
 include/exec/softmmu_template.h | 42 +++++++++++++++++++++++--------------
 2 files changed, 55 insertions(+), 33 deletions(-)

diff --git a/include/exec/softmmu_defs.h b/include/exec/softmmu_defs.h
index 1f25e33..e55e717 100644
--- a/include/exec/softmmu_defs.h
+++ b/include/exec/softmmu_defs.h
@@ -9,29 +9,41 @@
 #ifndef SOFTMMU_DEFS_H
 #define SOFTMMU_DEFS_H
 
+uint8_t helper_ret_ldb_mmu(CPUArchState *env, target_ulong addr,
+                           int mmu_idx, uintptr_t retaddr);
+uint16_t helper_ret_ldw_mmu(CPUArchState *env, target_ulong addr,
+                            int mmu_idx, uintptr_t retaddr);
+uint32_t helper_ret_ldl_mmu(CPUArchState *env, target_ulong addr,
+                            int mmu_idx, uintptr_t retaddr);
+uint64_t helper_ret_ldq_mmu(CPUArchState *env, target_ulong addr,
+                            int mmu_idx, uintptr_t retaddr);
+
+void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
+                        int mmu_idx, uintptr_t retaddr);
+void helper_ret_stw_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
+                        int mmu_idx, uintptr_t retaddr);
+void helper_ret_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                        int mmu_idx, uintptr_t retaddr);
+void helper_ret_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                        int mmu_idx, uintptr_t retaddr);
+
 uint8_t helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
-                    int mmu_idx);
 uint16_t helper_ldw_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stw_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
-                    int mmu_idx);
 uint32_t helper_ldl_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                    int mmu_idx);
 uint64_t helper_ldq_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                    int mmu_idx);
+
+void helper_stb_mmu(CPUArchState *env, target_ulong addr,
+                    uint8_t val, int mmu_idx);
+void helper_stw_mmu(CPUArchState *env, target_ulong addr,
+                    uint16_t val, int mmu_idx);
+void helper_stl_mmu(CPUArchState *env, target_ulong addr,
+                    uint32_t val, int mmu_idx);
+void helper_stq_mmu(CPUArchState *env, target_ulong addr,
+                    uint64_t val, int mmu_idx);
 
 uint8_t helper_ldb_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stb_cmmu(CPUArchState *env, target_ulong addr, uint8_t val,
-int mmu_idx);
 uint16_t helper_ldw_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stw_cmmu(CPUArchState *env, target_ulong addr, uint16_t val,
-                     int mmu_idx);
 uint32_t helper_ldl_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stl_cmmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                     int mmu_idx);
 uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stq_cmmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                     int mmu_idx);
-#endif
+
+#endif /* SOFTMMU_DEFS_H */
diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index 8584902..7d8bcb5 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -78,15 +78,18 @@ static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
 }
 
 /* handle all cases except unaligned access which span two pages */
+#ifdef SOFTMMU_CODE_ACCESS
+static
+#endif
 DATA_TYPE
-glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
-                                         int mmu_idx)
+glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
+                                             target_ulong addr, int mmu_idx,
+                                             uintptr_t retaddr)
 {
     DATA_TYPE res;
     int index;
     target_ulong tlb_addr;
     hwaddr ioaddr;
-    uintptr_t retaddr;
 
     /* test if there is match for unaligned or IO access */
     /* XXX: could done more in memory macro in a non portable way */
@@ -98,13 +101,11 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC_EXT();
             ioaddr = env->iotlb[mmu_idx][index];
             res = glue(io_read, SUFFIX)(env, ioaddr, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
             /* slow unaligned access (it spans two pages or IO) */
         do_unaligned_access:
-            retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
 #endif
@@ -115,7 +116,6 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC_EXT();
                 do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
             }
 #endif
@@ -124,8 +124,6 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                                 (addr + addend));
         }
     } else {
-        /* the page is not in the TLB : fill it */
-        retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
@@ -136,6 +134,14 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
     return res;
 }
 
+DATA_TYPE
+glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
+                                         int mmu_idx)
+{
+    return glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(env, addr, mmu_idx,
+                                                        GETPC_EXT());
+}
+
 /* handle all unaligned cases */
 static DATA_TYPE
 glue(glue(slow_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
@@ -214,13 +220,13 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
     io_mem_write(mr, physaddr, val, 1 << SHIFT);
 }
 
-void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
-                                              target_ulong addr, DATA_TYPE val,
-                                              int mmu_idx)
+void
+glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
+                                             target_ulong addr, DATA_TYPE val,
+                                             int mmu_idx, uintptr_t retaddr)
 {
     hwaddr ioaddr;
     target_ulong tlb_addr;
-    uintptr_t retaddr;
     int index;
 
     index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
@@ -231,12 +237,10 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC_EXT();
             ioaddr = env->iotlb[mmu_idx][index];
             glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
         do_unaligned_access:
-            retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
 #endif
@@ -247,7 +251,6 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC_EXT();
                 do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
             }
 #endif
@@ -257,7 +260,6 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
         }
     } else {
         /* the page is not in the TLB : fill it */
-        retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
@@ -267,6 +269,14 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
     }
 }
 
+void
+glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
+                                         DATA_TYPE val, int mmu_idx)
+{
+    glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(env, addr, val, mmu_idx,
+                                                 GETPC_EXT());
+}
+
 /* handles all unaligned cases */
 static void glue(glue(slow_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
                                                    target_ulong addr,
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PULL 6/7] tcg: Tidy softmmu_template.h
  2013-08-26 21:00 [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Richard Henderson
                   ` (4 preceding siblings ...)
  2013-08-26 21:00 ` [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument Richard Henderson
@ 2013-08-26 21:00 ` Richard Henderson
  2013-08-26 21:00 ` [Qemu-devel] [PULL 7/7] tcg-i386: Use new return-argument ld/st helpers Richard Henderson
  2013-08-27 21:30 ` [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Aurelien Jarno
  7 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-26 21:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: anthony, aurelien

Avoid a loop in the tlb_fill path; the fill will either succeed or
generate an exception.

Inline the slow_ld/st function; it was a complete copy of the main
helper except for the actual cross-page unaligned code, and the
compiler was inlining it anyway.

Add unlikely markers optimizing for the most common case of simple
tlb miss.

Make sure the compiler can optimize away the unaligned paths for a
1 byte access.

Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 include/exec/softmmu_template.h | 287 +++++++++++++++-------------------------
 1 file changed, 104 insertions(+), 183 deletions(-)

diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index 7d8bcb5..eaca9e1 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -54,10 +54,6 @@
 #define ADDR_READ addr_read
 #endif
 
-static DATA_TYPE glue(glue(slow_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
-                                                        target_ulong addr,
-                                                        int mmu_idx,
-                                                        uintptr_t retaddr);
 static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
                                               hwaddr physaddr,
                                               target_ulong addr,
@@ -86,52 +82,67 @@ glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
                                              target_ulong addr, int mmu_idx,
                                              uintptr_t retaddr)
 {
-    DATA_TYPE res;
-    int index;
-    target_ulong tlb_addr;
-    hwaddr ioaddr;
+    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
+    uintptr_t haddr;
 
-    /* test if there is match for unaligned or IO access */
-    /* XXX: could done more in memory macro in a non portable way */
-    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
- redo:
-    tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
-    if ((addr & TARGET_PAGE_MASK) == (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
-        if (tlb_addr & ~TARGET_PAGE_MASK) {
-            /* IO access */
-            if ((addr & (DATA_SIZE - 1)) != 0)
-                goto do_unaligned_access;
-            ioaddr = env->iotlb[mmu_idx][index];
-            res = glue(io_read, SUFFIX)(env, ioaddr, addr, retaddr);
-        } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
-            /* slow unaligned access (it spans two pages or IO) */
-        do_unaligned_access:
+    /* If the TLB entry is for a different page, reload and try again.  */
+    if ((addr & TARGET_PAGE_MASK)
+         != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
 #ifdef ALIGNED_ONLY
+        if ((addr & (DATA_SIZE - 1)) != 0) {
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        }
 #endif
-            res = glue(glue(slow_ld, SUFFIX), MMUSUFFIX)(env, addr,
-                                                         mmu_idx, retaddr);
-        } else {
-            /* unaligned/aligned access in the same page */
-            uintptr_t addend;
-#ifdef ALIGNED_ONLY
-            if ((addr & (DATA_SIZE - 1)) != 0) {
-                do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
-            }
-#endif
-            addend = env->tlb_table[mmu_idx][index].addend;
-            res = glue(glue(ld, USUFFIX), _raw)((uint8_t *)(intptr_t)
-                                                (addr + addend));
+        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
+    }
+
+    /* Handle an IO access.  */
+    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
+        hwaddr ioaddr;
+        if ((addr & (DATA_SIZE - 1)) != 0) {
+            goto do_unaligned_access;
         }
-    } else {
+        ioaddr = env->iotlb[mmu_idx][index];
+        return glue(io_read, SUFFIX)(env, ioaddr, addr, retaddr);
+    }
+
+    /* Handle slow unaligned access (it spans two pages or IO).  */
+    if (DATA_SIZE > 1
+        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                    >= TARGET_PAGE_SIZE)) {
+        target_ulong addr1, addr2;
+        DATA_TYPE res1, res2, res;
+        unsigned shift;
+    do_unaligned_access:
 #ifdef ALIGNED_ONLY
-        if ((addr & (DATA_SIZE - 1)) != 0)
-            do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
 #endif
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
-        goto redo;
+        addr1 = addr & ~(DATA_SIZE - 1);
+        addr2 = addr1 + DATA_SIZE;
+        res1 = glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(env, addr1,
+                                                            mmu_idx, retaddr);
+        res2 = glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(env, addr2,
+                                                            mmu_idx, retaddr);
+        shift = (addr & (DATA_SIZE - 1)) * 8;
+#ifdef TARGET_WORDS_BIGENDIAN
+        res = (res1 << shift) | (res2 >> ((DATA_SIZE * 8) - shift));
+#else
+        res = (res1 >> shift) | (res2 << ((DATA_SIZE * 8) - shift));
+#endif
+        return res;
+    }
+
+    /* Handle aligned access or unaligned access in the same page.  */
+#ifdef ALIGNED_ONLY
+    if ((addr & (DATA_SIZE - 1)) != 0) {
+        do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
     }
-    return res;
+#endif
+
+    haddr = addr + env->tlb_table[mmu_idx][index].addend;
+    return glue(glue(ld, USUFFIX), _raw)((uint8_t *)haddr);
 }
 
 DATA_TYPE
@@ -142,66 +153,8 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                                         GETPC_EXT());
 }
 
-/* handle all unaligned cases */
-static DATA_TYPE
-glue(glue(slow_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
-                                       target_ulong addr,
-                                       int mmu_idx,
-                                       uintptr_t retaddr)
-{
-    DATA_TYPE res, res1, res2;
-    int index, shift;
-    hwaddr ioaddr;
-    target_ulong tlb_addr, addr1, addr2;
-
-    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
- redo:
-    tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
-    if ((addr & TARGET_PAGE_MASK) == (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
-        if (tlb_addr & ~TARGET_PAGE_MASK) {
-            /* IO access */
-            if ((addr & (DATA_SIZE - 1)) != 0)
-                goto do_unaligned_access;
-            ioaddr = env->iotlb[mmu_idx][index];
-            res = glue(io_read, SUFFIX)(env, ioaddr, addr, retaddr);
-        } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
-        do_unaligned_access:
-            /* slow unaligned access (it spans two pages) */
-            addr1 = addr & ~(DATA_SIZE - 1);
-            addr2 = addr1 + DATA_SIZE;
-            res1 = glue(glue(slow_ld, SUFFIX), MMUSUFFIX)(env, addr1,
-                                                          mmu_idx, retaddr);
-            res2 = glue(glue(slow_ld, SUFFIX), MMUSUFFIX)(env, addr2,
-                                                          mmu_idx, retaddr);
-            shift = (addr & (DATA_SIZE - 1)) * 8;
-#ifdef TARGET_WORDS_BIGENDIAN
-            res = (res1 << shift) | (res2 >> ((DATA_SIZE * 8) - shift));
-#else
-            res = (res1 >> shift) | (res2 << ((DATA_SIZE * 8) - shift));
-#endif
-            res = (DATA_TYPE)res;
-        } else {
-            /* unaligned/aligned access in the same page */
-            uintptr_t addend = env->tlb_table[mmu_idx][index].addend;
-            res = glue(glue(ld, USUFFIX), _raw)((uint8_t *)(intptr_t)
-                                                (addr + addend));
-        }
-    } else {
-        /* the page is not in the TLB : fill it */
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
-        goto redo;
-    }
-    return res;
-}
-
 #ifndef SOFTMMU_CODE_ACCESS
 
-static void glue(glue(slow_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
-                                                   target_ulong addr,
-                                                   DATA_TYPE val,
-                                                   int mmu_idx,
-                                                   uintptr_t retaddr);
-
 static inline void glue(io_write, SUFFIX)(CPUArchState *env,
                                           hwaddr physaddr,
                                           DATA_TYPE val,
@@ -225,48 +178,66 @@ glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
                                              target_ulong addr, DATA_TYPE val,
                                              int mmu_idx, uintptr_t retaddr)
 {
-    hwaddr ioaddr;
-    target_ulong tlb_addr;
-    int index;
+    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+    uintptr_t haddr;
 
-    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
- redo:
-    tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
-    if ((addr & TARGET_PAGE_MASK) == (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
-        if (tlb_addr & ~TARGET_PAGE_MASK) {
-            /* IO access */
-            if ((addr & (DATA_SIZE - 1)) != 0)
-                goto do_unaligned_access;
-            ioaddr = env->iotlb[mmu_idx][index];
-            glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
-        } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
-        do_unaligned_access:
+    /* If the TLB entry is for a different page, reload and try again.  */
+    if ((addr & TARGET_PAGE_MASK)
+        != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
 #ifdef ALIGNED_ONLY
+        if ((addr & (DATA_SIZE - 1)) != 0) {
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
+        }
 #endif
-            glue(glue(slow_st, SUFFIX), MMUSUFFIX)(env, addr, val,
-                                                   mmu_idx, retaddr);
-        } else {
-            /* aligned/unaligned access in the same page */
-            uintptr_t addend;
+        tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+    }
+
+    /* Handle an IO access.  */
+    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
+        hwaddr ioaddr;
+        if ((addr & (DATA_SIZE - 1)) != 0) {
+            goto do_unaligned_access;
+        }
+        ioaddr = env->iotlb[mmu_idx][index];
+        glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
+        return;
+    }
+
+    /* Handle slow unaligned access (it spans two pages or IO).  */
+    if (DATA_SIZE > 1
+        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                     >= TARGET_PAGE_SIZE)) {
+        int i;
+    do_unaligned_access:
 #ifdef ALIGNED_ONLY
-            if ((addr & (DATA_SIZE - 1)) != 0) {
-                do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
-            }
+        do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
+#endif
+        /* XXX: not efficient, but simple */
+        /* Note: relies on the fact that tlb_fill() does not remove the
+         * previous page from the TLB cache.  */
+        for (i = DATA_SIZE - 1; i >= 0; i--) {
+#ifdef TARGET_WORDS_BIGENDIAN
+            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
+#else
+            uint8_t val8 = val >> (i * 8);
 #endif
-            addend = env->tlb_table[mmu_idx][index].addend;
-            glue(glue(st, SUFFIX), _raw)((uint8_t *)(intptr_t)
-                                         (addr + addend), val);
+            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
+                                            mmu_idx, retaddr);
         }
-    } else {
-        /* the page is not in the TLB : fill it */
+        return;
+    }
+
+    /* Handle aligned access or unaligned access in the same page.  */
 #ifdef ALIGNED_ONLY
-        if ((addr & (DATA_SIZE - 1)) != 0)
-            do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
-#endif
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
-        goto redo;
+    if ((addr & (DATA_SIZE - 1)) != 0) {
+        do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
     }
+#endif
+
+    haddr = addr + env->tlb_table[mmu_idx][index].addend;
+    glue(glue(st, SUFFIX), _raw)((uint8_t *)haddr, val);
 }
 
 void
@@ -277,56 +248,6 @@ glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                                  GETPC_EXT());
 }
 
-/* handles all unaligned cases */
-static void glue(glue(slow_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
-                                                   target_ulong addr,
-                                                   DATA_TYPE val,
-                                                   int mmu_idx,
-                                                   uintptr_t retaddr)
-{
-    hwaddr ioaddr;
-    target_ulong tlb_addr;
-    int index, i;
-
-    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
- redo:
-    tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
-    if ((addr & TARGET_PAGE_MASK) == (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
-        if (tlb_addr & ~TARGET_PAGE_MASK) {
-            /* IO access */
-            if ((addr & (DATA_SIZE - 1)) != 0)
-                goto do_unaligned_access;
-            ioaddr = env->iotlb[mmu_idx][index];
-            glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
-        } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
-        do_unaligned_access:
-            /* XXX: not efficient, but simple */
-            /* Note: relies on the fact that tlb_fill() does not remove the
-             * previous page from the TLB cache.  */
-            for(i = DATA_SIZE - 1; i >= 0; i--) {
-#ifdef TARGET_WORDS_BIGENDIAN
-                glue(slow_stb, MMUSUFFIX)(env, addr + i,
-                                          val >> (((DATA_SIZE - 1) * 8) - (i * 8)),
-                                          mmu_idx, retaddr);
-#else
-                glue(slow_stb, MMUSUFFIX)(env, addr + i,
-                                          val >> (i * 8),
-                                          mmu_idx, retaddr);
-#endif
-            }
-        } else {
-            /* aligned/unaligned access in the same page */
-            uintptr_t addend = env->tlb_table[mmu_idx][index].addend;
-            glue(glue(st, SUFFIX), _raw)((uint8_t *)(intptr_t)
-                                         (addr + addend), val);
-        }
-    } else {
-        /* the page is not in the TLB : fill it */
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
-        goto redo;
-    }
-}
-
 #endif /* !defined(SOFTMMU_CODE_ACCESS) */
 
 #undef READ_ACCESS_TYPE
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PULL 7/7] tcg-i386: Use new return-argument ld/st helpers
  2013-08-26 21:00 [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Richard Henderson
                   ` (5 preceding siblings ...)
  2013-08-26 21:00 ` [Qemu-devel] [PULL 6/7] tcg: Tidy softmmu_template.h Richard Henderson
@ 2013-08-26 21:00 ` Richard Henderson
  2013-08-28 22:55   ` Richard W.M. Jones
  2013-08-27 21:30 ` [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Aurelien Jarno
  7 siblings, 1 reply; 23+ messages in thread
From: Richard Henderson @ 2013-08-26 21:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: anthony, aurelien

Discontinue the jump-around-jump-to-jump scheme, trading it for a single
immediate move instruction.  The two extra jumps always consume 7 bytes,
whereas the immediate move is either 5 or 7 bytes depending on where the
code_gen_buffer gets located.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 include/exec/exec-all.h |  13 +-----
 tcg/i386/tcg-target.c   | 103 ++++++++++++++++++++++--------------------------
 2 files changed, 49 insertions(+), 67 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 5920f73..b70028a 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -326,18 +326,9 @@ extern uintptr_t tci_tb_ptr;
    (6) jump to corresponding code of the next of fast path
  */
 # if defined(__i386__) || defined(__x86_64__)
-/* To avoid broken disassembling, long jmp is used for embedding fast path pc,
-   so that the destination is the next code of fast path, though this jmp is
-   never executed.
-
-   call MMU helper
-   jmp POST_PROC (2byte)    <- GETRA()
-   jmp NEXT_CODE (5byte)
-   POST_PROCESS ...         <- GETRA() + 7
- */
 #  define GETRA() ((uintptr_t)__builtin_return_address(0))
-#  define GETPC_LDST() ((uintptr_t)(GETRA() + 7 + \
-                                    *(int32_t *)((void *)GETRA() + 3) - 1))
+/* The return address argument for ldst is passed directly.  */
+#  define GETPC_LDST()  (abort(), 0)
 # elif defined (_ARCH_PPC) && !defined (_ARCH_PPC64)
 #  define GETRA() ((uintptr_t)__builtin_return_address(0))
 #  define GETPC_LDST() ((uintptr_t) ((*(int32_t *)(GETRA() - 4)) - 1))
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index fba50f8..12a7ca3 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -190,11 +190,11 @@ static int target_parse_constraint(TCGArgConstraint *ct, const char **pct_str)
         /* qemu_ld/st address constraint */
     case 'L':
         ct->ct |= TCG_CT_REG;
-#if TCG_TARGET_REG_BITS == 64
+        if (TCG_TARGET_REG_BITS == 64) {
             tcg_regset_set32(ct->u.regs, 0, 0xffff);
-#else
+        } else {
             tcg_regset_set32(ct->u.regs, 0, 0xff);
-#endif
+        }
         tcg_regset_reset_reg(ct->u.regs, TCG_REG_L0);
         tcg_regset_reset_reg(ct->u.regs, TCG_REG_L1);
         break;
@@ -1025,22 +1025,24 @@ static void tcg_out_jmp(TCGContext *s, tcg_target_long dest)
 
 #include "exec/softmmu_defs.h"
 
-/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
-   int mmu_idx) */
-static const void *qemu_ld_helpers[4] = {
-    helper_ldb_mmu,
-    helper_ldw_mmu,
-    helper_ldl_mmu,
-    helper_ldq_mmu,
+/* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
+ *                                     int mmu_idx, uintptr_t ra)
+ */
+static const void * const qemu_ld_helpers[4] = {
+    helper_ret_ldb_mmu,
+    helper_ret_ldw_mmu,
+    helper_ret_ldl_mmu,
+    helper_ret_ldq_mmu,
 };
 
-/* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
-   uintxx_t val, int mmu_idx) */
-static const void *qemu_st_helpers[4] = {
-    helper_stb_mmu,
-    helper_stw_mmu,
-    helper_stl_mmu,
-    helper_stq_mmu,
+/* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
+ *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
+ */
+static const void * const qemu_st_helpers[4] = {
+    helper_ret_stb_mmu,
+    helper_ret_stw_mmu,
+    helper_ret_stl_mmu,
+    helper_ret_stq_mmu,
 };
 
 static void add_qemu_ldst_label(TCGContext *s,
@@ -1468,6 +1470,12 @@ static void add_qemu_ldst_label(TCGContext *s,
     }
 }
 
+/* See the GETPC definition in include/exec/exec-all.h.  */
+static inline uintptr_t do_getpc(uint8_t *raddr)
+{
+    return (uintptr_t)raddr - 1;
+}
+
 /*
  * Generate code for the slow path for a load at the end of block
  */
@@ -1499,33 +1507,20 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         }
 
         tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, l->mem_index);
+        ofs += 4;
+
+        tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, do_getpc(l->raddr));
     } else {
-        tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
+        tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
         /* The second argument is already loaded with addrlo.  */
         tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2],
                      l->mem_index);
+        tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[3],
+                     do_getpc(l->raddr));
     }
 
-    /* Code generation of qemu_ld/st's slow path calling MMU helper
-
-       PRE_PROC ...
-       call MMU helper
-       jmp POST_PROC (2b) : short forward jump <- GETRA()
-       jmp next_code (5b) : dummy long backward jump which is never executed
-       POST_PROC ... : do post-processing <- GETRA() + 7
-       jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
-    */
-
     tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
 
-    /* Jump to post-processing code */
-    tcg_out8(s, OPC_JMP_short);
-    tcg_out8(s, 5);
-    /* Dummy backward jump having information of fast path'pc for MMU helpers */
-    tcg_out8(s, OPC_JMP_long);
-    *(int32_t *)s->code_ptr = (int32_t)(l->raddr - s->code_ptr - 4);
-    s->code_ptr += 4;
-
     data_reg = l->datalo_reg;
     switch(opc) {
     case 0 | 4:
@@ -1606,36 +1601,32 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         }
 
         tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, l->mem_index);
+        ofs += 4;
+
+        tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, do_getpc(l->raddr));
     } else {
-        tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
+        uintptr_t pc;
+
+        tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
         /* The second argument is already loaded with addrlo.  */
         tcg_out_mov(s, (opc == 3 ? TCG_TYPE_I64 : TCG_TYPE_I32),
                     tcg_target_call_iarg_regs[2], l->datalo_reg);
         tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[3],
                      l->mem_index);
-    }
 
-    /* Code generation of qemu_ld/st's slow path calling MMU helper
-
-       PRE_PROC ...
-       call MMU helper
-       jmp POST_PROC (2b) : short forward jump <- GETRA()
-       jmp next_code (5b) : dummy long backward jump which is never executed
-       POST_PROC ... : do post-processing <- GETRA() + 7
-       jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
-    */
+        pc = do_getpc(l->raddr);
+        if (ARRAY_SIZE(tcg_target_call_iarg_regs) > 4) {
+            tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[4], pc);
+        } else if (pc == (int32_t)pc) {
+            tcg_out_sti(s, TCG_TYPE_PTR, TCG_REG_ESP, 0, pc);
+        } else {
+            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RAX, pc);
+            tcg_out_st(s, TCG_TYPE_PTR, TCG_REG_RAX, TCG_REG_ESP, 0);
+        }
+    }
 
     tcg_out_calli(s, (tcg_target_long)qemu_st_helpers[s_bits]);
 
-    /* Jump to post-processing code */
-    tcg_out8(s, OPC_JMP_short);
-    tcg_out8(s, 5);
-    /* Dummy backward jump having information of fast path'pc for MMU helpers */
-    tcg_out8(s, OPC_JMP_long);
-    *(int32_t *)s->code_ptr = (int32_t)(l->raddr - s->code_ptr - 4);
-    s->code_ptr += 4;
-
-    /* Jump to the code corresponding to next IR of qemu_st */
     tcg_out_jmp(s, (tcg_target_long)l->raddr);
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument
  2013-08-26 21:00 ` [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument Richard Henderson
@ 2013-08-26 22:26   ` Paolo Bonzini
  2013-08-26 22:34     ` Richard Henderson
  2013-08-26 23:26   ` Peter Maydell
  1 sibling, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2013-08-26 22:26 UTC (permalink / raw)
  To: Richard Henderson; +Cc: aurelien, qemu-devel, anthony

Il 26/08/2013 23:00, Richard Henderson ha scritto:
> Allow the code that tcg generates to be less obtuse, passing in
> the return address directly instead of computing it in the helper.
> 
> Maintain the old entrance point unchanged as an alternate entry point.
> 
> Delete the helper_st*_cmmu prototypes; the implementations did not exist.
> 
> Signed-off-by: Richard Henderson <rth@twiddle.net>

Something that can be done on top of this patch: what about moving the
"-1" to helper_ret_*?  It is common to pretty much all the targets
(except ARM has -2), and it would allow some simplifications.  For
example I played with return address helpers on 32-bit PPC, and you
could use a

    li   rN, retaddr
    mtlr rN
    b    st_trampoline[i]

sequence instead of one of

    li   rN, retaddr
    mtlr rN
    bl   st_trampoline[i]
    b    retaddr

or

    li   rN, retaddr
    mtlr rN
    addi rN, rN, -1
    b    st_trampoline[i]

Paolo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument
  2013-08-26 22:26   ` Paolo Bonzini
@ 2013-08-26 22:34     ` Richard Henderson
  2013-08-27 10:46       ` Aurelien Jarno
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Henderson @ 2013-08-26 22:34 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aurelien, qemu-devel, anthony

On 08/26/2013 03:26 PM, Paolo Bonzini wrote:
> Something that can be done on top of this patch: what about moving the
> "-1" to helper_ret_*?  It is common to pretty much all the targets
> (except ARM has -2), and it would allow some simplifications.

I suppose so, yes.

>     li   rN, retaddr
>     mtlr rN
>     b    st_trampoline[i]
> 
> sequence instead of one of
> 
>     li   rN, retaddr
>     mtlr rN
>     bl   st_trampoline[i]
>     b    retaddr

This sort of thing is very difficult to evaluate, because of the
cpu's return address prediction stack.  I have so far avoided it.

The only cpus that I believe can make good use of tail calls into
the memory helpers are those with predicated stores and calls, i.e.
arm and ia64.


r~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument
  2013-08-26 21:00 ` [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument Richard Henderson
  2013-08-26 22:26   ` Paolo Bonzini
@ 2013-08-26 23:26   ` Peter Maydell
  2013-08-27 10:47     ` Aurelien Jarno
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Maydell @ 2013-08-26 23:26 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Aurelien Jarno, QEMU Developers, Anthony Liguori

On 26 August 2013 22:00, Richard Henderson <rth@twiddle.net> wrote:
> Allow the code that tcg generates to be less obtuse, passing in
> the return address directly instead of computing it in the helper.

> +uint8_t helper_ret_ldb_mmu(CPUArchState *env, target_ulong addr,
> +                           int mmu_idx, uintptr_t retaddr);

>  uint8_t helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);

I thought the reason we did it this way round was to avoid having
so many arguments to helpers that we overflowed registers and
into the stack on some calling conventions? Or does this not make
much difference in practice?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument
  2013-08-26 22:34     ` Richard Henderson
@ 2013-08-27 10:46       ` Aurelien Jarno
  2013-08-27 14:53         ` Richard Henderson
  0 siblings, 1 reply; 23+ messages in thread
From: Aurelien Jarno @ 2013-08-27 10:46 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Paolo Bonzini, qemu-devel, anthony

On Mon, Aug 26, 2013 at 03:34:15PM -0700, Richard Henderson wrote:
> On 08/26/2013 03:26 PM, Paolo Bonzini wrote:
> > Something that can be done on top of this patch: what about moving the
> > "-1" to helper_ret_*?  It is common to pretty much all the targets
> > (except ARM has -2), and it would allow some simplifications.
> 
> I suppose so, yes.
> 
> >     li   rN, retaddr
> >     mtlr rN
> >     b    st_trampoline[i]
> > 
> > sequence instead of one of
> > 
> >     li   rN, retaddr
> >     mtlr rN
> >     bl   st_trampoline[i]
> >     b    retaddr
> 
> This sort of thing is very difficult to evaluate, because of the
> cpu's return address prediction stack.  I have so far avoided it.
> 
> The only cpus that I believe can make good use of tail calls into
> the memory helpers are those with predicated stores and calls, i.e.
> arm and ia64.
> 

On the other hand calling the helper is the exception more than the
rule (that's why they have been moved at the end of the TB), so we 
should not look to much at generating fast code, but rather small code
in order to use the caches (both TB and CPU caches) more efficiently.

Therefore even on x86, if we move the -1 at the helper level, it should
be possible to use a tail call for the stores, something like:

    mov    %r14,%rdi
    mov    %ebx,%edx
    xor    %ecx,%ecx
    lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
    pushq  %r8
    jmpq   0x7f25526757a0

Instead of:

    mov    %r14,%rdi
    mov    %ebx,%edx
    xor    %ecx,%ecx
    lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
    callq  0x7f25526757a0
    jmpq   0x7f2541a6f69b
    
-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument
  2013-08-26 23:26   ` Peter Maydell
@ 2013-08-27 10:47     ` Aurelien Jarno
  0 siblings, 0 replies; 23+ messages in thread
From: Aurelien Jarno @ 2013-08-27 10:47 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers, Anthony Liguori, Richard Henderson

On Tue, Aug 27, 2013 at 12:26:00AM +0100, Peter Maydell wrote:
> On 26 August 2013 22:00, Richard Henderson <rth@twiddle.net> wrote:
> > Allow the code that tcg generates to be less obtuse, passing in
> > the return address directly instead of computing it in the helper.
> 
> > +uint8_t helper_ret_ldb_mmu(CPUArchState *env, target_ulong addr,
> > +                           int mmu_idx, uintptr_t retaddr);
> 
> >  uint8_t helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
> 
> I thought the reason we did it this way round was to avoid having
> so many arguments to helpers that we overflowed registers and
> into the stack on some calling conventions? Or does this not make
> much difference in practice?
> 

That was indeed the idea, but I think it's a good idea to provide this
alternative for architectures supporting enough arguments in registers.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument
  2013-08-27 10:46       ` Aurelien Jarno
@ 2013-08-27 14:53         ` Richard Henderson
  2013-08-27 15:43           ` Aurelien Jarno
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Henderson @ 2013-08-27 14:53 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: Paolo Bonzini, qemu-devel, anthony

On 08/27/2013 03:46 AM, Aurelien Jarno wrote:
> On the other hand calling the helper is the exception more than the
> rule (that's why they have been moved at the end of the TB), so we 
> should not look to much at generating fast code, but rather small code
> in order to use the caches (both TB and CPU caches) more efficiently.
> 
> Therefore even on x86, if we move the -1 at the helper level, it should
> be possible to use a tail call for the stores, something like:
> 
>     mov    %r14,%rdi
>     mov    %ebx,%edx
>     xor    %ecx,%ecx
>     lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
>     pushq  %r8
>     jmpq   0x7f25526757a0
> 
> Instead of:
> 
>     mov    %r14,%rdi
>     mov    %ebx,%edx
>     xor    %ecx,%ecx
>     lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
>     callq  0x7f25526757a0
>     jmpq   0x7f2541a6f69b

Fair enough.  I'll have a go at some follow-ups then.


r~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument
  2013-08-27 14:53         ` Richard Henderson
@ 2013-08-27 15:43           ` Aurelien Jarno
  2013-08-27 15:53             ` Richard Henderson
  0 siblings, 1 reply; 23+ messages in thread
From: Aurelien Jarno @ 2013-08-27 15:43 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Paolo Bonzini, qemu-devel, anthony

On Tue, Aug 27, 2013 at 07:53:56AM -0700, Richard Henderson wrote:
> On 08/27/2013 03:46 AM, Aurelien Jarno wrote:
> > On the other hand calling the helper is the exception more than the
> > rule (that's why they have been moved at the end of the TB), so we 
> > should not look to much at generating fast code, but rather small code
> > in order to use the caches (both TB and CPU caches) more efficiently.
> > 
> > Therefore even on x86, if we move the -1 at the helper level, it should
> > be possible to use a tail call for the stores, something like:
> > 
> >     mov    %r14,%rdi
> >     mov    %ebx,%edx
> >     xor    %ecx,%ecx
> >     lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
> >     pushq  %r8
> >     jmpq   0x7f25526757a0
> > 
> > Instead of:
> > 
> >     mov    %r14,%rdi
> >     mov    %ebx,%edx
> >     xor    %ecx,%ecx
> >     lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
> >     callq  0x7f25526757a0
> >     jmpq   0x7f2541a6f69b
> 
> Fair enough.  I'll have a go at some follow-ups then.
> 

I think this can also be done in a second time. Do you want to create a
version 3, or should I just process the current pull request and you
will provide additional patches later?

-- 
Aurelien Jarno	                        GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument
  2013-08-27 15:43           ` Aurelien Jarno
@ 2013-08-27 15:53             ` Richard Henderson
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-27 15:53 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: Paolo Bonzini, qemu-devel, anthony

On 08/27/2013 08:43 AM, Aurelien Jarno wrote:
> I think this can also be done in a second time. Do you want to create a
> version 3, or should I just process the current pull request and you
> will provide additional patches later?

I think it might be better to provide follow-up patches.


r~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization
  2013-08-26 21:00 [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Richard Henderson
                   ` (6 preceding siblings ...)
  2013-08-26 21:00 ` [Qemu-devel] [PULL 7/7] tcg-i386: Use new return-argument ld/st helpers Richard Henderson
@ 2013-08-27 21:30 ` Aurelien Jarno
  7 siblings, 0 replies; 23+ messages in thread
From: Aurelien Jarno @ 2013-08-27 21:30 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, anthony

On Mon, Aug 26, 2013 at 02:00:05PM -0700, Richard Henderson wrote:
> 
> The following is a rebase of the i386 portion of the v2 patch set.
> That incorporated the minor comments from Aurelien from v1.  This
> pull request omits the ARM portion of the patch set, as that has
> yet to receive any review.
> 
> I'd like to get this patch set pulled, because I have three other
> patch sets that depend on this.
> 
> 
> r~
> 
> 
> The following changes since commit f7ad538e1ea130c8b6f3abb06ad6c856242c799e:
> 
>   Merge remote-tracking branch 'stefanha/block' into staging (2013-08-26 09:19:50 -0500)
> 
> are available in the git repository at:
> 
> 
>   git://github.com/rth7680/qemu.git tcg-ool
> 
> for you to fetch changes up to 401c227b0a1134245ec61c6c5a9997cfc963c8e4:
> 
>   tcg-i386: Use new return-argument ld/st helpers (2013-08-26 13:31:54 -0700)
> 
> ----------------------------------------------------------------
> Richard Henderson (7):
>       tcg: Tidy generated code for tcg_outN
>       tcg-i386: Add and use tcg_out64
>       tcg-i386: Try pc-relative lea for constant formation
>       tcg-i386: Tidy qemu_ld/st slow path
>       tcg: Add mmu helpers that take a return address argument
>       tcg: Tidy softmmu_template.h
>       tcg-i386: Use new return-argument ld/st helpers
> 
>  include/exec/exec-all.h         |  13 +-
>  include/exec/softmmu_defs.h     |  46 +++---
>  include/exec/softmmu_template.h | 309 ++++++++++++++++------------------------
>  tcg/i386/tcg-target.c           | 274 +++++++++++++++++------------------
>  tcg/tcg.c                       |  17 ++-
>  5 files changed, 295 insertions(+), 364 deletions(-)

Thanks, pulled.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PULL 7/7] tcg-i386: Use new return-argument ld/st helpers
  2013-08-26 21:00 ` [Qemu-devel] [PULL 7/7] tcg-i386: Use new return-argument ld/st helpers Richard Henderson
@ 2013-08-28 22:55   ` Richard W.M. Jones
  2013-08-29 15:21     ` [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST Richard Henderson
  0 siblings, 1 reply; 23+ messages in thread
From: Richard W.M. Jones @ 2013-08-28 22:55 UTC (permalink / raw)
  To: Richard Henderson; +Cc: aurelien, qemu-devel, anthony

On Mon, Aug 26, 2013 at 02:00:12PM -0700, Richard Henderson wrote:
> Discontinue the jump-around-jump-to-jump scheme, trading it for a single
> immediate move instruction.  The two extra jumps always consume 7 bytes,
> whereas the immediate move is either 5 or 7 bytes depending on where the
> code_gen_buffer gets located.

This seems to have broken qemu-system-ppc64:

https://bugs.launchpad.net/qemu/+bug/1218098

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST
  2013-08-28 22:55   ` Richard W.M. Jones
@ 2013-08-29 15:21     ` Richard Henderson
  2013-08-29 15:28       ` Paolo Bonzini
                         ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-29 15:21 UTC (permalink / raw)
  To: qemu-devel; +Cc: anthony, aurelien, rjones

Indeed, remove it entirely and remove the is_tcg_gen_code check
from GETPC_EXT.

Fixes https://bugs.launchpad.net/qemu/+bug/1218098 wherein a call
to a "normal" helper function performed a sequence of tail calls
all the way into the memory helper functions, leading to a stack
frame in which the memory helper function appeared to be called
directly from tcg.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 include/exec/exec-all.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

This is actually conclusive proof that using is_tcg_gen_code, and
thus any scheme that requires GETPC_LDST, is fundamentally flawed.

Thankfully, the follow-up patch sets that I've already posted give
a framework for properly fixing this.  I've already posted a cleanup
for ARM to fix this.  I have pending but unposted patch sets for
AArch64 and PowerPC.

In the meantime, please apply this fix for x86_64 asap.


r~



diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index b70028a..ffb69a4 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -326,9 +326,7 @@ extern uintptr_t tci_tb_ptr;
    (6) jump to corresponding code of the next of fast path
  */
 # if defined(__i386__) || defined(__x86_64__)
-#  define GETRA() ((uintptr_t)__builtin_return_address(0))
-/* The return address argument for ldst is passed directly.  */
-#  define GETPC_LDST()  (abort(), 0)
+#  define GETPC_EXT()  GETPC()
 # elif defined (_ARCH_PPC) && !defined (_ARCH_PPC64)
 #  define GETRA() ((uintptr_t)__builtin_return_address(0))
 #  define GETPC_LDST() ((uintptr_t) ((*(int32_t *)(GETRA() - 4)) - 1))
@@ -349,7 +347,7 @@ static inline uintptr_t tcg_getpc_ldst(uintptr_t ra)
                                    not the start of the next opcode  */
     return ra;
 }
-#elif defined(__aarch64__)
+# elif defined(__aarch64__)
 #  define GETRA()       ((uintptr_t)__builtin_return_address(0))
 #  define GETPC_LDST()  tcg_getpc_ldst(GETRA())
 static inline uintptr_t tcg_getpc_ldst(uintptr_t ra)
@@ -367,7 +365,9 @@ static inline uintptr_t tcg_getpc_ldst(uintptr_t ra)
 #  error "CONFIG_QEMU_LDST_OPTIMIZATION needs GETPC_LDST() implementation!"
 # endif
 bool is_tcg_gen_code(uintptr_t pc_ptr);
-# define GETPC_EXT() (is_tcg_gen_code(GETRA()) ? GETPC_LDST() : GETPC())
+# ifndef GETPC_EXT
+#  define GETPC_EXT() (is_tcg_gen_code(GETRA()) ? GETPC_LDST() : GETPC())
+# endif
 #else
 # define GETPC_EXT() GETPC()
 #endif
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST
  2013-08-29 15:21     ` [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST Richard Henderson
@ 2013-08-29 15:28       ` Paolo Bonzini
  2013-08-29 15:32         ` Richard Henderson
  2013-08-29 15:50       ` Richard W.M. Jones
  2013-08-29 19:13       ` Aurelien Jarno
  2 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2013-08-29 15:28 UTC (permalink / raw)
  To: Richard Henderson; +Cc: aurelien, qemu-devel, anthony, rjones

Il 29/08/2013 17:21, Richard Henderson ha scritto:
> I have pending but unposted patch sets for AArch64 and PowerPC.

I also have something working for PPC since I have old hardware...  I'm
waiting for the API to settle before posting patches.

Paolo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST
  2013-08-29 15:28       ` Paolo Bonzini
@ 2013-08-29 15:32         ` Richard Henderson
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2013-08-29 15:32 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aurelien, qemu-devel, anthony, rjones

On 08/29/2013 08:28 AM, Paolo Bonzini wrote:
> I also have something working for PPC since I have old hardware...  I'm
> waiting for the API to settle before posting patches.

I've gotten the admins to install enough of a 32-bit environment
on the GCC compile farm's power7 box to test there, but it'll be
nice to have a confirm on old hw too.


r~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST
  2013-08-29 15:21     ` [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST Richard Henderson
  2013-08-29 15:28       ` Paolo Bonzini
@ 2013-08-29 15:50       ` Richard W.M. Jones
  2013-08-29 19:13       ` Aurelien Jarno
  2 siblings, 0 replies; 23+ messages in thread
From: Richard W.M. Jones @ 2013-08-29 15:50 UTC (permalink / raw)
  To: Richard Henderson; +Cc: anthony, qemu-devel, aurelien

On Thu, Aug 29, 2013 at 08:21:37AM -0700, Richard Henderson wrote:
> Indeed, remove it entirely and remove the is_tcg_gen_code check
> from GETPC_EXT.
> 
> Fixes https://bugs.launchpad.net/qemu/+bug/1218098 wherein a call
> to a "normal" helper function performed a sequence of tail calls
> all the way into the memory helper functions, leading to a stack
> frame in which the memory helper function appeared to be called
> directly from tcg.
> 
> Signed-off-by: Richard Henderson <rth@twiddle.net>

This fixes the bug I saw.

Tested-by: Richard W.M. Jones <rjones@redhat.com>

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming blog: http://rwmj.wordpress.com
Fedora now supports 80 OCaml packages (the OPEN alternative to F#)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST
  2013-08-29 15:21     ` [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST Richard Henderson
  2013-08-29 15:28       ` Paolo Bonzini
  2013-08-29 15:50       ` Richard W.M. Jones
@ 2013-08-29 19:13       ` Aurelien Jarno
  2 siblings, 0 replies; 23+ messages in thread
From: Aurelien Jarno @ 2013-08-29 19:13 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, anthony, rjones

On Thu, Aug 29, 2013 at 08:21:37AM -0700, Richard Henderson wrote:
> Indeed, remove it entirely and remove the is_tcg_gen_code check
> from GETPC_EXT.
> 
> Fixes https://bugs.launchpad.net/qemu/+bug/1218098 wherein a call
> to a "normal" helper function performed a sequence of tail calls
> all the way into the memory helper functions, leading to a stack
> frame in which the memory helper function appeared to be called
> directly from tcg.
> 
> Signed-off-by: Richard Henderson <rth@twiddle.net>
> ---
>  include/exec/exec-all.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> This is actually conclusive proof that using is_tcg_gen_code, and
> thus any scheme that requires GETPC_LDST, is fundamentally flawed.
> 
> Thankfully, the follow-up patch sets that I've already posted give
> a framework for properly fixing this.  I've already posted a cleanup
> for ARM to fix this.  I have pending but unposted patch sets for
> AArch64 and PowerPC.
> 
> In the meantime, please apply this fix for x86_64 asap.
> 
> 
> r~
> 
> 
> 
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index b70028a..ffb69a4 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -326,9 +326,7 @@ extern uintptr_t tci_tb_ptr;
>     (6) jump to corresponding code of the next of fast path
>   */
>  # if defined(__i386__) || defined(__x86_64__)
> -#  define GETRA() ((uintptr_t)__builtin_return_address(0))
> -/* The return address argument for ldst is passed directly.  */
> -#  define GETPC_LDST()  (abort(), 0)
> +#  define GETPC_EXT()  GETPC()
>  # elif defined (_ARCH_PPC) && !defined (_ARCH_PPC64)
>  #  define GETRA() ((uintptr_t)__builtin_return_address(0))
>  #  define GETPC_LDST() ((uintptr_t) ((*(int32_t *)(GETRA() - 4)) - 1))
> @@ -349,7 +347,7 @@ static inline uintptr_t tcg_getpc_ldst(uintptr_t ra)
>                                     not the start of the next opcode  */
>      return ra;
>  }
> -#elif defined(__aarch64__)
> +# elif defined(__aarch64__)
>  #  define GETRA()       ((uintptr_t)__builtin_return_address(0))
>  #  define GETPC_LDST()  tcg_getpc_ldst(GETRA())
>  static inline uintptr_t tcg_getpc_ldst(uintptr_t ra)
> @@ -367,7 +365,9 @@ static inline uintptr_t tcg_getpc_ldst(uintptr_t ra)
>  #  error "CONFIG_QEMU_LDST_OPTIMIZATION needs GETPC_LDST() implementation!"
>  # endif
>  bool is_tcg_gen_code(uintptr_t pc_ptr);
> -# define GETPC_EXT() (is_tcg_gen_code(GETRA()) ? GETPC_LDST() : GETPC())
> +# ifndef GETPC_EXT
> +#  define GETPC_EXT() (is_tcg_gen_code(GETRA()) ? GETPC_LDST() : GETPC())
> +# endif
>  #else
>  # define GETPC_EXT() GETPC()
>  #endif

Thanks, applied.


-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-08-29 19:13 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-26 21:00 [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Richard Henderson
2013-08-26 21:00 ` [Qemu-devel] [PULL 1/7] tcg: Tidy generated code for tcg_outN Richard Henderson
2013-08-26 21:00 ` [Qemu-devel] [PULL 2/7] tcg-i386: Add and use tcg_out64 Richard Henderson
2013-08-26 21:00 ` [Qemu-devel] [PULL 3/7] tcg-i386: Try pc-relative lea for constant formation Richard Henderson
2013-08-26 21:00 ` [Qemu-devel] [PULL 4/7] tcg-i386: Tidy qemu_ld/st slow path Richard Henderson
2013-08-26 21:00 ` [Qemu-devel] [PULL 5/7] tcg: Add mmu helpers that take a return address argument Richard Henderson
2013-08-26 22:26   ` Paolo Bonzini
2013-08-26 22:34     ` Richard Henderson
2013-08-27 10:46       ` Aurelien Jarno
2013-08-27 14:53         ` Richard Henderson
2013-08-27 15:43           ` Aurelien Jarno
2013-08-27 15:53             ` Richard Henderson
2013-08-26 23:26   ` Peter Maydell
2013-08-27 10:47     ` Aurelien Jarno
2013-08-26 21:00 ` [Qemu-devel] [PULL 6/7] tcg: Tidy softmmu_template.h Richard Henderson
2013-08-26 21:00 ` [Qemu-devel] [PULL 7/7] tcg-i386: Use new return-argument ld/st helpers Richard Henderson
2013-08-28 22:55   ` Richard W.M. Jones
2013-08-29 15:21     ` [Qemu-devel] [PATCH] tcg-i386: Remove abort from GETPC_LDST Richard Henderson
2013-08-29 15:28       ` Paolo Bonzini
2013-08-29 15:32         ` Richard Henderson
2013-08-29 15:50       ` Richard W.M. Jones
2013-08-29 19:13       ` Aurelien Jarno
2013-08-27 21:30 ` [Qemu-devel] [PULL 0/7] Improve tcg ldst optimization Aurelien Jarno

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).