qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
@ 2012-10-31  7:04 Yeongkyoon Lee
  2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-31  7:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth

Here is the 8th version of the series optimizing TCG qemu_ld/st code generation.

v8:
 - Rebase
 
v7:
  - Rebase and fix mistyping

v6:
  - Remove an extra argument of return addr from MMU helpers
    Instead, embed the fast path addr to the slow path for helpers to use it
  - Change some bitwise operations to bitfields of structure
  - Change the name of function which handles finalization of TB code generation

v5:
  - Remove RFC tag

v4:
  - Remove CONFIG_SOFTMMU pre-condition from configure
  - Instead, add some CONFIG_SOFTMMU condition to TCG sources
  - Remove some unnecessary comments

v3:
  - Support CONFIG_TCG_PASS_AREG0
    (expected to get more performance enhancement than others)
  - Remove the configure option "--enable-ldst-optimization""
  - Make the optimization as default on i386 and x86_64 hosts
  - Fix some mistyping and apply checkpatch.pl before committing
  - Test i386, arm and sparc softmmu targets on i386 and x86_64 hosts
  - Test linux-user-test-0.3

v2:
  - Follow the submit rule of qemu

v1:
  - Initial commit request

I think the generated codes from qemu_ld/st IRs are relatively heavy, which are
up to 12 instructions for TLB hit case on i386 host.
This patch series enhance the code quality of TCG qemu_ld/st IRs by reducing
jump and enhancing locality.
Main idea is simple and has been already described in the comments in
tcg-target.c, which separates slow path (TLB miss case), and generates it at the
end of TB.

For example, the generated code from qemu_ld changes as follow.
Before:
(1) TLB check
(2) If hit fall through, else jump to TLB miss case (5)
(3) TLB hit case: Load value from host memory
(4) Jump to next code (6)
(5) TLB miss case: call MMU helper
(6) ... (next code)

After:
(1) TLB check
(2) If hit fall through, else jump to TLB miss case (5)
(3) TLB hit case: Load value from host memory
(4) ... (next code)
...
(5) TLB miss case: call MMU helper
(6) Jump to (8)
(7) [embedded addr of (4)] <- never executed but read by MMU helpers
(8) Return to next code (4)

Following is some performance results measured based on qemu 1.0.
Although there was measurement error, the results was not negligible.

* EEMBC CoreMark (before -> after)
  - Guest: i386, Linux (Tizen platform)
  - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
  - Results: 1135.6 -> 1179.9 (+3.9%)

* nbench (before -> after)
  - Guest: i386, Linux (linux-0.2.img included in QEMU source)
  - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
  - Results
    . MEMORY INDEX: 1.6782 -> 1.6818 (+0.2%)
    . INTEGER INDEX: 1.8258 -> 1.877 (+2.8%)
    . FLOATING-POINT INDEX: 0.5944 -> 0.5954 (+0.2%)

Summarized features:
 - The changes are wrapped by macro "CONFIG_QEMU_LDST_OPTIMIZATION" and
   they are enabled by default on i386/x86_64 hosts
 - Forced removal of the macro will cause compilation error on i386/x86_64 hosts
 - No implementations other than i386/x86_64 hosts yet

In addition, I have tried to remove the generated codes of calling MMU helpers
for TLB miss case from end of TB, however, have not found good solution yet.
In my opinion, TLB hit case performance could be degraded if removing the
calling codes, because it needs to set runtime parameters, such as, data,
mmu index and return address, in register or stack though they are not used
in TLB hit case.
This remains as a further issue.

Yeongkyoon Lee (3):
  configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st
    optimization
  tcg: Add extended GETPC mechanism for MMU helpers with ldst
    optimization
  tcg: Optimize qemu_ld/st by generating slow paths at the end of a
    block

 configure             |    6 +
 exec-all.h            |   36 +++++
 exec.c                |   11 ++
 softmmu_template.h    |   16 +-
 tcg/i386/tcg-target.c |  404 ++++++++++++++++++++++++++++++++++---------------
 tcg/tcg.c             |   12 ++
 tcg/tcg.h             |   30 ++++
 7 files changed, 381 insertions(+), 134 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v8 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization
  2012-10-31  7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
@ 2012-10-31  7:04 ` Yeongkyoon Lee
  2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization Yeongkyoon Lee
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-31  7:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth

Enable CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization only when
a host is i386 or x86_64.

Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
---
 configure |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/configure b/configure
index 9c6ac87..4be984e 100755
--- a/configure
+++ b/configure
@@ -3847,6 +3847,12 @@ upper() {
     echo "$@"| LC_ALL=C tr '[a-z]' '[A-Z]'
 }
 
+case "$cpu" in
+  i386|x86_64)
+    echo "CONFIG_QEMU_LDST_OPTIMIZATION=y" >> $config_target_mak
+  ;;
+esac
+
 echo "TARGET_SHORT_ALIGNMENT=$target_short_alignment" >> $config_target_mak
 echo "TARGET_INT_ALIGNMENT=$target_int_alignment" >> $config_target_mak
 echo "TARGET_LONG_ALIGNMENT=$target_long_alignment" >> $config_target_mak
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v8 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization
  2012-10-31  7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
  2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
@ 2012-10-31  7:04 ` Yeongkyoon Lee
  2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-31  7:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth

Add GETPC_EXT which is used by MMU helpers to selectively calculate the code
address of accessing guest memory when called from a qemu_ld/st optimized code
or a C function. Currently, it supports only i386 and x86-64 hosts.

Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
---
 exec-all.h         |   36 ++++++++++++++++++++++++++++++++++++
 exec.c             |   11 +++++++++++
 softmmu_template.h |   16 ++++++++--------
 3 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/exec-all.h b/exec-all.h
index 2ea0e4f..ad6d22b 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -310,6 +310,42 @@ extern uintptr_t tci_tb_ptr;
 # define GETPC() ((uintptr_t)__builtin_return_address(0) - 1)
 #endif
 
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* qemu_ld/st optimization split code generation to fast and slow path, thus,
+   it needs special handling for an MMU helper which is called from the slow
+   path, to get the fast path's pc without any additional argument.
+   It uses a tricky solution which embeds the fast path pc into the slow path.
+
+   Code flow in slow path:
+   (1) pre-process
+   (2) call MMU helper
+   (3) jump to (5)
+   (4) fast path information (implementation specific)
+   (5) post-process (e.g. stack adjust)
+   (6) jump to corresponding code of the next of fast path
+ */
+# if defined(__i386__) || defined(__x86_64__)
+/* To avoid broken disassembling, long jmp is used for embedding fast path pc,
+   so that the destination is the next code of fast path, though this jmp is
+   never executed.
+
+   call MMU helper
+   jmp POST_PROC (2byte)    <- GETRA()
+   jmp NEXT_CODE (5byte)
+   POST_PROCESS ...         <- GETRA() + 7
+ */
+#  define GETRA() ((uintptr_t)__builtin_return_address(0))
+#  define GETPC_LDST() ((uintptr_t)(GETRA() + 7 + \
+                                    *(int32_t *)((void *)GETRA() + 3) - 1))
+# else
+#  error "CONFIG_QEMU_LDST_OPTIMIZATION needs GETPC_LDST() implementation!"
+# endif
+bool is_tcg_gen_code(uintptr_t pc_ptr);
+# define GETPC_EXT() (is_tcg_gen_code(GETRA()) ? GETPC_LDST() : GETPC())
+#else
+# define GETPC_EXT() GETPC()
+#endif
+
 #if !defined(CONFIG_USER_ONLY)
 
 struct MemoryRegion *iotlb_to_region(hwaddr index);
diff --git a/exec.c b/exec.c
index b0ed593..1089250 100644
--- a/exec.c
+++ b/exec.c
@@ -1387,6 +1387,17 @@ void tb_link_page(TranslationBlock *tb,
     mmap_unlock();
 }
 
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* check whether the given addr is in TCG generated code buffer or not */
+bool is_tcg_gen_code(uintptr_t tc_ptr)
+{
+    /* This can be called during code generation, code_gen_buffer_max_size
+       is used instead of code_gen_ptr for upper boundary checking */
+    return (tc_ptr >= (uintptr_t)code_gen_buffer &&
+            tc_ptr < (uintptr_t)(code_gen_buffer + code_gen_buffer_max_size));
+}
+#endif
+
 /* find the TB 'tb' such that tb[0].tc_ptr <= tc_ptr <
    tb[1].tc_ptr. Return NULL if not found */
 TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
diff --git a/softmmu_template.h b/softmmu_template.h
index 20d6bab..ce30d8b 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -111,13 +111,13 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC();
+            retaddr = GETPC_EXT();
             ioaddr = env->iotlb[mmu_idx][index];
             res = glue(io_read, SUFFIX)(env, ioaddr, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
             /* slow unaligned access (it spans two pages or IO) */
         do_unaligned_access:
-            retaddr = GETPC();
+            retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
 #endif
@@ -128,7 +128,7 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC();
+                retaddr = GETPC_EXT();
                 do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
             }
 #endif
@@ -138,7 +138,7 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
         }
     } else {
         /* the page is not in the TLB : fill it */
-        retaddr = GETPC();
+        retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
@@ -257,12 +257,12 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC();
+            retaddr = GETPC_EXT();
             ioaddr = env->iotlb[mmu_idx][index];
             glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
         do_unaligned_access:
-            retaddr = GETPC();
+            retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
 #endif
@@ -273,7 +273,7 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC();
+                retaddr = GETPC_EXT();
                 do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
             }
 #endif
@@ -283,7 +283,7 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
         }
     } else {
         /* the page is not in the TLB : fill it */
-        retaddr = GETPC();
+        retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v8 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block
  2012-10-31  7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
  2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
  2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization Yeongkyoon Lee
@ 2012-10-31  7:04 ` Yeongkyoon Lee
  2012-11-02  5:35 ` [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs malc
  2012-11-03 12:52 ` Blue Swirl
  4 siblings, 0 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-31  7:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth

Add optimized TCG qemu_ld/st generation which locates the code of TLB miss
cases at the end of a block after generating the other IRs.
Currently, this optimization supports only i386 and x86_64 hosts.

Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
---
 tcg/i386/tcg-target.c |  404 ++++++++++++++++++++++++++++++++++---------------
 tcg/tcg.c             |   12 ++
 tcg/tcg.h             |   30 ++++
 3 files changed, 320 insertions(+), 126 deletions(-)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index e45a5a0..ed2f000 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1002,6 +1002,17 @@ static const void *qemu_st_helpers[4] = {
     helper_stq_mmu,
 };
 
+static void add_qemu_ldst_label(TCGContext *s,
+                                int is_ld,
+                                int opc,
+                                int data_reg,
+                                int data_reg2,
+                                int addrlo_reg,
+                                int addrhi_reg,
+                                int mem_index,
+                                uint8_t *raddr,
+                                uint8_t **label_ptr);
+
 /* Perform the TLB load and compare.
 
    Inputs:
@@ -1060,19 +1071,19 @@ static inline void tcg_out_tlb_load(TCGContext *s, int addrlo_idx,
 
     tcg_out_mov(s, type, r1, addrlo);
 
-    /* jne label1 */
-    tcg_out8(s, OPC_JCC_short + JCC_JNE);
+    /* jne slow_path */
+    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
     label_ptr[0] = s->code_ptr;
-    s->code_ptr++;
+    s->code_ptr += 4;
 
     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
         /* cmp 4(r0), addrhi */
         tcg_out_modrm_offset(s, OPC_CMP_GvEv, args[addrlo_idx+1], r0, 4);
 
-        /* jne label1 */
-        tcg_out8(s, OPC_JCC_short + JCC_JNE);
+        /* jne slow_path */
+        tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
         label_ptr[1] = s->code_ptr;
-        s->code_ptr++;
+        s->code_ptr += 4;
     }
 
     /* TLB Hit.  */
@@ -1193,10 +1204,7 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args,
     int addrlo_idx;
 #if defined(CONFIG_SOFTMMU)
     int mem_index, s_bits;
-#if TCG_TARGET_REG_BITS == 32
-    int stack_adjust;
-#endif
-    uint8_t *label_ptr[3];
+    uint8_t *label_ptr[2];
 #endif
 
     data_reg = args[0];
@@ -1216,87 +1224,17 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args,
     /* TLB Hit.  */
     tcg_out_qemu_ld_direct(s, data_reg, data_reg2, TCG_REG_L1, 0, 0, opc);
 
-    /* jmp label2 */
-    tcg_out8(s, OPC_JMP_short);
-    label_ptr[2] = s->code_ptr;
-    s->code_ptr++;
-
-    /* TLB Miss.  */
-
-    /* label1: */
-    *label_ptr[0] = s->code_ptr - label_ptr[0] - 1;
-    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
-        *label_ptr[1] = s->code_ptr - label_ptr[1] - 1;
-    }
-
-    /* XXX: move that code at the end of the TB */
-#if TCG_TARGET_REG_BITS == 32
-    tcg_out_pushi(s, mem_index);
-    stack_adjust = 4;
-    if (TARGET_LONG_BITS == 64) {
-        tcg_out_push(s, args[addrlo_idx + 1]);
-        stack_adjust += 4;
-    }
-    tcg_out_push(s, args[addrlo_idx]);
-    stack_adjust += 4;
-    tcg_out_push(s, TCG_AREG0);
-    stack_adjust += 4;
-#else
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
-    /* The second argument is already loaded with addrlo.  */
-    tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], mem_index);
-#endif
-
-    tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
-
-#if TCG_TARGET_REG_BITS == 32
-    if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
-        /* Pop and discard.  This is 2 bytes smaller than the add.  */
-        tcg_out_pop(s, TCG_REG_ECX);
-    } else if (stack_adjust != 0) {
-        tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
-    }
-#endif
-
-    switch(opc) {
-    case 0 | 4:
-        tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
-        break;
-    case 1 | 4:
-        tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
-        break;
-    case 0:
-        tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
-        break;
-    case 1:
-        tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
-        break;
-    case 2:
-        tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
-        break;
-#if TCG_TARGET_REG_BITS == 64
-    case 2 | 4:
-        tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
-        break;
-#endif
-    case 3:
-        if (TCG_TARGET_REG_BITS == 64) {
-            tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
-        } else if (data_reg == TCG_REG_EDX) {
-            /* xchg %edx, %eax */
-            tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EAX);
-        } else {
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EDX);
-        }
-        break;
-    default:
-        tcg_abort();
-    }
-
-    /* label2: */
-    *label_ptr[2] = s->code_ptr - label_ptr[2] - 1;
+    /* Record the current context of a load into ldst label */
+    add_qemu_ldst_label(s,
+                        1,
+                        opc,
+                        data_reg,
+                        data_reg2,
+                        args[addrlo_idx],
+                        args[addrlo_idx + 1],
+                        mem_index,
+                        s->code_ptr,
+                        label_ptr);
 #else
     {
         int32_t offset = GUEST_BASE;
@@ -1393,8 +1331,7 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     int addrlo_idx;
 #if defined(CONFIG_SOFTMMU)
     int mem_index, s_bits;
-    int stack_adjust;
-    uint8_t *label_ptr[3];
+    uint8_t *label_ptr[2];
 #endif
 
     data_reg = args[0];
@@ -1414,20 +1351,221 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     /* TLB Hit.  */
     tcg_out_qemu_st_direct(s, data_reg, data_reg2, TCG_REG_L1, 0, 0, opc);
 
-    /* jmp label2 */
+    /* Record the current context of a store into ldst label */
+    add_qemu_ldst_label(s,
+                        0,
+                        opc,
+                        data_reg,
+                        data_reg2,
+                        args[addrlo_idx],
+                        args[addrlo_idx + 1],
+                        mem_index,
+                        s->code_ptr,
+                        label_ptr);
+#else
+    {
+        int32_t offset = GUEST_BASE;
+        int base = args[addrlo_idx];
+        int seg = 0;
+
+        /* ??? We assume all operations have left us with register contents
+           that are zero extended.  So far this appears to be true.  If we
+           want to enforce this, we can either do an explicit zero-extension
+           here, or (if GUEST_BASE == 0, or a segment register is in use)
+           use the ADDR32 prefix.  For now, do nothing.  */
+        if (GUEST_BASE && guest_base_flags) {
+            seg = guest_base_flags;
+            offset = 0;
+        } else if (TCG_TARGET_REG_BITS == 64 && offset != GUEST_BASE) {
+            tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_L1, GUEST_BASE);
+            tgen_arithr(s, ARITH_ADD + P_REXW, TCG_REG_L1, base);
+            base = TCG_REG_L1;
+            offset = 0;
+        }
+
+        tcg_out_qemu_st_direct(s, data_reg, data_reg2, base, offset, seg, opc);
+    }
+#endif
+}
+
+#if defined(CONFIG_SOFTMMU)
+/*
+ * Record the context of a call to the out of line helper code for the slow path
+ * for a load or store, so that we can later generate the correct helper code
+ */
+static void add_qemu_ldst_label(TCGContext *s,
+                                int is_ld,
+                                int opc,
+                                int data_reg,
+                                int data_reg2,
+                                int addrlo_reg,
+                                int addrhi_reg,
+                                int mem_index,
+                                uint8_t *raddr,
+                                uint8_t **label_ptr)
+{
+    int idx;
+    TCGLabelQemuLdst *label;
+
+    if (s->nb_qemu_ldst_labels >= TCG_MAX_QEMU_LDST) {
+        tcg_abort();
+    }
+
+    idx = s->nb_qemu_ldst_labels++;
+    label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[idx];
+    label->is_ld = is_ld;
+    label->opc = opc;
+    label->datalo_reg = data_reg;
+    label->datahi_reg = data_reg2;
+    label->addrlo_reg = addrlo_reg;
+    label->addrhi_reg = addrhi_reg;
+    label->mem_index = mem_index;
+    label->raddr = raddr;
+    label->label_ptr[0] = label_ptr[0];
+    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+        label->label_ptr[1] = label_ptr[1];
+    }
+}
+
+/*
+ * Generate code for the slow path for a load at the end of block
+ */
+static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+{
+    int s_bits;
+    int opc = label->opc;
+    int mem_index = label->mem_index;
+#if TCG_TARGET_REG_BITS == 32
+    int stack_adjust;
+    int addrlo_reg = label->addrlo_reg;
+    int addrhi_reg = label->addrhi_reg;
+#endif
+    int data_reg = label->datalo_reg;
+    int data_reg2 = label->datahi_reg;
+    uint8_t *raddr = label->raddr;
+    uint8_t **label_ptr = &label->label_ptr[0];
+
+    s_bits = opc & 3;
+
+    /* resolve label address */
+    *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
+    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+        *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
+    }
+
+#if TCG_TARGET_REG_BITS == 32
+    tcg_out_pushi(s, mem_index);
+    stack_adjust = 4;
+    if (TARGET_LONG_BITS == 64) {
+        tcg_out_push(s, addrhi_reg);
+        stack_adjust += 4;
+    }
+    tcg_out_push(s, addrlo_reg);
+    stack_adjust += 4;
+    tcg_out_push(s, TCG_AREG0);
+    stack_adjust += 4;
+#else
+    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
+    /* The second argument is already loaded with addrlo.  */
+    tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], mem_index);
+#endif
+
+    /* Code generation of qemu_ld/st's slow path calling MMU helper
+
+       PRE_PROC ...
+       call MMU helper
+       jmp POST_PROC (2b) : short forward jump <- GETRA()
+       jmp next_code (5b) : dummy long backward jump which is never executed
+       POST_PROC ... : do post-processing <- GETRA() + 7
+       jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
+    */
+
+    tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
+
+    /* Jump to post-processing code */
     tcg_out8(s, OPC_JMP_short);
-    label_ptr[2] = s->code_ptr;
-    s->code_ptr++;
+    tcg_out8(s, 5);
+    /* Dummy backward jump having information of fast path'pc for MMU helpers */
+    tcg_out8(s, OPC_JMP_long);
+    *(int32_t *)s->code_ptr = (int32_t)(raddr - s->code_ptr - 4);
+    s->code_ptr += 4;
+
+#if TCG_TARGET_REG_BITS == 32
+    if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
+        /* Pop and discard.  This is 2 bytes smaller than the add.  */
+        tcg_out_pop(s, TCG_REG_ECX);
+    } else if (stack_adjust != 0) {
+        tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
+    }
+#endif
+
+    switch(opc) {
+    case 0 | 4:
+        tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
+        break;
+    case 1 | 4:
+        tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
+        break;
+    case 0:
+        tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
+        break;
+    case 1:
+        tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
+        break;
+    case 2:
+        tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
+        break;
+#if TCG_TARGET_REG_BITS == 64
+    case 2 | 4:
+        tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
+        break;
+#endif
+    case 3:
+        if (TCG_TARGET_REG_BITS == 64) {
+            tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
+        } else if (data_reg == TCG_REG_EDX) {
+            /* xchg %edx, %eax */
+            tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
+            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EAX);
+        } else {
+            tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
+            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EDX);
+        }
+        break;
+    default:
+        tcg_abort();
+    }
 
-    /* TLB Miss.  */
+    /* Jump to the code corresponding to next IR of qemu_st */
+    tcg_out_jmp(s, (tcg_target_long)raddr);
+}
 
-    /* label1: */
-    *label_ptr[0] = s->code_ptr - label_ptr[0] - 1;
+/*
+ * Generate code for the slow path for a store at the end of block
+ */
+static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+{
+    int s_bits;
+    int stack_adjust;
+    int opc = label->opc;
+    int mem_index = label->mem_index;
+    int data_reg = label->datalo_reg;
+#if TCG_TARGET_REG_BITS == 32
+    int data_reg2 = label->datahi_reg;
+    int addrlo_reg = label->addrlo_reg;
+    int addrhi_reg = label->addrhi_reg;
+#endif
+    uint8_t *raddr = label->raddr;
+    uint8_t **label_ptr = &label->label_ptr[0];
+
+    s_bits = opc & 3;
+
+    /* resolve label address */
+    *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
-        *label_ptr[1] = s->code_ptr - label_ptr[1] - 1;
+        *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
     }
 
-    /* XXX: move that code at the end of the TB */
 #if TCG_TARGET_REG_BITS == 32
     tcg_out_pushi(s, mem_index);
     stack_adjust = 4;
@@ -1438,10 +1576,10 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     tcg_out_push(s, data_reg);
     stack_adjust += 4;
     if (TARGET_LONG_BITS == 64) {
-        tcg_out_push(s, args[addrlo_idx + 1]);
+        tcg_out_push(s, addrhi_reg);
         stack_adjust += 4;
     }
-    tcg_out_push(s, args[addrlo_idx]);
+    tcg_out_push(s, addrlo_reg);
     stack_adjust += 4;
     tcg_out_push(s, TCG_AREG0);
     stack_adjust += 4;
@@ -1454,8 +1592,26 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     stack_adjust = 0;
 #endif
 
+    /* Code generation of qemu_ld/st's slow path calling MMU helper
+
+       PRE_PROC ...
+       call MMU helper
+       jmp POST_PROC (2b) : short forward jump <- GETRA()
+       jmp next_code (5b) : dummy long backward jump which is never executed
+       POST_PROC ... : do post-processing <- GETRA() + 7
+       jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
+    */
+
     tcg_out_calli(s, (tcg_target_long)qemu_st_helpers[s_bits]);
 
+    /* Jump to post-processing code */
+    tcg_out8(s, OPC_JMP_short);
+    tcg_out8(s, 5);
+    /* Dummy backward jump having information of fast path'pc for MMU helpers */
+    tcg_out8(s, OPC_JMP_long);
+    *(int32_t *)s->code_ptr = (int32_t)(raddr - s->code_ptr - 4);
+    s->code_ptr += 4;
+
     if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
         /* Pop and discard.  This is 2 bytes smaller than the add.  */
         tcg_out_pop(s, TCG_REG_ECX);
@@ -1463,33 +1619,29 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
         tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
     }
 
-    /* label2: */
-    *label_ptr[2] = s->code_ptr - label_ptr[2] - 1;
-#else
-    {
-        int32_t offset = GUEST_BASE;
-        int base = args[addrlo_idx];
-        int seg = 0;
+    /* Jump to the code corresponding to next IR of qemu_st */
+    tcg_out_jmp(s, (tcg_target_long)raddr);
+}
 
-        /* ??? We assume all operations have left us with register contents
-           that are zero extended.  So far this appears to be true.  If we
-           want to enforce this, we can either do an explicit zero-extension
-           here, or (if GUEST_BASE == 0, or a segment register is in use)
-           use the ADDR32 prefix.  For now, do nothing.  */
-        if (GUEST_BASE && guest_base_flags) {
-            seg = guest_base_flags;
-            offset = 0;
-        } else if (TCG_TARGET_REG_BITS == 64 && offset != GUEST_BASE) {
-            tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_L1, GUEST_BASE);
-            tgen_arithr(s, ARITH_ADD + P_REXW, TCG_REG_L1, base);
-            base = TCG_REG_L1;
-            offset = 0;
+/*
+ * Generate TB finalization at the end of block
+ */
+void tcg_out_tb_finalize(TCGContext *s)
+{
+    int i;
+    TCGLabelQemuLdst *label;
+
+    /* qemu_ld/st slow paths */
+    for (i = 0; i < s->nb_qemu_ldst_labels; i++) {
+        label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[i];
+        if (label->is_ld) {
+            tcg_out_qemu_ld_slow_path(s, label);
+        } else {
+            tcg_out_qemu_st_slow_path(s, label);
         }
-
-        tcg_out_qemu_st_direct(s, data_reg, data_reg2, base, offset, seg, opc);
     }
-#endif
 }
+#endif  /* CONFIG_SOFTMMU */
 
 static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
                               const TCGArg *args, const int *const_args)
diff --git a/tcg/tcg.c b/tcg/tcg.c
index c3a7f19..7f50da2 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -299,6 +299,14 @@ void tcg_func_start(TCGContext *s)
 
     gen_opc_ptr = gen_opc_buf;
     gen_opparam_ptr = gen_opparam_buf;
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+    /* Initialize qemu_ld/st labels to assist code generation at the end of TB
+       for TLB miss cases at the end of TB */
+    s->qemu_ldst_labels = tcg_malloc(sizeof(TCGLabelQemuLdst) *
+                                     TCG_MAX_QEMU_LDST);
+    s->nb_qemu_ldst_labels = 0;
+#endif
 }
 
 static inline void tcg_temp_alloc(TCGContext *s, int n)
@@ -2314,6 +2322,10 @@ static inline int tcg_gen_code_common(TCGContext *s, uint8_t *gen_code_buf,
 #endif
     }
  the_end:
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+    /* Generate TB finalization at the end of block */
+    tcg_out_tb_finalize(s);
+#endif
     return -1;
 }
 
diff --git a/tcg/tcg.h b/tcg/tcg.h
index a6c9256..9c9d083 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -188,6 +188,24 @@ typedef tcg_target_ulong TCGArg;
    are aliases for target_ulong and host pointer sized values respectively.
  */
 
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* Macros/structures for qemu_ld/st IR code optimization:
+   TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in exec-all.h. */
+#define TCG_MAX_QEMU_LDST       640
+
+typedef struct TCGLabelQemuLdst {
+    int is_ld:1;            /* qemu_ld: 1, qemu_st: 0 */
+    int opc:4;
+    int addrlo_reg;         /* reg index for low word of guest virtual addr */
+    int addrhi_reg;         /* reg index for high word of guest virtual addr */
+    int datalo_reg;         /* reg index for low word to be loaded or stored */
+    int datahi_reg;         /* reg index for high word to be loaded or stored */
+    int mem_index;          /* soft MMU memory index */
+    uint8_t *raddr;         /* gen code addr of the next IR of qemu_ld/st IR */
+    uint8_t *label_ptr[2];  /* label pointers to be updated */
+} TCGLabelQemuLdst;
+#endif
+
 #ifdef CONFIG_DEBUG_TCG
 #define DEBUG_TCGV 1
 #endif
@@ -431,6 +449,13 @@ struct TCGContext {
     int temps_in_use;
     int goto_tb_issue_mask;
 #endif
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+    /* labels info for qemu_ld/st IRs
+       The labels help to generate TLB miss case codes at the end of TB */
+    TCGLabelQemuLdst *qemu_ldst_labels;
+    int nb_qemu_ldst_labels;
+#endif
 };
 
 extern TCGContext tcg_ctx;
@@ -634,3 +659,8 @@ extern uint8_t *code_gen_prologue;
 #endif
 
 void tcg_register_jit(void *buf, size_t buf_size);
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* Generate TB finalization at the end of block */
+void tcg_out_tb_finalize(TCGContext *s);
+#endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-31  7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
                   ` (2 preceding siblings ...)
  2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
@ 2012-11-02  5:35 ` malc
  2012-11-03 12:52 ` Blue Swirl
  4 siblings, 0 replies; 6+ messages in thread
From: malc @ 2012-11-02  5:35 UTC (permalink / raw)
  To: Yeongkyoon Lee; +Cc: blauwirbel, qemu-devel, aurelien, rth

On Wed, 31 Oct 2012, Yeongkyoon Lee wrote:

> Here is the 8th version of the series optimizing TCG qemu_ld/st code generation.
> 
> v8:
>  - Rebase

[..snip..]

FWIW here's ppc32 implementation of your idea, thanks for explaining
the motivation behind certain aspects in our private discussion.

diff --git a/configure b/configure
index 4a54a8b..e42ad64 100755
--- a/configure
+++ b/configure
@@ -3844,7 +3844,7 @@ upper() {
 }
 
 case "$cpu" in
-  i386|x86_64)
+  i386|x86_64|ppc)
     echo "CONFIG_QEMU_LDST_OPTIMIZATION=y" >> $config_target_mak
   ;;
 esac
diff --git a/exec-all.h b/exec-all.h
index ad6d22b..c3b3e50 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -337,6 +337,9 @@ extern uintptr_t tci_tb_ptr;
 #  define GETRA() ((uintptr_t)__builtin_return_address(0))
 #  define GETPC_LDST() ((uintptr_t)(GETRA() + 7 + \
                                     *(int32_t *)((void *)GETRA() + 3) - 1))
+# elif defined (_ARCH_PPC) && !defined (_ARCH_PPC64)
+#  define GETRA() ((uintptr_t)__builtin_return_address(0))
+#  define GETPC_LDST() ((uintptr_t) ((*(int32_t *)(GETRA() + 4)) - 1))
 # else
 #  error "CONFIG_QEMU_LDST_OPTIMIZATION needs GETPC_LDST() implementation!"
 # endif
diff --git a/tcg/ppc/tcg-target.c b/tcg/ppc/tcg-target.c
index 60b7b92..ec10fa8 100644
--- a/tcg/ppc/tcg-target.c
+++ b/tcg/ppc/tcg-target.c
@@ -39,8 +39,6 @@ static uint8_t *tb_ret_addr;
 #define LR_OFFSET 4
 #endif
 
-#define FAST_PATH
-
 #ifndef GUEST_BASE
 #define GUEST_BASE 0
 #endif
@@ -520,6 +518,37 @@ static void tcg_out_call (TCGContext *s, tcg_target_long arg, int const_arg)
 
 #if defined(CONFIG_SOFTMMU)
 
+static void add_qemu_ldst_label (TCGContext *s,
+                                 int is_ld,
+                                 int opc,
+                                 int data_reg,
+                                 int data_reg2,
+                                 int addrlo_reg,
+                                 int addrhi_reg,
+                                 int mem_index,
+                                 uint8_t *raddr,
+                                 uint8_t *label_ptr)
+{
+    int idx;
+    TCGLabelQemuLdst *label;
+
+    if (s->nb_qemu_ldst_labels >= TCG_MAX_QEMU_LDST) {
+        tcg_abort();
+    }
+
+    idx = s->nb_qemu_ldst_labels++;
+    label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[idx];
+    label->is_ld = is_ld;
+    label->opc = opc;
+    label->datalo_reg = data_reg;
+    label->datahi_reg = data_reg2;
+    label->addrlo_reg = addrlo_reg;
+    label->addrhi_reg = addrhi_reg;
+    label->mem_index = mem_index;
+    label->raddr = raddr;
+    label->label_ptr[0] = label_ptr;
+}
+
 #include "../../softmmu_defs.h"
 
 /* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
@@ -541,34 +570,11 @@ static const void * const qemu_st_helpers[4] = {
 };
 #endif
 
-static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
+static void tcg_out_tlb_check (TCGContext *s, int r0, int r1, int r2,
+                               int addr_reg, int addr_reg2, int s_bits,
+                               int offset1, int offset2, uint8_t **label_ptr)
 {
-    int addr_reg, data_reg, data_reg2, r0, r1, rbase, bswap;
-#ifdef CONFIG_SOFTMMU
-    int mem_index, s_bits, r2, ir;
-    void *label1_ptr, *label2_ptr;
-#if TARGET_LONG_BITS == 64
-    int addr_reg2;
-#endif
-#endif
-
-    data_reg = *args++;
-    if (opc == 3)
-        data_reg2 = *args++;
-    else
-        data_reg2 = 0;
-    addr_reg = *args++;
-
-#ifdef CONFIG_SOFTMMU
-#if TARGET_LONG_BITS == 64
-    addr_reg2 = *args++;
-#endif
-    mem_index = *args;
-    s_bits = opc & 3;
-    r0 = 3;
-    r1 = 4;
-    r2 = 0;
-    rbase = 0;
+    uint16_t retranst;
 
     tcg_out32 (s, (RLWINM
                    | RA (r0)
@@ -582,7 +588,7 @@ static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
     tcg_out32 (s, (LWZU
                    | RT (r1)
                    | RA (r0)
-                   | offsetof (CPUArchState, tlb_table[mem_index][0].addr_read)
+                   | offset1
                    )
         );
     tcg_out32 (s, (RLWINM
@@ -600,77 +606,57 @@ static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
     tcg_out32 (s, CMP | BF (6) | RA (addr_reg2) | RB (r1));
     tcg_out32 (s, CRAND | BT (7, CR_EQ) | BA (6, CR_EQ) | BB (7, CR_EQ));
 #endif
+    *label_ptr = s->code_ptr;
+    retranst = ((uint16_t *) s->code_ptr)[1] & ~3;
+    tcg_out32 (s, BC | BI (7, CR_EQ) | retranst | BO_COND_FALSE);
 
-    label1_ptr = s->code_ptr;
-#ifdef FAST_PATH
-    tcg_out32 (s, BC | BI (7, CR_EQ) | BO_COND_TRUE);
-#endif
-
-    /* slow path */
-    ir = 3;
-    tcg_out_mov (s, TCG_TYPE_I32, ir++, TCG_AREG0);
-#if TARGET_LONG_BITS == 32
-    tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
-#else
-#ifdef TCG_TARGET_CALL_ALIGN_ARGS
-    ir |= 1;
-#endif
-    tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg2);
-    tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
-#endif
-    tcg_out_movi (s, TCG_TYPE_I32, ir, mem_index);
-
-    tcg_out_call (s, (tcg_target_long) qemu_ld_helpers[s_bits], 1);
-    switch (opc) {
-    case 0|4:
-        tcg_out32 (s, EXTSB | RA (data_reg) | RS (3));
-        break;
-    case 1|4:
-        tcg_out32 (s, EXTSH | RA (data_reg) | RS (3));
-        break;
-    case 0:
-    case 1:
-    case 2:
-        if (data_reg != 3)
-            tcg_out_mov (s, TCG_TYPE_I32, data_reg, 3);
-        break;
-    case 3:
-        if (data_reg == 3) {
-            if (data_reg2 == 4) {
-                tcg_out_mov (s, TCG_TYPE_I32, 0, 4);
-                tcg_out_mov (s, TCG_TYPE_I32, 4, 3);
-                tcg_out_mov (s, TCG_TYPE_I32, 3, 0);
-            }
-            else {
-                tcg_out_mov (s, TCG_TYPE_I32, data_reg2, 3);
-                tcg_out_mov (s, TCG_TYPE_I32, 3, 4);
-            }
-        }
-        else {
-            if (data_reg != 4) tcg_out_mov (s, TCG_TYPE_I32, data_reg, 4);
-            if (data_reg2 != 3) tcg_out_mov (s, TCG_TYPE_I32, data_reg2, 3);
-        }
-        break;
-    }
-    label2_ptr = s->code_ptr;
-    tcg_out32 (s, B);
-
-    /* label1: fast path */
-#ifdef FAST_PATH
-    reloc_pc14 (label1_ptr, (tcg_target_long) s->code_ptr);
-#endif
-
-    /* r0 now contains &env->tlb_table[mem_index][index].addr_read */
+    /* r0 now contains &env->tlb_table[mem_index][index].addr_x */
     tcg_out32 (s, (LWZ
                    | RT (r0)
                    | RA (r0)
-                   | (offsetof (CPUTLBEntry, addend)
-                      - offsetof (CPUTLBEntry, addr_read))
-                   ));
+                   | offset2
+                   )
+        );
     /* r0 = env->tlb_table[mem_index][index].addend */
     tcg_out32 (s, ADD | RT (r0) | RA (r0) | RB (addr_reg));
     /* r0 = env->tlb_table[mem_index][index].addend + addr */
 
+}
+
+static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
+{
+    int addr_reg, addr_reg2, data_reg, data_reg2, r0, r1, rbase, bswap;
+#ifdef CONFIG_SOFTMMU
+    int mem_index, s_bits, r2;
+    uint8_t *label_ptr;
+#endif
+
+    data_reg = *args++;
+    if (opc == 3)
+        data_reg2 = *args++;
+    else
+        data_reg2 = 0;
+    addr_reg = *args++;
+
+#ifdef CONFIG_SOFTMMU
+#if TARGET_LONG_BITS == 64
+    addr_reg2 = *args++;
+#else
+    addr_reg2 = 0;
+#endif
+    mem_index = *args;
+    s_bits = opc & 3;
+    r0 = 3;
+    r1 = 4;
+    r2 = 0;
+    rbase = 0;
+
+    tcg_out_tlb_check (
+        s, r0, r1, r2, addr_reg, addr_reg2, s_bits,
+        offsetof (CPUArchState, tlb_table[mem_index][0].addr_read),
+        offsetof (CPUTLBEntry, addend) - offsetof (CPUTLBEntry, addr_read),
+        &label_ptr
+        );
 #else  /* !CONFIG_SOFTMMU */
     r0 = addr_reg;
     r1 = 3;
@@ -736,21 +722,26 @@ static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
         }
         break;
     }
-
 #ifdef CONFIG_SOFTMMU
-    reloc_pc24 (label2_ptr, (tcg_target_long) s->code_ptr);
+    add_qemu_ldst_label (s,
+                         1,
+                         opc,
+                         data_reg,
+                         data_reg2,
+                         addr_reg,
+                         addr_reg2,
+                         mem_index,
+                         s->code_ptr,
+                         label_ptr);
 #endif
 }
 
 static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
 {
-    int addr_reg, r0, r1, data_reg, data_reg2, bswap, rbase;
+    int addr_reg, addr_reg2, r0, r1, data_reg, data_reg2, bswap, rbase;
 #ifdef CONFIG_SOFTMMU
-    int mem_index, r2, ir;
-    void *label1_ptr, *label2_ptr;
-#if TARGET_LONG_BITS == 64
-    int addr_reg2;
-#endif
+    int mem_index, r2;
+    uint8_t *label_ptr;
 #endif
 
     data_reg = *args++;
@@ -763,6 +754,8 @@ static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
 #ifdef CONFIG_SOFTMMU
 #if TARGET_LONG_BITS == 64
     addr_reg2 = *args++;
+#else
+    addr_reg2 = 0;
 #endif
     mem_index = *args;
     r0 = 3;
@@ -770,41 +763,89 @@ static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
     r2 = 0;
     rbase = 0;
 
-    tcg_out32 (s, (RLWINM
-                   | RA (r0)
-                   | RS (addr_reg)
-                   | SH (32 - (TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS))
-                   | MB (32 - (CPU_TLB_ENTRY_BITS + CPU_TLB_BITS))
-                   | ME (31 - CPU_TLB_ENTRY_BITS)
-                   )
-        );
-    tcg_out32 (s, ADD | RT (r0) | RA (r0) | RB (TCG_AREG0));
-    tcg_out32 (s, (LWZU
-                   | RT (r1)
-                   | RA (r0)
-                   | offsetof (CPUArchState, tlb_table[mem_index][0].addr_write)
-                   )
-        );
-    tcg_out32 (s, (RLWINM
-                   | RA (r2)
-                   | RS (addr_reg)
-                   | SH (0)
-                   | MB ((32 - opc) & 31)
-                   | ME (31 - TARGET_PAGE_BITS)
-                   )
+    tcg_out_tlb_check (
+        s, r0, r1, r2, addr_reg, addr_reg2, opc & 3,
+        offsetof (CPUArchState, tlb_table[mem_index][0].addr_write),
+        offsetof (CPUTLBEntry, addend) - offsetof (CPUTLBEntry, addr_write),
+        &label_ptr
         );
+#else  /* !CONFIG_SOFTMMU */
+    r0 = addr_reg;
+    r1 = 3;
+    rbase = GUEST_BASE ? TCG_GUEST_BASE_REG : 0;
+#endif
 
-    tcg_out32 (s, CMP | (7 << 23) | RA (r2) | RB (r1));
-#if TARGET_LONG_BITS == 64
-    tcg_out32 (s, LWZ | RT (r1) | RA (r0) | 4);
-    tcg_out32 (s, CMP | BF (6) | RA (addr_reg2) | RB (r1));
-    tcg_out32 (s, CRAND | BT (7, CR_EQ) | BA (6, CR_EQ) | BB (7, CR_EQ));
+#ifdef TARGET_WORDS_BIGENDIAN
+    bswap = 0;
+#else
+    bswap = 1;
 #endif
+    switch (opc) {
+    case 0:
+        tcg_out32 (s, STBX | SAB (data_reg, rbase, r0));
+        break;
+    case 1:
+        if (bswap)
+            tcg_out32 (s, STHBRX | SAB (data_reg, rbase, r0));
+        else
+            tcg_out32 (s, STHX | SAB (data_reg, rbase, r0));
+        break;
+    case 2:
+        if (bswap)
+            tcg_out32 (s, STWBRX | SAB (data_reg, rbase, r0));
+        else
+            tcg_out32 (s, STWX | SAB (data_reg, rbase, r0));
+        break;
+    case 3:
+        if (bswap) {
+            tcg_out32 (s, ADDI | RT (r1) | RA (r0) | 4);
+            tcg_out32 (s, STWBRX | SAB (data_reg,  rbase, r0));
+            tcg_out32 (s, STWBRX | SAB (data_reg2, rbase, r1));
+        }
+        else {
+#ifdef CONFIG_USE_GUEST_BASE
+            tcg_out32 (s, STWX | SAB (data_reg2, rbase, r0));
+            tcg_out32 (s, ADDI | RT (r1) | RA (r0) | 4);
+            tcg_out32 (s, STWX | SAB (data_reg,  rbase, r1));
+#else
+            tcg_out32 (s, STW | RS (data_reg2) | RA (r0));
+            tcg_out32 (s, STW | RS (data_reg) | RA (r0) | 4);
+#endif
+        }
+        break;
+    }
 
-    label1_ptr = s->code_ptr;
-#ifdef FAST_PATH
-    tcg_out32 (s, BC | BI (7, CR_EQ) | BO_COND_TRUE);
+#ifdef CONFIG_SOFTMMU
+    add_qemu_ldst_label (s,
+                         0,
+                         opc,
+                         data_reg,
+                         data_reg2,
+                         addr_reg,
+                         addr_reg2,
+                         mem_index,
+                         s->code_ptr,
+                         label_ptr);
 #endif
+}
+
+#if defined(CONFIG_SOFTMMU)
+static void tcg_out_qemu_ld_slow_path (TCGContext *s, TCGLabelQemuLdst *label)
+{
+    int s_bits;
+    int ir;
+    int opc = label->opc;
+    int mem_index = label->mem_index;
+    int data_reg = label->datalo_reg;
+    int data_reg2 = label->datahi_reg;
+    int addr_reg = label->addrlo_reg;
+    uint8_t *raddr = label->raddr;
+    uint8_t **label_ptr = &label->label_ptr[0];
+
+    s_bits = opc & 3;
+
+    /* resolve label address */
+    reloc_pc14 (label_ptr[0], (tcg_target_long) s->code_ptr);
 
     /* slow path */
     ir = 3;
@@ -815,7 +856,75 @@ static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
 #ifdef TCG_TARGET_CALL_ALIGN_ARGS
     ir |= 1;
 #endif
-    tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg2);
+    tcg_out_mov (s, TCG_TYPE_I32, ir++, label->addrhi_reg);
+    tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
+#endif
+    tcg_out_movi (s, TCG_TYPE_I32, ir, mem_index);
+    tcg_out_call (s, (tcg_target_long) qemu_ld_helpers[s_bits], 1);
+    tcg_out32 (s, B | 8);
+    tcg_out32 (s, (tcg_target_long) raddr);
+    switch (opc) {
+    case 0|4:
+        tcg_out32 (s, EXTSB | RA (data_reg) | RS (3));
+        break;
+    case 1|4:
+        tcg_out32 (s, EXTSH | RA (data_reg) | RS (3));
+        break;
+    case 0:
+    case 1:
+    case 2:
+        if (data_reg != 3)
+            tcg_out_mov (s, TCG_TYPE_I32, data_reg, 3);
+        break;
+    case 3:
+        if (data_reg == 3) {
+            if (data_reg2 == 4) {
+                tcg_out_mov (s, TCG_TYPE_I32, 0, 4);
+                tcg_out_mov (s, TCG_TYPE_I32, 4, 3);
+                tcg_out_mov (s, TCG_TYPE_I32, 3, 0);
+            }
+            else {
+                tcg_out_mov (s, TCG_TYPE_I32, data_reg2, 3);
+                tcg_out_mov (s, TCG_TYPE_I32, 3, 4);
+            }
+        }
+        else {
+            if (data_reg != 4) tcg_out_mov (s, TCG_TYPE_I32, data_reg, 4);
+            if (data_reg2 != 3) tcg_out_mov (s, TCG_TYPE_I32, data_reg2, 3);
+        }
+        break;
+    }
+    /* Jump to the code corresponding to next IR of qemu_st */
+    tcg_out_b (s, 0, (tcg_target_long) raddr);
+}
+
+static void tcg_out_qemu_st_slow_path (TCGContext *s, TCGLabelQemuLdst *label)
+{
+    int s_bits;
+    int ir;
+    int opc = label->opc;
+    int mem_index = label->mem_index;
+    int data_reg = label->datalo_reg;
+    int data_reg2 = label->datahi_reg;
+    int addr_reg = label->addrlo_reg;
+    uint8_t *raddr = label->raddr;
+    uint8_t **label_ptr = &label->label_ptr[0];
+
+    s_bits = opc & 3;
+
+    /* resolve label address */
+    reloc_pc14 (label_ptr[0], (tcg_target_long) s->code_ptr);
+
+    /* slow path */
+    ir = 3;
+    tcg_out_mov (s, TCG_TYPE_I32, ir++, TCG_AREG0);
+#if TARGET_LONG_BITS == 32
+    tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
+#else
+#ifdef TCG_TARGET_CALL_ALIGN_ARGS
+    ir |= 1;
+#endif
+    tcg_out_mov (s, TCG_TYPE_I32, ir++, label->addrhi_reg);
     tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
 #endif
 
@@ -851,74 +960,28 @@ static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
 
     tcg_out_movi (s, TCG_TYPE_I32, ir, mem_index);
     tcg_out_call (s, (tcg_target_long) qemu_st_helpers[opc], 1);
-    label2_ptr = s->code_ptr;
-    tcg_out32 (s, B);
-
-    /* label1: fast path */
-#ifdef FAST_PATH
-    reloc_pc14 (label1_ptr, (tcg_target_long) s->code_ptr);
-#endif
-
-    tcg_out32 (s, (LWZ
-                   | RT (r0)
-                   | RA (r0)
-                   | (offsetof (CPUTLBEntry, addend)
-                      - offsetof (CPUTLBEntry, addr_write))
-                   ));
-    /* r0 = env->tlb_table[mem_index][index].addend */
-    tcg_out32 (s, ADD | RT (r0) | RA (r0) | RB (addr_reg));
-    /* r0 = env->tlb_table[mem_index][index].addend + addr */
-
-#else  /* !CONFIG_SOFTMMU */
-    r0 = addr_reg;
-    r1 = 3;
-    rbase = GUEST_BASE ? TCG_GUEST_BASE_REG : 0;
-#endif
+    tcg_out32 (s, B | 8);
+    tcg_out32 (s, (tcg_target_long) raddr);
+    tcg_out_b (s, 0, (tcg_target_long) raddr);
+}
 
-#ifdef TARGET_WORDS_BIGENDIAN
-    bswap = 0;
-#else
-    bswap = 1;
-#endif
-    switch (opc) {
-    case 0:
-        tcg_out32 (s, STBX | SAB (data_reg, rbase, r0));
-        break;
-    case 1:
-        if (bswap)
-            tcg_out32 (s, STHBRX | SAB (data_reg, rbase, r0));
-        else
-            tcg_out32 (s, STHX | SAB (data_reg, rbase, r0));
-        break;
-    case 2:
-        if (bswap)
-            tcg_out32 (s, STWBRX | SAB (data_reg, rbase, r0));
-        else
-            tcg_out32 (s, STWX | SAB (data_reg, rbase, r0));
-        break;
-    case 3:
-        if (bswap) {
-            tcg_out32 (s, ADDI | RT (r1) | RA (r0) | 4);
-            tcg_out32 (s, STWBRX | SAB (data_reg,  rbase, r0));
-            tcg_out32 (s, STWBRX | SAB (data_reg2, rbase, r1));
+void tcg_out_tb_finalize(TCGContext *s)
+{
+    int i;
+    TCGLabelQemuLdst *label;
+
+    /* qemu_ld/st slow paths */
+    for (i = 0; i < s->nb_qemu_ldst_labels; i++) {
+        label = (TCGLabelQemuLdst *) &s->qemu_ldst_labels[i];
+        if (label->is_ld) {
+            tcg_out_qemu_ld_slow_path (s, label);
         }
         else {
-#ifdef CONFIG_USE_GUEST_BASE
-            tcg_out32 (s, STWX | SAB (data_reg2, rbase, r0));
-            tcg_out32 (s, ADDI | RT (r1) | RA (r0) | 4);
-            tcg_out32 (s, STWX | SAB (data_reg,  rbase, r1));
-#else
-            tcg_out32 (s, STW | RS (data_reg2) | RA (r0));
-            tcg_out32 (s, STW | RS (data_reg) | RA (r0) | 4);
-#endif
+            tcg_out_qemu_st_slow_path (s, label);
         }
-        break;
     }
-
-#ifdef CONFIG_SOFTMMU
-    reloc_pc24 (label2_ptr, (tcg_target_long) s->code_ptr);
-#endif
 }
+#endif
 
 static void tcg_target_qemu_prologue (TCGContext *s)
 {


-- 
mailto:av1474@comtv.ru

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-31  7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
                   ` (3 preceding siblings ...)
  2012-11-02  5:35 ` [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs malc
@ 2012-11-03 12:52 ` Blue Swirl
  4 siblings, 0 replies; 6+ messages in thread
From: Blue Swirl @ 2012-11-03 12:52 UTC (permalink / raw)
  To: Yeongkyoon Lee; +Cc: qemu-devel, aurelien, rth

On Wed, Oct 31, 2012 at 7:04 AM, Yeongkyoon Lee
<yeongkyoon.lee@samsung.com> wrote:
> Here is the 8th version of the series optimizing TCG qemu_ld/st code generation.

Thanks, applied all.

>
> v8:
>  - Rebase
>
> v7:
>   - Rebase and fix mistyping
>
> v6:
>   - Remove an extra argument of return addr from MMU helpers
>     Instead, embed the fast path addr to the slow path for helpers to use it
>   - Change some bitwise operations to bitfields of structure
>   - Change the name of function which handles finalization of TB code generation
>
> v5:
>   - Remove RFC tag
>
> v4:
>   - Remove CONFIG_SOFTMMU pre-condition from configure
>   - Instead, add some CONFIG_SOFTMMU condition to TCG sources
>   - Remove some unnecessary comments
>
> v3:
>   - Support CONFIG_TCG_PASS_AREG0
>     (expected to get more performance enhancement than others)
>   - Remove the configure option "--enable-ldst-optimization""
>   - Make the optimization as default on i386 and x86_64 hosts
>   - Fix some mistyping and apply checkpatch.pl before committing
>   - Test i386, arm and sparc softmmu targets on i386 and x86_64 hosts
>   - Test linux-user-test-0.3
>
> v2:
>   - Follow the submit rule of qemu
>
> v1:
>   - Initial commit request
>
> I think the generated codes from qemu_ld/st IRs are relatively heavy, which are
> up to 12 instructions for TLB hit case on i386 host.
> This patch series enhance the code quality of TCG qemu_ld/st IRs by reducing
> jump and enhancing locality.
> Main idea is simple and has been already described in the comments in
> tcg-target.c, which separates slow path (TLB miss case), and generates it at the
> end of TB.
>
> For example, the generated code from qemu_ld changes as follow.
> Before:
> (1) TLB check
> (2) If hit fall through, else jump to TLB miss case (5)
> (3) TLB hit case: Load value from host memory
> (4) Jump to next code (6)
> (5) TLB miss case: call MMU helper
> (6) ... (next code)
>
> After:
> (1) TLB check
> (2) If hit fall through, else jump to TLB miss case (5)
> (3) TLB hit case: Load value from host memory
> (4) ... (next code)
> ...
> (5) TLB miss case: call MMU helper
> (6) Jump to (8)
> (7) [embedded addr of (4)] <- never executed but read by MMU helpers
> (8) Return to next code (4)
>
> Following is some performance results measured based on qemu 1.0.
> Although there was measurement error, the results was not negligible.
>
> * EEMBC CoreMark (before -> after)
>   - Guest: i386, Linux (Tizen platform)
>   - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
>   - Results: 1135.6 -> 1179.9 (+3.9%)
>
> * nbench (before -> after)
>   - Guest: i386, Linux (linux-0.2.img included in QEMU source)
>   - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
>   - Results
>     . MEMORY INDEX: 1.6782 -> 1.6818 (+0.2%)
>     . INTEGER INDEX: 1.8258 -> 1.877 (+2.8%)
>     . FLOATING-POINT INDEX: 0.5944 -> 0.5954 (+0.2%)
>
> Summarized features:
>  - The changes are wrapped by macro "CONFIG_QEMU_LDST_OPTIMIZATION" and
>    they are enabled by default on i386/x86_64 hosts
>  - Forced removal of the macro will cause compilation error on i386/x86_64 hosts
>  - No implementations other than i386/x86_64 hosts yet
>
> In addition, I have tried to remove the generated codes of calling MMU helpers
> for TLB miss case from end of TB, however, have not found good solution yet.
> In my opinion, TLB hit case performance could be degraded if removing the
> calling codes, because it needs to set runtime parameters, such as, data,
> mmu index and return address, in register or stack though they are not used
> in TLB hit case.
> This remains as a further issue.
>
> Yeongkyoon Lee (3):
>   configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st
>     optimization
>   tcg: Add extended GETPC mechanism for MMU helpers with ldst
>     optimization
>   tcg: Optimize qemu_ld/st by generating slow paths at the end of a
>     block
>
>  configure             |    6 +
>  exec-all.h            |   36 +++++
>  exec.c                |   11 ++
>  softmmu_template.h    |   16 +-
>  tcg/i386/tcg-target.c |  404 ++++++++++++++++++++++++++++++++++---------------
>  tcg/tcg.c             |   12 ++
>  tcg/tcg.h             |   30 ++++
>  7 files changed, 381 insertions(+), 134 deletions(-)
>
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-11-03 12:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-31  7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization Yeongkyoon Lee
2012-10-31  7:04 ` [Qemu-devel] [PATCH v8 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
2012-11-02  5:35 ` [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs malc
2012-11-03 12:52 ` Blue Swirl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).