qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v6 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
@ 2012-10-16  7:23 Yeongkyoon Lee
  2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-16  7:23 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth

Hi, all.

Here is the 6th version of the series optimizing TCG qemu_ld/st code generation.

v6:
  - Remove an extra argument of return addr from MMU helpers
    Instead, embed the fast path addr to the slow path for helpers to use it
  - Change some bitwise operations to bitfields of structure
  - Change the name of function which handles finalization of TB code generation

v5:
  - Remove RFC tag

v4:
  - Remove CONFIG_SOFTMMU pre-condition from configure
  - Instead, add some CONFIG_SOFTMMU condition to TCG sources
  - Remove some unnecessary comments

v3:
  - Support CONFIG_TCG_PASS_AREG0
    (expected to get more performance enhancement than others)
  - Remove the configure option "--enable-ldst-optimization""
  - Make the optimization as default on i386 and x86_64 hosts
  - Fix some mistyping and apply checkpatch.pl before committing
  - Test i386, arm and sparc softmmu targets on i386 and x86_64 hosts
  - Test linux-user-test-0.3

v2:
  - Follow the submit rule of qemu

v1:
  - Initial commit request

I think the generated codes from qemu_ld/st IRs are relatively heavy, which are
up to 12 instructions for TLB hit case on i386 host.
This patch series enhance the code quality of TCG qemu_ld/st IRs by reducing
jump and enhancing locality.
Main idea is simple and has been already described in the comments in
tcg-target.c, which separates slow path (TLB miss case), and generates it at the
end of TB.

For example, the generated code from qemu_ld changes as follow.
Before:
(1) TLB check
(2) If hit fall through, else jump to TLB miss case (5)
(3) TLB hit case: Load value from host memory
(4) Jump to next code (6)
(5) TLB miss case: call MMU helper
(6) ... (next code)

After:
(1) TLB check
(2) If hit fall through, else jump to TLB miss case (5)
(3) TLB hit case: Load value from host memory
(4) ... (next code)
...
(5) TLB miss case: call MMU helper
(6) Jump to (8)
(7) [embedded addr of (4)] <- never executed but read by MMU helpers
(8) Return to next code (4)

Following is some performance results measured based on qemu 1.0.
Although there was measurement error, the results was not negligible.

* EEMBC CoreMark (before -> after)
  - Guest: i386, Linux (Tizen platform)
  - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
  - Results: 1135.6 -> 1179.9 (+3.9%)

* nbench (before -> after)
  - Guest: i386, Linux (linux-0.2.img included in QEMU source)
  - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
  - Results
    . MEMORY INDEX: 1.6782 -> 1.6818 (+0.2%)
    . INTEGER INDEX: 1.8258 -> 1.877 (+2.8%)
    . FLOATING-POINT INDEX: 0.5944 -> 0.5954 (+0.2%)

Summarized features:
 - The changes are wrapped by macro "CONFIG_QEMU_LDST_OPTIMIZATION" and
   they are enabled by default on i386/x86_64 hosts
 - Forced removal of the macro will cause compilation error on i386/x86_64 hosts
 - No implementations other than i386/x86_64 hosts yet

In addition, I have tried to remove the generated codes of calling MMU helpers
for TLB miss case from end of TB, however, have not found good solution yet.
In my opinion, TLB hit case performance could be degraded if removing the
calling codes, because it needs to set runtime parameters, such as, data,
mmu index and return address, in register or stack though they are not used
in TLB hit case.
This remains as a further issue.

Yeongkyoon Lee (3):
  configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st
    optimization
  tcg: Add extended GETPC mechanism for MMU helpers with ldst
    optimization
  tcg: Optimize qemu_ld/st by generating slow paths at the end of a
    block

 configure             |    6 +
 exec-all.h            |   36 +++++
 exec.c                |   11 ++
 softmmu_template.h    |   16 +-
 tcg/i386/tcg-target.c |  415 +++++++++++++++++++++++++++++++++----------------
 tcg/tcg.c             |   12 ++
 tcg/tcg.h             |   30 ++++
 7 files changed, 385 insertions(+), 141 deletions(-)

--
1.7.5.4

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v6 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization
  2012-10-16  7:23 [Qemu-devel] [PATCH v6 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
@ 2012-10-16  7:23 ` Yeongkyoon Lee
  2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization Yeongkyoon Lee
  2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
  2 siblings, 0 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-16  7:23 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth

Enable CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization only when
a host is i386 or x86_64.

Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>

---
 configure |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index 353d788..8b15111 100755
--- a/configure
+++ b/configure
@@ -3848,6 +3848,12 @@ upper() {
     echo "$@"| LC_ALL=C tr '[a-z]' '[A-Z]'
 }

+case "$cpu" in
+  i386|x86_64)
+    echo "CONFIG_QEMU_LDST_OPTIMIZATION=y" >> $config_target_mak
+  ;;
+esac
+
 echo "TARGET_SHORT_ALIGNMENT=$target_short_alignment" >> $config_target_mak
 echo "TARGET_INT_ALIGNMENT=$target_int_alignment" >> $config_target_mak
 echo "TARGET_LONG_ALIGNMENT=$target_long_alignment" >> $config_target_mak
--
1.7.5.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v6 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization
  2012-10-16  7:23 [Qemu-devel] [PATCH v6 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
  2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
@ 2012-10-16  7:23 ` Yeongkyoon Lee
  2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
  2 siblings, 0 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-16  7:23 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth

Add GETPC_EXT which is used by MMU helpers to selectively calculate the code
address of accessing guest memory when called from a qemu_ld/st optimized code
or a C function. Currently, it supports only i386 and x86-64 hosts.

Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>

---
 exec-all.h         |   36 ++++++++++++++++++++++++++++++++++++
 exec.c             |   11 +++++++++++
 softmmu_template.h |   16 ++++++++--------
 3 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/exec-all.h b/exec-all.h
index 6516da0..9eda604 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -311,6 +311,42 @@ extern uintptr_t tci_tb_ptr;
 # define GETPC() ((uintptr_t)__builtin_return_address(0) - 1)
 #endif

+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* qemu_ld/st optimization split code generation to fast and slow path, thus,
+   it needs special handling for an MMU helper which is called from the slow
+   path, to get the fast path's pc without any additional argument.
+   It uses a tricky solution which embeds the fast path pc into the slow path.
+
+   Code flow in slow path:
+   (1) pre-process
+   (2) call MMU helper
+   (3) jump to (5)
+   (4) fast path information (implementation specific)
+   (5) post-process (e.g. stack adjust)
+   (6) jump to corresponding code of the next of fast path
+ */
+# if defined(__i386__) || defined(__x86_64__)
+/* To avoid broken disassembling, long jmp is used for embedding fast path pc,
+   so that the destination is the next code of fast path, though this jmp is
+   never executed.
+
+   call MMU helper
+   jmp POST_PROC (2byte)    <- GETRA()
+   jmp NEXT_CODE (5byte)
+   POST_PROCESS ...         <- GETRA() + 7
+ */
+#  define GETRA() ((uintptr_t)__builtin_return_address(0))
+#  define GETPC_LDST() ((uintptr_t)(GETRA() + 7 + \
+                                    *(int32_t *)((void *)GETRA() + 3) - 1))
+# else
+#  error "CONFIG_QEMU_LDST_OPTIMIZATION needs GETPC_LDST() implementation!"
+# endif
+bool is_tcg_gen_code(uintptr_t pc_ptr);
+# define GETPC_EXT() (is_tcg_gen_code(GETRA()) ? GETPC_LDST() : GETPC())
+#else
+# define GETPC_EXT() GETPC()
+#endif
+
 #if !defined(CONFIG_USER_ONLY)

 struct MemoryRegion *iotlb_to_region(target_phys_addr_t index);
diff --git a/exec.c b/exec.c
index 7899042..8a825a9 100644
--- a/exec.c
+++ b/exec.c
@@ -1379,6 +1379,17 @@ void tb_link_page(TranslationBlock *tb,
     mmap_unlock();
 }

+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* check whether the give addr is in TCG generated code buffer or not */
+bool is_tcg_gen_code(uintptr_t tc_ptr)
+{
+    /* This can be called during code generation, code_gen_buffer_max_size
+       is used instead of code_gen_ptr for upper boundary checking */
+    return (tc_ptr >= (uintptr_t)code_gen_buffer &&
+            tc_ptr < (uintptr_t)(code_gen_buffer + code_gen_buffer_max_size));
+}
+#endif
+
 /* find the TB 'tb' such that tb[0].tc_ptr <= tc_ptr <
    tb[1].tc_ptr. Return NULL if not found */
 TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
diff --git a/softmmu_template.h b/softmmu_template.h
index e2490f0..d23de8c 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -111,13 +111,13 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC();
+            retaddr = GETPC_EXT();
             ioaddr = env->iotlb[mmu_idx][index];
             res = glue(io_read, SUFFIX)(env, ioaddr, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
             /* slow unaligned access (it spans two pages or IO) */
         do_unaligned_access:
-            retaddr = GETPC();
+            retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
 #endif
@@ -128,7 +128,7 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC();
+                retaddr = GETPC_EXT();
                 do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
             }
 #endif
@@ -138,7 +138,7 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
         }
     } else {
         /* the page is not in the TLB : fill it */
-        retaddr = GETPC();
+        retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
@@ -257,12 +257,12 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC();
+            retaddr = GETPC_EXT();
             ioaddr = env->iotlb[mmu_idx][index];
             glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
         do_unaligned_access:
-            retaddr = GETPC();
+            retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
 #endif
@@ -273,7 +273,7 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC();
+                retaddr = GETPC_EXT();
                 do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
             }
 #endif
@@ -283,7 +283,7 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
         }
     } else {
         /* the page is not in the TLB : fill it */
-        retaddr = GETPC();
+        retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
--
1.7.5.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v6 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block
  2012-10-16  7:23 [Qemu-devel] [PATCH v6 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
  2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
  2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization Yeongkyoon Lee
@ 2012-10-16  7:23 ` Yeongkyoon Lee
  2012-10-17 23:44   ` Richard Henderson
  2 siblings, 1 reply; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-16  7:23 UTC (permalink / raw)
  To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth

Add optimized TCG qemu_ld/st generation which locates the code of TLB miss
cases at the end of a block after generating the other IRs.
Currently, this optimization supports only i386 and x86_64 hosts.

Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>

---
 tcg/i386/tcg-target.c |  415 +++++++++++++++++++++++++++++++++----------------
 tcg/tcg.c             |   12 ++
 tcg/tcg.h             |   30 ++++
 3 files changed, 324 insertions(+), 133 deletions(-)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 4952c05..51abe85 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1001,6 +1001,17 @@ static const void *qemu_st_helpers[4] = {
     helper_stq_mmu,
 };

+static void add_qemu_ldst_label(TCGContext *s,
+                                int is_ld,
+                                int opc,
+                                int data_reg,
+                                int data_reg2,
+                                int addrlo_reg,
+                                int addrhi_reg,
+                                int mem_index,
+                                uint8_t *raddr,
+                                uint8_t **label_ptr);
+
 /* Perform the TLB load and compare.

    Inputs:
@@ -1059,19 +1070,19 @@ static inline void tcg_out_tlb_load(TCGContext *s, int addrlo_idx,

     tcg_out_mov(s, type, r0, addrlo);

-    /* jne label1 */
-    tcg_out8(s, OPC_JCC_short + JCC_JNE);
+    /* jne slow_path */
+    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
     label_ptr[0] = s->code_ptr;
-    s->code_ptr++;
+    s->code_ptr += 4;

     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
         /* cmp 4(r1), addrhi */
         tcg_out_modrm_offset(s, OPC_CMP_GvEv, args[addrlo_idx+1], r1, 4);

-        /* jne label1 */
-        tcg_out8(s, OPC_JCC_short + JCC_JNE);
+        /* jne slow_path */
+        tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
         label_ptr[1] = s->code_ptr;
-        s->code_ptr++;
+        s->code_ptr += 4;
     }

     /* TLB Hit.  */
@@ -1169,12 +1180,7 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args,
     int addrlo_idx;
 #if defined(CONFIG_SOFTMMU)
     int mem_index, s_bits;
-#if TCG_TARGET_REG_BITS == 64
-    int arg_idx;
-#else
-    int stack_adjust;
-#endif
-    uint8_t *label_ptr[3];
+    uint8_t *label_ptr[2];
 #endif

     data_reg = args[0];
@@ -1194,93 +1200,17 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args,
     /* TLB Hit.  */
     tcg_out_qemu_ld_direct(s, data_reg, data_reg2, TCG_REG_L0, 0, opc);

-    /* jmp label2 */
-    tcg_out8(s, OPC_JMP_short);
-    label_ptr[2] = s->code_ptr;
-    s->code_ptr++;
-
-    /* TLB Miss.  */
-
-    /* label1: */
-    *label_ptr[0] = s->code_ptr - label_ptr[0] - 1;
-    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
-        *label_ptr[1] = s->code_ptr - label_ptr[1] - 1;
-    }
-
-    /* XXX: move that code at the end of the TB */
-#if TCG_TARGET_REG_BITS == 32
-    tcg_out_pushi(s, mem_index);
-    stack_adjust = 4;
-    if (TARGET_LONG_BITS == 64) {
-        tcg_out_push(s, args[addrlo_idx + 1]);
-        stack_adjust += 4;
-    }
-    tcg_out_push(s, args[addrlo_idx]);
-    stack_adjust += 4;
-    tcg_out_push(s, TCG_AREG0);
-    stack_adjust += 4;
-#else
-    /* The first argument is already loaded with addrlo.  */
-    arg_idx = 1;
-    tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[arg_idx],
-                 mem_index);
-    /* XXX/FIXME: suboptimal */
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[3], TCG_REG_L2);
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[2], TCG_REG_L1);
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[1], TCG_REG_L0);
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
-#endif
-
-    tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
-
-#if TCG_TARGET_REG_BITS == 32
-    if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
-        /* Pop and discard.  This is 2 bytes smaller than the add.  */
-        tcg_out_pop(s, TCG_REG_ECX);
-    } else if (stack_adjust != 0) {
-        tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
-    }
-#endif
-
-    switch(opc) {
-    case 0 | 4:
-        tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
-        break;
-    case 1 | 4:
-        tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
-        break;
-    case 0:
-        tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
-        break;
-    case 1:
-        tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
-        break;
-    case 2:
-        tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
-        break;
-#if TCG_TARGET_REG_BITS == 64
-    case 2 | 4:
-        tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
-        break;
-#endif
-    case 3:
-        if (TCG_TARGET_REG_BITS == 64) {
-            tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
-        } else if (data_reg == TCG_REG_EDX) {
-            /* xchg %edx, %eax */
-            tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EAX);
-        } else {
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EDX);
-        }
-        break;
-    default:
-        tcg_abort();
-    }
-
-    /* label2: */
-    *label_ptr[2] = s->code_ptr - label_ptr[2] - 1;
+    /* Record the current context of a load into ldst label */
+    add_qemu_ldst_label(s,
+                        1,
+                        opc,
+                        data_reg,
+                        data_reg2,
+                        args[addrlo_idx],
+                        args[addrlo_idx + 1],
+                        mem_index,
+                        s->code_ptr,
+                        label_ptr);
 #else
     {
         int32_t offset = GUEST_BASE;
@@ -1372,8 +1302,7 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     int addrlo_idx;
 #if defined(CONFIG_SOFTMMU)
     int mem_index, s_bits;
-    int stack_adjust;
-    uint8_t *label_ptr[3];
+    uint8_t *label_ptr[2];
 #endif

     data_reg = args[0];
@@ -1393,20 +1322,225 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     /* TLB Hit.  */
     tcg_out_qemu_st_direct(s, data_reg, data_reg2, TCG_REG_L0, 0, opc);

-    /* jmp label2 */
+    /* Record the current context of a store into ldst label */
+    add_qemu_ldst_label(s,
+                        0,
+                        opc,
+                        data_reg,
+                        data_reg2,
+                        args[addrlo_idx],
+                        args[addrlo_idx + 1],
+                        mem_index,
+                        s->code_ptr,
+                        label_ptr);
+#else
+    {
+        int32_t offset = GUEST_BASE;
+        int base = args[addrlo_idx];
+
+        if (TCG_TARGET_REG_BITS == 64) {
+            /* ??? We assume all operations have left us with register
+               contents that are zero extended.  So far this appears to
+               be true.  If we want to enforce this, we can either do
+               an explicit zero-extension here, or (if GUEST_BASE == 0)
+               use the ADDR32 prefix.  For now, do nothing.  */
+
+            if (offset != GUEST_BASE) {
+                tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_L0, GUEST_BASE);
+                tgen_arithr(s, ARITH_ADD + P_REXW, TCG_REG_L0, base);
+                base = TCG_REG_L0;
+                offset = 0;
+            }
+        }
+
+        tcg_out_qemu_st_direct(s, data_reg, data_reg2, base, offset, opc);
+    }
+#endif
+}
+
+#if defined(CONFIG_SOFTMMU)
+/*
+ * Record the context of a call to the out of line helper code for the slow path
+ * for a load or store, so that we can later generate the correct helper code
+ */
+static void add_qemu_ldst_label(TCGContext *s,
+                                int is_ld,
+                                int opc,
+                                int data_reg,
+                                int data_reg2,
+                                int addrlo_reg,
+                                int addrhi_reg,
+                                int mem_index,
+                                uint8_t *raddr,
+                                uint8_t **label_ptr)
+{
+    int idx;
+    TCGLabelQemuLdst *label;
+
+    if (s->nb_qemu_ldst_labels >= TCG_MAX_QEMU_LDST) {
+        tcg_abort();
+    }
+
+    idx = s->nb_qemu_ldst_labels++;
+    label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[idx];
+    label->is_ld = is_ld;
+    label->opc = opc;
+    label->datalo_reg = data_reg;
+    label->datahi_reg = data_reg2;
+    label->addrlo_reg = addrlo_reg;
+    label->addrhi_reg = addrhi_reg;
+    label->mem_index = mem_index;
+    label->raddr = raddr;
+    label->label_ptr[0] = label_ptr[0];
+    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+        label->label_ptr[1] = label_ptr[1];
+    }
+}
+
+/*
+ * Generate code for the slow path for a load at the end of block
+ */
+static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+{
+    int s_bits;
+    int opc = label->opc;
+    int mem_index = label->mem_index;
+#if TCG_TARGET_REG_BITS == 32
+    int stack_adjust;
+    int addrlo_reg = label->addrlo_reg;
+    int addrhi_reg = label->addrhi_reg;
+#endif
+    int data_reg = label->datalo_reg;
+    int data_reg2 = label->datahi_reg;
+    uint8_t *raddr = label->raddr;
+    uint8_t **label_ptr = &label->label_ptr[0];
+
+    s_bits = opc & 3;
+
+    /* resolve label address */
+    *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
+    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+        *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
+    }
+
+#if TCG_TARGET_REG_BITS == 32
+    tcg_out_pushi(s, mem_index);
+    stack_adjust = 4;
+    if (TARGET_LONG_BITS == 64) {
+        tcg_out_push(s, addrhi_reg);
+        stack_adjust += 4;
+    }
+    tcg_out_push(s, addrlo_reg);
+    stack_adjust += 4;
+    tcg_out_push(s, TCG_AREG0);
+    stack_adjust += 4;
+#else
+    /* The first argument is already loaded with addrlo.  */
+    tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[1],
+                 mem_index);
+    /* XXX/FIXME: suboptimal */
+    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[3], TCG_REG_L2);
+    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[2], TCG_REG_L1);
+    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[1], TCG_REG_L0);
+    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
+#endif
+
+    /* Code generation of qemu_ld/st's slow path calling MMU helper
+
+       PRE_PROC ...
+       call MMU helper
+       jmp POST_PROC (2b) : short forward jump <- GETRA()
+       jmp next_code (5b) : dummy long backward jump which is never executed
+       POST_PROC ... : do post-processing <- GETRA() + 7
+       jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
+    */
+
+    tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
+
+    /* Jump to post-processing code */
     tcg_out8(s, OPC_JMP_short);
-    label_ptr[2] = s->code_ptr;
-    s->code_ptr++;
+    tcg_out8(s, 5);
+    /* Dummy backward jump having information of fast path'pc for MMU helpers */
+    tcg_out8(s, OPC_JMP_long);
+    *(int32_t *)s->code_ptr = (int32_t)(raddr - s->code_ptr - 4);
+    s->code_ptr += 4;

-    /* TLB Miss.  */
+#if TCG_TARGET_REG_BITS == 32
+    if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
+        /* Pop and discard.  This is 2 bytes smaller than the add.  */
+        tcg_out_pop(s, TCG_REG_ECX);
+    } else if (stack_adjust != 0) {
+        tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
+    }
+#endif

-    /* label1: */
-    *label_ptr[0] = s->code_ptr - label_ptr[0] - 1;
+    switch (opc) {
+    case 0 | 4:
+        tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
+        break;
+    case 1 | 4:
+        tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
+        break;
+    case 0:
+        tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
+        break;
+    case 1:
+        tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
+        break;
+    case 2:
+        tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
+        break;
+#if TCG_TARGET_REG_BITS == 64
+    case 2 | 4:
+        tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
+        break;
+#endif
+    case 3:
+        if (TCG_TARGET_REG_BITS == 64) {
+            tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
+        } else if (data_reg == TCG_REG_EDX) {
+            /* xchg %edx, %eax */
+            tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
+            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EAX);
+        } else {
+            tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
+            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EDX);
+        }
+        break;
+    default:
+        tcg_abort();
+    }
+
+    /* Jump to the code corresponding to next IR of qemu_st */
+    tcg_out_jmp(s, (tcg_target_long)raddr);
+}
+
+/*
+ * Generate code for the slow path for a store at the end of block
+ */
+static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+{
+    int s_bits;
+    int stack_adjust;
+    int opc = label->opc;
+    int mem_index = label->mem_index;
+    int data_reg = label->datalo_reg;
+#if TCG_TARGET_REG_BITS == 32
+    int data_reg2 = label->datahi_reg;
+    int addrlo_reg = label->addrlo_reg;
+    int addrhi_reg = label->addrhi_reg;
+#endif
+    uint8_t *raddr = label->raddr;
+    uint8_t **label_ptr = &label->label_ptr[0];
+
+    s_bits = opc & 3;
+
+    /* resolve label address */
+    *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
-        *label_ptr[1] = s->code_ptr - label_ptr[1] - 1;
+        *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
     }

-    /* XXX: move that code at the end of the TB */
 #if TCG_TARGET_REG_BITS == 32
     tcg_out_pushi(s, mem_index);
     stack_adjust = 4;
@@ -1417,10 +1551,10 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     tcg_out_push(s, data_reg);
     stack_adjust += 4;
     if (TARGET_LONG_BITS == 64) {
-        tcg_out_push(s, args[addrlo_idx + 1]);
+        tcg_out_push(s, addrhi_reg);
         stack_adjust += 4;
     }
-    tcg_out_push(s, args[addrlo_idx]);
+    tcg_out_push(s, addrlo_reg);
     stack_adjust += 4;
     tcg_out_push(s, TCG_AREG0);
     stack_adjust += 4;
@@ -1436,8 +1570,26 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
 #endif

+    /* Code generation of qemu_ld/st's slow path calling MMU helper
+
+       PRE_PROC ...
+       call MMU helper
+       jmp POST_PROC (2b) : short forward jump <- GETRA()
+       jmp next_code (5b) : dummy long backward jump which is never executed
+       POST_PROC ... : do post-processing <- GETRA() + 7
+       jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
+    */
+
     tcg_out_calli(s, (tcg_target_long)qemu_st_helpers[s_bits]);

+    /* Jump to post-processing code */
+    tcg_out8(s, OPC_JMP_short);
+    tcg_out8(s, 5);
+    /* Dummy backward jump having information of fast path'pc for MMU helpers */
+    tcg_out8(s, OPC_JMP_long);
+    *(int32_t *)s->code_ptr = (int32_t)(raddr - s->code_ptr - 4);
+    s->code_ptr += 4;
+
     if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
         /* Pop and discard.  This is 2 bytes smaller than the add.  */
         tcg_out_pop(s, TCG_REG_ECX);
@@ -1445,32 +1597,29 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
         tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
     }

-    /* label2: */
-    *label_ptr[2] = s->code_ptr - label_ptr[2] - 1;
-#else
-    {
-        int32_t offset = GUEST_BASE;
-        int base = args[addrlo_idx];
-
-        if (TCG_TARGET_REG_BITS == 64) {
-            /* ??? We assume all operations have left us with register
-               contents that are zero extended.  So far this appears to
-               be true.  If we want to enforce this, we can either do
-               an explicit zero-extension here, or (if GUEST_BASE == 0)
-               use the ADDR32 prefix.  For now, do nothing.  */
+    /* Jump to the code corresponding to next IR of qemu_st */
+    tcg_out_jmp(s, (tcg_target_long)raddr);
+}

-            if (offset != GUEST_BASE) {
-                tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_L0, GUEST_BASE);
-                tgen_arithr(s, ARITH_ADD + P_REXW, TCG_REG_L0, base);
-                base = TCG_REG_L0;
-                offset = 0;
-            }
+/*
+ * Generate TB finalization at the end of block
+ */
+void tcg_out_tb_finalize(TCGContext *s)
+{
+    int i;
+    TCGLabelQemuLdst *label;
+
+    /* qemu_ld/st slow paths */
+    for (i = 0; i < s->nb_qemu_ldst_labels; i++) {
+        label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[i];
+        if (label->is_ld) {
+            tcg_out_qemu_ld_slow_path(s, label);
+        } else {
+            tcg_out_qemu_st_slow_path(s, label);
         }
-
-        tcg_out_qemu_st_direct(s, data_reg, data_reg2, base, offset, opc);
     }
-#endif
 }
+#endif  /* CONFIG_SOFTMMU */

 static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
                               const TCGArg *args, const int *const_args)
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 32cd0c6..d224f61 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -303,6 +303,14 @@ void tcg_func_start(TCGContext *s)

     gen_opc_ptr = gen_opc_buf;
     gen_opparam_ptr = gen_opparam_buf;
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+    /* Initialize qemu_ld/st labels to assist code generation at the end of TB
+       for TLB miss cases at the end of TB */
+    s->qemu_ldst_labels = tcg_malloc(sizeof(TCGLabelQemuLdst) *
+                                     TCG_MAX_QEMU_LDST);
+    s->nb_qemu_ldst_labels = 0;
+#endif
 }

 static inline void tcg_temp_alloc(TCGContext *s, int n)
@@ -2164,6 +2172,10 @@ static inline int tcg_gen_code_common(TCGContext *s, uint8_t *gen_code_buf,
 #endif
     }
  the_end:
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+    /* Generate TB finalization at the end of block */
+    tcg_out_tb_finalize(s);
+#endif
     return -1;
 }

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 7bafe0e..c930d1a 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -188,6 +188,24 @@ typedef tcg_target_ulong TCGArg;
    are aliases for target_ulong and host pointer sized values respectively.
  */

+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* Macros/structures for qemu_ld/st IR code optimization:
+   TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in exec-all.h. */
+#define TCG_MAX_QEMU_LDST       640
+
+typedef struct TCGLabelQemuLdst {
+    int is_ld:1;            /* qemu_ld: 1, qemu_st: 0 */
+    int opc:4;
+    int addrlo_reg;         /* reg index for low word of guest virtual addr */
+    int addrhi_reg;         /* reg index for high word of guest virtual addr */
+    int datalo_reg;         /* reg index for low word to be loaded or stored */
+    int datahi_reg;         /* reg index for high word to be loaded or stored */
+    int mem_index;          /* soft MMU memory index */
+    uint8_t *raddr;         /* gen code addr of the next IR of qemu_ld/st IR */
+    uint8_t *label_ptr[2];  /* label pointers to be updated */
+} TCGLabelQemuLdst;
+#endif
+
 #ifdef CONFIG_DEBUG_TCG
 #define DEBUG_TCGV 1
 #endif
@@ -422,6 +440,13 @@ struct TCGContext {
     int temps_in_use;
     int goto_tb_issue_mask;
 #endif
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+    /* labels info for qemu_ld/st IRs
+       The labels help to generate TLB miss case codes at the end of TB */
+    TCGLabelQemuLdst *qemu_ldst_labels;
+    int nb_qemu_ldst_labels;
+#endif
 };

 extern TCGContext tcg_ctx;
@@ -625,3 +650,8 @@ extern uint8_t code_gen_prologue[];
 #endif

 void tcg_register_jit(void *buf, size_t buf_size);
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* Generate TB finalization at the end of block */
+void tcg_out_tb_finalize(TCGContext *s);
+#endif
--
1.7.5.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH v6 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block
  2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
@ 2012-10-17 23:44   ` Richard Henderson
  2012-10-18  2:58     ` Yeongkyoon Lee
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Henderson @ 2012-10-17 23:44 UTC (permalink / raw)
  To: Yeongkyoon Lee; +Cc: blauwirbel, qemu-devel, aurelien

On 2012-10-16 17:23, Yeongkyoon Lee wrote:
> +    /* Code generation of qemu_ld/st's slow path calling MMU helper
> +
> +       PRE_PROC ...
> +       call MMU helper
> +       jmp POST_PROC (2b) : short forward jump <- GETRA()
> +       jmp next_code (5b) : dummy long backward jump which is never executed
> +       POST_PROC ... : do post-processing <- GETRA() + 7
> +       jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
> +    */

Is this jump over jump really any better than passing next_code
as another function argument?

In 32-bit mode
	push $next_code
In 64-bit mode
	leaq next_code(%rip),%r8


r~

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH v6 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block
  2012-10-17 23:44   ` Richard Henderson
@ 2012-10-18  2:58     ` Yeongkyoon Lee
  0 siblings, 0 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-18  2:58 UTC (permalink / raw)
  To: Richard Henderson; +Cc: blauwirbel, qemu-devel, aurelien

On 2012년 10월 18일 08:44, Richard Henderson wrote:
> On 2012-10-16 17:23, Yeongkyoon Lee wrote:
>> +    /* Code generation of qemu_ld/st's slow path calling MMU helper
>> +
>> +       PRE_PROC ...
>> +       call MMU helper
>> +       jmp POST_PROC (2b) : short forward jump <- GETRA()
>> +       jmp next_code (5b) : dummy long backward jump which is never executed
>> +       POST_PROC ... : do post-processing <- GETRA() + 7
>> +       jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
>> +    */
> Is this jump over jump really any better than passing next_code
> as another function argument?
>
> In 32-bit mode
> 	push $next_code
> In 64-bit mode
> 	leaq next_code(%rip),%r8
>
>
> r~
>

Only one advantage is no fragmentation of MMU helpers, that is, we will 
still have the same helper prototypes.
In my opinion, the performance degradation of using jmp instead of push 
or something, is negligible because it is executed on slow path.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-10-18  2:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-16  7:23 [Qemu-devel] [PATCH v6 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization Yeongkyoon Lee
2012-10-16  7:23 ` [Qemu-devel] [PATCH v6 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
2012-10-17 23:44   ` Richard Henderson
2012-10-18  2:58     ` Yeongkyoon Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).