* [Qemu-devel] [PATCH v8 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization
2012-10-31 7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
@ 2012-10-31 7:04 ` Yeongkyoon Lee
2012-10-31 7:04 ` [Qemu-devel] [PATCH v8 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization Yeongkyoon Lee
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-31 7:04 UTC (permalink / raw)
To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth
Enable CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization only when
a host is i386 or x86_64.
Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
---
configure | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/configure b/configure
index 9c6ac87..4be984e 100755
--- a/configure
+++ b/configure
@@ -3847,6 +3847,12 @@ upper() {
echo "$@"| LC_ALL=C tr '[a-z]' '[A-Z]'
}
+case "$cpu" in
+ i386|x86_64)
+ echo "CONFIG_QEMU_LDST_OPTIMIZATION=y" >> $config_target_mak
+ ;;
+esac
+
echo "TARGET_SHORT_ALIGNMENT=$target_short_alignment" >> $config_target_mak
echo "TARGET_INT_ALIGNMENT=$target_int_alignment" >> $config_target_mak
echo "TARGET_LONG_ALIGNMENT=$target_long_alignment" >> $config_target_mak
--
1.7.9.5
^ permalink raw reply related [flat|nested] 6+ messages in thread* [Qemu-devel] [PATCH v8 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization
2012-10-31 7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
2012-10-31 7:04 ` [Qemu-devel] [PATCH v8 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
@ 2012-10-31 7:04 ` Yeongkyoon Lee
2012-10-31 7:04 ` [Qemu-devel] [PATCH v8 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-31 7:04 UTC (permalink / raw)
To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth
Add GETPC_EXT which is used by MMU helpers to selectively calculate the code
address of accessing guest memory when called from a qemu_ld/st optimized code
or a C function. Currently, it supports only i386 and x86-64 hosts.
Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
---
exec-all.h | 36 ++++++++++++++++++++++++++++++++++++
exec.c | 11 +++++++++++
softmmu_template.h | 16 ++++++++--------
3 files changed, 55 insertions(+), 8 deletions(-)
diff --git a/exec-all.h b/exec-all.h
index 2ea0e4f..ad6d22b 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -310,6 +310,42 @@ extern uintptr_t tci_tb_ptr;
# define GETPC() ((uintptr_t)__builtin_return_address(0) - 1)
#endif
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* qemu_ld/st optimization split code generation to fast and slow path, thus,
+ it needs special handling for an MMU helper which is called from the slow
+ path, to get the fast path's pc without any additional argument.
+ It uses a tricky solution which embeds the fast path pc into the slow path.
+
+ Code flow in slow path:
+ (1) pre-process
+ (2) call MMU helper
+ (3) jump to (5)
+ (4) fast path information (implementation specific)
+ (5) post-process (e.g. stack adjust)
+ (6) jump to corresponding code of the next of fast path
+ */
+# if defined(__i386__) || defined(__x86_64__)
+/* To avoid broken disassembling, long jmp is used for embedding fast path pc,
+ so that the destination is the next code of fast path, though this jmp is
+ never executed.
+
+ call MMU helper
+ jmp POST_PROC (2byte) <- GETRA()
+ jmp NEXT_CODE (5byte)
+ POST_PROCESS ... <- GETRA() + 7
+ */
+# define GETRA() ((uintptr_t)__builtin_return_address(0))
+# define GETPC_LDST() ((uintptr_t)(GETRA() + 7 + \
+ *(int32_t *)((void *)GETRA() + 3) - 1))
+# else
+# error "CONFIG_QEMU_LDST_OPTIMIZATION needs GETPC_LDST() implementation!"
+# endif
+bool is_tcg_gen_code(uintptr_t pc_ptr);
+# define GETPC_EXT() (is_tcg_gen_code(GETRA()) ? GETPC_LDST() : GETPC())
+#else
+# define GETPC_EXT() GETPC()
+#endif
+
#if !defined(CONFIG_USER_ONLY)
struct MemoryRegion *iotlb_to_region(hwaddr index);
diff --git a/exec.c b/exec.c
index b0ed593..1089250 100644
--- a/exec.c
+++ b/exec.c
@@ -1387,6 +1387,17 @@ void tb_link_page(TranslationBlock *tb,
mmap_unlock();
}
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* check whether the given addr is in TCG generated code buffer or not */
+bool is_tcg_gen_code(uintptr_t tc_ptr)
+{
+ /* This can be called during code generation, code_gen_buffer_max_size
+ is used instead of code_gen_ptr for upper boundary checking */
+ return (tc_ptr >= (uintptr_t)code_gen_buffer &&
+ tc_ptr < (uintptr_t)(code_gen_buffer + code_gen_buffer_max_size));
+}
+#endif
+
/* find the TB 'tb' such that tb[0].tc_ptr <= tc_ptr <
tb[1].tc_ptr. Return NULL if not found */
TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
diff --git a/softmmu_template.h b/softmmu_template.h
index 20d6bab..ce30d8b 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -111,13 +111,13 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
/* IO access */
if ((addr & (DATA_SIZE - 1)) != 0)
goto do_unaligned_access;
- retaddr = GETPC();
+ retaddr = GETPC_EXT();
ioaddr = env->iotlb[mmu_idx][index];
res = glue(io_read, SUFFIX)(env, ioaddr, addr, retaddr);
} else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
/* slow unaligned access (it spans two pages or IO) */
do_unaligned_access:
- retaddr = GETPC();
+ retaddr = GETPC_EXT();
#ifdef ALIGNED_ONLY
do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
#endif
@@ -128,7 +128,7 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
uintptr_t addend;
#ifdef ALIGNED_ONLY
if ((addr & (DATA_SIZE - 1)) != 0) {
- retaddr = GETPC();
+ retaddr = GETPC_EXT();
do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
}
#endif
@@ -138,7 +138,7 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
}
} else {
/* the page is not in the TLB : fill it */
- retaddr = GETPC();
+ retaddr = GETPC_EXT();
#ifdef ALIGNED_ONLY
if ((addr & (DATA_SIZE - 1)) != 0)
do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
@@ -257,12 +257,12 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
/* IO access */
if ((addr & (DATA_SIZE - 1)) != 0)
goto do_unaligned_access;
- retaddr = GETPC();
+ retaddr = GETPC_EXT();
ioaddr = env->iotlb[mmu_idx][index];
glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
} else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
do_unaligned_access:
- retaddr = GETPC();
+ retaddr = GETPC_EXT();
#ifdef ALIGNED_ONLY
do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
#endif
@@ -273,7 +273,7 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
uintptr_t addend;
#ifdef ALIGNED_ONLY
if ((addr & (DATA_SIZE - 1)) != 0) {
- retaddr = GETPC();
+ retaddr = GETPC_EXT();
do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
}
#endif
@@ -283,7 +283,7 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
}
} else {
/* the page is not in the TLB : fill it */
- retaddr = GETPC();
+ retaddr = GETPC_EXT();
#ifdef ALIGNED_ONLY
if ((addr & (DATA_SIZE - 1)) != 0)
do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
--
1.7.9.5
^ permalink raw reply related [flat|nested] 6+ messages in thread* [Qemu-devel] [PATCH v8 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block
2012-10-31 7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
2012-10-31 7:04 ` [Qemu-devel] [PATCH v8 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
2012-10-31 7:04 ` [Qemu-devel] [PATCH v8 2/3] tcg: Add extended GETPC mechanism for MMU helpers with ldst optimization Yeongkyoon Lee
@ 2012-10-31 7:04 ` Yeongkyoon Lee
2012-11-02 5:35 ` [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs malc
2012-11-03 12:52 ` Blue Swirl
4 siblings, 0 replies; 6+ messages in thread
From: Yeongkyoon Lee @ 2012-10-31 7:04 UTC (permalink / raw)
To: qemu-devel; +Cc: blauwirbel, Yeongkyoon Lee, aurelien, rth
Add optimized TCG qemu_ld/st generation which locates the code of TLB miss
cases at the end of a block after generating the other IRs.
Currently, this optimization supports only i386 and x86_64 hosts.
Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
---
tcg/i386/tcg-target.c | 404 ++++++++++++++++++++++++++++++++++---------------
tcg/tcg.c | 12 ++
tcg/tcg.h | 30 ++++
3 files changed, 320 insertions(+), 126 deletions(-)
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index e45a5a0..ed2f000 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1002,6 +1002,17 @@ static const void *qemu_st_helpers[4] = {
helper_stq_mmu,
};
+static void add_qemu_ldst_label(TCGContext *s,
+ int is_ld,
+ int opc,
+ int data_reg,
+ int data_reg2,
+ int addrlo_reg,
+ int addrhi_reg,
+ int mem_index,
+ uint8_t *raddr,
+ uint8_t **label_ptr);
+
/* Perform the TLB load and compare.
Inputs:
@@ -1060,19 +1071,19 @@ static inline void tcg_out_tlb_load(TCGContext *s, int addrlo_idx,
tcg_out_mov(s, type, r1, addrlo);
- /* jne label1 */
- tcg_out8(s, OPC_JCC_short + JCC_JNE);
+ /* jne slow_path */
+ tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
label_ptr[0] = s->code_ptr;
- s->code_ptr++;
+ s->code_ptr += 4;
if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
/* cmp 4(r0), addrhi */
tcg_out_modrm_offset(s, OPC_CMP_GvEv, args[addrlo_idx+1], r0, 4);
- /* jne label1 */
- tcg_out8(s, OPC_JCC_short + JCC_JNE);
+ /* jne slow_path */
+ tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
label_ptr[1] = s->code_ptr;
- s->code_ptr++;
+ s->code_ptr += 4;
}
/* TLB Hit. */
@@ -1193,10 +1204,7 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args,
int addrlo_idx;
#if defined(CONFIG_SOFTMMU)
int mem_index, s_bits;
-#if TCG_TARGET_REG_BITS == 32
- int stack_adjust;
-#endif
- uint8_t *label_ptr[3];
+ uint8_t *label_ptr[2];
#endif
data_reg = args[0];
@@ -1216,87 +1224,17 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args,
/* TLB Hit. */
tcg_out_qemu_ld_direct(s, data_reg, data_reg2, TCG_REG_L1, 0, 0, opc);
- /* jmp label2 */
- tcg_out8(s, OPC_JMP_short);
- label_ptr[2] = s->code_ptr;
- s->code_ptr++;
-
- /* TLB Miss. */
-
- /* label1: */
- *label_ptr[0] = s->code_ptr - label_ptr[0] - 1;
- if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
- *label_ptr[1] = s->code_ptr - label_ptr[1] - 1;
- }
-
- /* XXX: move that code at the end of the TB */
-#if TCG_TARGET_REG_BITS == 32
- tcg_out_pushi(s, mem_index);
- stack_adjust = 4;
- if (TARGET_LONG_BITS == 64) {
- tcg_out_push(s, args[addrlo_idx + 1]);
- stack_adjust += 4;
- }
- tcg_out_push(s, args[addrlo_idx]);
- stack_adjust += 4;
- tcg_out_push(s, TCG_AREG0);
- stack_adjust += 4;
-#else
- tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
- /* The second argument is already loaded with addrlo. */
- tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], mem_index);
-#endif
-
- tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
-
-#if TCG_TARGET_REG_BITS == 32
- if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
- /* Pop and discard. This is 2 bytes smaller than the add. */
- tcg_out_pop(s, TCG_REG_ECX);
- } else if (stack_adjust != 0) {
- tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
- }
-#endif
-
- switch(opc) {
- case 0 | 4:
- tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
- break;
- case 1 | 4:
- tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
- break;
- case 0:
- tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
- break;
- case 1:
- tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
- break;
- case 2:
- tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
- break;
-#if TCG_TARGET_REG_BITS == 64
- case 2 | 4:
- tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
- break;
-#endif
- case 3:
- if (TCG_TARGET_REG_BITS == 64) {
- tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
- } else if (data_reg == TCG_REG_EDX) {
- /* xchg %edx, %eax */
- tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
- tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EAX);
- } else {
- tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
- tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EDX);
- }
- break;
- default:
- tcg_abort();
- }
-
- /* label2: */
- *label_ptr[2] = s->code_ptr - label_ptr[2] - 1;
+ /* Record the current context of a load into ldst label */
+ add_qemu_ldst_label(s,
+ 1,
+ opc,
+ data_reg,
+ data_reg2,
+ args[addrlo_idx],
+ args[addrlo_idx + 1],
+ mem_index,
+ s->code_ptr,
+ label_ptr);
#else
{
int32_t offset = GUEST_BASE;
@@ -1393,8 +1331,7 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
int addrlo_idx;
#if defined(CONFIG_SOFTMMU)
int mem_index, s_bits;
- int stack_adjust;
- uint8_t *label_ptr[3];
+ uint8_t *label_ptr[2];
#endif
data_reg = args[0];
@@ -1414,20 +1351,221 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
/* TLB Hit. */
tcg_out_qemu_st_direct(s, data_reg, data_reg2, TCG_REG_L1, 0, 0, opc);
- /* jmp label2 */
+ /* Record the current context of a store into ldst label */
+ add_qemu_ldst_label(s,
+ 0,
+ opc,
+ data_reg,
+ data_reg2,
+ args[addrlo_idx],
+ args[addrlo_idx + 1],
+ mem_index,
+ s->code_ptr,
+ label_ptr);
+#else
+ {
+ int32_t offset = GUEST_BASE;
+ int base = args[addrlo_idx];
+ int seg = 0;
+
+ /* ??? We assume all operations have left us with register contents
+ that are zero extended. So far this appears to be true. If we
+ want to enforce this, we can either do an explicit zero-extension
+ here, or (if GUEST_BASE == 0, or a segment register is in use)
+ use the ADDR32 prefix. For now, do nothing. */
+ if (GUEST_BASE && guest_base_flags) {
+ seg = guest_base_flags;
+ offset = 0;
+ } else if (TCG_TARGET_REG_BITS == 64 && offset != GUEST_BASE) {
+ tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_L1, GUEST_BASE);
+ tgen_arithr(s, ARITH_ADD + P_REXW, TCG_REG_L1, base);
+ base = TCG_REG_L1;
+ offset = 0;
+ }
+
+ tcg_out_qemu_st_direct(s, data_reg, data_reg2, base, offset, seg, opc);
+ }
+#endif
+}
+
+#if defined(CONFIG_SOFTMMU)
+/*
+ * Record the context of a call to the out of line helper code for the slow path
+ * for a load or store, so that we can later generate the correct helper code
+ */
+static void add_qemu_ldst_label(TCGContext *s,
+ int is_ld,
+ int opc,
+ int data_reg,
+ int data_reg2,
+ int addrlo_reg,
+ int addrhi_reg,
+ int mem_index,
+ uint8_t *raddr,
+ uint8_t **label_ptr)
+{
+ int idx;
+ TCGLabelQemuLdst *label;
+
+ if (s->nb_qemu_ldst_labels >= TCG_MAX_QEMU_LDST) {
+ tcg_abort();
+ }
+
+ idx = s->nb_qemu_ldst_labels++;
+ label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[idx];
+ label->is_ld = is_ld;
+ label->opc = opc;
+ label->datalo_reg = data_reg;
+ label->datahi_reg = data_reg2;
+ label->addrlo_reg = addrlo_reg;
+ label->addrhi_reg = addrhi_reg;
+ label->mem_index = mem_index;
+ label->raddr = raddr;
+ label->label_ptr[0] = label_ptr[0];
+ if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+ label->label_ptr[1] = label_ptr[1];
+ }
+}
+
+/*
+ * Generate code for the slow path for a load at the end of block
+ */
+static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+{
+ int s_bits;
+ int opc = label->opc;
+ int mem_index = label->mem_index;
+#if TCG_TARGET_REG_BITS == 32
+ int stack_adjust;
+ int addrlo_reg = label->addrlo_reg;
+ int addrhi_reg = label->addrhi_reg;
+#endif
+ int data_reg = label->datalo_reg;
+ int data_reg2 = label->datahi_reg;
+ uint8_t *raddr = label->raddr;
+ uint8_t **label_ptr = &label->label_ptr[0];
+
+ s_bits = opc & 3;
+
+ /* resolve label address */
+ *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
+ if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+ *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
+ }
+
+#if TCG_TARGET_REG_BITS == 32
+ tcg_out_pushi(s, mem_index);
+ stack_adjust = 4;
+ if (TARGET_LONG_BITS == 64) {
+ tcg_out_push(s, addrhi_reg);
+ stack_adjust += 4;
+ }
+ tcg_out_push(s, addrlo_reg);
+ stack_adjust += 4;
+ tcg_out_push(s, TCG_AREG0);
+ stack_adjust += 4;
+#else
+ tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
+ /* The second argument is already loaded with addrlo. */
+ tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], mem_index);
+#endif
+
+ /* Code generation of qemu_ld/st's slow path calling MMU helper
+
+ PRE_PROC ...
+ call MMU helper
+ jmp POST_PROC (2b) : short forward jump <- GETRA()
+ jmp next_code (5b) : dummy long backward jump which is never executed
+ POST_PROC ... : do post-processing <- GETRA() + 7
+ jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
+ */
+
+ tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
+
+ /* Jump to post-processing code */
tcg_out8(s, OPC_JMP_short);
- label_ptr[2] = s->code_ptr;
- s->code_ptr++;
+ tcg_out8(s, 5);
+ /* Dummy backward jump having information of fast path'pc for MMU helpers */
+ tcg_out8(s, OPC_JMP_long);
+ *(int32_t *)s->code_ptr = (int32_t)(raddr - s->code_ptr - 4);
+ s->code_ptr += 4;
+
+#if TCG_TARGET_REG_BITS == 32
+ if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
+ /* Pop and discard. This is 2 bytes smaller than the add. */
+ tcg_out_pop(s, TCG_REG_ECX);
+ } else if (stack_adjust != 0) {
+ tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
+ }
+#endif
+
+ switch(opc) {
+ case 0 | 4:
+ tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
+ break;
+ case 1 | 4:
+ tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
+ break;
+ case 0:
+ tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
+ break;
+ case 1:
+ tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
+ break;
+ case 2:
+ tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
+ break;
+#if TCG_TARGET_REG_BITS == 64
+ case 2 | 4:
+ tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
+ break;
+#endif
+ case 3:
+ if (TCG_TARGET_REG_BITS == 64) {
+ tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
+ } else if (data_reg == TCG_REG_EDX) {
+ /* xchg %edx, %eax */
+ tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
+ tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EAX);
+ } else {
+ tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
+ tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EDX);
+ }
+ break;
+ default:
+ tcg_abort();
+ }
- /* TLB Miss. */
+ /* Jump to the code corresponding to next IR of qemu_st */
+ tcg_out_jmp(s, (tcg_target_long)raddr);
+}
- /* label1: */
- *label_ptr[0] = s->code_ptr - label_ptr[0] - 1;
+/*
+ * Generate code for the slow path for a store at the end of block
+ */
+static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+{
+ int s_bits;
+ int stack_adjust;
+ int opc = label->opc;
+ int mem_index = label->mem_index;
+ int data_reg = label->datalo_reg;
+#if TCG_TARGET_REG_BITS == 32
+ int data_reg2 = label->datahi_reg;
+ int addrlo_reg = label->addrlo_reg;
+ int addrhi_reg = label->addrhi_reg;
+#endif
+ uint8_t *raddr = label->raddr;
+ uint8_t **label_ptr = &label->label_ptr[0];
+
+ s_bits = opc & 3;
+
+ /* resolve label address */
+ *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
- *label_ptr[1] = s->code_ptr - label_ptr[1] - 1;
+ *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
}
- /* XXX: move that code at the end of the TB */
#if TCG_TARGET_REG_BITS == 32
tcg_out_pushi(s, mem_index);
stack_adjust = 4;
@@ -1438,10 +1576,10 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
tcg_out_push(s, data_reg);
stack_adjust += 4;
if (TARGET_LONG_BITS == 64) {
- tcg_out_push(s, args[addrlo_idx + 1]);
+ tcg_out_push(s, addrhi_reg);
stack_adjust += 4;
}
- tcg_out_push(s, args[addrlo_idx]);
+ tcg_out_push(s, addrlo_reg);
stack_adjust += 4;
tcg_out_push(s, TCG_AREG0);
stack_adjust += 4;
@@ -1454,8 +1592,26 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
stack_adjust = 0;
#endif
+ /* Code generation of qemu_ld/st's slow path calling MMU helper
+
+ PRE_PROC ...
+ call MMU helper
+ jmp POST_PROC (2b) : short forward jump <- GETRA()
+ jmp next_code (5b) : dummy long backward jump which is never executed
+ POST_PROC ... : do post-processing <- GETRA() + 7
+ jmp next_code : jump to the code corresponding to next IR of qemu_ld/st
+ */
+
tcg_out_calli(s, (tcg_target_long)qemu_st_helpers[s_bits]);
+ /* Jump to post-processing code */
+ tcg_out8(s, OPC_JMP_short);
+ tcg_out8(s, 5);
+ /* Dummy backward jump having information of fast path'pc for MMU helpers */
+ tcg_out8(s, OPC_JMP_long);
+ *(int32_t *)s->code_ptr = (int32_t)(raddr - s->code_ptr - 4);
+ s->code_ptr += 4;
+
if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
/* Pop and discard. This is 2 bytes smaller than the add. */
tcg_out_pop(s, TCG_REG_ECX);
@@ -1463,33 +1619,29 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
}
- /* label2: */
- *label_ptr[2] = s->code_ptr - label_ptr[2] - 1;
-#else
- {
- int32_t offset = GUEST_BASE;
- int base = args[addrlo_idx];
- int seg = 0;
+ /* Jump to the code corresponding to next IR of qemu_st */
+ tcg_out_jmp(s, (tcg_target_long)raddr);
+}
- /* ??? We assume all operations have left us with register contents
- that are zero extended. So far this appears to be true. If we
- want to enforce this, we can either do an explicit zero-extension
- here, or (if GUEST_BASE == 0, or a segment register is in use)
- use the ADDR32 prefix. For now, do nothing. */
- if (GUEST_BASE && guest_base_flags) {
- seg = guest_base_flags;
- offset = 0;
- } else if (TCG_TARGET_REG_BITS == 64 && offset != GUEST_BASE) {
- tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_L1, GUEST_BASE);
- tgen_arithr(s, ARITH_ADD + P_REXW, TCG_REG_L1, base);
- base = TCG_REG_L1;
- offset = 0;
+/*
+ * Generate TB finalization at the end of block
+ */
+void tcg_out_tb_finalize(TCGContext *s)
+{
+ int i;
+ TCGLabelQemuLdst *label;
+
+ /* qemu_ld/st slow paths */
+ for (i = 0; i < s->nb_qemu_ldst_labels; i++) {
+ label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[i];
+ if (label->is_ld) {
+ tcg_out_qemu_ld_slow_path(s, label);
+ } else {
+ tcg_out_qemu_st_slow_path(s, label);
}
-
- tcg_out_qemu_st_direct(s, data_reg, data_reg2, base, offset, seg, opc);
}
-#endif
}
+#endif /* CONFIG_SOFTMMU */
static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
const TCGArg *args, const int *const_args)
diff --git a/tcg/tcg.c b/tcg/tcg.c
index c3a7f19..7f50da2 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -299,6 +299,14 @@ void tcg_func_start(TCGContext *s)
gen_opc_ptr = gen_opc_buf;
gen_opparam_ptr = gen_opparam_buf;
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+ /* Initialize qemu_ld/st labels to assist code generation at the end of TB
+ for TLB miss cases at the end of TB */
+ s->qemu_ldst_labels = tcg_malloc(sizeof(TCGLabelQemuLdst) *
+ TCG_MAX_QEMU_LDST);
+ s->nb_qemu_ldst_labels = 0;
+#endif
}
static inline void tcg_temp_alloc(TCGContext *s, int n)
@@ -2314,6 +2322,10 @@ static inline int tcg_gen_code_common(TCGContext *s, uint8_t *gen_code_buf,
#endif
}
the_end:
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+ /* Generate TB finalization at the end of block */
+ tcg_out_tb_finalize(s);
+#endif
return -1;
}
diff --git a/tcg/tcg.h b/tcg/tcg.h
index a6c9256..9c9d083 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -188,6 +188,24 @@ typedef tcg_target_ulong TCGArg;
are aliases for target_ulong and host pointer sized values respectively.
*/
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* Macros/structures for qemu_ld/st IR code optimization:
+ TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in exec-all.h. */
+#define TCG_MAX_QEMU_LDST 640
+
+typedef struct TCGLabelQemuLdst {
+ int is_ld:1; /* qemu_ld: 1, qemu_st: 0 */
+ int opc:4;
+ int addrlo_reg; /* reg index for low word of guest virtual addr */
+ int addrhi_reg; /* reg index for high word of guest virtual addr */
+ int datalo_reg; /* reg index for low word to be loaded or stored */
+ int datahi_reg; /* reg index for high word to be loaded or stored */
+ int mem_index; /* soft MMU memory index */
+ uint8_t *raddr; /* gen code addr of the next IR of qemu_ld/st IR */
+ uint8_t *label_ptr[2]; /* label pointers to be updated */
+} TCGLabelQemuLdst;
+#endif
+
#ifdef CONFIG_DEBUG_TCG
#define DEBUG_TCGV 1
#endif
@@ -431,6 +449,13 @@ struct TCGContext {
int temps_in_use;
int goto_tb_issue_mask;
#endif
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+ /* labels info for qemu_ld/st IRs
+ The labels help to generate TLB miss case codes at the end of TB */
+ TCGLabelQemuLdst *qemu_ldst_labels;
+ int nb_qemu_ldst_labels;
+#endif
};
extern TCGContext tcg_ctx;
@@ -634,3 +659,8 @@ extern uint8_t *code_gen_prologue;
#endif
void tcg_register_jit(void *buf, size_t buf_size);
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* Generate TB finalization at the end of block */
+void tcg_out_tb_finalize(TCGContext *s);
+#endif
--
1.7.9.5
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
2012-10-31 7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
` (2 preceding siblings ...)
2012-10-31 7:04 ` [Qemu-devel] [PATCH v8 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
@ 2012-11-02 5:35 ` malc
2012-11-03 12:52 ` Blue Swirl
4 siblings, 0 replies; 6+ messages in thread
From: malc @ 2012-11-02 5:35 UTC (permalink / raw)
To: Yeongkyoon Lee; +Cc: blauwirbel, qemu-devel, aurelien, rth
On Wed, 31 Oct 2012, Yeongkyoon Lee wrote:
> Here is the 8th version of the series optimizing TCG qemu_ld/st code generation.
>
> v8:
> - Rebase
[..snip..]
FWIW here's ppc32 implementation of your idea, thanks for explaining
the motivation behind certain aspects in our private discussion.
diff --git a/configure b/configure
index 4a54a8b..e42ad64 100755
--- a/configure
+++ b/configure
@@ -3844,7 +3844,7 @@ upper() {
}
case "$cpu" in
- i386|x86_64)
+ i386|x86_64|ppc)
echo "CONFIG_QEMU_LDST_OPTIMIZATION=y" >> $config_target_mak
;;
esac
diff --git a/exec-all.h b/exec-all.h
index ad6d22b..c3b3e50 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -337,6 +337,9 @@ extern uintptr_t tci_tb_ptr;
# define GETRA() ((uintptr_t)__builtin_return_address(0))
# define GETPC_LDST() ((uintptr_t)(GETRA() + 7 + \
*(int32_t *)((void *)GETRA() + 3) - 1))
+# elif defined (_ARCH_PPC) && !defined (_ARCH_PPC64)
+# define GETRA() ((uintptr_t)__builtin_return_address(0))
+# define GETPC_LDST() ((uintptr_t) ((*(int32_t *)(GETRA() + 4)) - 1))
# else
# error "CONFIG_QEMU_LDST_OPTIMIZATION needs GETPC_LDST() implementation!"
# endif
diff --git a/tcg/ppc/tcg-target.c b/tcg/ppc/tcg-target.c
index 60b7b92..ec10fa8 100644
--- a/tcg/ppc/tcg-target.c
+++ b/tcg/ppc/tcg-target.c
@@ -39,8 +39,6 @@ static uint8_t *tb_ret_addr;
#define LR_OFFSET 4
#endif
-#define FAST_PATH
-
#ifndef GUEST_BASE
#define GUEST_BASE 0
#endif
@@ -520,6 +518,37 @@ static void tcg_out_call (TCGContext *s, tcg_target_long arg, int const_arg)
#if defined(CONFIG_SOFTMMU)
+static void add_qemu_ldst_label (TCGContext *s,
+ int is_ld,
+ int opc,
+ int data_reg,
+ int data_reg2,
+ int addrlo_reg,
+ int addrhi_reg,
+ int mem_index,
+ uint8_t *raddr,
+ uint8_t *label_ptr)
+{
+ int idx;
+ TCGLabelQemuLdst *label;
+
+ if (s->nb_qemu_ldst_labels >= TCG_MAX_QEMU_LDST) {
+ tcg_abort();
+ }
+
+ idx = s->nb_qemu_ldst_labels++;
+ label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[idx];
+ label->is_ld = is_ld;
+ label->opc = opc;
+ label->datalo_reg = data_reg;
+ label->datahi_reg = data_reg2;
+ label->addrlo_reg = addrlo_reg;
+ label->addrhi_reg = addrhi_reg;
+ label->mem_index = mem_index;
+ label->raddr = raddr;
+ label->label_ptr[0] = label_ptr;
+}
+
#include "../../softmmu_defs.h"
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
@@ -541,34 +570,11 @@ static const void * const qemu_st_helpers[4] = {
};
#endif
-static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
+static void tcg_out_tlb_check (TCGContext *s, int r0, int r1, int r2,
+ int addr_reg, int addr_reg2, int s_bits,
+ int offset1, int offset2, uint8_t **label_ptr)
{
- int addr_reg, data_reg, data_reg2, r0, r1, rbase, bswap;
-#ifdef CONFIG_SOFTMMU
- int mem_index, s_bits, r2, ir;
- void *label1_ptr, *label2_ptr;
-#if TARGET_LONG_BITS == 64
- int addr_reg2;
-#endif
-#endif
-
- data_reg = *args++;
- if (opc == 3)
- data_reg2 = *args++;
- else
- data_reg2 = 0;
- addr_reg = *args++;
-
-#ifdef CONFIG_SOFTMMU
-#if TARGET_LONG_BITS == 64
- addr_reg2 = *args++;
-#endif
- mem_index = *args;
- s_bits = opc & 3;
- r0 = 3;
- r1 = 4;
- r2 = 0;
- rbase = 0;
+ uint16_t retranst;
tcg_out32 (s, (RLWINM
| RA (r0)
@@ -582,7 +588,7 @@ static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
tcg_out32 (s, (LWZU
| RT (r1)
| RA (r0)
- | offsetof (CPUArchState, tlb_table[mem_index][0].addr_read)
+ | offset1
)
);
tcg_out32 (s, (RLWINM
@@ -600,77 +606,57 @@ static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
tcg_out32 (s, CMP | BF (6) | RA (addr_reg2) | RB (r1));
tcg_out32 (s, CRAND | BT (7, CR_EQ) | BA (6, CR_EQ) | BB (7, CR_EQ));
#endif
+ *label_ptr = s->code_ptr;
+ retranst = ((uint16_t *) s->code_ptr)[1] & ~3;
+ tcg_out32 (s, BC | BI (7, CR_EQ) | retranst | BO_COND_FALSE);
- label1_ptr = s->code_ptr;
-#ifdef FAST_PATH
- tcg_out32 (s, BC | BI (7, CR_EQ) | BO_COND_TRUE);
-#endif
-
- /* slow path */
- ir = 3;
- tcg_out_mov (s, TCG_TYPE_I32, ir++, TCG_AREG0);
-#if TARGET_LONG_BITS == 32
- tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
-#else
-#ifdef TCG_TARGET_CALL_ALIGN_ARGS
- ir |= 1;
-#endif
- tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg2);
- tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
-#endif
- tcg_out_movi (s, TCG_TYPE_I32, ir, mem_index);
-
- tcg_out_call (s, (tcg_target_long) qemu_ld_helpers[s_bits], 1);
- switch (opc) {
- case 0|4:
- tcg_out32 (s, EXTSB | RA (data_reg) | RS (3));
- break;
- case 1|4:
- tcg_out32 (s, EXTSH | RA (data_reg) | RS (3));
- break;
- case 0:
- case 1:
- case 2:
- if (data_reg != 3)
- tcg_out_mov (s, TCG_TYPE_I32, data_reg, 3);
- break;
- case 3:
- if (data_reg == 3) {
- if (data_reg2 == 4) {
- tcg_out_mov (s, TCG_TYPE_I32, 0, 4);
- tcg_out_mov (s, TCG_TYPE_I32, 4, 3);
- tcg_out_mov (s, TCG_TYPE_I32, 3, 0);
- }
- else {
- tcg_out_mov (s, TCG_TYPE_I32, data_reg2, 3);
- tcg_out_mov (s, TCG_TYPE_I32, 3, 4);
- }
- }
- else {
- if (data_reg != 4) tcg_out_mov (s, TCG_TYPE_I32, data_reg, 4);
- if (data_reg2 != 3) tcg_out_mov (s, TCG_TYPE_I32, data_reg2, 3);
- }
- break;
- }
- label2_ptr = s->code_ptr;
- tcg_out32 (s, B);
-
- /* label1: fast path */
-#ifdef FAST_PATH
- reloc_pc14 (label1_ptr, (tcg_target_long) s->code_ptr);
-#endif
-
- /* r0 now contains &env->tlb_table[mem_index][index].addr_read */
+ /* r0 now contains &env->tlb_table[mem_index][index].addr_x */
tcg_out32 (s, (LWZ
| RT (r0)
| RA (r0)
- | (offsetof (CPUTLBEntry, addend)
- - offsetof (CPUTLBEntry, addr_read))
- ));
+ | offset2
+ )
+ );
/* r0 = env->tlb_table[mem_index][index].addend */
tcg_out32 (s, ADD | RT (r0) | RA (r0) | RB (addr_reg));
/* r0 = env->tlb_table[mem_index][index].addend + addr */
+}
+
+static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
+{
+ int addr_reg, addr_reg2, data_reg, data_reg2, r0, r1, rbase, bswap;
+#ifdef CONFIG_SOFTMMU
+ int mem_index, s_bits, r2;
+ uint8_t *label_ptr;
+#endif
+
+ data_reg = *args++;
+ if (opc == 3)
+ data_reg2 = *args++;
+ else
+ data_reg2 = 0;
+ addr_reg = *args++;
+
+#ifdef CONFIG_SOFTMMU
+#if TARGET_LONG_BITS == 64
+ addr_reg2 = *args++;
+#else
+ addr_reg2 = 0;
+#endif
+ mem_index = *args;
+ s_bits = opc & 3;
+ r0 = 3;
+ r1 = 4;
+ r2 = 0;
+ rbase = 0;
+
+ tcg_out_tlb_check (
+ s, r0, r1, r2, addr_reg, addr_reg2, s_bits,
+ offsetof (CPUArchState, tlb_table[mem_index][0].addr_read),
+ offsetof (CPUTLBEntry, addend) - offsetof (CPUTLBEntry, addr_read),
+ &label_ptr
+ );
#else /* !CONFIG_SOFTMMU */
r0 = addr_reg;
r1 = 3;
@@ -736,21 +722,26 @@ static void tcg_out_qemu_ld (TCGContext *s, const TCGArg *args, int opc)
}
break;
}
-
#ifdef CONFIG_SOFTMMU
- reloc_pc24 (label2_ptr, (tcg_target_long) s->code_ptr);
+ add_qemu_ldst_label (s,
+ 1,
+ opc,
+ data_reg,
+ data_reg2,
+ addr_reg,
+ addr_reg2,
+ mem_index,
+ s->code_ptr,
+ label_ptr);
#endif
}
static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
{
- int addr_reg, r0, r1, data_reg, data_reg2, bswap, rbase;
+ int addr_reg, addr_reg2, r0, r1, data_reg, data_reg2, bswap, rbase;
#ifdef CONFIG_SOFTMMU
- int mem_index, r2, ir;
- void *label1_ptr, *label2_ptr;
-#if TARGET_LONG_BITS == 64
- int addr_reg2;
-#endif
+ int mem_index, r2;
+ uint8_t *label_ptr;
#endif
data_reg = *args++;
@@ -763,6 +754,8 @@ static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
#ifdef CONFIG_SOFTMMU
#if TARGET_LONG_BITS == 64
addr_reg2 = *args++;
+#else
+ addr_reg2 = 0;
#endif
mem_index = *args;
r0 = 3;
@@ -770,41 +763,89 @@ static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
r2 = 0;
rbase = 0;
- tcg_out32 (s, (RLWINM
- | RA (r0)
- | RS (addr_reg)
- | SH (32 - (TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS))
- | MB (32 - (CPU_TLB_ENTRY_BITS + CPU_TLB_BITS))
- | ME (31 - CPU_TLB_ENTRY_BITS)
- )
- );
- tcg_out32 (s, ADD | RT (r0) | RA (r0) | RB (TCG_AREG0));
- tcg_out32 (s, (LWZU
- | RT (r1)
- | RA (r0)
- | offsetof (CPUArchState, tlb_table[mem_index][0].addr_write)
- )
- );
- tcg_out32 (s, (RLWINM
- | RA (r2)
- | RS (addr_reg)
- | SH (0)
- | MB ((32 - opc) & 31)
- | ME (31 - TARGET_PAGE_BITS)
- )
+ tcg_out_tlb_check (
+ s, r0, r1, r2, addr_reg, addr_reg2, opc & 3,
+ offsetof (CPUArchState, tlb_table[mem_index][0].addr_write),
+ offsetof (CPUTLBEntry, addend) - offsetof (CPUTLBEntry, addr_write),
+ &label_ptr
);
+#else /* !CONFIG_SOFTMMU */
+ r0 = addr_reg;
+ r1 = 3;
+ rbase = GUEST_BASE ? TCG_GUEST_BASE_REG : 0;
+#endif
- tcg_out32 (s, CMP | (7 << 23) | RA (r2) | RB (r1));
-#if TARGET_LONG_BITS == 64
- tcg_out32 (s, LWZ | RT (r1) | RA (r0) | 4);
- tcg_out32 (s, CMP | BF (6) | RA (addr_reg2) | RB (r1));
- tcg_out32 (s, CRAND | BT (7, CR_EQ) | BA (6, CR_EQ) | BB (7, CR_EQ));
+#ifdef TARGET_WORDS_BIGENDIAN
+ bswap = 0;
+#else
+ bswap = 1;
#endif
+ switch (opc) {
+ case 0:
+ tcg_out32 (s, STBX | SAB (data_reg, rbase, r0));
+ break;
+ case 1:
+ if (bswap)
+ tcg_out32 (s, STHBRX | SAB (data_reg, rbase, r0));
+ else
+ tcg_out32 (s, STHX | SAB (data_reg, rbase, r0));
+ break;
+ case 2:
+ if (bswap)
+ tcg_out32 (s, STWBRX | SAB (data_reg, rbase, r0));
+ else
+ tcg_out32 (s, STWX | SAB (data_reg, rbase, r0));
+ break;
+ case 3:
+ if (bswap) {
+ tcg_out32 (s, ADDI | RT (r1) | RA (r0) | 4);
+ tcg_out32 (s, STWBRX | SAB (data_reg, rbase, r0));
+ tcg_out32 (s, STWBRX | SAB (data_reg2, rbase, r1));
+ }
+ else {
+#ifdef CONFIG_USE_GUEST_BASE
+ tcg_out32 (s, STWX | SAB (data_reg2, rbase, r0));
+ tcg_out32 (s, ADDI | RT (r1) | RA (r0) | 4);
+ tcg_out32 (s, STWX | SAB (data_reg, rbase, r1));
+#else
+ tcg_out32 (s, STW | RS (data_reg2) | RA (r0));
+ tcg_out32 (s, STW | RS (data_reg) | RA (r0) | 4);
+#endif
+ }
+ break;
+ }
- label1_ptr = s->code_ptr;
-#ifdef FAST_PATH
- tcg_out32 (s, BC | BI (7, CR_EQ) | BO_COND_TRUE);
+#ifdef CONFIG_SOFTMMU
+ add_qemu_ldst_label (s,
+ 0,
+ opc,
+ data_reg,
+ data_reg2,
+ addr_reg,
+ addr_reg2,
+ mem_index,
+ s->code_ptr,
+ label_ptr);
#endif
+}
+
+#if defined(CONFIG_SOFTMMU)
+static void tcg_out_qemu_ld_slow_path (TCGContext *s, TCGLabelQemuLdst *label)
+{
+ int s_bits;
+ int ir;
+ int opc = label->opc;
+ int mem_index = label->mem_index;
+ int data_reg = label->datalo_reg;
+ int data_reg2 = label->datahi_reg;
+ int addr_reg = label->addrlo_reg;
+ uint8_t *raddr = label->raddr;
+ uint8_t **label_ptr = &label->label_ptr[0];
+
+ s_bits = opc & 3;
+
+ /* resolve label address */
+ reloc_pc14 (label_ptr[0], (tcg_target_long) s->code_ptr);
/* slow path */
ir = 3;
@@ -815,7 +856,75 @@ static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
#ifdef TCG_TARGET_CALL_ALIGN_ARGS
ir |= 1;
#endif
- tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg2);
+ tcg_out_mov (s, TCG_TYPE_I32, ir++, label->addrhi_reg);
+ tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
+#endif
+ tcg_out_movi (s, TCG_TYPE_I32, ir, mem_index);
+ tcg_out_call (s, (tcg_target_long) qemu_ld_helpers[s_bits], 1);
+ tcg_out32 (s, B | 8);
+ tcg_out32 (s, (tcg_target_long) raddr);
+ switch (opc) {
+ case 0|4:
+ tcg_out32 (s, EXTSB | RA (data_reg) | RS (3));
+ break;
+ case 1|4:
+ tcg_out32 (s, EXTSH | RA (data_reg) | RS (3));
+ break;
+ case 0:
+ case 1:
+ case 2:
+ if (data_reg != 3)
+ tcg_out_mov (s, TCG_TYPE_I32, data_reg, 3);
+ break;
+ case 3:
+ if (data_reg == 3) {
+ if (data_reg2 == 4) {
+ tcg_out_mov (s, TCG_TYPE_I32, 0, 4);
+ tcg_out_mov (s, TCG_TYPE_I32, 4, 3);
+ tcg_out_mov (s, TCG_TYPE_I32, 3, 0);
+ }
+ else {
+ tcg_out_mov (s, TCG_TYPE_I32, data_reg2, 3);
+ tcg_out_mov (s, TCG_TYPE_I32, 3, 4);
+ }
+ }
+ else {
+ if (data_reg != 4) tcg_out_mov (s, TCG_TYPE_I32, data_reg, 4);
+ if (data_reg2 != 3) tcg_out_mov (s, TCG_TYPE_I32, data_reg2, 3);
+ }
+ break;
+ }
+ /* Jump to the code corresponding to next IR of qemu_st */
+ tcg_out_b (s, 0, (tcg_target_long) raddr);
+}
+
+static void tcg_out_qemu_st_slow_path (TCGContext *s, TCGLabelQemuLdst *label)
+{
+ int s_bits;
+ int ir;
+ int opc = label->opc;
+ int mem_index = label->mem_index;
+ int data_reg = label->datalo_reg;
+ int data_reg2 = label->datahi_reg;
+ int addr_reg = label->addrlo_reg;
+ uint8_t *raddr = label->raddr;
+ uint8_t **label_ptr = &label->label_ptr[0];
+
+ s_bits = opc & 3;
+
+ /* resolve label address */
+ reloc_pc14 (label_ptr[0], (tcg_target_long) s->code_ptr);
+
+ /* slow path */
+ ir = 3;
+ tcg_out_mov (s, TCG_TYPE_I32, ir++, TCG_AREG0);
+#if TARGET_LONG_BITS == 32
+ tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
+#else
+#ifdef TCG_TARGET_CALL_ALIGN_ARGS
+ ir |= 1;
+#endif
+ tcg_out_mov (s, TCG_TYPE_I32, ir++, label->addrhi_reg);
tcg_out_mov (s, TCG_TYPE_I32, ir++, addr_reg);
#endif
@@ -851,74 +960,28 @@ static void tcg_out_qemu_st (TCGContext *s, const TCGArg *args, int opc)
tcg_out_movi (s, TCG_TYPE_I32, ir, mem_index);
tcg_out_call (s, (tcg_target_long) qemu_st_helpers[opc], 1);
- label2_ptr = s->code_ptr;
- tcg_out32 (s, B);
-
- /* label1: fast path */
-#ifdef FAST_PATH
- reloc_pc14 (label1_ptr, (tcg_target_long) s->code_ptr);
-#endif
-
- tcg_out32 (s, (LWZ
- | RT (r0)
- | RA (r0)
- | (offsetof (CPUTLBEntry, addend)
- - offsetof (CPUTLBEntry, addr_write))
- ));
- /* r0 = env->tlb_table[mem_index][index].addend */
- tcg_out32 (s, ADD | RT (r0) | RA (r0) | RB (addr_reg));
- /* r0 = env->tlb_table[mem_index][index].addend + addr */
-
-#else /* !CONFIG_SOFTMMU */
- r0 = addr_reg;
- r1 = 3;
- rbase = GUEST_BASE ? TCG_GUEST_BASE_REG : 0;
-#endif
+ tcg_out32 (s, B | 8);
+ tcg_out32 (s, (tcg_target_long) raddr);
+ tcg_out_b (s, 0, (tcg_target_long) raddr);
+}
-#ifdef TARGET_WORDS_BIGENDIAN
- bswap = 0;
-#else
- bswap = 1;
-#endif
- switch (opc) {
- case 0:
- tcg_out32 (s, STBX | SAB (data_reg, rbase, r0));
- break;
- case 1:
- if (bswap)
- tcg_out32 (s, STHBRX | SAB (data_reg, rbase, r0));
- else
- tcg_out32 (s, STHX | SAB (data_reg, rbase, r0));
- break;
- case 2:
- if (bswap)
- tcg_out32 (s, STWBRX | SAB (data_reg, rbase, r0));
- else
- tcg_out32 (s, STWX | SAB (data_reg, rbase, r0));
- break;
- case 3:
- if (bswap) {
- tcg_out32 (s, ADDI | RT (r1) | RA (r0) | 4);
- tcg_out32 (s, STWBRX | SAB (data_reg, rbase, r0));
- tcg_out32 (s, STWBRX | SAB (data_reg2, rbase, r1));
+void tcg_out_tb_finalize(TCGContext *s)
+{
+ int i;
+ TCGLabelQemuLdst *label;
+
+ /* qemu_ld/st slow paths */
+ for (i = 0; i < s->nb_qemu_ldst_labels; i++) {
+ label = (TCGLabelQemuLdst *) &s->qemu_ldst_labels[i];
+ if (label->is_ld) {
+ tcg_out_qemu_ld_slow_path (s, label);
}
else {
-#ifdef CONFIG_USE_GUEST_BASE
- tcg_out32 (s, STWX | SAB (data_reg2, rbase, r0));
- tcg_out32 (s, ADDI | RT (r1) | RA (r0) | 4);
- tcg_out32 (s, STWX | SAB (data_reg, rbase, r1));
-#else
- tcg_out32 (s, STW | RS (data_reg2) | RA (r0));
- tcg_out32 (s, STW | RS (data_reg) | RA (r0) | 4);
-#endif
+ tcg_out_qemu_st_slow_path (s, label);
}
- break;
}
-
-#ifdef CONFIG_SOFTMMU
- reloc_pc24 (label2_ptr, (tcg_target_long) s->code_ptr);
-#endif
}
+#endif
static void tcg_target_qemu_prologue (TCGContext *s)
{
--
mailto:av1474@comtv.ru
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
2012-10-31 7:04 [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
` (3 preceding siblings ...)
2012-11-02 5:35 ` [Qemu-devel] [PATCH v8 0/3] tcg: enhance code generation quality for qemu_ld/st IRs malc
@ 2012-11-03 12:52 ` Blue Swirl
4 siblings, 0 replies; 6+ messages in thread
From: Blue Swirl @ 2012-11-03 12:52 UTC (permalink / raw)
To: Yeongkyoon Lee; +Cc: qemu-devel, aurelien, rth
On Wed, Oct 31, 2012 at 7:04 AM, Yeongkyoon Lee
<yeongkyoon.lee@samsung.com> wrote:
> Here is the 8th version of the series optimizing TCG qemu_ld/st code generation.
Thanks, applied all.
>
> v8:
> - Rebase
>
> v7:
> - Rebase and fix mistyping
>
> v6:
> - Remove an extra argument of return addr from MMU helpers
> Instead, embed the fast path addr to the slow path for helpers to use it
> - Change some bitwise operations to bitfields of structure
> - Change the name of function which handles finalization of TB code generation
>
> v5:
> - Remove RFC tag
>
> v4:
> - Remove CONFIG_SOFTMMU pre-condition from configure
> - Instead, add some CONFIG_SOFTMMU condition to TCG sources
> - Remove some unnecessary comments
>
> v3:
> - Support CONFIG_TCG_PASS_AREG0
> (expected to get more performance enhancement than others)
> - Remove the configure option "--enable-ldst-optimization""
> - Make the optimization as default on i386 and x86_64 hosts
> - Fix some mistyping and apply checkpatch.pl before committing
> - Test i386, arm and sparc softmmu targets on i386 and x86_64 hosts
> - Test linux-user-test-0.3
>
> v2:
> - Follow the submit rule of qemu
>
> v1:
> - Initial commit request
>
> I think the generated codes from qemu_ld/st IRs are relatively heavy, which are
> up to 12 instructions for TLB hit case on i386 host.
> This patch series enhance the code quality of TCG qemu_ld/st IRs by reducing
> jump and enhancing locality.
> Main idea is simple and has been already described in the comments in
> tcg-target.c, which separates slow path (TLB miss case), and generates it at the
> end of TB.
>
> For example, the generated code from qemu_ld changes as follow.
> Before:
> (1) TLB check
> (2) If hit fall through, else jump to TLB miss case (5)
> (3) TLB hit case: Load value from host memory
> (4) Jump to next code (6)
> (5) TLB miss case: call MMU helper
> (6) ... (next code)
>
> After:
> (1) TLB check
> (2) If hit fall through, else jump to TLB miss case (5)
> (3) TLB hit case: Load value from host memory
> (4) ... (next code)
> ...
> (5) TLB miss case: call MMU helper
> (6) Jump to (8)
> (7) [embedded addr of (4)] <- never executed but read by MMU helpers
> (8) Return to next code (4)
>
> Following is some performance results measured based on qemu 1.0.
> Although there was measurement error, the results was not negligible.
>
> * EEMBC CoreMark (before -> after)
> - Guest: i386, Linux (Tizen platform)
> - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
> - Results: 1135.6 -> 1179.9 (+3.9%)
>
> * nbench (before -> after)
> - Guest: i386, Linux (linux-0.2.img included in QEMU source)
> - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
> - Results
> . MEMORY INDEX: 1.6782 -> 1.6818 (+0.2%)
> . INTEGER INDEX: 1.8258 -> 1.877 (+2.8%)
> . FLOATING-POINT INDEX: 0.5944 -> 0.5954 (+0.2%)
>
> Summarized features:
> - The changes are wrapped by macro "CONFIG_QEMU_LDST_OPTIMIZATION" and
> they are enabled by default on i386/x86_64 hosts
> - Forced removal of the macro will cause compilation error on i386/x86_64 hosts
> - No implementations other than i386/x86_64 hosts yet
>
> In addition, I have tried to remove the generated codes of calling MMU helpers
> for TLB miss case from end of TB, however, have not found good solution yet.
> In my opinion, TLB hit case performance could be degraded if removing the
> calling codes, because it needs to set runtime parameters, such as, data,
> mmu index and return address, in register or stack though they are not used
> in TLB hit case.
> This remains as a further issue.
>
> Yeongkyoon Lee (3):
> configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st
> optimization
> tcg: Add extended GETPC mechanism for MMU helpers with ldst
> optimization
> tcg: Optimize qemu_ld/st by generating slow paths at the end of a
> block
>
> configure | 6 +
> exec-all.h | 36 +++++
> exec.c | 11 ++
> softmmu_template.h | 16 +-
> tcg/i386/tcg-target.c | 404 ++++++++++++++++++++++++++++++++++---------------
> tcg/tcg.c | 12 ++
> tcg/tcg.h | 30 ++++
> 7 files changed, 381 insertions(+), 134 deletions(-)
>
> --
> 1.7.9.5
>
^ permalink raw reply [flat|nested] 6+ messages in thread