* [Qemu-devel] [PATCH 1/7] exec: Reorganize the GETRA/GETPC macros
2013-08-27 21:46 [Qemu-devel] [PATCH 0/7] Further tcg ldst improvements Richard Henderson
@ 2013-08-27 21:46 ` Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 2/7] tcg-i386: Don't perform GETPC adjustment in TCG code Richard Henderson
` (5 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-08-27 21:46 UTC (permalink / raw)
To: qemu-devel; +Cc: aurelien
Always define GETRA; use __builtin_extract_return_addr, rather than
having a special case for s390. Split GETPC_ADJ out of GETPC; use 2
universally, rather than having a special case for arm.
Rename GETPC_LDST to GETRA_LDST to indicate that it does not
contain the GETPC_ADJ value. Likewise with GETPC_EXT to GETRA_EXT.
Perform the GETPC_ADJ adjustment inside helper_ret_ld/st. This will
allow backends to pass along the "true" return address rather than
the massaged GETPC value. In the meantime, double application of
GETPC_ADJ does not hurt, since the call insn in all ISAs is at least
4 bytes long.
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
include/exec/exec-all.h | 87 +++++++++++++++++++----------------------
include/exec/softmmu_template.h | 24 ++++++++----
2 files changed, 58 insertions(+), 53 deletions(-)
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index b70028a..32d0204 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -295,49 +295,42 @@ static inline void tb_add_jump(TranslationBlock *tb, int n,
}
}
-/* The return address may point to the start of the next instruction.
- Subtracting one gets us the call instruction itself. */
+/* GETRA is the true target of the return instruction that we'll execute,
+ defined here for simplicity of defining the follow-up macros. */
#if defined(CONFIG_TCG_INTERPRETER)
extern uintptr_t tci_tb_ptr;
-# define GETPC() tci_tb_ptr
-#elif defined(__s390__) && !defined(__s390x__)
-# define GETPC() \
- (((uintptr_t)__builtin_return_address(0) & 0x7fffffffUL) - 1)
-#elif defined(__arm__)
-/* Thumb return addresses have the low bit set, so we need to subtract two.
- This is still safe in ARM mode because instructions are 4 bytes. */
-# define GETPC() ((uintptr_t)__builtin_return_address(0) - 2)
+# define GETRA() tci_tb_ptr
+#else
+# define GETRA() \
+ ((uintptr_t)__builtin_extract_return_addr(__builtin_return_address(0)))
+#endif
+
+/* The true return address will often point to a host insn that is part of
+ the next translated guest insn. Adjust the address backward to point to
+ the middle of the call insn. Subtracting one would do the job except for
+ several compressed mode architectures (arm, mips) which set the low bit
+ to indicate the compressed mode; subtracting two works around that. It
+ is also the case that there are no host isas that contain a call insn
+ smaller than 4 bytes, so we don't worry about special-casing this. */
+#if defined(CONFIG_TCG_INTERPRETER)
+# define GETPC_ADJ 0
#else
-# define GETPC() ((uintptr_t)__builtin_return_address(0) - 1)
+# define GETPC_ADJ 2
#endif
+#define GETPC() (GETRA() - GETPC_ADJ)
+
+/* The LDST optimizations splits code generation into fast and slow path.
+ In some implementations, we pass the "logical" return address manually;
+ in others, we must infer the logical return from the true return. */
#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
-/* qemu_ld/st optimization split code generation to fast and slow path, thus,
- it needs special handling for an MMU helper which is called from the slow
- path, to get the fast path's pc without any additional argument.
- It uses a tricky solution which embeds the fast path pc into the slow path.
-
- Code flow in slow path:
- (1) pre-process
- (2) call MMU helper
- (3) jump to (5)
- (4) fast path information (implementation specific)
- (5) post-process (e.g. stack adjust)
- (6) jump to corresponding code of the next of fast path
- */
-# if defined(__i386__) || defined(__x86_64__)
-# define GETRA() ((uintptr_t)__builtin_return_address(0))
-/* The return address argument for ldst is passed directly. */
-# define GETPC_LDST() (abort(), 0)
-# elif defined (_ARCH_PPC) && !defined (_ARCH_PPC64)
-# define GETRA() ((uintptr_t)__builtin_return_address(0))
-# define GETPC_LDST() ((uintptr_t) ((*(int32_t *)(GETRA() - 4)) - 1))
+# if defined (_ARCH_PPC) && !defined (_ARCH_PPC64)
+# define GETRA_LDST(RA) (*(int32_t *)((RA) - 4))
# elif defined(__arm__)
/* We define two insns between the return address and the branch back to
straight-line. Find and decode that branch insn. */
-# define GETRA() ((uintptr_t)__builtin_return_address(0))
-# define GETPC_LDST() tcg_getpc_ldst(GETRA())
-static inline uintptr_t tcg_getpc_ldst(uintptr_t ra)
+# define GETRA_LDST(RA) tcg_getra_ldst(RA)
+static inline uintptr_t tcg_getra_ldst(uintptr_t ra)
{
int32_t b;
ra += 8; /* skip the two insns */
@@ -345,31 +338,33 @@ static inline uintptr_t tcg_getpc_ldst(uintptr_t ra)
b = (b << 8) >> (8 - 2); /* extract the displacement */
ra += 8; /* branches are relative to pc+8 */
ra += b; /* apply the displacement */
- ra -= 4; /* return a pointer into the current opcode,
- not the start of the next opcode */
return ra;
}
-#elif defined(__aarch64__)
-# define GETRA() ((uintptr_t)__builtin_return_address(0))
-# define GETPC_LDST() tcg_getpc_ldst(GETRA())
-static inline uintptr_t tcg_getpc_ldst(uintptr_t ra)
+# elif defined(__aarch64__)
+# define GETRA_LDST(RA) tcg_getra_ldst(RA)
+static inline uintptr_t tcg_getra_ldst(uintptr_t ra)
{
int32_t b;
ra += 4; /* skip one instruction */
b = *(int32_t *)ra; /* load the branch insn */
b = (b << 6) >> (6 - 2); /* extract the displacement */
ra += b; /* apply the displacement */
- ra -= 4; /* return a pointer into the current opcode,
- not the start of the next opcode */
return ra;
}
-# else
-# error "CONFIG_QEMU_LDST_OPTIMIZATION needs GETPC_LDST() implementation!"
# endif
+#endif /* CONFIG_QEMU_LDST_OPTIMIZATION */
+
+/* ??? This declaration really ought to be in tcg.h. */
bool is_tcg_gen_code(uintptr_t pc_ptr);
-# define GETPC_EXT() (is_tcg_gen_code(GETRA()) ? GETPC_LDST() : GETPC())
+
+#ifdef GETRA_LDST
+# define GETRA_EXT() tcg_getra_ext(GETRA())
+static inline uintptr_t tcg_getra_ext(uintptr_t ra)
+{
+ return is_tcg_gen_code(ra) ? GETRA_LDST(ra) : ra;
+}
#else
-# define GETPC_EXT() GETPC()
+# define GETRA_EXT() GETRA()
#endif
#if !defined(CONFIG_USER_ONLY)
diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index eaca9e1..2fc6ea3 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -86,6 +86,9 @@ glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
uintptr_t haddr;
+ /* Adjust the given return address. */
+ retaddr -= GETPC_ADJ;
+
/* If the TLB entry is for a different page, reload and try again. */
if ((addr & TARGET_PAGE_MASK)
!= (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
@@ -121,10 +124,12 @@ glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
#endif
addr1 = addr & ~(DATA_SIZE - 1);
addr2 = addr1 + DATA_SIZE;
- res1 = glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(env, addr1,
- mmu_idx, retaddr);
- res2 = glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(env, addr2,
- mmu_idx, retaddr);
+ /* Note the adjustment at the beginning of the function.
+ Undo that for the recursion. */
+ res1 = glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)
+ (env, addr1, mmu_idx, retaddr + GETPC_ADJ);
+ res2 = glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)
+ (env, addr2, mmu_idx, retaddr + GETPC_ADJ);
shift = (addr & (DATA_SIZE - 1)) * 8;
#ifdef TARGET_WORDS_BIGENDIAN
res = (res1 << shift) | (res2 >> ((DATA_SIZE * 8) - shift));
@@ -150,7 +155,7 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
int mmu_idx)
{
return glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(env, addr, mmu_idx,
- GETPC_EXT());
+ GETRA_EXT());
}
#ifndef SOFTMMU_CODE_ACCESS
@@ -182,6 +187,9 @@ glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
uintptr_t haddr;
+ /* Adjust the given return address. */
+ retaddr -= GETPC_ADJ;
+
/* If the TLB entry is for a different page, reload and try again. */
if ((addr & TARGET_PAGE_MASK)
!= (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
@@ -223,8 +231,10 @@ glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
#else
uint8_t val8 = val >> (i * 8);
#endif
+ /* Note the adjustment at the beginning of the function.
+ Undo that for the recursion. */
glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
- mmu_idx, retaddr);
+ mmu_idx, retaddr + GETPC_ADJ);
}
return;
}
@@ -245,7 +255,7 @@ glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
DATA_TYPE val, int mmu_idx)
{
glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(env, addr, val, mmu_idx,
- GETPC_EXT());
+ GETRA_EXT());
}
#endif /* !defined(SOFTMMU_CODE_ACCESS) */
--
1.8.1.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 2/7] tcg-i386: Don't perform GETPC adjustment in TCG code
2013-08-27 21:46 [Qemu-devel] [PATCH 0/7] Further tcg ldst improvements Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 1/7] exec: Reorganize the GETRA/GETPC macros Richard Henderson
@ 2013-08-27 21:46 ` Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 3/7] exec: Rename USUFFIX to LSUFFIX Richard Henderson
` (4 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-08-27 21:46 UTC (permalink / raw)
To: qemu-devel; +Cc: aurelien
Since we now perform it inside the helper, no need to do it here.
This also lets us perform a tail-call from the store slow path to
the helper.
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
tcg/i386/tcg-target.c | 34 ++++++++++++++--------------------
1 file changed, 14 insertions(+), 20 deletions(-)
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 12a7ca3..84f17fe 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1470,12 +1470,6 @@ static void add_qemu_ldst_label(TCGContext *s,
}
}
-/* See the GETPC definition in include/exec/exec-all.h. */
-static inline uintptr_t do_getpc(uint8_t *raddr)
-{
- return (uintptr_t)raddr - 1;
-}
-
/*
* Generate code for the slow path for a load at the end of block
*/
@@ -1509,14 +1503,14 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, l->mem_index);
ofs += 4;
- tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, do_getpc(l->raddr));
+ tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, (uintptr_t)l->raddr);
} else {
tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
/* The second argument is already loaded with addrlo. */
tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2],
l->mem_index);
tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[3],
- do_getpc(l->raddr));
+ (uintptr_t)l->raddr);
}
tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
@@ -1571,6 +1565,7 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
int opc = l->opc;
int s_bits = opc & 3;
uint8_t **label_ptr = &l->label_ptr[0];
+ TCGReg retaddr;
/* resolve label address */
*(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
@@ -1603,10 +1598,10 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, l->mem_index);
ofs += 4;
- tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, do_getpc(l->raddr));
+ retaddr = TCG_REG_EAX;
+ tcg_out_movi(s, TCG_TYPE_I32, retaddr, (uintptr_t)l->raddr);
+ tcg_out_st(s, TCG_TYPE_I32, retaddr, TCG_REG_ESP, ofs);
} else {
- uintptr_t pc;
-
tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
/* The second argument is already loaded with addrlo. */
tcg_out_mov(s, (opc == 3 ? TCG_TYPE_I64 : TCG_TYPE_I32),
@@ -1614,20 +1609,19 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[3],
l->mem_index);
- pc = do_getpc(l->raddr);
if (ARRAY_SIZE(tcg_target_call_iarg_regs) > 4) {
- tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[4], pc);
- } else if (pc == (int32_t)pc) {
- tcg_out_sti(s, TCG_TYPE_PTR, TCG_REG_ESP, 0, pc);
+ retaddr = tcg_target_call_iarg_regs[4];
+ tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
} else {
- tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RAX, pc);
- tcg_out_st(s, TCG_TYPE_PTR, TCG_REG_RAX, TCG_REG_ESP, 0);
+ retaddr = TCG_REG_RAX;
+ tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
+ tcg_out_st(s, TCG_TYPE_PTR, retaddr, TCG_REG_ESP, 0);
}
}
- tcg_out_calli(s, (tcg_target_long)qemu_st_helpers[s_bits]);
-
- tcg_out_jmp(s, (tcg_target_long)l->raddr);
+ /* "Tail call" to the helper, with the return address back inline. */
+ tcg_out_push(s, retaddr);
+ tcg_out_jmp(s, (tcg_target_long)qemu_st_helpers[s_bits]);
}
/*
--
1.8.1.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 3/7] exec: Rename USUFFIX to LSUFFIX
2013-08-27 21:46 [Qemu-devel] [PATCH 0/7] Further tcg ldst improvements Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 1/7] exec: Reorganize the GETRA/GETPC macros Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 2/7] tcg-i386: Don't perform GETPC adjustment in TCG code Richard Henderson
@ 2013-08-27 21:46 ` Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 4/7] target: Include softmmu_exec.h where forgotten Richard Henderson
` (3 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-08-27 21:46 UTC (permalink / raw)
To: qemu-devel; +Cc: aurelien
In a following patch, there will be confusion between multiple "unsigned"
suffixes; rename this one so as to imply "load".
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
include/exec/softmmu_template.h | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index 2fc6ea3..f9922e2 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -28,19 +28,19 @@
#if DATA_SIZE == 8
#define SUFFIX q
-#define USUFFIX q
+#define LSUFFIX q
#define DATA_TYPE uint64_t
#elif DATA_SIZE == 4
#define SUFFIX l
-#define USUFFIX l
+#define LSUFFIX l
#define DATA_TYPE uint32_t
#elif DATA_SIZE == 2
#define SUFFIX w
-#define USUFFIX uw
+#define LSUFFIX uw
#define DATA_TYPE uint16_t
#elif DATA_SIZE == 1
#define SUFFIX b
-#define USUFFIX ub
+#define LSUFFIX ub
#define DATA_TYPE uint8_t
#else
#error unsupported data size
@@ -147,7 +147,7 @@ glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
#endif
haddr = addr + env->tlb_table[mmu_idx][index].addend;
- return glue(glue(ld, USUFFIX), _raw)((uint8_t *)haddr);
+ return glue(glue(ld, LSUFFIX), _raw)((uint8_t *)haddr);
}
DATA_TYPE
@@ -264,6 +264,6 @@ glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
#undef SHIFT
#undef DATA_TYPE
#undef SUFFIX
-#undef USUFFIX
+#undef LSUFFIX
#undef DATA_SIZE
#undef ADDR_READ
--
1.8.1.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 4/7] target: Include softmmu_exec.h where forgotten
2013-08-27 21:46 [Qemu-devel] [PATCH 0/7] Further tcg ldst improvements Richard Henderson
` (2 preceding siblings ...)
2013-08-27 21:46 ` [Qemu-devel] [PATCH 3/7] exec: Rename USUFFIX to LSUFFIX Richard Henderson
@ 2013-08-27 21:46 ` Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 5/7] exec: Split softmmu_defs.h Richard Henderson
` (2 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-08-27 21:46 UTC (permalink / raw)
To: qemu-devel; +Cc: aurelien
Several targets forgot to include softmmu_exec.h, which would
break them with a header cleanup to follow.
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
target-lm32/op_helper.c | 2 ++
target-moxie/helper.c | 1 +
target-ppc/mmu_helper.c | 2 ++
target-unicore32/op_helper.c | 2 ++
target-xtensa/op_helper.c | 1 +
5 files changed, 8 insertions(+)
diff --git a/target-lm32/op_helper.c b/target-lm32/op_helper.c
index 2dab9f2..8f5ef55 100644
--- a/target-lm32/op_helper.c
+++ b/target-lm32/op_helper.c
@@ -6,6 +6,8 @@
#include "hw/lm32/lm32_pic.h"
#include "hw/char/lm32_juart.h"
+#include "exec/softmmu_exec.h"
+
#if !defined(CONFIG_USER_ONLY)
#define MMUSUFFIX _mmu
#define SHIFT 0
diff --git a/target-moxie/helper.c b/target-moxie/helper.c
index b12e4ff..7859102 100644
--- a/target-moxie/helper.c
+++ b/target-moxie/helper.c
@@ -25,6 +25,7 @@
#include "cpu.h"
#include "mmu.h"
#include "exec/exec-all.h"
+#include "exec/softmmu_exec.h"
#include "qemu/host-utils.h"
#include "helper.h"
diff --git a/target-ppc/mmu_helper.c b/target-ppc/mmu_helper.c
index 5dd4e05..44f04e5 100644
--- a/target-ppc/mmu_helper.c
+++ b/target-ppc/mmu_helper.c
@@ -2871,6 +2871,8 @@ void helper_booke206_tlbflush(CPUPPCState *env, uint32_t type)
/*****************************************************************************/
+#include "exec/softmmu_exec.h"
+
#define MMUSUFFIX _mmu
#define SHIFT 0
diff --git a/target-unicore32/op_helper.c b/target-unicore32/op_helper.c
index 6443ffe..4f9f41e 100644
--- a/target-unicore32/op_helper.c
+++ b/target-unicore32/op_helper.c
@@ -239,6 +239,8 @@ uint32_t HELPER(ror_cc)(CPUUniCore32State *env, uint32_t x, uint32_t i)
}
#ifndef CONFIG_USER_ONLY
+#include "exec/softmmu_exec.h"
+
#define MMUSUFFIX _mmu
#define SHIFT 0
diff --git a/target-xtensa/op_helper.c b/target-xtensa/op_helper.c
index 01123af..cf97025 100644
--- a/target-xtensa/op_helper.c
+++ b/target-xtensa/op_helper.c
@@ -28,6 +28,7 @@
#include "cpu.h"
#include "helper.h"
#include "qemu/host-utils.h"
+#include "exec/softmmu_exec.h"
static void do_unaligned_access(CPUXtensaState *env,
target_ulong addr, int is_write, int is_user, uintptr_t retaddr);
--
1.8.1.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 5/7] exec: Split softmmu_defs.h
2013-08-27 21:46 [Qemu-devel] [PATCH 0/7] Further tcg ldst improvements Richard Henderson
` (3 preceding siblings ...)
2013-08-27 21:46 ` [Qemu-devel] [PATCH 4/7] target: Include softmmu_exec.h where forgotten Richard Henderson
@ 2013-08-27 21:46 ` Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 6/7] tcg: Introduce zero and sign-extended versions of load helpers Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu Richard Henderson
6 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-08-27 21:46 UTC (permalink / raw)
To: qemu-devel; +Cc: aurelien
The _cmmu helpers can be moved to exec-all.h. The helpers that are
used from TCG will shortly need access to tcg_target_long so move
their declarations into tcg.h.
This requires minor include adjustments to all TCG backends.
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
include/exec/exec-all.h | 5 ++++-
include/exec/softmmu_defs.h | 49 ---------------------------------------------
include/exec/softmmu_exec.h | 3 ++-
tcg/aarch64/tcg-target.c | 2 --
tcg/arm/tcg-target.c | 2 --
tcg/hppa/tcg-target.c | 2 --
tcg/i386/tcg-target.c | 3 ---
tcg/ia64/tcg-target.c | 3 ---
tcg/mips/tcg-target.c | 3 ---
tcg/ppc/tcg-target.c | 2 --
tcg/ppc64/tcg-target.c | 3 ---
tcg/s390/tcg-target.c | 3 ---
tcg/sparc/tcg-target.c | 2 --
tcg/tcg.h | 43 +++++++++++++++++++++++++++++++++++++++
14 files changed, 49 insertions(+), 76 deletions(-)
delete mode 100644 include/exec/softmmu_defs.h
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 32d0204..ffee83f 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -378,7 +378,10 @@ bool io_mem_write(struct MemoryRegion *mr, hwaddr addr,
void tlb_fill(CPUArchState *env1, target_ulong addr, int is_write, int mmu_idx,
uintptr_t retaddr);
-#include "exec/softmmu_defs.h"
+uint8_t helper_ldb_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
+uint16_t helper_ldw_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
+uint32_t helper_ldl_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
+uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
#define ACCESS_TYPE (NB_MMU_MODES + 1)
#define MEMSUFFIX _code
diff --git a/include/exec/softmmu_defs.h b/include/exec/softmmu_defs.h
deleted file mode 100644
index e55e717..0000000
--- a/include/exec/softmmu_defs.h
+++ /dev/null
@@ -1,49 +0,0 @@
-/*
- * Software MMU support
- *
- * Declare helpers used by TCG for qemu_ld/st ops.
- *
- * Used by softmmu_exec.h, TCG targets and exec-all.h.
- *
- */
-#ifndef SOFTMMU_DEFS_H
-#define SOFTMMU_DEFS_H
-
-uint8_t helper_ret_ldb_mmu(CPUArchState *env, target_ulong addr,
- int mmu_idx, uintptr_t retaddr);
-uint16_t helper_ret_ldw_mmu(CPUArchState *env, target_ulong addr,
- int mmu_idx, uintptr_t retaddr);
-uint32_t helper_ret_ldl_mmu(CPUArchState *env, target_ulong addr,
- int mmu_idx, uintptr_t retaddr);
-uint64_t helper_ret_ldq_mmu(CPUArchState *env, target_ulong addr,
- int mmu_idx, uintptr_t retaddr);
-
-void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
- int mmu_idx, uintptr_t retaddr);
-void helper_ret_stw_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
- int mmu_idx, uintptr_t retaddr);
-void helper_ret_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
- int mmu_idx, uintptr_t retaddr);
-void helper_ret_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
- int mmu_idx, uintptr_t retaddr);
-
-uint8_t helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-uint16_t helper_ldw_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-uint32_t helper_ldl_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-uint64_t helper_ldq_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-
-void helper_stb_mmu(CPUArchState *env, target_ulong addr,
- uint8_t val, int mmu_idx);
-void helper_stw_mmu(CPUArchState *env, target_ulong addr,
- uint16_t val, int mmu_idx);
-void helper_stl_mmu(CPUArchState *env, target_ulong addr,
- uint32_t val, int mmu_idx);
-void helper_stq_mmu(CPUArchState *env, target_ulong addr,
- uint64_t val, int mmu_idx);
-
-uint8_t helper_ldb_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-uint16_t helper_ldw_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-uint32_t helper_ldl_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-
-#endif /* SOFTMMU_DEFS_H */
diff --git a/include/exec/softmmu_exec.h b/include/exec/softmmu_exec.h
index 3e4e886..6fde154 100644
--- a/include/exec/softmmu_exec.h
+++ b/include/exec/softmmu_exec.h
@@ -19,7 +19,8 @@
#define ldul_executive ldl_executive
#define ldul_supervisor ldl_supervisor
-#include "exec/softmmu_defs.h"
+/* The memory helpers for tcg-generated code need tcg_target_long etc. */
+#include "tcg.h"
#define ACCESS_TYPE 0
#define MEMSUFFIX MMU_MODE0_SUFFIX
diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index 41a17f8..55ff700 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -778,8 +778,6 @@ static inline void tcg_out_nop(TCGContext *s)
}
#ifdef CONFIG_SOFTMMU
-#include "exec/softmmu_defs.h"
-
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
int mmu_idx) */
static const void * const qemu_ld_helpers[4] = {
diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 6c4854d..6d084b3 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -1058,8 +1058,6 @@ static inline void tcg_out_goto_label(TCGContext *s, int cond, int label_index)
#ifdef CONFIG_SOFTMMU
-#include "exec/softmmu_defs.h"
-
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
int mmu_idx) */
static const void * const qemu_ld_helpers[4] = {
diff --git a/tcg/hppa/tcg-target.c b/tcg/hppa/tcg-target.c
index 68f77ba..bac1f12 100644
--- a/tcg/hppa/tcg-target.c
+++ b/tcg/hppa/tcg-target.c
@@ -906,8 +906,6 @@ static void tcg_out_movcond(TCGContext *s, int cond, TCGArg ret,
}
#if defined(CONFIG_SOFTMMU)
-#include "exec/softmmu_defs.h"
-
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
int mmu_idx) */
static const void * const qemu_ld_helpers[4] = {
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 84f17fe..4c98cc4 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1022,9 +1022,6 @@ static void tcg_out_jmp(TCGContext *s, tcg_target_long dest)
}
#if defined(CONFIG_SOFTMMU)
-
-#include "exec/softmmu_defs.h"
-
/* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
* int mmu_idx, uintptr_t ra)
*/
diff --git a/tcg/ia64/tcg-target.c b/tcg/ia64/tcg-target.c
index 2373d9e..56a1b3a 100644
--- a/tcg/ia64/tcg-target.c
+++ b/tcg/ia64/tcg-target.c
@@ -1490,9 +1490,6 @@ static inline void tcg_out_movcond(TCGContext *s, TCGCond cond, TCGArg ret,
}
#if defined(CONFIG_SOFTMMU)
-
-#include "exec/softmmu_defs.h"
-
/* Load and compare a TLB entry, and return the result in (p6, p7).
R2 is loaded with the address of the addend TLB entry.
R57 is loaded with the address, zero extented on 32-bit targets. */
diff --git a/tcg/mips/tcg-target.c b/tcg/mips/tcg-target.c
index 793532e..a750d42 100644
--- a/tcg/mips/tcg-target.c
+++ b/tcg/mips/tcg-target.c
@@ -919,9 +919,6 @@ static void tcg_out_setcond2(TCGContext *s, TCGCond cond, TCGReg ret,
}
#if defined(CONFIG_SOFTMMU)
-
-#include "exec/softmmu_defs.h"
-
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
int mmu_idx) */
static const void * const qemu_ld_helpers[4] = {
diff --git a/tcg/ppc/tcg-target.c b/tcg/ppc/tcg-target.c
index 453ab6b..9a73d06 100644
--- a/tcg/ppc/tcg-target.c
+++ b/tcg/ppc/tcg-target.c
@@ -549,8 +549,6 @@ static void add_qemu_ldst_label (TCGContext *s,
label->label_ptr[0] = label_ptr;
}
-#include "exec/softmmu_defs.h"
-
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
int mmu_idx) */
static const void * const qemu_ld_helpers[4] = {
diff --git a/tcg/ppc64/tcg-target.c b/tcg/ppc64/tcg-target.c
index 0678de2..ec067e6 100644
--- a/tcg/ppc64/tcg-target.c
+++ b/tcg/ppc64/tcg-target.c
@@ -750,9 +750,6 @@ static void tcg_out_ldsta(TCGContext *s, TCGReg ret, TCGReg addr,
}
#if defined (CONFIG_SOFTMMU)
-
-#include "exec/softmmu_defs.h"
-
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
int mmu_idx) */
static const void * const qemu_ld_helpers[4] = {
diff --git a/tcg/s390/tcg-target.c b/tcg/s390/tcg-target.c
index f229f1c..5b556fe 100644
--- a/tcg/s390/tcg-target.c
+++ b/tcg/s390/tcg-target.c
@@ -315,9 +315,6 @@ static const uint8_t tcg_cond_to_ltr_cond[] = {
};
#ifdef CONFIG_SOFTMMU
-
-#include "exec/softmmu_defs.h"
-
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
int mmu_idx) */
static const void * const qemu_ld_helpers[4] = {
diff --git a/tcg/sparc/tcg-target.c b/tcg/sparc/tcg-target.c
index 5bfd29c..b1e3fc1 100644
--- a/tcg/sparc/tcg-target.c
+++ b/tcg/sparc/tcg-target.c
@@ -831,8 +831,6 @@ static void tcg_target_qemu_prologue(TCGContext *s)
#if defined(CONFIG_SOFTMMU)
-#include "exec/softmmu_defs.h"
-
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
int mmu_idx) */
static const void * const qemu_ld_helpers[4] = {
diff --git a/tcg/tcg.h b/tcg/tcg.h
index f3f9889..eeff6b7 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -21,6 +21,10 @@
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
+
+#ifndef TCG_H
+#define TCG_H
+
#include "qemu-common.h"
/* Target word size (must be identical to pointer size). */
@@ -741,3 +745,42 @@ void tcg_register_jit(void *buf, size_t buf_size);
/* Generate TB finalization at the end of block */
void tcg_out_tb_finalize(TCGContext *s);
#endif
+
+/*
+ * Memory helpers that will be used by TCG generated code.
+ */
+#ifdef CONFIG_SOFTMMU
+uint8_t helper_ret_ldb_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
+uint16_t helper_ret_ldw_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
+uint32_t helper_ret_ldl_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
+uint64_t helper_ret_ldq_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
+
+void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
+ int mmu_idx, uintptr_t retaddr);
+void helper_ret_stw_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
+ int mmu_idx, uintptr_t retaddr);
+void helper_ret_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+ int mmu_idx, uintptr_t retaddr);
+void helper_ret_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+ int mmu_idx, uintptr_t retaddr);
+
+uint8_t helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
+uint16_t helper_ldw_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
+uint32_t helper_ldl_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
+uint64_t helper_ldq_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
+
+void helper_stb_mmu(CPUArchState *env, target_ulong addr,
+ uint8_t val, int mmu_idx);
+void helper_stw_mmu(CPUArchState *env, target_ulong addr,
+ uint16_t val, int mmu_idx);
+void helper_stl_mmu(CPUArchState *env, target_ulong addr,
+ uint32_t val, int mmu_idx);
+void helper_stq_mmu(CPUArchState *env, target_ulong addr,
+ uint64_t val, int mmu_idx);
+#endif /* CONFIG_SOFTMMU */
+
+#endif /* TCG_H */
--
1.8.1.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 6/7] tcg: Introduce zero and sign-extended versions of load helpers
2013-08-27 21:46 [Qemu-devel] [PATCH 0/7] Further tcg ldst improvements Richard Henderson
` (4 preceding siblings ...)
2013-08-27 21:46 ` [Qemu-devel] [PATCH 5/7] exec: Split softmmu_defs.h Richard Henderson
@ 2013-08-27 21:46 ` Richard Henderson
2013-08-27 21:46 ` [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu Richard Henderson
6 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-08-27 21:46 UTC (permalink / raw)
To: qemu-devel; +Cc: aurelien
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
include/exec/softmmu_template.h | 58 ++++++++++++++++++++++++++++++++---------
tcg/i386/tcg-target.c | 6 ++---
tcg/tcg.h | 21 ++++++++++-----
3 files changed, 64 insertions(+), 21 deletions(-)
diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index f9922e2..5bbc56a 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -29,23 +29,39 @@
#if DATA_SIZE == 8
#define SUFFIX q
#define LSUFFIX q
-#define DATA_TYPE uint64_t
+#define SDATA_TYPE int64_t
#elif DATA_SIZE == 4
#define SUFFIX l
#define LSUFFIX l
-#define DATA_TYPE uint32_t
+#define SDATA_TYPE int32_t
#elif DATA_SIZE == 2
#define SUFFIX w
#define LSUFFIX uw
-#define DATA_TYPE uint16_t
+#define SDATA_TYPE int16_t
#elif DATA_SIZE == 1
#define SUFFIX b
#define LSUFFIX ub
-#define DATA_TYPE uint8_t
+#define SDATA_TYPE int8_t
#else
#error unsupported data size
#endif
+#define DATA_TYPE glue(u, SDATA_TYPE)
+
+/* For the benefit of TCG generated code, we want to avoid the complication
+ of ABI-specific return type promotion and always return a value extended
+ to the register size of the host. This is tcg_target_long, except in the
+ case of a 32-bit host and 64-bit data, and for that we always have
+ uint64_t. Don't bother with this widened value for SOFTMMU_CODE_ACCESS. */
+#if defined(SOFTMMU_CODE_ACCESS) || DATA_SIZE == 8
+# define WORD_TYPE DATA_TYPE
+# define USUFFIX SUFFIX
+#else
+# define WORD_TYPE tcg_target_ulong
+# define USUFFIX glue(u, SUFFIX)
+# define SSUFFIX glue(s, SUFFIX)
+#endif
+
#ifdef SOFTMMU_CODE_ACCESS
#define READ_ACCESS_TYPE 2
#define ADDR_READ addr_code
@@ -77,10 +93,10 @@ static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
#ifdef SOFTMMU_CODE_ACCESS
static
#endif
-DATA_TYPE
-glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
- target_ulong addr, int mmu_idx,
- uintptr_t retaddr)
+WORD_TYPE
+glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)(CPUArchState *env,
+ target_ulong addr, int mmu_idx,
+ uintptr_t retaddr)
{
int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
@@ -126,9 +142,9 @@ glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
addr2 = addr1 + DATA_SIZE;
/* Note the adjustment at the beginning of the function.
Undo that for the recursion. */
- res1 = glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)
+ res1 = glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
(env, addr1, mmu_idx, retaddr + GETPC_ADJ);
- res2 = glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)
+ res2 = glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
(env, addr2, mmu_idx, retaddr + GETPC_ADJ);
shift = (addr & (DATA_SIZE - 1)) * 8;
#ifdef TARGET_WORDS_BIGENDIAN
@@ -147,19 +163,33 @@ glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
#endif
haddr = addr + env->tlb_table[mmu_idx][index].addend;
- return glue(glue(ld, LSUFFIX), _raw)((uint8_t *)haddr);
+ /* Note that ldl_raw is defined with type "int". */
+ return (DATA_TYPE) glue(glue(ld, LSUFFIX), _raw)((uint8_t *)haddr);
}
DATA_TYPE
glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
int mmu_idx)
{
- return glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(env, addr, mmu_idx,
+ return glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)(env, addr, mmu_idx,
GETRA_EXT());
}
#ifndef SOFTMMU_CODE_ACCESS
+/* Provide signed versions of the load routines as well. We can of course
+ avoid this for 64-bit data, or for 32-bit data on 32-bit host. */
+#if DATA_SIZE * 8 < TCG_TARGET_REG_BITS
+WORD_TYPE
+glue(glue(helper_ret_ld, SSUFFIX), MMUSUFFIX)(CPUArchState *env,
+ target_ulong addr, int mmu_idx,
+ uintptr_t retaddr)
+{
+ return (SDATA_TYPE) glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
+ (env, addr, mmu_idx, retaddr);
+}
+#endif
+
static inline void glue(io_write, SUFFIX)(CPUArchState *env,
hwaddr physaddr,
DATA_TYPE val,
@@ -267,3 +297,7 @@ glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
#undef LSUFFIX
#undef DATA_SIZE
#undef ADDR_READ
+#undef WORD_TYPE
+#undef SDATA_TYPE
+#undef USUFFIX
+#undef SSUFFIX
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 4c98cc4..5aee0fa 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1026,9 +1026,9 @@ static void tcg_out_jmp(TCGContext *s, tcg_target_long dest)
* int mmu_idx, uintptr_t ra)
*/
static const void * const qemu_ld_helpers[4] = {
- helper_ret_ldb_mmu,
- helper_ret_ldw_mmu,
- helper_ret_ldl_mmu,
+ helper_ret_ldub_mmu,
+ helper_ret_lduw_mmu,
+ helper_ret_ldul_mmu,
helper_ret_ldq_mmu,
};
diff --git a/tcg/tcg.h b/tcg/tcg.h
index eeff6b7..27f6ee5 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -750,15 +750,24 @@ void tcg_out_tb_finalize(TCGContext *s);
* Memory helpers that will be used by TCG generated code.
*/
#ifdef CONFIG_SOFTMMU
-uint8_t helper_ret_ldb_mmu(CPUArchState *env, target_ulong addr,
- int mmu_idx, uintptr_t retaddr);
-uint16_t helper_ret_ldw_mmu(CPUArchState *env, target_ulong addr,
- int mmu_idx, uintptr_t retaddr);
-uint32_t helper_ret_ldl_mmu(CPUArchState *env, target_ulong addr,
- int mmu_idx, uintptr_t retaddr);
+/* Value zero-extended to tcg register size. */
+tcg_target_ulong helper_ret_ldub_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_ret_lduw_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_ret_ldul_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
uint64_t helper_ret_ldq_mmu(CPUArchState *env, target_ulong addr,
int mmu_idx, uintptr_t retaddr);
+/* Value sign-extended to tcg register size. */
+tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_ret_ldsw_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
+tcg_target_ulong helper_ret_ldsl_mmu(CPUArchState *env, target_ulong addr,
+ int mmu_idx, uintptr_t retaddr);
+
void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
int mmu_idx, uintptr_t retaddr);
void helper_ret_stw_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
--
1.8.1.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
2013-08-27 21:46 [Qemu-devel] [PATCH 0/7] Further tcg ldst improvements Richard Henderson
` (5 preceding siblings ...)
2013-08-27 21:46 ` [Qemu-devel] [PATCH 6/7] tcg: Introduce zero and sign-extended versions of load helpers Richard Henderson
@ 2013-08-27 21:46 ` Richard Henderson
2013-08-28 16:34 ` Richard Henderson
` (2 more replies)
6 siblings, 3 replies; 14+ messages in thread
From: Richard Henderson @ 2013-08-27 21:46 UTC (permalink / raw)
To: qemu-devel; +Cc: aurelien
This does require the fast path always load to the function return
value register, but apparently the loaded value usually needs to be
spilled back to its memory slot anyway so the change in register
does not really change much.
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
tcg/i386/tcg-target.c | 107 ++++++++++++++++++--------------------------------
1 file changed, 39 insertions(+), 68 deletions(-)
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 5aee0fa..b1d05b8 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1025,11 +1025,20 @@ static void tcg_out_jmp(TCGContext *s, tcg_target_long dest)
/* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
* int mmu_idx, uintptr_t ra)
*/
-static const void * const qemu_ld_helpers[4] = {
+static const void * const qemu_ld_helpers[8] = {
helper_ret_ldub_mmu,
helper_ret_lduw_mmu,
helper_ret_ldul_mmu,
helper_ret_ldq_mmu,
+
+ helper_ret_ldsb_mmu,
+ helper_ret_ldsw_mmu,
+#if TCG_TARGET_REG_BITS == 64
+ helper_ret_ldsl_mmu,
+#else
+ helper_ret_ldul_mmu,
+#endif
+ helper_ret_ldq_mmu,
};
/* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
@@ -1473,9 +1482,8 @@ static void add_qemu_ldst_label(TCGContext *s,
static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
{
int opc = l->opc;
- int s_bits = opc & 3;
- TCGReg data_reg;
uint8_t **label_ptr = &l->label_ptr[0];
+ TCGReg retaddr;
/* resolve label address */
*(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
@@ -1500,58 +1508,21 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, l->mem_index);
ofs += 4;
- tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, (uintptr_t)l->raddr);
+ retaddr = TCG_REG_EAX;
+ tcg_out_movi(s, TCG_TYPE_I32, retaddr, (uintptr_t)l->raddr);
+ tcg_out_st(s, TCG_TYPE_I32, retaddr, TCG_REG_ESP, ofs);
} else {
tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
/* The second argument is already loaded with addrlo. */
tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2],
l->mem_index);
- tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[3],
- (uintptr_t)l->raddr);
- }
-
- tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
-
- data_reg = l->datalo_reg;
- switch(opc) {
- case 0 | 4:
- tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
- break;
- case 1 | 4:
- tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
- break;
- case 0:
- tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
- break;
- case 1:
- tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
- break;
- case 2:
- tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
- break;
-#if TCG_TARGET_REG_BITS == 64
- case 2 | 4:
- tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
- break;
-#endif
- case 3:
- if (TCG_TARGET_REG_BITS == 64) {
- tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
- } else if (data_reg == TCG_REG_EDX) {
- /* xchg %edx, %eax */
- tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
- tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_EAX);
- } else {
- tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
- tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_EDX);
- }
- break;
- default:
- tcg_abort();
+ retaddr = tcg_target_call_iarg_regs[3];
+ tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
}
- /* Jump to the code corresponding to next IR of qemu_st */
- tcg_out_jmp(s, (tcg_target_long)l->raddr);
+ /* "Tail call" to the helper, with the return address back inline. */
+ tcg_out_push(s, retaddr);
+ tcg_out_jmp(s, (tcg_target_long)qemu_ld_helpers[opc]);
}
/*
@@ -2125,38 +2096,38 @@ static const TCGTargetOpDef x86_op_defs[] = {
#endif
#if TCG_TARGET_REG_BITS == 64
- { INDEX_op_qemu_ld8u, { "r", "L" } },
- { INDEX_op_qemu_ld8s, { "r", "L" } },
- { INDEX_op_qemu_ld16u, { "r", "L" } },
- { INDEX_op_qemu_ld16s, { "r", "L" } },
- { INDEX_op_qemu_ld32, { "r", "L" } },
- { INDEX_op_qemu_ld32u, { "r", "L" } },
- { INDEX_op_qemu_ld32s, { "r", "L" } },
- { INDEX_op_qemu_ld64, { "r", "L" } },
+ { INDEX_op_qemu_ld8u, { "a", "L" } },
+ { INDEX_op_qemu_ld8s, { "a", "L" } },
+ { INDEX_op_qemu_ld16u, { "a", "L" } },
+ { INDEX_op_qemu_ld16s, { "a", "L" } },
+ { INDEX_op_qemu_ld32, { "a", "L" } },
+ { INDEX_op_qemu_ld32u, { "a", "L" } },
+ { INDEX_op_qemu_ld32s, { "a", "L" } },
+ { INDEX_op_qemu_ld64, { "a", "L" } },
{ INDEX_op_qemu_st8, { "L", "L" } },
{ INDEX_op_qemu_st16, { "L", "L" } },
{ INDEX_op_qemu_st32, { "L", "L" } },
{ INDEX_op_qemu_st64, { "L", "L" } },
#elif TARGET_LONG_BITS <= TCG_TARGET_REG_BITS
- { INDEX_op_qemu_ld8u, { "r", "L" } },
- { INDEX_op_qemu_ld8s, { "r", "L" } },
- { INDEX_op_qemu_ld16u, { "r", "L" } },
- { INDEX_op_qemu_ld16s, { "r", "L" } },
- { INDEX_op_qemu_ld32, { "r", "L" } },
- { INDEX_op_qemu_ld64, { "r", "r", "L" } },
+ { INDEX_op_qemu_ld8u, { "a", "L" } },
+ { INDEX_op_qemu_ld8s, { "a", "L" } },
+ { INDEX_op_qemu_ld16u, { "a", "L" } },
+ { INDEX_op_qemu_ld16s, { "a", "L" } },
+ { INDEX_op_qemu_ld32, { "a", "L" } },
+ { INDEX_op_qemu_ld64, { "a", "d", "L" } },
{ INDEX_op_qemu_st8, { "cb", "L" } },
{ INDEX_op_qemu_st16, { "L", "L" } },
{ INDEX_op_qemu_st32, { "L", "L" } },
{ INDEX_op_qemu_st64, { "L", "L", "L" } },
#else
- { INDEX_op_qemu_ld8u, { "r", "L", "L" } },
- { INDEX_op_qemu_ld8s, { "r", "L", "L" } },
- { INDEX_op_qemu_ld16u, { "r", "L", "L" } },
- { INDEX_op_qemu_ld16s, { "r", "L", "L" } },
- { INDEX_op_qemu_ld32, { "r", "L", "L" } },
- { INDEX_op_qemu_ld64, { "r", "r", "L", "L" } },
+ { INDEX_op_qemu_ld8u, { "a", "L", "L" } },
+ { INDEX_op_qemu_ld8s, { "a", "L", "L" } },
+ { INDEX_op_qemu_ld16u, { "a", "L", "L" } },
+ { INDEX_op_qemu_ld16s, { "a", "L", "L" } },
+ { INDEX_op_qemu_ld32, { "a", "L", "L" } },
+ { INDEX_op_qemu_ld64, { "a", "d", "L", "L" } },
{ INDEX_op_qemu_st8, { "cb", "L", "L" } },
{ INDEX_op_qemu_st16, { "L", "L", "L" } },
--
1.8.1.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
2013-08-27 21:46 ` [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu Richard Henderson
@ 2013-08-28 16:34 ` Richard Henderson
2013-08-29 15:31 ` Paolo Bonzini
2013-08-29 17:06 ` Aurelien Jarno
2 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-08-28 16:34 UTC (permalink / raw)
To: qemu-devel; +Cc: aurelien
On 08/27/2013 02:46 PM, Richard Henderson wrote:
> - { INDEX_op_qemu_ld8u, { "r", "L" } },
> - { INDEX_op_qemu_ld8s, { "r", "L" } },
> - { INDEX_op_qemu_ld16u, { "r", "L" } },
> - { INDEX_op_qemu_ld16s, { "r", "L" } },
> - { INDEX_op_qemu_ld32, { "r", "L" } },
> - { INDEX_op_qemu_ld32u, { "r", "L" } },
> - { INDEX_op_qemu_ld32s, { "r", "L" } },
> - { INDEX_op_qemu_ld64, { "r", "L" } },
> + { INDEX_op_qemu_ld8u, { "a", "L" } },
> + { INDEX_op_qemu_ld8s, { "a", "L" } },
> + { INDEX_op_qemu_ld16u, { "a", "L" } },
> + { INDEX_op_qemu_ld16s, { "a", "L" } },
> + { INDEX_op_qemu_ld32, { "a", "L" } },
> + { INDEX_op_qemu_ld32u, { "a", "L" } },
> + { INDEX_op_qemu_ld32s, { "a", "L" } },
> + { INDEX_op_qemu_ld64, { "a", "L" } },
I should have invented a new constraint letter here.
While this works, and is ideal for softmmu, this
unnecessarily penalizes non-softmmu.
r~
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
2013-08-27 21:46 ` [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu Richard Henderson
2013-08-28 16:34 ` Richard Henderson
@ 2013-08-29 15:31 ` Paolo Bonzini
2013-08-29 16:08 ` Richard Henderson
2013-08-29 17:06 ` Aurelien Jarno
2 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2013-08-29 15:31 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, aurelien
Il 27/08/2013 23:46, Richard Henderson ha scritto:
> This does require the fast path always load to the function return
> value register, but apparently the loaded value usually needs to be
> spilled back to its memory slot anyway so the change in register
> does not really change much.
Even for something like
mov (%rdi), %rax
add (%r8), %rax
? Memory operands should avoid the need to spill anything.
Is this change really an advantage considering the additional icache
footprint of the new helpers?
Paolo
> Signed-off-by: Richard Henderson <rth@twiddle.net>
> ---
> tcg/i386/tcg-target.c | 107 ++++++++++++++++++--------------------------------
> 1 file changed, 39 insertions(+), 68 deletions(-)
>
> diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
> index 5aee0fa..b1d05b8 100644
> --- a/tcg/i386/tcg-target.c
> +++ b/tcg/i386/tcg-target.c
> @@ -1025,11 +1025,20 @@ static void tcg_out_jmp(TCGContext *s, tcg_target_long dest)
> /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
> * int mmu_idx, uintptr_t ra)
> */
> -static const void * const qemu_ld_helpers[4] = {
> +static const void * const qemu_ld_helpers[8] = {
> helper_ret_ldub_mmu,
> helper_ret_lduw_mmu,
> helper_ret_ldul_mmu,
> helper_ret_ldq_mmu,
> +
> + helper_ret_ldsb_mmu,
> + helper_ret_ldsw_mmu,
> +#if TCG_TARGET_REG_BITS == 64
> + helper_ret_ldsl_mmu,
> +#else
> + helper_ret_ldul_mmu,
> +#endif
> + helper_ret_ldq_mmu,
> };
>
> /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
> @@ -1473,9 +1482,8 @@ static void add_qemu_ldst_label(TCGContext *s,
> static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
> {
> int opc = l->opc;
> - int s_bits = opc & 3;
> - TCGReg data_reg;
> uint8_t **label_ptr = &l->label_ptr[0];
> + TCGReg retaddr;
>
> /* resolve label address */
> *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
> @@ -1500,58 +1508,21 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
> tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, l->mem_index);
> ofs += 4;
>
> - tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, (uintptr_t)l->raddr);
> + retaddr = TCG_REG_EAX;
> + tcg_out_movi(s, TCG_TYPE_I32, retaddr, (uintptr_t)l->raddr);
> + tcg_out_st(s, TCG_TYPE_I32, retaddr, TCG_REG_ESP, ofs);
> } else {
> tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
> /* The second argument is already loaded with addrlo. */
> tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2],
> l->mem_index);
> - tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[3],
> - (uintptr_t)l->raddr);
> - }
> -
> - tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
> -
> - data_reg = l->datalo_reg;
> - switch(opc) {
> - case 0 | 4:
> - tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
> - break;
> - case 1 | 4:
> - tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
> - break;
> - case 0:
> - tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
> - break;
> - case 1:
> - tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
> - break;
> - case 2:
> - tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
> - break;
> -#if TCG_TARGET_REG_BITS == 64
> - case 2 | 4:
> - tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
> - break;
> -#endif
> - case 3:
> - if (TCG_TARGET_REG_BITS == 64) {
> - tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
> - } else if (data_reg == TCG_REG_EDX) {
> - /* xchg %edx, %eax */
> - tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
> - tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_EAX);
> - } else {
> - tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
> - tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_EDX);
> - }
> - break;
> - default:
> - tcg_abort();
> + retaddr = tcg_target_call_iarg_regs[3];
> + tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
> }
>
> - /* Jump to the code corresponding to next IR of qemu_st */
> - tcg_out_jmp(s, (tcg_target_long)l->raddr);
> + /* "Tail call" to the helper, with the return address back inline. */
> + tcg_out_push(s, retaddr);
> + tcg_out_jmp(s, (tcg_target_long)qemu_ld_helpers[opc]);
> }
>
> /*
> @@ -2125,38 +2096,38 @@ static const TCGTargetOpDef x86_op_defs[] = {
> #endif
>
> #if TCG_TARGET_REG_BITS == 64
> - { INDEX_op_qemu_ld8u, { "r", "L" } },
> - { INDEX_op_qemu_ld8s, { "r", "L" } },
> - { INDEX_op_qemu_ld16u, { "r", "L" } },
> - { INDEX_op_qemu_ld16s, { "r", "L" } },
> - { INDEX_op_qemu_ld32, { "r", "L" } },
> - { INDEX_op_qemu_ld32u, { "r", "L" } },
> - { INDEX_op_qemu_ld32s, { "r", "L" } },
> - { INDEX_op_qemu_ld64, { "r", "L" } },
> + { INDEX_op_qemu_ld8u, { "a", "L" } },
> + { INDEX_op_qemu_ld8s, { "a", "L" } },
> + { INDEX_op_qemu_ld16u, { "a", "L" } },
> + { INDEX_op_qemu_ld16s, { "a", "L" } },
> + { INDEX_op_qemu_ld32, { "a", "L" } },
> + { INDEX_op_qemu_ld32u, { "a", "L" } },
> + { INDEX_op_qemu_ld32s, { "a", "L" } },
> + { INDEX_op_qemu_ld64, { "a", "L" } },
>
> { INDEX_op_qemu_st8, { "L", "L" } },
> { INDEX_op_qemu_st16, { "L", "L" } },
> { INDEX_op_qemu_st32, { "L", "L" } },
> { INDEX_op_qemu_st64, { "L", "L" } },
> #elif TARGET_LONG_BITS <= TCG_TARGET_REG_BITS
> - { INDEX_op_qemu_ld8u, { "r", "L" } },
> - { INDEX_op_qemu_ld8s, { "r", "L" } },
> - { INDEX_op_qemu_ld16u, { "r", "L" } },
> - { INDEX_op_qemu_ld16s, { "r", "L" } },
> - { INDEX_op_qemu_ld32, { "r", "L" } },
> - { INDEX_op_qemu_ld64, { "r", "r", "L" } },
> + { INDEX_op_qemu_ld8u, { "a", "L" } },
> + { INDEX_op_qemu_ld8s, { "a", "L" } },
> + { INDEX_op_qemu_ld16u, { "a", "L" } },
> + { INDEX_op_qemu_ld16s, { "a", "L" } },
> + { INDEX_op_qemu_ld32, { "a", "L" } },
> + { INDEX_op_qemu_ld64, { "a", "d", "L" } },
>
> { INDEX_op_qemu_st8, { "cb", "L" } },
> { INDEX_op_qemu_st16, { "L", "L" } },
> { INDEX_op_qemu_st32, { "L", "L" } },
> { INDEX_op_qemu_st64, { "L", "L", "L" } },
> #else
> - { INDEX_op_qemu_ld8u, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld8s, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld16u, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld16s, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld32, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld64, { "r", "r", "L", "L" } },
> + { INDEX_op_qemu_ld8u, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld8s, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld16u, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld16s, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld32, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld64, { "a", "d", "L", "L" } },
>
> { INDEX_op_qemu_st8, { "cb", "L", "L" } },
> { INDEX_op_qemu_st16, { "L", "L", "L" } },
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
2013-08-29 15:31 ` Paolo Bonzini
@ 2013-08-29 16:08 ` Richard Henderson
2013-08-29 16:36 ` Paolo Bonzini
0 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2013-08-29 16:08 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel, aurelien
On 08/29/2013 08:31 AM, Paolo Bonzini wrote:
> Il 27/08/2013 23:46, Richard Henderson ha scritto:
>> This does require the fast path always load to the function return
>> value register, but apparently the loaded value usually needs to be
>> spilled back to its memory slot anyway so the change in register
>> does not really change much.
>
> Even for something like
>
> mov (%rdi), %rax
> add (%r8), %rax
>
> ? Memory operands should avoid the need to spill anything.
No, not that kind of spilling. The kind in which subsequent operations
in the TB perform something that requires the register backing store
within env be up to date. Thus a "spill" from register to the memory slot.
I could have used better verbage i guess...
In theory, having the load go to a call-saved register instead of the
call-clobbered return value register would mean that (even with updating the
memory backing store) if we had a future read-only use of the register within
the basic block we could re-use the value in the register.
It's just that I didn't see that ever happen in the couple of MB of dumps that
I saw. Thus my conclusion that it's rare enough not to worry about it.
> Is this change really an advantage considering the additional icache
> footprint of the new helpers?
We *are* getting fractionally smaller TBs, for the low low price of three
functions like
> 00000000005f2df0 <helper_ret_ldsw_mmu>:
> 5f2df0: 48 83 ec 18 sub $0x18,%rsp
> 5f2df4: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
> 5f2dfb: 00 00
> 5f2dfd: 48 89 44 24 08 mov %rax,0x8(%rsp)
> 5f2e02: 31 c0 xor %eax,%eax
> 5f2e04: e8 27 fe ff ff callq 5f2c30 <helper_ret_lduw_mmu>
> 5f2e09: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
> 5f2e0e: 64 48 33 14 25 28 00 xor %fs:0x28,%rdx
> 5f2e15: 00 00
> 5f2e17: 48 0f bf c0 movswq %ax,%rax
> 5f2e1b: 75 05 jne 5f2e22 <helper_ret_ldsw_mmu+0x32>
> 5f2e1d: 48 83 c4 18 add $0x18,%rsp
> 5f2e21: c3 retq
> 5f2e22: e8 79 a7 e1 ff callq 40d5a0 <__stack_chk_fail@plt>
> 5f2e27: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
> 5f2e2e: 00 00
Spend 64 * 3 bytes on new helpers, and save 10 * zillion bytes on the TBs. In
all likelyhood we're saving icache rather than using additional.
Where I do think there's cause for treading carefully is wrt Aurelien's
statement "it's the slow path exception, the call-return stack doesn't matter".
Alternately, given that it *is* the slow path, who cares if the return from
the helper immediately hits a branch, rather than tail-calling back into the
fast path, if the benefit is that the call-return stack is still valid above
the code_gen_buffer after a simple tlb miss?
As an aside, why why o why do we default to -fstack-protector-all? Do we
really need checks in every single function, as opposed to those that actually
do something with arrays? Switch to plain -fstack-protector so we have
> 00000000005a1fd0 <helper_ret_ldsw_mmu>:
> 5a1fd0: 48 83 ec 08 sub $0x8,%rsp
> 5a1fd4: e8 57 fe ff ff callq 5a1e30 <helper_ret_lduw_mmu>
> 5a1fd9: 48 83 c4 08 add $0x8,%rsp
> 5a1fdd: 48 0f bf c0 movswq %ax,%rax
> 5a1fe1: c3 retq
> 5a1fe2: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
> 5a1fe9: 1f 84 00 00 00 00 00
and then lets talk about icache savings...
r~
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
2013-08-29 16:08 ` Richard Henderson
@ 2013-08-29 16:36 ` Paolo Bonzini
0 siblings, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2013-08-29 16:36 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, aurelien
Il 29/08/2013 18:08, Richard Henderson ha scritto:
> Where I do think there's cause for treading carefully is wrt Aurelien's
> statement "it's the slow path exception, the call-return stack doesn't matter".
> Alternately, given that it *is* the slow path, who cares if the return from
> the helper immediately hits a branch, rather than tail-calling back into the
> fast path, if the benefit is that the call-return stack is still valid above
> the code_gen_buffer after a simple tlb miss?
Aurelien's comment was that lea+push+jmp is smaller than lea+call+ret,
which I can buy.
I guess it depends more than everything on the hardware implementation
of return branch prediction, and _how much_ the call-return stack is broken.
PPC's mtlr+b+...+blr and x86's push+jmp+...+ret are quite similar in
this respect, and they beg the same question. After the blr/ret, is the
entire predictor state broken or will the processor simply take a miss
and still keep the remainder of the stack valid? (For x86 it could in
principle see that the stack pointer is lower and thus keep the entries
above it. For PPC it's not that simple since LR is a callee-save
register, but there's probably plenty of tricks and heuristics that can
be employed).
>
> As an aside, why why o why do we default to -fstack-protector-all? Do we
> really need checks in every single function, as opposed to those that actually
> do something with arrays? Switch to plain -fstack-protector so we have
>
>> 00000000005a1fd0 <helper_ret_ldsw_mmu>:
>> 5a1fd0: 48 83 ec 08 sub $0x8,%rsp
>> 5a1fd4: e8 57 fe ff ff callq 5a1e30 <helper_ret_lduw_mmu>
>> 5a1fd9: 48 83 c4 08 add $0x8,%rsp
>> 5a1fdd: 48 0f bf c0 movswq %ax,%rax
>> 5a1fe1: c3 retq
>> 5a1fe2: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
>> 5a1fe9: 1f 84 00 00 00 00 00
>
> and then lets talk about icache savings...
I think it was simply paranoia + not knowing the difference. Patch
welcome I guess. (And I admit I only skimmed the patches so I didn't
know how small the wrappers were).
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
2013-08-27 21:46 ` [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu Richard Henderson
2013-08-28 16:34 ` Richard Henderson
2013-08-29 15:31 ` Paolo Bonzini
@ 2013-08-29 17:06 ` Aurelien Jarno
2013-08-29 17:43 ` Richard Henderson
2 siblings, 1 reply; 14+ messages in thread
From: Aurelien Jarno @ 2013-08-29 17:06 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel
On Tue, Aug 27, 2013 at 02:46:31PM -0700, Richard Henderson wrote:
> This does require the fast path always load to the function return
> value register, but apparently the loaded value usually needs to be
> spilled back to its memory slot anyway so the change in register
> does not really change much.
My suggestion was to only do that for the store, not for the load.
Forcing a single call clobbered register for load function means we are
going to generate a lot more code for spilling the currently used value,
but also to reload it later.
You might argue that if the value is a global, it has to be saved back
to memory, but spilling it instead of saving it back means it has to be
reloaded later. Here is an example, in the first few TB from seabios:
0x7fd21c18d241: mov %ebp,%edi 0x7fcade58f241: mov %ebp,%edi
0x7fd21c18d243: mov %ebp,%esi 0x7fcade58f243: mov %ebp,%esi
0x7fd21c18d245: shr $0x7,%edi 0x7fcade58f245: shr $0x7,%edi
0x7fd21c18d248: and $0xfffff001,%esi 0x7fcade58f248: and $0xfffff001,%esi
0x7fd21c18d24e: and $0x1fe0,%edi 0x7fcade58f24e: and $0x1fe0,%edi
0x7fd21c18d254: lea 0x380(%r14,%rdi,1),%rdi 0x7fcade58f254: lea 0x380(%r14,%rdi,1),%rdi
0x7fd21c18d25c: cmp (%rdi),%esi 0x7fcade58f25c: cmp (%rdi),%esi
0x7fd21c18d25e: mov %ebp,%esi 0x7fcade58f25e: mov %ebp,%esi
0x7fd21c18d260: jne 0x7fd21c18d384 0x7fcade58f260: jne 0x7fcade58f3a0
0x7fd21c18d266: add 0x10(%rdi),%rsi 0x7fcade58f266: add 0x10(%rdi),%rsi
0x7fd21c18d26a: movzwl (%rsi),%ebx 0x7fcade58f26a: movzwl (%rsi),%eax
0x7fd21c18d26d: add $0x2,%ebp 0x7fcade58f26d: add $0x2,%ebp
0x7fcade58f270: mov %eax,0x80(%rsp)
Here %eax needs to be spilled as the value is needed for the next
qemu_ld operation.
0x7fd21c18d270: mov %ebp,%edi 0x7fcade58f277: mov %ebp,%edi
0x7fd21c18d272: mov %ebp,%esi 0x7fcade58f279: mov %ebp,%esi
0x7fd21c18d274: shr $0x7,%edi 0x7fcade58f27b: shr $0x7,%edi
0x7fd21c18d277: and $0xfffff003,%esi 0x7fcade58f27e: and $0xfffff003,%esi
0x7fd21c18d27d: and $0x1fe0,%edi 0x7fcade58f284: and $0x1fe0,%edi
0x7fd21c18d283: lea 0x380(%r14,%rdi,1),%rdi 0x7fcade58f28a: lea 0x380(%r14,%rdi,1),%rdi
0x7fd21c18d28b: cmp (%rdi),%esi 0x7fcade58f292: cmp (%rdi),%esi
0x7fd21c18d28d: mov %ebp,%esi 0x7fcade58f294: mov %ebp,%esi
0x7fd21c18d28f: jne 0x7fd21c18d39d 0x7fcade58f296: jne 0x7fcade58f3b2
0x7fd21c18d295: add 0x10(%rdi),%rsi 0x7fcade58f29c: add 0x10(%rdi),%rsi
0x7fd21c18d299: mov (%rsi),%ebp 0x7fcade58f2a0: mov (%rsi),%eax
0x7fd21c18d29b: and $0xffffff,%ebp 0x7fcade58f2a2: and $0xffffff,%eax
0x7fd21c18d2a1: mov %ebp,0xd8(%r14) 0x7fcade58f2a8: mov %eax,0xd8(%r14)
0x7fcade58f2b6: mov %ebp,0xdc(%r14)
And here it has to be reloaded from memory.
0x7fd21c18d2af: mov 0x58(%r14),%ebp 0x7fcade58f2af: mov 0x80(%rsp),%ebp
0x7fd21c18d2a8: mov %ebx,0xdc(%r14) 0x7fcade58f2bd: mov 0x58(%r14),%ebp
Overall it even makes the TB bigger. As the fast path is really what is
important and executed a lot more often than the slow path, we should
not deoptimize the fast path to get a smaller slow path.
> Signed-off-by: Richard Henderson <rth@twiddle.net>
> ---
> tcg/i386/tcg-target.c | 107 ++++++++++++++++++--------------------------------
> 1 file changed, 39 insertions(+), 68 deletions(-)
>
> diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
> index 5aee0fa..b1d05b8 100644
> --- a/tcg/i386/tcg-target.c
> +++ b/tcg/i386/tcg-target.c
> @@ -1025,11 +1025,20 @@ static void tcg_out_jmp(TCGContext *s, tcg_target_long dest)
> /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
> * int mmu_idx, uintptr_t ra)
> */
> -static const void * const qemu_ld_helpers[4] = {
> +static const void * const qemu_ld_helpers[8] = {
> helper_ret_ldub_mmu,
> helper_ret_lduw_mmu,
> helper_ret_ldul_mmu,
> helper_ret_ldq_mmu,
> +
> + helper_ret_ldsb_mmu,
> + helper_ret_ldsw_mmu,
> +#if TCG_TARGET_REG_BITS == 64
> + helper_ret_ldsl_mmu,
> +#else
> + helper_ret_ldul_mmu,
> +#endif
> + helper_ret_ldq_mmu,
> };
>
> /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
> @@ -1473,9 +1482,8 @@ static void add_qemu_ldst_label(TCGContext *s,
> static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
> {
> int opc = l->opc;
> - int s_bits = opc & 3;
> - TCGReg data_reg;
> uint8_t **label_ptr = &l->label_ptr[0];
> + TCGReg retaddr;
>
> /* resolve label address */
> *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
> @@ -1500,58 +1508,21 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
> tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, l->mem_index);
> ofs += 4;
>
> - tcg_out_sti(s, TCG_TYPE_I32, TCG_REG_ESP, ofs, (uintptr_t)l->raddr);
> + retaddr = TCG_REG_EAX;
> + tcg_out_movi(s, TCG_TYPE_I32, retaddr, (uintptr_t)l->raddr);
> + tcg_out_st(s, TCG_TYPE_I32, retaddr, TCG_REG_ESP, ofs);
> } else {
> tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
> /* The second argument is already loaded with addrlo. */
> tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2],
> l->mem_index);
> - tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[3],
> - (uintptr_t)l->raddr);
> - }
> -
> - tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
> -
> - data_reg = l->datalo_reg;
> - switch(opc) {
> - case 0 | 4:
> - tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
> - break;
> - case 1 | 4:
> - tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
> - break;
> - case 0:
> - tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
> - break;
> - case 1:
> - tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
> - break;
> - case 2:
> - tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
> - break;
> -#if TCG_TARGET_REG_BITS == 64
> - case 2 | 4:
> - tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
> - break;
> -#endif
> - case 3:
> - if (TCG_TARGET_REG_BITS == 64) {
> - tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
> - } else if (data_reg == TCG_REG_EDX) {
> - /* xchg %edx, %eax */
> - tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
> - tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_EAX);
> - } else {
> - tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
> - tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_EDX);
> - }
> - break;
> - default:
> - tcg_abort();
> + retaddr = tcg_target_call_iarg_regs[3];
> + tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
> }
>
> - /* Jump to the code corresponding to next IR of qemu_st */
> - tcg_out_jmp(s, (tcg_target_long)l->raddr);
> + /* "Tail call" to the helper, with the return address back inline. */
> + tcg_out_push(s, retaddr);
> + tcg_out_jmp(s, (tcg_target_long)qemu_ld_helpers[opc]);
> }
>
> /*
> @@ -2125,38 +2096,38 @@ static const TCGTargetOpDef x86_op_defs[] = {
> #endif
>
> #if TCG_TARGET_REG_BITS == 64
> - { INDEX_op_qemu_ld8u, { "r", "L" } },
> - { INDEX_op_qemu_ld8s, { "r", "L" } },
> - { INDEX_op_qemu_ld16u, { "r", "L" } },
> - { INDEX_op_qemu_ld16s, { "r", "L" } },
> - { INDEX_op_qemu_ld32, { "r", "L" } },
> - { INDEX_op_qemu_ld32u, { "r", "L" } },
> - { INDEX_op_qemu_ld32s, { "r", "L" } },
> - { INDEX_op_qemu_ld64, { "r", "L" } },
> + { INDEX_op_qemu_ld8u, { "a", "L" } },
> + { INDEX_op_qemu_ld8s, { "a", "L" } },
> + { INDEX_op_qemu_ld16u, { "a", "L" } },
> + { INDEX_op_qemu_ld16s, { "a", "L" } },
> + { INDEX_op_qemu_ld32, { "a", "L" } },
> + { INDEX_op_qemu_ld32u, { "a", "L" } },
> + { INDEX_op_qemu_ld32s, { "a", "L" } },
> + { INDEX_op_qemu_ld64, { "a", "L" } },
>
> { INDEX_op_qemu_st8, { "L", "L" } },
> { INDEX_op_qemu_st16, { "L", "L" } },
> { INDEX_op_qemu_st32, { "L", "L" } },
> { INDEX_op_qemu_st64, { "L", "L" } },
> #elif TARGET_LONG_BITS <= TCG_TARGET_REG_BITS
> - { INDEX_op_qemu_ld8u, { "r", "L" } },
> - { INDEX_op_qemu_ld8s, { "r", "L" } },
> - { INDEX_op_qemu_ld16u, { "r", "L" } },
> - { INDEX_op_qemu_ld16s, { "r", "L" } },
> - { INDEX_op_qemu_ld32, { "r", "L" } },
> - { INDEX_op_qemu_ld64, { "r", "r", "L" } },
> + { INDEX_op_qemu_ld8u, { "a", "L" } },
> + { INDEX_op_qemu_ld8s, { "a", "L" } },
> + { INDEX_op_qemu_ld16u, { "a", "L" } },
> + { INDEX_op_qemu_ld16s, { "a", "L" } },
> + { INDEX_op_qemu_ld32, { "a", "L" } },
> + { INDEX_op_qemu_ld64, { "a", "d", "L" } },
>
> { INDEX_op_qemu_st8, { "cb", "L" } },
> { INDEX_op_qemu_st16, { "L", "L" } },
> { INDEX_op_qemu_st32, { "L", "L" } },
> { INDEX_op_qemu_st64, { "L", "L", "L" } },
> #else
> - { INDEX_op_qemu_ld8u, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld8s, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld16u, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld16s, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld32, { "r", "L", "L" } },
> - { INDEX_op_qemu_ld64, { "r", "r", "L", "L" } },
> + { INDEX_op_qemu_ld8u, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld8s, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld16u, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld16s, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld32, { "a", "L", "L" } },
> + { INDEX_op_qemu_ld64, { "a", "d", "L", "L" } },
>
> { INDEX_op_qemu_st8, { "cb", "L", "L" } },
> { INDEX_op_qemu_st16, { "L", "L", "L" } },
> --
> 1.8.1.4
>
>
>
--
Aurelien Jarno GPG: 1024D/F1BCDB73
aurelien@aurel32.net http://www.aurel32.net
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
2013-08-29 17:06 ` Aurelien Jarno
@ 2013-08-29 17:43 ` Richard Henderson
0 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-08-29 17:43 UTC (permalink / raw)
To: Aurelien Jarno; +Cc: qemu-devel
On 08/29/2013 10:06 AM, Aurelien Jarno wrote:
> On Tue, Aug 27, 2013 at 02:46:31PM -0700, Richard Henderson wrote:
>> This does require the fast path always load to the function return
>> value register, but apparently the loaded value usually needs to be
>> spilled back to its memory slot anyway so the change in register
>> does not really change much.
>
> My suggestion was to only do that for the store, not for the load.
Ah, ok. Fair enough.
Shall I include the tail-call for store only if it saves code space,
due any possible concern for call-return stack?
I'm thinking here of PPC, which is 100% space neutral on this,
exchanging one B for one MTLR in order to implement the tail call.
r~
^ permalink raw reply [flat|nested] 14+ messages in thread