From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LAqMt-00039o-Jg for qemu-devel@nongnu.org; Thu, 11 Dec 2008 13:34:35 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LAqMs-00039F-My for qemu-devel@nongnu.org; Thu, 11 Dec 2008 13:34:34 -0500 Received: from [199.232.76.173] (port=38109 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LAqMs-00039A-C7 for qemu-devel@nongnu.org; Thu, 11 Dec 2008 13:34:34 -0500 Received: from mx20.gnu.org ([199.232.41.8]:4591) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1LAqMr-0007Ay-KH for qemu-devel@nongnu.org; Thu, 11 Dec 2008 13:34:33 -0500 Received: from mail.codesourcery.com ([65.74.133.4]) by mx20.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1LAqMp-0004pT-MY for qemu-devel@nongnu.org; Thu, 11 Dec 2008 13:34:32 -0500 From: Vladimir Prus Date: Thu, 11 Dec 2008 21:34:30 +0300 MIME-Version: 1.0 Content-Type: Multipart/Mixed; boundary="Boundary-00=_20VQJxpzlItFJRx" Message-Id: <200812112134.30560.vladimir@codesourcery.com> Subject: [Qemu-devel] SH: Fix movca.l/ocbi emulation. Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org --Boundary-00=_20VQJxpzlItFJRx Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline This patch fixes the emulation of movca.l and ocbi instructions. movca.l is documented to allocate a cache like, and write into it. ocbi invalidates a cache line. So, given code like: asm volatile("movca.l r0, @%0\n\t" "movca.l r0, @%1\n\t" "ocbi @%0\n\t" "ocbi @%1" : : "r" (a0), "r" (a1)); Nothing is actually written to memory. Code like this can be found in arch/sh/mm/cache-sh4.c and is used to flush the cache. Current QEMU implements movca.l the same way as ordinary move, and the code above just corrupts memory. Doing full cache emulation is out of question, so this patch implements a hack. Stores done by movca.l are delayed. If we execute an instructions that is neither movca.l nor ocbi, we flush all pending stores. If we execute ocbi, we look at pending stores and delete a store to the invalidated address. This appears to work fairly well in practice. - Volodya --Boundary-00=_20VQJxpzlItFJRx Content-Type: text/x-diff; charset="iso 8859-15"; name="0004-Fix-movcal.l-ocbi-emulation.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0004-Fix-movcal.l-ocbi-emulation.patch" =46rom 2b28d4213e3b16a93d53eb2e4d522c5824de1647 Mon Sep 17 00:00:00 2001 =46rom: Vladimir Prus Date: Tue, 11 Nov 2008 12:29:02 +0300 Subject: [PATCH] Fix movcal.l/ocbi emulation. To: qemu-devel@nongnu.org X-KMail-Transport: CodeSourcery X-KMail-Identity: 901867920 * target-sh4/cpu.h (store_request_t): New. (CPUSH4State): New fields store_requests and store_request_tail. * target-sh4/helper.h (helper_movcal, herlper_do_stores, helper_ocbi): New. * target-sh4/op_helper.c (helper_movcal, herlper_do_stores) (helper_ocbi): New. * target-sh4/translate.c (DisasContext): New field has_movcal. (sh4_defs): Update CVS for SH7785. (cpu_sh4_init): Initialize env->store_request_tail; (_decode_opc): Flush pending movca.l-originated stores. Make use of helper_movcal and helper_ocbi. (gen_intermediate_code_internal): Initialize has_movcal to 1. =2D-- cpu-exec.c | 2 +- target-sh4/cpu.h | 17 +++++++++++++-- target-sh4/helper.h | 4 +++ target-sh4/op_helper.c | 49 ++++++++++++++++++++++++++++++++++++++++++++= ++++ target-sh4/translate.c | 44 +++++++++++++++++++++++++++++++++++++++--- 5 files changed, 108 insertions(+), 8 deletions(-) diff --git a/cpu-exec.c b/cpu-exec.c index 9a35a59..64b0845 100644 =2D-- a/cpu-exec.c +++ b/cpu-exec.c @@ -174,7 +174,7 @@ static inline TranslationBlock *tb_find_fast(void) /* we record a subset of the CPU state. It will always be the same before a given translated block is executed. */ =2D cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags); + cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags); =20 tb =3D env->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]; if (unlikely(!tb || tb->pc !=3D pc || tb->cs_base !=3D cs_base || tb->flags !=3D flags)) { diff --git a/target-sh4/cpu.h b/target-sh4/cpu.h index ae434d1..eed3b1b 100644 =2D-- a/target-sh4/cpu.h +++ b/target-sh4/cpu.h @@ -93,6 +93,12 @@ enum sh_features { SH_FEATURE_SH4A =3D 1, }; =20 +typedef struct store_request_t { + uint32_t address; + uint32_t value; + struct store_request_t *next; +} store_request_t; + typedef struct CPUSH4State { int id; /* CPU model */ =20 @@ -141,6 +147,8 @@ typedef struct CPUSH4State { tlb_t itlb[ITLB_SIZE]; /* instruction translation table */ void *intc_handle; int intr_at_halt; /* SR_BL ignored during sleep */ + store_request_t *store_requests; + store_request_t **store_request_tail; } CPUSH4State; =20 CPUSH4State *cpu_sh4_init(const char *cpu_model); @@ -281,16 +289,19 @@ static inline void cpu_pc_from_tb(CPUState *env, Tran= slationBlock *tb) env->flags =3D tb->flags; } =20 +#define TB_FLAG_PENDING_MOVCA (1 << 4) + static inline void cpu_get_tb_cpu_state(CPUState *env, target_ulong *pc, target_ulong *cs_base, int *flags) { *pc =3D env->pc; *cs_base =3D 0; *flags =3D (env->flags & (DELAY_SLOT | DELAY_SLOT_CONDITIONAL =2D | DELAY_SLOT_TRUE | DELAY_SLOT_CLEARME)) /* Bits = 0- 3 */ =2D | (env->fpscr & (FPSCR_FR | FPSCR_SZ | FPSCR_PR)) /* Bits 1= 9-21 */ + | DELAY_SLOT_TRUE | DELAY_SLOT_CLEARME)) /* Bits = 0- 3 */ + | (env->fpscr & (FPSCR_FR | FPSCR_SZ | FPSCR_PR)) /* Bits 1= 9-21 */ | (env->sr & (SR_MD | SR_RB)) /* Bits 29-= 30 */ =2D | (env->sr & SR_FD); /* Bit 15= */ + | (env->sr & SR_FD) /* Bit 15 */ + | (env->store_requests ? TB_FLAG_PENDING_MOVCA : 0); /* Bit 4 = */ } =20 #endif /* _CPU_SH4_H */ diff --git a/target-sh4/helper.h b/target-sh4/helper.h index 631e7e1..d995688 100644 =2D-- a/target-sh4/helper.h +++ b/target-sh4/helper.h @@ -9,6 +9,10 @@ DEF_HELPER_0(debug, void) DEF_HELPER_1(sleep, void, i32) DEF_HELPER_1(trapa, void, i32) =20 +DEF_HELPER_2(movcal, void, i32, i32) +DEF_HELPER_0(do_stores, void) +DEF_HELPER_1(ocbi, void, i32) + DEF_HELPER_2(addv, i32, i32, i32) DEF_HELPER_2(addc, i32, i32, i32) DEF_HELPER_2(subv, i32, i32, i32) diff --git a/target-sh4/op_helper.c b/target-sh4/op_helper.c index 6352219..b4982b0 100644 =2D-- a/target-sh4/op_helper.c +++ b/target-sh4/op_helper.c @@ -122,6 +122,55 @@ void helper_trapa(uint32_t tra) cpu_loop_exit(); } =20 +void helper_movcal(uint32_t address, uint32_t value) +{ + store_request_t *r =3D (store_request_t *)malloc (sizeof(store_request_t= )); + r->address =3D address; + r->value =3D value; + r->next =3D NULL; + + *(env->store_request_tail) =3D r; + env->store_request_tail =3D &(r->next); +} + +void helper_do_stores(void) +{ + store_request_t *current =3D env->store_requests; + + while(current) + { + uint32_t a =3D current->address, v =3D current->value; + store_request_t *next =3D current->next; + free (current); + env->store_requests =3D current =3D next; + if (current =3D=3D 0) + env->store_request_tail =3D &(env->store_requests); + + stl_data(a, v); + }=20 +} + +void helper_ocbi(uint32_t address) +{ + store_request_t **current =3D &(env->store_requests); + while (*current) + { + if ((*current)->address =3D=3D address) + { + store_request_t *next =3D (*current)->next; + + if (next =3D=3D 0) + { + env->store_request_tail =3D current; + } + + free (*current); + *current =3D next; + break; + } + } +} + uint32_t helper_addc(uint32_t arg0, uint32_t arg1) { uint32_t tmp0, tmp1; diff --git a/target-sh4/translate.c b/target-sh4/translate.c index ba9db14..949cb06 100644 =2D-- a/target-sh4/translate.c +++ b/target-sh4/translate.c @@ -50,6 +50,7 @@ typedef struct DisasContext { uint32_t delayed_pc; int singlestep_enabled; uint32_t features; + int has_movcal; } DisasContext; =20 #if defined(CONFIG_USER_ONLY) @@ -278,6 +279,7 @@ CPUSH4State *cpu_sh4_init(const char *cpu_model) return NULL; env->features =3D def->features; cpu_exec_init(env); + env->store_request_tail =3D &(env->store_requests); sh4_translate_init(); env->cpu_model_str =3D cpu_model; cpu_sh4_reset(env); @@ -490,6 +492,40 @@ static inline void gen_store_fpr64 (TCGv_i64 t, int re= g) =20 static void _decode_opc(DisasContext * ctx) { + /* This code tries to make movcal emulation sufficiently + accurate for Linux purposes. This instruction writes + memory, and prior to that, always allocates a cache line. + It is used in two contexts: + - in memcpy, where data is copied in blocks, the first write + of to a block uses movca.l. I presume this is because writing + all data into cache, and then having the data sent into memory + later, via store buffer, is faster than, in case of write-through + cache configuration, to wait for memory write on each store. + - in arch/sh/mm/cache-sh4.c, movcal.l + ocbi combination is used + to flush the cache. Here, the data written by movcal.l is never + written to memory, and the data written is just bogus. + + To simulate this, we keep a list of store requests initiated + by movcal.l, see env->store_requests. movcal.l only adds new entry + to this list. When we see an instruction that is neither movca.l + nor ocbi, we perform the stores recorded in this list. When we see + ocbi, we check if the stores list has the address being invalidated. + If so, we remove the address from the list. =20 + + To optimize, we only try to flush stores when we're at the start of + TB, or if we already saw movca.l in this TB and did not flush stores + yet. */ + if (ctx->has_movcal) + { + int opcode =3D ctx->opcode & 0xf0ff; + if (opcode !=3D 0x0093 /* ocbi */ + && opcode !=3D 0x00c3 /* movca.l */) + { + gen_helper_do_stores (); + ctx->has_movcal =3D 0; + } + } + #if 0 fprintf(stderr, "Translating opcode 0x%04x\n", ctx->opcode); #endif @@ -1529,7 +1565,8 @@ static void _decode_opc(DisasContext * ctx) } return; case 0x00c3: /* movca.l R0,@Rm */ =2D tcg_gen_qemu_st32(REG(0), REG(B11_8), ctx->memidx); + gen_helper_movcal (REG(B11_8), REG(0));=09 + ctx->has_movcal =3D 1; return; case 0x40a9: /* MOVUA.L @Rm,R0 (Rm) -> R0 @@ -1578,9 +1615,7 @@ static void _decode_opc(DisasContext * ctx) break; case 0x0093: /* ocbi @Rn */ { =2D TCGv dummy =3D tcg_temp_new(); =2D tcg_gen_qemu_ld32s(dummy, REG(B11_8), ctx->memidx); =2D tcg_temp_free(dummy); + gen_helper_ocbi (REG(B11_8)); } return; case 0x00a3: /* ocbp @Rn */ @@ -1858,6 +1893,7 @@ gen_intermediate_code_internal(CPUState * env, Transl= ationBlock * tb, ctx.tb =3D tb; ctx.singlestep_enabled =3D env->singlestep_enabled; ctx.features =3D env->features; + ctx.has_movcal =3D (tb->flags & TB_FLAG_PENDING_MOVCA); =20 #ifdef DEBUG_DISAS if (loglevel & CPU_LOG_TB_CPU) { =2D-=20 1.5.3.5 --Boundary-00=_20VQJxpzlItFJRx--