From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Lol2I-0001XB-J2 for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:18 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Lol2D-0001U7-9p for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:17 -0400 Received: from [199.232.76.173] (port=47232 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Lol2D-0001U4-7F for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:13 -0400 Received: from mx20.gnu.org ([199.232.41.8]:37922) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1Lol2C-0004kF-Ml for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:12 -0400 Received: from mail.codesourcery.com ([65.74.133.4]) by mx20.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Lol2A-0007Fl-La for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:11 -0400 From: Vladimir Prus Subject: Re: [Qemu-devel] SH: Fix movca.l/ocbi emulation. Date: Wed, 1 Apr 2009 00:58:06 +0400 References: <200812112134.30560.vladimir@codesourcery.com> <20090114204511.GA1973@edgar.se.axis.com> In-Reply-To: <20090114204511.GA1973@edgar.se.axis.com> MIME-Version: 1.0 Content-Type: Multipart/Mixed; boundary="Boundary-00=_fPo0JvWEeJSFcBl" Message-Id: <200904010058.07573.vladimir@codesourcery.com> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Edgar E. Iglesias" Cc: Shin-ichiro KAWASAKI , qemu-devel@nongnu.org --Boundary-00=_fPo0JvWEeJSFcBl Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Wednesday 14 January 2009 23:45:11 Edgar E. Iglesias wrote: > On Thu, Dec 11, 2008 at 09:34:30PM +0300, Vladimir Prus wrote: > > > > This patch fixes the emulation of movca.l and ocbi > > instructions. movca.l is documented to allocate a cache > > like, and write into it. ocbi invalidates a cache line. > > So, given code like: > > > > asm volatile("movca.l r0, @%0\n\t" > > "movca.l r0, @%1\n\t" > > "ocbi @%0\n\t" > > "ocbi @%1" : : > > "r" (a0), "r" (a1)); > > > > Nothing is actually written to memory. Code like this can be > > found in arch/sh/mm/cache-sh4.c and is used to flush the cache. > > > > Current QEMU implements movca.l the same way as ordinary move, > > and the code above just corrupts memory. Doing full cache emulation > > is out of question, so this patch implements a hack. Stores > > done by movca.l are delayed. If we execute an instructions that is > > neither movca.l nor ocbi, we flush all pending stores. If we execute > > ocbi, we look at pending stores and delete a store to the invalidated > > address. This appears to work fairly well in practice. > > > > - Volodya > > Hello and thanks for the patch. > > If you wonder why the late sudden interest in this, see the following > thread :) > http://lists.gnu.org/archive/html/qemu-devel/2009-01/msg00460.html Hi Edgar, apologies for belated reply. > I've got a few comments on your patch. Lets start with the general ones. > > It would be good if the patch handled when the data cache gets disabled or > put into writethrough mode. There are also memory areas that are not > cache:able (TLB feature). Supporting the latter can get messy for very > little win, IMO not needed. I am afraid I might not be the best person to accurately catch all cases where cache might be enabled or disabled. KAWASAKI-san has posted a function that can be used to check if caching is enabled, in: http://article.gmane.org/gmane.comp.emulators.qemu/36348 Does that one address your suggestion? If so, I can incorporate that function, and checking for it. > In the other thread discussing this issue, I raised conerns about the > delayed stores beeing issued on insns where they normaly wouldn't. Let me > give you an example of the kind of thing I'm thinking of. > > movca > --- <-- Page boundary, broken TB. > mov <-- I TLB happens to miss. SH jumps to TLB miss handler. > xxx <- I TLB refill handlers first insn issues delayed store > and the delayed store happens to fault! > > That scenario could cause a D-TLB fault inside a I-TLB refill handler. Some > archs don't handle that. If it's a real problem for sh (which I beleive it > is) you could turn the stake and make the movca into a load + store, saving > the overwritten memory value. ocbi could then just bring back the old state. > I beleive that approach would not break the exception rules of sh. I think your approach will indeed be more reliable, so I've reworked my patch accordingly. > > +void helper_do_stores(void) > > +{ > > + store_request_t *current = env->store_requests; > > + > > + while(current) > > + { > > + uint32_t a = current->address, v = current->value; > > + store_request_t *next = current->next; > > + free (current); > > + env->store_requests = current = next; > > + if (current == 0) > > + env->store_request_tail = &(env->store_requests); > > + > > + stl_data(a, v); > > Shouldn't you remove the delayed store record _after_ stl() returns? > Otherwise you might loose delayed stores in case of TLB exceptions or? I recall that I have explicitly used this ordering, but I no longer remember exactly why. Anyway, this is not an issue after patch is revised as you suggested above. > > > > > + } > > +} > > + > > +void helper_ocbi(uint32_t address) > > +{ > > + store_request_t **current = &(env->store_requests); > > + while (*current) > > + { > > + if ((*current)->address == address) > > + { > > > Nitpick: > It may not be very significant but you can in an easy way catch a few more > movca+ocbi use cases by not comparing line offsets. I.e, just mask off > the last five bits from the compared addresses to zap all recorded movca's > made to the same cache-line ocbi is invalidating. I did so. > > static void _decode_opc(DisasContext * ctx) > > { > > + /* This code tries to make movcal emulation sufficiently > > + accurate for Linux purposes. This instruction writes > > + memory, and prior to that, always allocates a cache line. > > + It is used in two contexts: > > + - in memcpy, where data is copied in blocks, the first write > > + of to a block uses movca.l. I presume this is because writing > > + all data into cache, and then having the data sent into memory > > + later, via store buffer, is faster than, in case of write-through > > + cache configuration, to wait for memory write on each store. > > I dont think this is the reason for using movca. In fact, according to the > documentation I've got, movca behaves like a normal mov for caches in > writethrough mode. I beleive the reason is to avoid the line fetch in case > the store from a normal mov misses the cache in writeback mode. Anyway, > there's no reason to speculate, IMO we can just remove the comment on why > and just leave the note about it beeing used for fast block copies. OK. I am attaching a revised patch -- what do you think? - Volodya --Boundary-00=_fPo0JvWEeJSFcBl Content-Type: text/x-patch; charset="UTF-8"; name="movcal-ocbi.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="movcal-ocbi.diff" commit 1902e15049a9179ced0c924c5ce7c43f69e74001 Author: Vladimir Prus Date: Tue Nov 11 12:29:02 2008 +0300 Fix movcal.l/ocbi emulation. * target-sh4/cpu.h (memory_content_t): New. (CPUSH4State): New fields movcal_backup and movcal_backup_tail. * target-sh4/helper.h (helper_movcal) (helper_discard_movcal_backup, helper_ocbi): New. * target-sh4/op_helper.c (helper_movcal) (helper_discard_movcal_backup, helper_ocbi): New. * target-sh4/translate.c (DisasContext): New field has_movcal. (sh4_defs): Update CVS for SH7785. (cpu_sh4_init): Initialize env->movcal_backup_tail. (_decode_opc): Discard movca.l-backup. Make use of helper_movcal and helper_ocbi. (gen_intermediate_code_internal): Initialize has_movcal to 1. diff --git a/cpu-exec.c b/cpu-exec.c index cf7c1fb..b0f0b5e 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -174,7 +174,7 @@ static inline TranslationBlock *tb_find_fast(void) /* we record a subset of the CPU state. It will always be the same before a given translated block is executed. */ - cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags); + cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags); tb = env->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]; if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base || tb->flags != flags)) { diff --git a/target-sh4/cpu.h b/target-sh4/cpu.h index e9eee2c..67741f6 100644 --- a/target-sh4/cpu.h +++ b/target-sh4/cpu.h @@ -100,6 +100,12 @@ enum sh_features { SH_FEATURE_BCR3_AND_BCR4 = 2, }; +typedef struct memory_content_t { + uint32_t address; + uint32_t value; + struct memory_content_t *next; +} memory_content_t; + typedef struct CPUSH4State { int id; /* CPU model */ @@ -149,6 +155,8 @@ typedef struct CPUSH4State { tlb_t itlb[ITLB_SIZE]; /* instruction translation table */ void *intc_handle; int intr_at_halt; /* SR_BL ignored during sleep */ + memory_content_t *movcal_backup; + memory_content_t **movcal_backup_tail; } CPUSH4State; CPUSH4State *cpu_sh4_init(const char *cpu_model); @@ -294,16 +302,19 @@ static inline void cpu_pc_from_tb(CPUState *env, TranslationBlock *tb) env->flags = tb->flags; } +#define TB_FLAG_PENDING_MOVCA (1 << 4) + static inline void cpu_get_tb_cpu_state(CPUState *env, target_ulong *pc, target_ulong *cs_base, int *flags) { *pc = env->pc; *cs_base = 0; *flags = (env->flags & (DELAY_SLOT | DELAY_SLOT_CONDITIONAL - | DELAY_SLOT_TRUE | DELAY_SLOT_CLEARME)) /* Bits 0- 3 */ - | (env->fpscr & (FPSCR_FR | FPSCR_SZ | FPSCR_PR)) /* Bits 19-21 */ + | DELAY_SLOT_TRUE | DELAY_SLOT_CLEARME)) /* Bits 0- 3 */ + | (env->fpscr & (FPSCR_FR | FPSCR_SZ | FPSCR_PR)) /* Bits 19-21 */ | (env->sr & (SR_MD | SR_RB)) /* Bits 29-30 */ - | (env->sr & SR_FD); /* Bit 15 */ + | (env->sr & SR_FD) /* Bit 15 */ + | (env->movcal_backup ? TB_FLAG_PENDING_MOVCA : 0); /* Bit 4 */ } #endif /* _CPU_SH4_H */ diff --git a/target-sh4/helper.h b/target-sh4/helper.h index e665185..4b2fcdd 100644 --- a/target-sh4/helper.h +++ b/target-sh4/helper.h @@ -9,6 +9,10 @@ DEF_HELPER_0(debug, void) DEF_HELPER_1(sleep, void, i32) DEF_HELPER_1(trapa, void, i32) +DEF_HELPER_2(movcal, void, i32, i32) +DEF_HELPER_0(discard_movcal_backup, void) +DEF_HELPER_1(ocbi, void, i32) + DEF_HELPER_2(addv, i32, i32, i32) DEF_HELPER_2(addc, i32, i32, i32) DEF_HELPER_2(subv, i32, i32, i32) diff --git a/target-sh4/op_helper.c b/target-sh4/op_helper.c index 84e1ad3..e6b71a0 100644 --- a/target-sh4/op_helper.c +++ b/target-sh4/op_helper.c @@ -18,6 +18,7 @@ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston MA 02110-1301 USA */ #include +#include #include "exec.h" #include "helper.h" @@ -122,6 +123,55 @@ void helper_trapa(uint32_t tra) cpu_loop_exit(); } +void helper_movcal(uint32_t address, uint32_t value) +{ + memory_content_t *r = (memory_content_t *)malloc (sizeof(memory_content_t)); + r->address = address; + r->value = value; + r->next = NULL; + + *(env->movcal_backup_tail) = r; + env->movcal_backup_tail = &(r->next); +} + +void helper_discard_movcal_backup(void) +{ + memory_content_t *current = env->movcal_backup; + + while(current) + { + memory_content_t *next = current->next; + free (current); + env->movcal_backup = current = next; + if (current == 0) + env->movcal_backup_tail = &(env->movcal_backup); + } +} + +void helper_ocbi(uint32_t address) +{ + memory_content_t **current = &(env->movcal_backup); + while (*current) + { + uint32_t a = (*current)->address; + if ((a & ~0x1F) == (address & ~0x1F)) + { + memory_content_t *next = (*current)->next; + + stl_data(a, (*current)->value); + + if (next == 0) + { + env->movcal_backup_tail = current; + } + + free (*current); + *current = next; + break; + } + } +} + uint32_t helper_addc(uint32_t arg0, uint32_t arg1) { uint32_t tmp0, tmp1; diff --git a/target-sh4/translate.c b/target-sh4/translate.c index cc9f886..4ced176 100644 --- a/target-sh4/translate.c +++ b/target-sh4/translate.c @@ -50,6 +50,7 @@ typedef struct DisasContext { uint32_t delayed_pc; int singlestep_enabled; uint32_t features; + int has_movcal; } DisasContext; #if defined(CONFIG_USER_ONLY) @@ -283,6 +284,7 @@ CPUSH4State *cpu_sh4_init(const char *cpu_model) env = qemu_mallocz(sizeof(CPUSH4State)); env->features = def->features; cpu_exec_init(env); + env->movcal_backup_tail = &(env->movcal_backup); sh4_translate_init(); env->cpu_model_str = cpu_model; cpu_sh4_reset(env); @@ -495,6 +497,37 @@ static inline void gen_store_fpr64 (TCGv_i64 t, int reg) static void _decode_opc(DisasContext * ctx) { + /* This code tries to make movcal emulation sufficiently + accurate for Linux purposes. This instruction writes + memory, and prior to that, always allocates a cache line. + It is used in two contexts: + - in memcpy, where data is copied in blocks, the first write + of to a block uses movca.l for performance. + - in arch/sh/mm/cache-sh4.c, movcal.l + ocbi combination is used + to flush the cache. Here, the data written by movcal.l is never + written to memory, and the data written is just bogus. + + To simulate this, we simulate movcal.l, we store the value to memory, + but we also remember the previous content. If we see ocbi, we check + if movcal.l for that address was done previously. If so, the write should + not have hit the memory, so we restore the previous content. + When we see an instruction that is neither movca.l + nor ocbi, the previous content is discarded. + + To optimize, we only try to flush stores when we're at the start of + TB, or if we already saw movca.l in this TB and did not flush stores + yet. */ + if (ctx->has_movcal) + { + int opcode = ctx->opcode & 0xf0ff; + if (opcode != 0x0093 /* ocbi */ + && opcode != 0x00c3 /* movca.l */) + { + gen_helper_discard_movcal_backup (); + ctx->has_movcal = 0; + } + } + #if 0 fprintf(stderr, "Translating opcode 0x%04x\n", ctx->opcode); #endif @@ -1545,7 +1578,13 @@ static void _decode_opc(DisasContext * ctx) } return; case 0x00c3: /* movca.l R0,@Rm */ - tcg_gen_qemu_st32(REG(0), REG(B11_8), ctx->memidx); + { + TCGv val = tcg_temp_new(); + tcg_gen_qemu_ld32u(val, REG(B11_8), ctx->memidx); + gen_helper_movcal (REG(B11_8), val); + tcg_gen_qemu_st32(REG(0), REG(B11_8), ctx->memidx); + } + ctx->has_movcal = 1; return; case 0x40a9: /* MOVUA.L @Rm,R0 (Rm) -> R0 @@ -1594,9 +1633,7 @@ static void _decode_opc(DisasContext * ctx) break; case 0x0093: /* ocbi @Rn */ { - TCGv dummy = tcg_temp_new(); - tcg_gen_qemu_ld32s(dummy, REG(B11_8), ctx->memidx); - tcg_temp_free(dummy); + gen_helper_ocbi (REG(B11_8)); } return; case 0x00a3: /* ocbp @Rn */ @@ -1876,6 +1913,7 @@ gen_intermediate_code_internal(CPUState * env, TranslationBlock * tb, ctx.tb = tb; ctx.singlestep_enabled = env->singlestep_enabled; ctx.features = env->features; + ctx.has_movcal = (tb->flags & TB_FLAG_PENDING_MOVCA); #ifdef DEBUG_DISAS qemu_log_mask(CPU_LOG_TB_CPU, --Boundary-00=_fPo0JvWEeJSFcBl--