From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Lol2I-0001XB-J2
	for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:18 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Lol2D-0001U7-9p
	for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:17 -0400
Received: from [199.232.76.173] (port=47232 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Lol2D-0001U4-7F
	for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:13 -0400
Received: from mx20.gnu.org ([199.232.41.8]:37922)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <vladimir@codesourcery.com>)
	id 1Lol2C-0004kF-Ml
	for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:12 -0400
Received: from mail.codesourcery.com ([65.74.133.4])
	by mx20.gnu.org with esmtp (Exim 4.60)
	(envelope-from <vladimir@codesourcery.com>) id 1Lol2A-0007Fl-La
	for qemu-devel@nongnu.org; Tue, 31 Mar 2009 16:58:11 -0400
From: Vladimir Prus <vladimir@codesourcery.com>
Subject: Re: [Qemu-devel] SH: Fix movca.l/ocbi emulation.
Date: Wed, 1 Apr 2009 00:58:06 +0400
References: <200812112134.30560.vladimir@codesourcery.com>
	<20090114204511.GA1973@edgar.se.axis.com>
In-Reply-To: <20090114204511.GA1973@edgar.se.axis.com>
MIME-Version: 1.0
Content-Type: Multipart/Mixed;
  boundary="Boundary-00=_fPo0JvWEeJSFcBl"
Message-Id: <200904010058.07573.vladimir@codesourcery.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Edgar E. Iglesias" <edgar.iglesias@axis.com>
Cc: Shin-ichiro KAWASAKI <kawasaki@juno.dti.ne.jp>, qemu-devel@nongnu.org

--Boundary-00=_fPo0JvWEeJSFcBl
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit

On Wednesday 14 January 2009 23:45:11 Edgar E. Iglesias wrote:
> On Thu, Dec 11, 2008 at 09:34:30PM +0300, Vladimir Prus wrote:
> > 
> > This patch fixes the emulation of movca.l and ocbi
> > instructions. movca.l is documented to allocate a cache
> > like, and write into it. ocbi invalidates a cache line.
> > So, given code like:
> > 
> > 			asm volatile("movca.l r0, @%0\n\t"
> > 			     "movca.l r0, @%1\n\t"
> > 			     "ocbi @%0\n\t"
> > 			     "ocbi @%1" : :
> > 			     "r" (a0), "r" (a1));
> > 
> > Nothing is actually written to memory. Code like this can be
> > found in arch/sh/mm/cache-sh4.c and is used to flush the cache.
> > 
> > Current QEMU implements movca.l the same way as ordinary move,
> > and the code above just corrupts memory. Doing full cache emulation
> > is out of question, so this patch implements a hack. Stores
> > done by movca.l are delayed. If we execute an instructions that is
> > neither movca.l nor ocbi, we flush all pending stores. If we execute
> > ocbi, we look at pending stores and delete a store to the invalidated
> > address. This appears to work fairly well in practice.
> > 
> > - Volodya
> 
> Hello and thanks for the patch.
> 
> If you wonder why the late sudden interest in this, see the following
> thread :)
> http://lists.gnu.org/archive/html/qemu-devel/2009-01/msg00460.html

Hi Edgar,

apologies for belated reply.

> I've got a few comments on your patch. Lets start with the general ones.
> 
> It would be good if the patch handled when the data cache gets disabled or
> put into writethrough mode. There are also memory areas that are not
> cache:able (TLB feature). Supporting the latter can get messy for very
> little win, IMO not needed.

I am afraid I might not be the best person to accurately catch all cases
where cache might be enabled or disabled. KAWASAKI-san has posted a function
that can be used to check if caching is enabled, in:

	http://article.gmane.org/gmane.comp.emulators.qemu/36348

Does that one address your suggestion? If so, I can incorporate that function,
and checking for it.

> In the other thread discussing this issue, I raised conerns about the
> delayed stores beeing issued on insns where they normaly wouldn't. Let me
> give you an example of the kind of thing I'm thinking of.
> 
> movca
> ---             <-- Page boundary, broken TB.
> mov             <-- I TLB happens to miss. SH jumps to TLB miss handler.
> xxx             <-  I TLB refill handlers first insn issues delayed store
>                     and the delayed store happens to fault!
> 
> That scenario could cause a D-TLB fault inside a I-TLB refill handler. Some
> archs don't handle that. If it's a real problem for sh (which I beleive it
> is) you could turn the stake and make the movca into a load + store, saving
> the overwritten memory value. ocbi could then just bring back the old state.
> I beleive that approach would not break the exception rules of sh.

I think your approach will indeed be more reliable, so I've reworked my patch
accordingly.

> > +void helper_do_stores(void)
> > +{
> > +  store_request_t *current = env->store_requests;
> > +
> > +  while(current)
> > +    {
> > +      uint32_t a = current->address, v = current->value;
> > +      store_request_t *next = current->next;
> > +      free (current);
> > +      env->store_requests = current = next;
> > +      if (current == 0)
> > +	env->store_request_tail = &(env->store_requests);
> > +
> > +      stl_data(a, v);
> 
> Shouldn't you remove the delayed store record _after_ stl() returns?
> Otherwise you might loose delayed stores in case of TLB exceptions or?

I recall that I have explicitly used this ordering, but I no longer 
remember exactly why. Anyway, this is not an issue after patch is
revised as you suggested above.
> 
> 
> 
> > +    } 
> > +}
> > +
> > +void helper_ocbi(uint32_t address)
> > +{
> > +  store_request_t **current = &(env->store_requests);
> > +  while (*current)
> > +    {
> > +      if ((*current)->address == address)
> > +	{
> 
> 
> Nitpick:
> It may not be very significant but you can in an easy way catch a few more
> movca+ocbi use cases by not comparing line offsets. I.e, just mask off
> the last five bits from the compared addresses to zap all recorded movca's
> made to the same cache-line ocbi is invalidating.

I did so.


> >  static void _decode_opc(DisasContext * ctx)
> >  {
> > +    /* This code tries to make movcal emulation sufficiently
> > +       accurate for Linux purposes.  This instruction writes
> > +       memory, and prior to that, always allocates a cache line.
> > +       It is used in two contexts:
> > +       - in memcpy, where data is copied in blocks, the first write
> > +       of to a block uses movca.l. I presume this is because writing
> > +       all data into cache, and then having the data sent into memory
> > +       later, via store buffer, is faster than, in case of write-through
> > +       cache configuration, to wait for memory write on each store.
> 
> I dont think this is the reason for using movca. In fact, according to the
> documentation I've got, movca behaves like a normal mov for caches in
> writethrough mode. I beleive the reason is to avoid the line fetch in case
> the store from a normal mov misses the cache in writeback mode. Anyway,
> there's no reason to speculate, IMO we can just remove the comment on why
> and just leave the note about it beeing used for fast block copies.

OK.

I am attaching a revised patch -- what do you think?

- Volodya


--Boundary-00=_fPo0JvWEeJSFcBl
Content-Type: text/x-patch;
  charset="UTF-8";
  name="movcal-ocbi.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="movcal-ocbi.diff"

commit 1902e15049a9179ced0c924c5ce7c43f69e74001
Author: Vladimir Prus <vladimir@codesourcery.com>
Date:   Tue Nov 11 12:29:02 2008 +0300

    Fix movcal.l/ocbi emulation.
    
    	* target-sh4/cpu.h (memory_content_t): New.
    	(CPUSH4State): New fields movcal_backup and movcal_backup_tail.
    	* target-sh4/helper.h (helper_movcal)
    	(helper_discard_movcal_backup, helper_ocbi): New.
    	* target-sh4/op_helper.c (helper_movcal)
    	(helper_discard_movcal_backup, helper_ocbi): New.
    	* target-sh4/translate.c (DisasContext): New field has_movcal.
    	(sh4_defs): Update CVS for SH7785.
    	(cpu_sh4_init): Initialize env->movcal_backup_tail.
    	(_decode_opc): Discard movca.l-backup.
    	Make use of helper_movcal and helper_ocbi.
    	(gen_intermediate_code_internal): Initialize has_movcal to 1.

diff --git a/cpu-exec.c b/cpu-exec.c
index cf7c1fb..b0f0b5e 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -174,7 +174,7 @@ static inline TranslationBlock *tb_find_fast(void)
     /* we record a subset of the CPU state. It will
        always be the same before a given translated block
        is executed. */
-    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
+    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);            
     tb = env->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
     if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
                  tb->flags != flags)) {
diff --git a/target-sh4/cpu.h b/target-sh4/cpu.h
index e9eee2c..67741f6 100644
--- a/target-sh4/cpu.h
+++ b/target-sh4/cpu.h
@@ -100,6 +100,12 @@ enum sh_features {
     SH_FEATURE_BCR3_AND_BCR4 = 2,
 };
 
+typedef struct memory_content_t {
+    uint32_t address;
+    uint32_t value;
+    struct memory_content_t *next;
+} memory_content_t;
+
 typedef struct CPUSH4State {
     int id;			/* CPU model */
 
@@ -149,6 +155,8 @@ typedef struct CPUSH4State {
     tlb_t itlb[ITLB_SIZE];	/* instruction translation table */
     void *intc_handle;
     int intr_at_halt;		/* SR_BL ignored during sleep */
+    memory_content_t *movcal_backup;
+    memory_content_t **movcal_backup_tail;
 } CPUSH4State;
 
 CPUSH4State *cpu_sh4_init(const char *cpu_model);
@@ -294,16 +302,19 @@ static inline void cpu_pc_from_tb(CPUState *env, TranslationBlock *tb)
     env->flags = tb->flags;
 }
 
+#define TB_FLAG_PENDING_MOVCA  (1 << 4)
+
 static inline void cpu_get_tb_cpu_state(CPUState *env, target_ulong *pc,
                                         target_ulong *cs_base, int *flags)
 {
     *pc = env->pc;
     *cs_base = 0;
     *flags = (env->flags & (DELAY_SLOT | DELAY_SLOT_CONDITIONAL
-                    | DELAY_SLOT_TRUE | DELAY_SLOT_CLEARME))   /* Bits  0- 3 */
-            | (env->fpscr & (FPSCR_FR | FPSCR_SZ | FPSCR_PR))  /* Bits 19-21 */
+                    | DELAY_SLOT_TRUE | DELAY_SLOT_CLEARME))     /* Bits  0- 3 */
+            | (env->fpscr & (FPSCR_FR | FPSCR_SZ | FPSCR_PR))    /* Bits 19-21 */
             | (env->sr & (SR_MD | SR_RB))                      /* Bits 29-30 */
-            | (env->sr & SR_FD);                               /* Bit 15 */
+            | (env->sr & SR_FD)                                /* Bit 15 */
+            | (env->movcal_backup ? TB_FLAG_PENDING_MOVCA : 0); /* Bit 4 */
 }
 
 #endif				/* _CPU_SH4_H */
diff --git a/target-sh4/helper.h b/target-sh4/helper.h
index e665185..4b2fcdd 100644
--- a/target-sh4/helper.h
+++ b/target-sh4/helper.h
@@ -9,6 +9,10 @@ DEF_HELPER_0(debug, void)
 DEF_HELPER_1(sleep, void, i32)
 DEF_HELPER_1(trapa, void, i32)
 
+DEF_HELPER_2(movcal, void, i32, i32)
+DEF_HELPER_0(discard_movcal_backup, void)
+DEF_HELPER_1(ocbi, void, i32)
+
 DEF_HELPER_2(addv, i32, i32, i32)
 DEF_HELPER_2(addc, i32, i32, i32)
 DEF_HELPER_2(subv, i32, i32, i32)
diff --git a/target-sh4/op_helper.c b/target-sh4/op_helper.c
index 84e1ad3..e6b71a0 100644
--- a/target-sh4/op_helper.c
+++ b/target-sh4/op_helper.c
@@ -18,6 +18,7 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston MA  02110-1301 USA
  */
 #include <assert.h>
+#include <stdlib.h>
 #include "exec.h"
 #include "helper.h"
 
@@ -122,6 +123,55 @@ void helper_trapa(uint32_t tra)
     cpu_loop_exit();
 }
 
+void helper_movcal(uint32_t address, uint32_t value)
+{
+    memory_content_t *r = (memory_content_t *)malloc (sizeof(memory_content_t));
+    r->address = address;
+    r->value = value;
+    r->next = NULL;
+
+    *(env->movcal_backup_tail) = r;
+    env->movcal_backup_tail = &(r->next);
+}
+
+void helper_discard_movcal_backup(void)
+{
+    memory_content_t *current = env->movcal_backup;
+
+    while(current)
+    {
+        memory_content_t *next = current->next;
+        free (current);
+        env->movcal_backup = current = next;
+        if (current == 0)
+            env->movcal_backup_tail = &(env->movcal_backup);
+    } 
+}
+
+void helper_ocbi(uint32_t address)
+{
+    memory_content_t **current = &(env->movcal_backup);
+    while (*current)
+    {
+        uint32_t a = (*current)->address;
+        if ((a & ~0x1F) == (address & ~0x1F))
+	{
+            memory_content_t *next = (*current)->next;
+
+            stl_data(a, (*current)->value);
+
+            if (next == 0)
+	    {
+                env->movcal_backup_tail = current;
+	    }
+
+            free (*current);
+            *current = next;
+            break;
+	}
+    }
+}
+
 uint32_t helper_addc(uint32_t arg0, uint32_t arg1)
 {
     uint32_t tmp0, tmp1;
diff --git a/target-sh4/translate.c b/target-sh4/translate.c
index cc9f886..4ced176 100644
--- a/target-sh4/translate.c
+++ b/target-sh4/translate.c
@@ -50,6 +50,7 @@ typedef struct DisasContext {
     uint32_t delayed_pc;
     int singlestep_enabled;
     uint32_t features;
+    int has_movcal;
 } DisasContext;
 
 #if defined(CONFIG_USER_ONLY)
@@ -283,6 +284,7 @@ CPUSH4State *cpu_sh4_init(const char *cpu_model)
     env = qemu_mallocz(sizeof(CPUSH4State));
     env->features = def->features;
     cpu_exec_init(env);
+    env->movcal_backup_tail = &(env->movcal_backup);
     sh4_translate_init();
     env->cpu_model_str = cpu_model;
     cpu_sh4_reset(env);
@@ -495,6 +497,37 @@ static inline void gen_store_fpr64 (TCGv_i64 t, int reg)
 
 static void _decode_opc(DisasContext * ctx)
 {
+    /* This code tries to make movcal emulation sufficiently
+       accurate for Linux purposes.  This instruction writes
+       memory, and prior to that, always allocates a cache line.
+       It is used in two contexts:
+       - in memcpy, where data is copied in blocks, the first write
+       of to a block uses movca.l for performance.
+       - in arch/sh/mm/cache-sh4.c, movcal.l + ocbi combination is used
+       to flush the cache. Here, the data written by movcal.l is never
+       written to memory, and the data written is just bogus.
+
+       To simulate this, we simulate movcal.l, we store the value to memory,
+       but we also remember the previous content. If we see ocbi, we check
+       if movcal.l for that address was done previously. If so, the write should
+       not have hit the memory, so we restore the previous content.
+       When we see an instruction that is neither movca.l
+       nor ocbi, the previous content is discarded.
+
+       To optimize, we only try to flush stores when we're at the start of
+       TB, or if we already saw movca.l in this TB and did not flush stores
+       yet.  */
+    if (ctx->has_movcal)
+	{
+	  int opcode = ctx->opcode & 0xf0ff;
+	  if (opcode != 0x0093 /* ocbi */
+	      && opcode != 0x00c3 /* movca.l */)
+	      {
+		  gen_helper_discard_movcal_backup ();
+		  ctx->has_movcal = 0;
+	      }
+	}
+
 #if 0
     fprintf(stderr, "Translating opcode 0x%04x\n", ctx->opcode);
 #endif
@@ -1545,7 +1578,13 @@ static void _decode_opc(DisasContext * ctx)
 	}
 	return;
     case 0x00c3:		/* movca.l R0,@Rm */
-	tcg_gen_qemu_st32(REG(0), REG(B11_8), ctx->memidx);
+        {
+            TCGv val = tcg_temp_new();
+            tcg_gen_qemu_ld32u(val, REG(B11_8), ctx->memidx);
+            gen_helper_movcal (REG(B11_8), val);            
+            tcg_gen_qemu_st32(REG(0), REG(B11_8), ctx->memidx);
+        }
+        ctx->has_movcal = 1;
 	return;
     case 0x40a9:
 	/* MOVUA.L @Rm,R0 (Rm) -> R0
@@ -1594,9 +1633,7 @@ static void _decode_opc(DisasContext * ctx)
 	    break;
     case 0x0093:		/* ocbi @Rn */
 	{
-	    TCGv dummy = tcg_temp_new();
-	    tcg_gen_qemu_ld32s(dummy, REG(B11_8), ctx->memidx);
-	    tcg_temp_free(dummy);
+	    gen_helper_ocbi (REG(B11_8));
 	}
 	return;
     case 0x00a3:		/* ocbp @Rn */
@@ -1876,6 +1913,7 @@ gen_intermediate_code_internal(CPUState * env, TranslationBlock * tb,
     ctx.tb = tb;
     ctx.singlestep_enabled = env->singlestep_enabled;
     ctx.features = env->features;
+    ctx.has_movcal = (tb->flags & TB_FLAG_PENDING_MOVCA);
 
 #ifdef DEBUG_DISAS
     qemu_log_mask(CPU_LOG_TB_CPU,

--Boundary-00=_fPo0JvWEeJSFcBl--