From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1LNCc1-0003ZW-SL
	for qemu-devel@nongnu.org; Wed, 14 Jan 2009 15:45:17 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1LNCbx-0003Xm-9A
	for qemu-devel@nongnu.org; Wed, 14 Jan 2009 15:45:17 -0500
Received: from [199.232.76.173] (port=55450 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1LNCbw-0003Xd-Pr
	for qemu-devel@nongnu.org; Wed, 14 Jan 2009 15:45:13 -0500
Received: from bart.se.axis.com ([195.60.68.10]:50242)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <edgar.iglesias@axis.com>) id 1LNCbw-00088J-4A
	for qemu-devel@nongnu.org; Wed, 14 Jan 2009 15:45:12 -0500
Received: from bart.se.axis.com (bart.se.axis.com [127.0.0.1])
	by bart.se.axis.com (Postfix) with ESMTP id 26FDA640F2
	for <qemu-devel@nongnu.org>; Wed, 14 Jan 2009 21:45:11 +0100 (CET)
Received: from axis.com (edgar.se.axis.com [10.93.151.1])
	by bart.se.axis.com (Postfix) with ESMTP id 1801D64044
	for <qemu-devel@nongnu.org>; Wed, 14 Jan 2009 21:45:11 +0100 (CET)
Date: Wed, 14 Jan 2009 21:45:11 +0100
From: "Edgar E. Iglesias" <edgar.iglesias@axis.com>
Subject: Re: [Qemu-devel] SH: Fix movca.l/ocbi emulation.
Message-ID: <20090114204511.GA1973@edgar.se.axis.com>
References: <200812112134.30560.vladimir@codesourcery.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200812112134.30560.vladimir@codesourcery.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Vladimir Prus <vladimir@codesourcery.com>
Cc: Shin-ichiro KAWASAKI <kawasaki@juno.dti.ne.jp>, qemu-devel@nongnu.org

On Thu, Dec 11, 2008 at 09:34:30PM +0300, Vladimir Prus wrote:
> 
> This patch fixes the emulation of movca.l and ocbi
> instructions. movca.l is documented to allocate a cache
> like, and write into it. ocbi invalidates a cache line.
> So, given code like:
> 
> 			asm volatile("movca.l r0, @%0\n\t"
> 			     "movca.l r0, @%1\n\t"
> 			     "ocbi @%0\n\t"
> 			     "ocbi @%1" : :
> 			     "r" (a0), "r" (a1));
> 
> Nothing is actually written to memory. Code like this can be
> found in arch/sh/mm/cache-sh4.c and is used to flush the cache.
> 
> Current QEMU implements movca.l the same way as ordinary move,
> and the code above just corrupts memory. Doing full cache emulation
> is out of question, so this patch implements a hack. Stores
> done by movca.l are delayed. If we execute an instructions that is
> neither movca.l nor ocbi, we flush all pending stores. If we execute
> ocbi, we look at pending stores and delete a store to the invalidated
> address. This appears to work fairly well in practice.
> 
> - Volodya

Hello and thanks for the patch.

If you wonder why the late sudden interest in this, see the following
thread :)
http://lists.gnu.org/archive/html/qemu-devel/2009-01/msg00460.html

I've got a few comments on your patch. Lets start with the general ones.

It would be good if the patch handled when the data cache gets disabled or
put into writethrough mode. There are also memory areas that are not
cache:able (TLB feature). Supporting the latter can get messy for very
little win, IMO not needed.

In the other thread discussing this issue, I raised conerns about the
delayed stores beeing issued on insns where they normaly wouldn't. Let me
give you an example of the kind of thing I'm thinking of.

movca
---             <-- Page boundary, broken TB.
mov             <-- I TLB happens to miss. SH jumps to TLB miss handler.
xxx             <-  I TLB refill handlers first insn issues delayed store
                    and the delayed store happens to fault!

That scenario could cause a D-TLB fault inside a I-TLB refill handler. Some
archs don't handle that. If it's a real problem for sh (which I beleive it
is) you could turn the stake and make the movca into a load + store, saving
the overwritten memory value. ocbi could then just bring back the old state.
I beleive that approach would not break the exception rules of sh.


> From 2b28d4213e3b16a93d53eb2e4d522c5824de1647 Mon Sep 17 00:00:00 2001
> From: Vladimir Prus <vladimir@codesourcery.com>
> Date: Tue, 11 Nov 2008 12:29:02 +0300
> Subject: [PATCH] Fix movcal.l/ocbi emulation.
> To: qemu-devel@nongnu.org
> X-KMail-Transport: CodeSourcery
> X-KMail-Identity: 901867920
> 
> 	* target-sh4/cpu.h (store_request_t): New.
> 	(CPUSH4State): New fields store_requests and store_request_tail.
> 	* target-sh4/helper.h (helper_movcal, herlper_do_stores, helper_ocbi): New.
> 	* target-sh4/op_helper.c (helper_movcal, herlper_do_stores)
> 	(helper_ocbi): New.
> 	* target-sh4/translate.c (DisasContext): New field has_movcal.
> 	(sh4_defs): Update CVS for SH7785.
> 	(cpu_sh4_init): Initialize env->store_request_tail;
> 	(_decode_opc): Flush pending movca.l-originated stores.
> 	Make use of helper_movcal and helper_ocbi.
> 	(gen_intermediate_code_internal): Initialize has_movcal to 1.
> ---
>  cpu-exec.c             |    2 +-
>  target-sh4/cpu.h       |   17 +++++++++++++--
>  target-sh4/helper.h    |    4 +++
>  target-sh4/op_helper.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++++
>  target-sh4/translate.c |   44 +++++++++++++++++++++++++++++++++++++++---
>  5 files changed, 108 insertions(+), 8 deletions(-)
> 
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 9a35a59..64b0845 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -174,7 +174,7 @@ static inline TranslationBlock *tb_find_fast(void)
>      /* we record a subset of the CPU state. It will
>         always be the same before a given translated block
>         is executed. */
> -    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> +    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);            
>      tb = env->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>                   tb->flags != flags)) {
> diff --git a/target-sh4/cpu.h b/target-sh4/cpu.h
> index ae434d1..eed3b1b 100644
> --- a/target-sh4/cpu.h
> +++ b/target-sh4/cpu.h
> @@ -93,6 +93,12 @@ enum sh_features {
>      SH_FEATURE_SH4A = 1,
>  };
>  
> +typedef struct store_request_t {
> +    uint32_t address;
> +    uint32_t value;
> +    struct store_request_t *next;
> +} store_request_t;
> +
>  typedef struct CPUSH4State {
>      int id;			/* CPU model */
>  
> @@ -141,6 +147,8 @@ typedef struct CPUSH4State {
>      tlb_t itlb[ITLB_SIZE];	/* instruction translation table */
>      void *intc_handle;
>      int intr_at_halt;		/* SR_BL ignored during sleep */
> +    store_request_t *store_requests;
> +    store_request_t **store_request_tail;
>  } CPUSH4State;
>  
>  CPUSH4State *cpu_sh4_init(const char *cpu_model);
> @@ -281,16 +289,19 @@ static inline void cpu_pc_from_tb(CPUState *env, TranslationBlock *tb)
>      env->flags = tb->flags;
>  }
>  
> +#define TB_FLAG_PENDING_MOVCA  (1 << 4)
> +
>  static inline void cpu_get_tb_cpu_state(CPUState *env, target_ulong *pc,
>                                          target_ulong *cs_base, int *flags)
>  {
>      *pc = env->pc;
>      *cs_base = 0;
>      *flags = (env->flags & (DELAY_SLOT | DELAY_SLOT_CONDITIONAL
> -                    | DELAY_SLOT_TRUE | DELAY_SLOT_CLEARME))   /* Bits  0- 3 */
> -            | (env->fpscr & (FPSCR_FR | FPSCR_SZ | FPSCR_PR))  /* Bits 19-21 */
> +                    | DELAY_SLOT_TRUE | DELAY_SLOT_CLEARME))     /* Bits  0- 3 */
> +            | (env->fpscr & (FPSCR_FR | FPSCR_SZ | FPSCR_PR))    /* Bits 19-21 */
>              | (env->sr & (SR_MD | SR_RB))                      /* Bits 29-30 */
> -            | (env->sr & SR_FD);                               /* Bit 15 */
> +            | (env->sr & SR_FD)                                /* Bit 15 */
> +            | (env->store_requests ? TB_FLAG_PENDING_MOVCA : 0); /* Bit 4 */
>  }
>  
>  #endif				/* _CPU_SH4_H */
> diff --git a/target-sh4/helper.h b/target-sh4/helper.h
> index 631e7e1..d995688 100644
> --- a/target-sh4/helper.h
> +++ b/target-sh4/helper.h
> @@ -9,6 +9,10 @@ DEF_HELPER_0(debug, void)
>  DEF_HELPER_1(sleep, void, i32)
>  DEF_HELPER_1(trapa, void, i32)
>  
> +DEF_HELPER_2(movcal, void, i32, i32)
> +DEF_HELPER_0(do_stores, void)
> +DEF_HELPER_1(ocbi, void, i32)
> +
>  DEF_HELPER_2(addv, i32, i32, i32)
>  DEF_HELPER_2(addc, i32, i32, i32)
>  DEF_HELPER_2(subv, i32, i32, i32)
> diff --git a/target-sh4/op_helper.c b/target-sh4/op_helper.c
> index 6352219..b4982b0 100644
> --- a/target-sh4/op_helper.c
> +++ b/target-sh4/op_helper.c
> @@ -122,6 +122,55 @@ void helper_trapa(uint32_t tra)
>      cpu_loop_exit();
>  }
>  
> +void helper_movcal(uint32_t address, uint32_t value)
> +{
> +  store_request_t *r = (store_request_t *)malloc (sizeof(store_request_t));
> +  r->address = address;
> +  r->value = value;
> +  r->next = NULL;
> +
> +  *(env->store_request_tail) = r;
> +  env->store_request_tail = &(r->next);
> +}
> +
> +void helper_do_stores(void)
> +{
> +  store_request_t *current = env->store_requests;
> +
> +  while(current)
> +    {
> +      uint32_t a = current->address, v = current->value;
> +      store_request_t *next = current->next;
> +      free (current);
> +      env->store_requests = current = next;
> +      if (current == 0)
> +	env->store_request_tail = &(env->store_requests);
> +
> +      stl_data(a, v);

Shouldn't you remove the delayed store record _after_ stl() returns?
Otherwise you might loose delayed stores in case of TLB exceptions or?


> +    } 
> +}
> +
> +void helper_ocbi(uint32_t address)
> +{
> +  store_request_t **current = &(env->store_requests);
> +  while (*current)
> +    {
> +      if ((*current)->address == address)
> +	{


Nitpick:
It may not be very significant but you can in an easy way catch a few more
movca+ocbi use cases by not comparing line offsets. I.e, just mask off
the last five bits from the compared addresses to zap all recorded movca's
made to the same cache-line ocbi is invalidating.


> +	  store_request_t *next = (*current)->next;
> +
> +	  if (next == 0)
> +	    {
> +	      env->store_request_tail = current;
> +	    }
> +
> +	  free (*current);
> +	  *current = next;
> +	  break;
> +	}
> +    }
> +}
> +
>  uint32_t helper_addc(uint32_t arg0, uint32_t arg1)
>  {
>      uint32_t tmp0, tmp1;
> diff --git a/target-sh4/translate.c b/target-sh4/translate.c
> index ba9db14..949cb06 100644
> --- a/target-sh4/translate.c
> +++ b/target-sh4/translate.c
> @@ -50,6 +50,7 @@ typedef struct DisasContext {
>      uint32_t delayed_pc;
>      int singlestep_enabled;
>      uint32_t features;
> +    int has_movcal;
>  } DisasContext;
>  
>  #if defined(CONFIG_USER_ONLY)
> @@ -278,6 +279,7 @@ CPUSH4State *cpu_sh4_init(const char *cpu_model)
>  	return NULL;
>      env->features = def->features;
>      cpu_exec_init(env);
> +    env->store_request_tail = &(env->store_requests);
>      sh4_translate_init();
>      env->cpu_model_str = cpu_model;
>      cpu_sh4_reset(env);
> @@ -490,6 +492,40 @@ static inline void gen_store_fpr64 (TCGv_i64 t, int reg)
>  
>  static void _decode_opc(DisasContext * ctx)
>  {
> +    /* This code tries to make movcal emulation sufficiently
> +       accurate for Linux purposes.  This instruction writes
> +       memory, and prior to that, always allocates a cache line.
> +       It is used in two contexts:
> +       - in memcpy, where data is copied in blocks, the first write
> +       of to a block uses movca.l. I presume this is because writing
> +       all data into cache, and then having the data sent into memory
> +       later, via store buffer, is faster than, in case of write-through
> +       cache configuration, to wait for memory write on each store.

I dont think this is the reason for using movca. In fact, according to the
documentation I've got, movca behaves like a normal mov for caches in
writethrough mode. I beleive the reason is to avoid the line fetch in case
the store from a normal mov misses the cache in writeback mode. Anyway,
there's no reason to speculate, IMO we can just remove the comment on why
and just leave the note about it beeing used for fast block copies.


> +       - in arch/sh/mm/cache-sh4.c, movcal.l + ocbi combination is used
> +       to flush the cache. Here, the data written by movcal.l is never
> +       written to memory, and the data written is just bogus.

Agreed.

Cheers