On Wednesday 14 January 2009 23:45:11 Edgar E. Iglesias wrote:
> On Thu, Dec 11, 2008 at 09:34:30PM +0300, Vladimir Prus wrote:
> > 
> > This patch fixes the emulation of movca.l and ocbi
> > instructions. movca.l is documented to allocate a cache
> > like, and write into it. ocbi invalidates a cache line.
> > So, given code like:
> > 
> > 			asm volatile("movca.l r0, @%0\n\t"
> > 			     "movca.l r0, @%1\n\t"
> > 			     "ocbi @%0\n\t"
> > 			     "ocbi @%1" : :
> > 			     "r" (a0), "r" (a1));
> > 
> > Nothing is actually written to memory. Code like this can be
> > found in arch/sh/mm/cache-sh4.c and is used to flush the cache.
> > 
> > Current QEMU implements movca.l the same way as ordinary move,
> > and the code above just corrupts memory. Doing full cache emulation
> > is out of question, so this patch implements a hack. Stores
> > done by movca.l are delayed. If we execute an instructions that is
> > neither movca.l nor ocbi, we flush all pending stores. If we execute
> > ocbi, we look at pending stores and delete a store to the invalidated
> > address. This appears to work fairly well in practice.
> > 
> > - Volodya
> 
> Hello and thanks for the patch.
> 
> If you wonder why the late sudden interest in this, see the following
> thread :)
> http://lists.gnu.org/archive/html/qemu-devel/2009-01/msg00460.html

Hi Edgar,

apologies for belated reply.

> I've got a few comments on your patch. Lets start with the general ones.
> 
> It would be good if the patch handled when the data cache gets disabled or
> put into writethrough mode. There are also memory areas that are not
> cache:able (TLB feature). Supporting the latter can get messy for very
> little win, IMO not needed.

I am afraid I might not be the best person to accurately catch all cases
where cache might be enabled or disabled. KAWASAKI-san has posted a function
that can be used to check if caching is enabled, in:

	http://article.gmane.org/gmane.comp.emulators.qemu/36348

Does that one address your suggestion? If so, I can incorporate that function,
and checking for it.

> In the other thread discussing this issue, I raised conerns about the
> delayed stores beeing issued on insns where they normaly wouldn't. Let me
> give you an example of the kind of thing I'm thinking of.
> 
> movca
> ---             <-- Page boundary, broken TB.
> mov             <-- I TLB happens to miss. SH jumps to TLB miss handler.
> xxx             <-  I TLB refill handlers first insn issues delayed store
>                     and the delayed store happens to fault!
> 
> That scenario could cause a D-TLB fault inside a I-TLB refill handler. Some
> archs don't handle that. If it's a real problem for sh (which I beleive it
> is) you could turn the stake and make the movca into a load + store, saving
> the overwritten memory value. ocbi could then just bring back the old state.
> I beleive that approach would not break the exception rules of sh.

I think your approach will indeed be more reliable, so I've reworked my patch
accordingly.

> > +void helper_do_stores(void)
> > +{
> > +  store_request_t *current = env->store_requests;
> > +
> > +  while(current)
> > +    {
> > +      uint32_t a = current->address, v = current->value;
> > +      store_request_t *next = current->next;
> > +      free (current);
> > +      env->store_requests = current = next;
> > +      if (current == 0)
> > +	env->store_request_tail = &(env->store_requests);
> > +
> > +      stl_data(a, v);
> 
> Shouldn't you remove the delayed store record _after_ stl() returns?
> Otherwise you might loose delayed stores in case of TLB exceptions or?

I recall that I have explicitly used this ordering, but I no longer 
remember exactly why. Anyway, this is not an issue after patch is
revised as you suggested above.
> 
> 
> 
> > +    } 
> > +}
> > +
> > +void helper_ocbi(uint32_t address)
> > +{
> > +  store_request_t **current = &(env->store_requests);
> > +  while (*current)
> > +    {
> > +      if ((*current)->address == address)
> > +	{
> 
> 
> Nitpick:
> It may not be very significant but you can in an easy way catch a few more
> movca+ocbi use cases by not comparing line offsets. I.e, just mask off
> the last five bits from the compared addresses to zap all recorded movca's
> made to the same cache-line ocbi is invalidating.

I did so.


> >  static void _decode_opc(DisasContext * ctx)
> >  {
> > +    /* This code tries to make movcal emulation sufficiently
> > +       accurate for Linux purposes.  This instruction writes
> > +       memory, and prior to that, always allocates a cache line.
> > +       It is used in two contexts:
> > +       - in memcpy, where data is copied in blocks, the first write
> > +       of to a block uses movca.l. I presume this is because writing
> > +       all data into cache, and then having the data sent into memory
> > +       later, via store buffer, is faster than, in case of write-through
> > +       cache configuration, to wait for memory write on each store.
> 
> I dont think this is the reason for using movca. In fact, according to the
> documentation I've got, movca behaves like a normal mov for caches in
> writethrough mode. I beleive the reason is to avoid the line fetch in case
> the store from a normal mov misses the cache in writeback mode. Anyway,
> there's no reason to speculate, IMO we can just remove the comment on why
> and just leave the note about it beeing used for fast block copies.

OK.

I am attaching a revised patch -- what do you think?

- Volodya