From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1LMX2A-0004ID-SK
	for qemu-devel@nongnu.org; Mon, 12 Jan 2009 19:21:30 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1LMX27-0004Fb-UO
	for qemu-devel@nongnu.org; Mon, 12 Jan 2009 19:21:29 -0500
Received: from [199.232.76.173] (port=36565 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1LMX27-0004FA-LQ
	for qemu-devel@nongnu.org; Mon, 12 Jan 2009 19:21:27 -0500
Received: from vsmtp04.dti.ne.jp ([202.216.231.139]:56709)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <kawasaki@juno.dti.ne.jp>) id 1LMX27-0002gW-0j
	for qemu-devel@nongnu.org; Mon, 12 Jan 2009 19:21:27 -0500
Message-ID: <496BDE7E.4040002@juno.dti.ne.jp>
Date: Tue, 13 Jan 2009 09:21:18 +0900
From: Shin-ichiro KAWASAKI <kawasaki@juno.dti.ne.jp>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [PATCH 2/3] sh: movca.l cancel by ocbi
References: <4969B880.10600@juno.dti.ne.jp>
	<20090111142221.GG26952@edgar.se.axis.com>
	<496AD162.1040808@juno.dti.ne.jp>
In-Reply-To: <496AD162.1040808@juno.dti.ne.jp>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Shin-ichiro KAWASAKI <kawasaki@juno.dti.ne.jp>
Cc: qemu-devel@nongnu.org, "Edgar E. Iglesias" <edgar.iglesias@axis.com>

I've completely passed over Vladimir-san's nice patch on movca-ocbi,
even though it is already in qemu-sh staging repository.

http://lists.gnu.org/archive/html/qemu-devel/2008-12/msg00605.html

But now, I can review the patch.  I'll do it later.

Regards,
Shin-ichiro KAWASAKI


Shin-ichiro KAWASAKI wrote:
> Thank you for your commmnts!
> 
> Edgar E. Iglesias wrote:
>> On Sun, Jan 11, 2009 at 06:14:40PM +0900, Shin-ichiro KAWASAKI wrote:
>>> Current sh4's "movca.l" instruction is implemented in a same way
>>> as "mov.l". Then, write to memory cannot be canceled by "ocbi"
>>> cache control instrunction.  This makes text area broken on
>>> cache flush by linux kernel.
>>>
>>> This patch delays "movca.l" execution and provide the chance to 
>>> cancel it for "ocbi".
>>> # Thank you Edgar, for your advice!
>>>
>>> This patch does,
>>> - on executing "movca.l", just records what and where movca   should 
>>> store data.
>>> - on executing "ocbi", find the corresponding record by movca,
>>>   and delete it to cancel.
>>> - lets TCG produce "delayed_movca" instruction at the first
>>>   instruction which is neither "movca.l" nor "ocbi".
>>> - on executing "delayed_movca", does the data store task,
>>>   according to the record.
>>
>> Hello,
>>
>> I think your patch will catch the linux dflush sequence but I've got a
>> few comments.
>>
>> There is a complication with the delayed stores and potential mmu
>> exceptions that they might generate. I think with this approach
>> you'll take those faults at the next instruction which might not
>> even be a load/store. Not sure if that can cause problems for sh.
>> Maybe you could avoid that by making movca a load + store and the
>> ocbi a store with the original value loaded at movca. But you'll
>> still get into trouble with exceptions that might break the
>> movca/ocbi sequence.
> That's an idea. Current implementation is "delay and do-or-cancel" way.
> But your idea is "do and commit-or-rollback" way.
> I'll try to implement it to avoid mmu fault by movca.
> The mmu fault by movca will cause invoke exception handler, and break
> movca/ocbi sequence.  It will be a quite rare case, however, I'll consider
> it for next version.
> 
>> I don't it's worth having a dynamically sized movca delay buffer.
>> The movca insn has a restricted addressing mode making it a bit hard
>> to use back-to-back without alu insns in between. IIU your code
>> correctly you flush the buffer before every non movca/ocbi insn so I
>> guess that for most realistic use-cases you'll only need very few
>> (maybe only 1) entries in the delay buffer.
> That is a point I wavered.
> I investigated linux kernel and found following lines in
> "__flush_dcache_segment_4way()" in "arch/sh/mm/cache-sh4.c".
> 
>     a0 = base_addr;
>     a1 = a0 + way_incr;
>     a2 = a1 + way_incr;
>     a3 = a2 + way_incr;
>     a0e = base_addr + extent_per_way;
>     do {
>         asm volatile("ldc %0, sr" : : "r" (sr_with_bl));
>         asm volatile("movca.l r0, @%0\n\t"
>                  "movca.l r0, @%1\n\t"
>                  "movca.l r0, @%2\n\t"
>                  "movca.l r0, @%3\n\t"
>                  "ocbi @%0\n\t"
>                  "ocbi @%1\n\t"
>                  "ocbi @%2\n\t"
>                  "ocbi @%3\n\t" : :
>                  "r" (a0), "r" (a1), "r" (a2), "r" (a3));
> 
> Multiple movca.l insns executed consequently, and they access different 
> lines.
> At least, 4 entries are needed for it.  This code shows kernel 
> implementation
> decides how many number of movcas are invoked consequently.
> That's the reason why I introudcued dynamic size change. 
> 
>>> Signed-off-by: Shin-ichiro KAWASAKI <kawasaki@juno.dti.ne.jp>
>>>
>>> Index: trunk/target-sh4/op_helper.c
>>> ===================================================================
>>> --- trunk/target-sh4/op_helper.c    (revision 6133)
>>> +++ trunk/target-sh4/op_helper.c    (working copy)
>>> @@ -20,6 +20,7 @@
>>>  #include <assert.h>
>>>  #include "exec.h"
>>>  #include "helper.h"
>>> +void *qemu_mallocz(size_t size);
>>>  
>>>  #ifndef CONFIG_USER_ONLY
>>>  
>>> @@ -604,3 +605,57 @@
>>>      d.ll = t0;
>>>      return float64_to_int32_round_to_zero(d.d, &env->fp_status);
>>>  }
>>> +
>>> +void helper_movca(uint32_t val, uint32_t addr)
>>> +{
>>> +    delayed_movca_t *cur = &env->movca_list;
>>> +    delayed_movca_t *prev = NULL;
>>> +    while (cur) {
>>> +    if (!cur->valid) {
>>
>> Might be overkill but you can merge here, i.e:
>>
>>     if (!cur->valid || cur->addr == addr) {
>>
>>
>>> +        cur->valid = 1;
>>> +        cur->value = val;
>>> +        cur->addr = addr;
>>> +        return;
>>> +    }
>>> +    prev = cur;
>>> +    cur = cur->next;
>>> +    }
>>> +
>>> +    /* movca entry shortage. allocate it. */
>>> +    prev->next = cur = qemu_mallocz(sizeof(delayed_movca_t));
>>> +    if (cur == NULL) {
>>> +    printf("out of memory for delayed movca. @%08x\n", addr);
>>> +    return;
>>> +    }
>>> +    cur->valid = 1;
>>> +    cur->value = val;
>>> +    cur->addr = addr;
>>> +}
>>> +
>>> +void helper_ocbi(uint32_t addr)
>>> +{
>>> +    delayed_movca_t *cur = &env->movca_list;
>>> +
>>> +    while (cur) {
>>> +    if (cur->valid && cur->addr == addr) { /* found! */
>>> +        cur->valid = 0;
>>> +        return;
>>> +    }
>>> +    cur = cur->next;
>>> +    }
>>
>> I think you can quite easy catch a few more movca use cases if you
>> dont compare line offsets, i.e for sh4 you'd mask the addr & ~(32 - 1).
>> Also, you should not return until you've scanned the entire movca delay
>> buffer as there might be multiple stores to the same line that need to be
>> ignored.
>>
>> example:
>>     addr &= ~31;
>>     while (cur) {
>>        if (cur->valid && (cur->addr & ~31) == addr)
>>            cur->valid = 0;
>>        cur = cur->next;
>>     }
> 
> That's a good advice.  I'll try to implement it for next version.
> 
> Thanks again for your review!
> 
> Regards,
> Shin-ichiro KAWASAKI
>