qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] sh : performance problem
@ 2009-02-26 16:28 Shin-ichiro KAWASAKI
  2009-02-28 19:15 ` Lionel Landwerlin
  2009-03-02 23:58 ` Lionel Landwerlin
  0 siblings, 2 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-02-26 16:28 UTC (permalink / raw)
  To: qemu-devel

Hi, all.

One of the current problems of qemu-sh4 system emulation is performance.
Kernel boot process seems not so slow, but userland process performance
is too bad.  This problem can be seen numerically when you compile simple '.c'
source with empty main().  The result is as follows.

  sh4 : 5.8 [seconds]     (-M r2d + Fedora 6)
  arm : 0.8 [seconds]     (-M versatilepb + Debian ARM)

sh4 is 7 times slower than arm...

 /* I repeated the compile work until the measured time converges. */


Using Oprofile, we can see what qemu doing while the compile work on sh4
system emulation.  Top 20 time consuming functions are as follows.

---------------------------------------------------------------------------
CPU: Core 2, speed 1600 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        image name               app name                 symbol name
38481    25.0550  qemu-system-sh4          qemu-system-sh4          find_tlb_entry
19836    12.9152  anon (tgid:2483 range:0xb2181000-0xb3381000) qemu-system-sh4          (no symbols)
9333      6.0767  qemu-system-sh4          qemu-system-sh4          cpu_sh4_exec
8180      5.3260  qemu-system-sh4          qemu-system-sh4          tcg_reg_alloc_op
7236      4.7114  qemu-system-sh4          qemu-system-sh4          temp_save
7051      4.5909  qemu-system-sh4          qemu-system-sh4          tcg_liveness_analysis
4339      2.8251  libc-2.8.90.so           libc-2.8.90.so           (no symbols)
3414      2.2229  qemu-system-sh4          qemu-system-sh4          tlb_set_page_exec
3017      1.9644  qemu-system-sh4          qemu-system-sh4          decode_opc
2802      1.8244  qemu-system-sh4          qemu-system-sh4          get_physical_address
sa2358      1.5353  qemu-system-sh4          qemu-system-sh4          find_itlb_entry
2215      1.4422  qemu-system-sh4          qemu-system-sh4          cpu_sh4_write_mmaped_utlb_addr
1771      1.1531  qemu-system-sh4          qemu-system-sh4          tcg_gen_code_search_pc
1746      1.1368  qemu-system-sh4          qemu-system-sh4          __ldl_mmu
1663      1.0828  qemu-system-sh4          qemu-system-sh4          cpu_sh4_handle_mmu_fault
1557      1.0138  qemu-system-sh4          qemu-system-sh4          gen_intermediate_code_pc
1328      0.8647  qemu-system-sh4          qemu-system-sh4          tcg_temp_new_internal_i32
1233      0.8028  qemu-system-arm          qemu-system-arm          cpu_arm_exec
1229      0.8002  vmlinux                  vmlinux                  vmi_activate
1046      0.6811  Xvnc4                    Xvnc4                    (no symbols)
---------------------------------------------------------------------------

Most time consuming function is 'find_tlb_entry()', which search the sh4's TLB
entries to find an entry which matches with given address.
If my understanding is right, this search happens when TLB miss happens.
Too many TLB misses causes bad perfomance, I guess.

The actions to solve this problem will be,

 (i)   tune up 'find_tlb_entry()'
 (ii)  reduce TLB miss by expanding page size
 (iii) reduce TLB miss by increase the number of TLB entries virtually,
       more than the real cpu has.

First, I tried (i).  The attached patch introduces binary search.
It shortens the gcc compile time from 5.8 seconds to 4.6 seconds :
it make sh4 system emulation 20% faster.

'find_tlb_entry()' searches unified tlb array with 64 entries, and
instruction tlb array with 4 entries.  This patch focus only on unified
tlb array search.

This patch is rather rough one.  Any advise to brush it up will be appreciated.
I'm going to work on approaches (ii) and (iii).  Advises for them will be
thanked too.

Regards,
Shin-ichiro KAWASAKI


Index: trunk/target-sh4/cpu.h
===================================================================
--- trunk/target-sh4/cpu.h	(revision 6628)
+++ trunk/target-sh4/cpu.h	(working copy)
@@ -137,13 +137,16 @@
     uint32_t intevt;		/* interrupt event register */
 
     uint32_t pvr;		/* Processor Version Register */
     uint32_t prr;		/* Processor Revision Register */
     uint32_t cvr;		/* Cache Version Register */
 
-     CPU_COMMON tlb_t utlb[UTLB_SIZE];	/* unified translation table */
+    CPU_COMMON
+    tlb_t utlb[UTLB_SIZE];	/* unified translation table */
+    tlb_t * sorted_utlb[UTLB_SIZE];
+    uint32_t sorted_utlb_num;
     tlb_t itlb[ITLB_SIZE];	/* instruction translation table */
     void *intc_handle;
     int intr_at_halt;		/* SR_BL ignored during sleep */
 } CPUSH4State;
 
 CPUSH4State *cpu_sh4_init(const char *cpu_model);
Index: trunk/target-sh4/helper.c
===================================================================
--- trunk/target-sh4/helper.c	(revision 6628)
+++ trunk/target-sh4/helper.c	(working copy)
@@ -322,7 +322,7 @@
     if (e == MMU_DTLB_MULTIPLE)
 	e = MMU_ITLB_MULTIPLE;
     else if (e == MMU_DTLB_MISS && update) {
-	e = find_tlb_entry(env, address, env->utlb, UTLB_SIZE, use_asid);
+	e = find_utlb_entry(env, address, use_asid);
 	if (e >= 0) {
 	    tlb_t * ientry;
 	    n = itlb_replacement(env);
@@ -342,15 +342,69 @@
     return e;
 }
 
+static inline int is_utlb_match(tlb_t * e, uint32_t addr)
+{
+    return (e->vpn << 10 & ~(e->size - 1)) == (addr & ~(e->size - 1));
+}
+
 /* Find utlb entry
    Return entry, MMU_DTLB_MISS, MMU_DTLB_MULTIPLE */
 int find_utlb_entry(CPUState * env, target_ulong address, int use_asid)
 {
+    int min = 0;
+    int max = env->sorted_utlb_num - 1;
+    int cur = (min + max) / 2 ;
+    int save;
+    uint8_t asid = env->pteh & 0xff;
+    int ret = MMU_DTLB_MISS;
+
     /* per utlb access */
     increment_urc(env);
 
-    /* Return entry */
-    return find_tlb_entry(env, address, env->utlb, UTLB_SIZE, use_asid);
+    if (env->sorted_utlb_num <= 0)
+	return MMU_DTLB_MISS;
+
+    /* binary search */
+    while (!is_utlb_match(env->sorted_utlb[cur], address)) {
+	if (min >= max)
+	    return MMU_DTLB_MISS;
+	if (((env->sorted_utlb[cur]->vpn << 10)
+	    & ~(env->sorted_utlb[cur]->size - 1)) <
+	    (address & ~(env->sorted_utlb[cur]->size - 1))) {
+	    min = cur + 1;
+	} else {
+	    max = cur - 1;
+	}
+	cur = (min + max) / 2;
+    }
+
+    save = cur;
+
+    /* minus search */
+    do {
+	tlb_t * e = env->sorted_utlb[cur];
+	if (e->sh || (!use_asid) || e->asid == asid) {
+	    if (ret != MMU_DTLB_MISS)
+		return MMU_DTLB_MULTIPLE;
+	    ret = e - &env->utlb[0];
+	}
+	cur--;
+    } while(cur >= 0 && is_utlb_match(env->sorted_utlb[cur], address));
+
+    /* plus search */
+    cur = save + 1;
+    while (cur < env->sorted_utlb_num &&
+	   is_utlb_match(env->sorted_utlb[cur], address)) {
+	tlb_t * e = env->sorted_utlb[cur];
+	if (e->sh || (!use_asid) || e->asid == asid) {
+	    if (ret != MMU_DTLB_MISS)
+		return MMU_DTLB_MULTIPLE;
+	    ret = e - &env->utlb[0];
+	}
+	cur++;
+    } 
+
+    return ret;
 }
 
 /* Match address against MMU
@@ -525,6 +579,63 @@
     return physical;
 }
 
+static void add_entry_to_sorted_utlb(CPUState * env, tlb_t * entry)
+{
+    int i;
+
+    if (env->sorted_utlb_num == 0) {
+	env->sorted_utlb[0] = entry;
+	env->sorted_utlb_num = 1;
+	return;
+    }
+
+    if (entry->vpn <= env->sorted_utlb[0]->vpn) {
+	memmove(&env->sorted_utlb[1],
+		&env->sorted_utlb[0],
+		sizeof(tlb_t *) * env->sorted_utlb_num);
+	env->sorted_utlb[0] = entry;
+	env->sorted_utlb_num++;
+	return;
+    }
+
+    for (i = 0; i < env->sorted_utlb_num - 1; i++) {
+	tlb_t * e1 = env->sorted_utlb[i];
+	tlb_t * e2 = env->sorted_utlb[i + 1];
+	if (e1->vpn <= entry->vpn && entry->vpn <= e2->vpn) {
+	    memmove(&env->sorted_utlb[i + 2],
+		    &env->sorted_utlb[i + 1],
+		    sizeof(tlb_t *) * (env->sorted_utlb_num - i - 1));
+	    env->sorted_utlb[i + 1] = entry;
+	    env->sorted_utlb_num++;
+	    return;
+	}
+    }
+
+#if 0
+    assert(env->sorted_utlb_num < UTLB_SIZE);
+#endif
+    env->sorted_utlb[env->sorted_utlb_num] = entry;
+    env->sorted_utlb_num++;
+}
+
+static void remove_entry_from_sorted_utlb(CPUState * env, tlb_t * entry)
+{
+    int i;
+#if 0
+    assert(env->sorted_utlb_num > 0);
+#endif
+
+    for (i = 0; i < env->sorted_utlb_num; i++) {
+	if (env->sorted_utlb[i] == entry) {
+	    env->sorted_utlb_num--;
+	    memmove(&env->sorted_utlb[i],
+		    &env->sorted_utlb[i + 1],
+		    sizeof(tlb_t *) * (env->sorted_utlb_num - i));
+	    return;
+	}
+    }
+}
+
 void cpu_load_tlb(CPUState * env)
 {
     int n = cpu_mmucr_urc(env->mmucr);
@@ -536,6 +647,8 @@
 	if (!same_tlb_entry_exists(env->itlb, ITLB_SIZE, entry)) {
 	    tlb_flush_page(env, address);
 	}
+
+	remove_entry_from_sorted_utlb(env, entry);
     }
 
     /* Take values into cpu status from registers. */
@@ -568,6 +681,9 @@
     entry->wt   = (uint8_t)cpu_ptel_wt(env->ptel);
     entry->sa   = (uint8_t)cpu_ptea_sa(env->ptea);
     entry->tc   = (uint8_t)cpu_ptea_tc(env->ptea);
+
+    /* add to sorted list */
+    add_entry_to_sorted_utlb(env, entry);
 }
 
 void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
@@ -599,8 +715,12 @@
 		    s->tea = addr;
 		    break;
 	        }
-		if (entry->v && !v)
+		if (entry->v && !v) {
 		    needs_tlb_flush = 1;
+		    remove_entry_from_sorted_utlb(s, entry);
+		} else if (!entry->v && v){
+		    add_entry_to_sorted_utlb(s, entry);
+		}
 		entry->v = v;
 		entry->d = d;
 	        utlb_match_entry = entry;
@@ -635,6 +755,7 @@
 	    if (!same_tlb_entry_exists(s->itlb, ITLB_SIZE, entry)) {
 	        tlb_flush_page(s, address);
 	    }
+	    remove_entry_from_sorted_utlb(s, entry);
 	}
 	entry->asid = asid;
 	entry->vpn = vpn;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-02-26 16:28 [Qemu-devel] sh : performance problem Shin-ichiro KAWASAKI
@ 2009-02-28 19:15 ` Lionel Landwerlin
  2009-03-01  6:05   ` Shin-ichiro KAWASAKI
  2009-03-02 23:58 ` Lionel Landwerlin
  1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-02-28 19:15 UTC (permalink / raw)
  To: qemu-devel

Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
écrit :

Hi Shin-ichiro,

I'm interested by your patch, but it does not apply on top of the r6215
(qemu svn) + qemu-sh patches
( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary) 

Should I try on the last qemu svn ?

Regards,

-- 
Lionel Landwerlin <lionel.landwerlin@openwide.fr>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-02-28 19:15 ` Lionel Landwerlin
@ 2009-03-01  6:05   ` Shin-ichiro KAWASAKI
  2009-03-01 12:46     ` Lionel Landwerlin
  2009-03-02 22:30     ` Lionel Landwerlin
  0 siblings, 2 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-01  6:05 UTC (permalink / raw)
  To: qemu-devel

Hi, Lionel.
Thank you for the response!

Lionel Landwerlin wrote:
> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
> écrit :
> 
> Hi Shin-ichiro,
> 
> I'm interested by your patch, but it does not apply on top of the r6215
> (qemu svn) + qemu-sh patches
> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary) 
> 
> Should I try on the last qemu svn ?

Yes.  When I send patch to qemu-devel ml, I make diff against
qemu main repository.  Could you try on it?


Regards,
Shin-ichiro KAWASAKI

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-01  6:05   ` Shin-ichiro KAWASAKI
@ 2009-03-01 12:46     ` Lionel Landwerlin
  2009-03-01 16:10       ` Shin-ichiro KAWASAKI
  2009-03-02 22:30     ` Lionel Landwerlin
  1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-01 12:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Shin-ichiro KAWASAKI

Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
> Hi, Lionel.
> Thank you for the response!
> 
> Lionel Landwerlin wrote:
> > Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
> > écrit :
> > 
> > Hi Shin-ichiro,
> > 
> > I'm interested by your patch, but it does not apply on top of the r6215
> > (qemu svn) + qemu-sh patches
> > ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary) 
> > 
> > Should I try on the last qemu svn ?
> 
> Yes.  When I send patch to qemu-devel ml, I make diff against
> qemu main repository.  Could you try on it?

Yes, I will.

Do you know when the qemu-sh patches will be integrated to qemu ?


-- 
Lionel Landwerlin <lionel.landwerlin@openwide.fr>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-01 12:46     ` Lionel Landwerlin
@ 2009-03-01 16:10       ` Shin-ichiro KAWASAKI
  2009-03-01 17:39         ` Lionel Landwerlin
  0 siblings, 1 reply; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-01 16:10 UTC (permalink / raw)
  To: Lionel Landwerlin; +Cc: qemu-devel

Lionel Landwerlin wrote:
> Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
>> Hi, Lionel.
>> Thank you for the response!
>>
>> Lionel Landwerlin wrote:
>>> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
>>> écrit :
>>>
>>> Hi Shin-ichiro,
>>>
>>> I'm interested by your patch, but it does not apply on top of the r6215
>>> (qemu svn) + qemu-sh patches
>>> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary) 
>>>
>>> Should I try on the last qemu svn ?
>> Yes.  When I send patch to qemu-devel ml, I make diff against
>> qemu main repository.  Could you try on it?
> 
> Yes, I will.
> 
> Do you know when the qemu-sh patches will be integrated to qemu ?

Nope.

Sometimes the main developers afford some time for sh kindly, then
some of the patches in the qemu-sh staging repository are already
integrated.  But some patches are left. I guess they are busy on
other CPU archs.

Especially, it's not a good time to apply patches which is not so
serious, because version cutting work is going now.

Regards,
Shin-ichiro KAWASAKI

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-01 16:10       ` Shin-ichiro KAWASAKI
@ 2009-03-01 17:39         ` Lionel Landwerlin
  2009-03-02 14:46           ` Shin-ichiro KAWASAKI
  0 siblings, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-01 17:39 UTC (permalink / raw)
  To: Shin-ichiro KAWASAKI; +Cc: qemu-devel

Le lundi 02 mars 2009 à 01:10 +0900, Shin-ichiro KAWASAKI a écrit :
> Lionel Landwerlin wrote:
> > Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
> >> Hi, Lionel.
> >> Thank you for the response!
> >>
> >> Lionel Landwerlin wrote:
> >>> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
> >>> écrit :
> >>>
> >>> Hi Shin-ichiro,
> >>>
> >>> I'm interested by your patch, but it does not apply on top of the r6215
> >>> (qemu svn) + qemu-sh patches
> >>> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary) 
> >>>
> >>> Should I try on the last qemu svn ?
> >> Yes.  When I send patch to qemu-devel ml, I make diff against
> >> qemu main repository.  Could you try on it?
> > 
> > Yes, I will.
> > 
> > Do you know when the qemu-sh patches will be integrated to qemu ?
> 
> Nope.
> 
> Sometimes the main developers afford some time for sh kindly, then
> some of the patches in the qemu-sh staging repository are already
> integrated.  But some patches are left. I guess they are busy on
> other CPU archs.
> 
> Especially, it's not a good time to apply patches which is not so
> serious, because version cutting work is going now.
> 
> Regards,
> Shin-ichiro KAWASAKI
> 
> 

There is also one thing that is annoying me.

With the algorithm you proposed in your patch, you're not able anymore
to raise the MMU_ITLB_MULTIPLE and MMU_DTLB_MULTIPLE exceptions. This
makes the MMU emulation different from the real processor (even if
thoses cases aren't even handled in Linux and probably others OS).

For example you can probably not map the same physical area with
differents virtual page sizes.

-- 
Lionel Landwerlin <lionel.landwerlin@openwide.fr>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-01 17:39         ` Lionel Landwerlin
@ 2009-03-02 14:46           ` Shin-ichiro KAWASAKI
  0 siblings, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-02 14:46 UTC (permalink / raw)
  To: qemu-devel

Lionel Landwerlin wrote:
> Le lundi 02 mars 2009 à 01:10 +0900, Shin-ichiro KAWASAKI a écrit :
>> Lionel Landwerlin wrote:
>>> Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
>>>> Hi, Lionel.
>>>> Thank you for the response!
>>>>
>>>> Lionel Landwerlin wrote:
>>>>> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
>>>>> écrit :
>>>>>
>>>>> Hi Shin-ichiro,
>>>>>
>>>>> I'm interested by your patch, but it does not apply on top of the r6215
>>>>> (qemu svn) + qemu-sh patches
>>>>> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary) 
>>>>>
>>>>> Should I try on the last qemu svn ?
>>>> Yes.  When I send patch to qemu-devel ml, I make diff against
>>>> qemu main repository.  Could you try on it?
>>> Yes, I will.
>>>
>>> Do you know when the qemu-sh patches will be integrated to qemu ?
>> Nope.
>>
>> Sometimes the main developers afford some time for sh kindly, then
>> some of the patches in the qemu-sh staging repository are already
>> integrated.  But some patches are left. I guess they are busy on
>> other CPU archs.
>>
>> Especially, it's not a good time to apply patches which is not so
>> serious, because version cutting work is going now.
>>
>> Regards,
>> Shin-ichiro KAWASAKI
>>
>>
> 
> There is also one thing that is annoying me.
> 
> With the algorithm you proposed in your patch, you're not able anymore
> to raise the MMU_ITLB_MULTIPLE and MMU_DTLB_MULTIPLE exceptions. This
> makes the MMU emulation different from the real processor (even if
> thoses cases aren't even handled in Linux and probably others OS).
> 
> For example you can probably not map the same physical area with
> differents virtual page sizes.

Right.  I'll consider on it for next version of the patch.
Thank you!

Regards,
Shin-ichiro KAWASAKI

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-01  6:05   ` Shin-ichiro KAWASAKI
  2009-03-01 12:46     ` Lionel Landwerlin
@ 2009-03-02 22:30     ` Lionel Landwerlin
  2009-03-03 15:32       ` Shin-ichiro KAWASAKI
  1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-02 22:30 UTC (permalink / raw)
  To: qemu-devel

Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
> Hi, Lionel.
> Thank you for the response!
> 
> Lionel Landwerlin wrote:
> > Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
> > écrit :
> > 
> > Hi Shin-ichiro,
> > 
> > I'm interested by your patch, but it does not apply on top of the r6215
> > (qemu svn) + qemu-sh patches
> > ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary) 
> > 
> > Should I try on the last qemu svn ?
> 
> Yes.  When I send patch to qemu-devel ml, I make diff against
> qemu main repository.  Could you try on it?
> 
> 

Are you working on a 32 or 64bits host system ?

I experiencing a lot of random behavior with qemu...

>From now I don't a have video anymore...

-- 
Lionel Landwerlin <lionel.landwerlin@openwide.fr>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-02-26 16:28 [Qemu-devel] sh : performance problem Shin-ichiro KAWASAKI
  2009-02-28 19:15 ` Lionel Landwerlin
@ 2009-03-02 23:58 ` Lionel Landwerlin
  2009-03-03  0:09   ` Lionel Landwerlin
  2009-03-03 15:46   ` Shin-ichiro KAWASAKI
  1 sibling, 2 replies; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-02 23:58 UTC (permalink / raw)
  To: qemu-devel; +Cc: Shin-ichiro KAWASAKI

Shin-ichiro,

Sorry, but I cannot apply your patch cleanly on the last qemu-svn.

Instead, I would like to try another approach. The patch you proposed to
find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
something like that if I remember) instead here is a patch with a
complexity of O(1).

This patch has the same problem concerning the multiple hit exception
that is not raised anymore, but I think we can handle it with some more
variables in tlb entries.

The patch apply on 

qemu-svn r6215

+

git://git.kernel.org/pub/scm/virt/qemu/lethal/qemu-sh.git patches 

Regards,


-- 
Lionel Landwerlin <lionel.landwerlin@openwide.fr>

[PATCH] SH4: Use O(1) algorithm instead of O(n) for utlb lookup

The current algorithm to know whether a virtual address belongs to the
TLB cache is to parse all the TLB one by one. The complexity of this
algorithm is O(n).

The following patch implements a TLB entry cache, by keeping for each
virtual page number the associated TLB entry. As usual the complexity
gain is balanced by an overall memory comsumption that is about 5Mb.

This patches also introduce a behavior divergence with the real SH4
processor, because no more 'multiple hit exception' is raised, when 2
TLB map the same physical area. This will be fixed in another
iteration of this patch.

Signed-off-by: Lionel Landwerlin <lionel.landwerlin@openwide.fr>
---
 target-sh4/cpu.h       |    8 +++
 target-sh4/helper.c    |  156 ++++++++++++++++++++++++++++++++++--------------
 target-sh4/translate.c |    9 ++-
 3 files changed, 127 insertions(+), 46 deletions(-)

diff --git a/target-sh4/cpu.h b/target-sh4/cpu.h
index 4a9ddea..2b17a93 100644
--- a/target-sh4/cpu.h
+++ b/target-sh4/cpu.h
@@ -153,6 +153,14 @@ typedef struct CPUSH4State {
     int intr_at_halt;		/* SR_BL ignored during sleep */
     store_request_t *store_requests;
     store_request_t **store_request_tail;
+
+#if !defined(CONFIG_USER_ONLY)
+    /* vpn to utlb entry caches (too much space for user emulation) */
+    uint8_t utlbs_1k[4194304]; /* 2^22 => 4 Mb */
+    uint8_t utlbs_4k[1048576]; /* 2^20 => 1 Mb */
+    uint8_t utlbs_64k[65536]; /* 2^16 => 64 Kb */
+    uint8_t utlbs_1m[4096]; /* 2^12 => 4 Kb */
+#endif
 } CPUSH4State;
 
 CPUSH4State *cpu_sh4_init(const char *cpu_model);
diff --git a/target-sh4/helper.c b/target-sh4/helper.c
index 54a3f1f..c9b2ef0 100644
--- a/target-sh4/helper.c
+++ b/target-sh4/helper.c
@@ -240,6 +240,29 @@ static int itlb_replacement(CPUState * env)
     assert(0);
 }
 
+static inline int is_tlb_matching (tlb_t *entry, int use_asid,
+                                   uint8_t asid, target_ulong address)
+{
+    target_ulong start, end;
+
+    /* Invalid entry */
+    if (unlikely(!entry->v))
+        return 0;
+
+    /* Bad ASID */
+    if (unlikely (!entry->sh && use_asid && entry->asid != asid))
+        return 0;
+
+    start = (entry->vpn << 10) & ~(entry->size - 1);
+    end = start + entry->size - 1;
+
+    /* Match */
+    if (likely (address >= start && address <= end))
+        return 1;
+
+    return 0;
+}
+
 /* Find the corresponding entry in the right TLB
    Return entry, MMU_DTLB_MISS or MMU_DTLB_MULTIPLE
 */
@@ -253,37 +276,59 @@ static int find_tlb_entry(CPUState * env, target_ulong address,
 
     asid = env->pteh & 0xff;
 
-    for (i = 0; i < nbtlb; i++) {
-	if (!entries[i].v)
-	    continue;		/* Invalid entry */
-	if (!entries[i].sh && use_asid && entries[i].asid != asid)
-	    continue;		/* Bad ASID */
-#if 0
-	switch (entries[i].sz) {
-	case 0:
-	    size = 1024;	/* 1kB */
-	    break;
-	case 1:
-	    size = 4 * 1024;	/* 4kB */
-	    break;
-	case 2:
-	    size = 64 * 1024;	/* 64kB */
-	    break;
-	case 3:
-	    size = 1024 * 1024;	/* 1MB */
-	    break;
-	default:
-	    assert(0);
-	}
-#endif
-	start = (entries[i].vpn << 10) & ~(entries[i].size - 1);
-	end = start + entries[i].size - 1;
-	if (address >= start && address <= end) {	/* Match */
-	    if (match != MMU_DTLB_MISS)
-		return MMU_DTLB_MULTIPLE;	/* Multiple match */
-	    match = i;
-	}
+    /* In case we are looking at the utlbs. */
+    if (env->utlb == entries)
+    {
+        /* We are most likely to use 4k virtual pages. */
+        i = env->utlbs_4k[address >> 12];
+        if (unlikely (!is_tlb_matching (&entries[i], use_asid,
+                                        asid, address)))
+        {
+            /* Then huge pages */
+            i = env->utlbs_64k[address >> 14];
+            if (unlikely (!is_tlb_matching (&entries[i], use_asid,
+                                            asid, address)))
+            {
+                /* Or *HUGE* pages */
+                i = env->utlbs_1m[address >> 16];
+                if (unlikely (!is_tlb_matching (&entries[i], use_asid,
+                                                asid, address)))
+                {
+                    /* And finally tiny pages (who uses that ??). */
+                    i = env->utlbs_1k[address >> 10];
+                    if (likely (is_tlb_matching (&entries[i], use_asid,
+                                                 asid, address)))
+                        match = i;
+                }
+                else
+                    match = i;
+            }
+            else
+                match = i;
+        }
+        else
+            match = i;
+    }
+    else
+    {
+        /* We keep the traditional o(n) algorithm for itlbs (only 4 of
+         * them). */
+        for (i = 0; i < nbtlb; i++) {
+            if (!entries[i].v)
+                continue;		/* Invalid entry */
+            if (!entries[i].sh && use_asid && entries[i].asid != asid)
+                continue;		/* Bad ASID */
+
+            start = (entries[i].vpn << 10) & ~(entries[i].size - 1);
+            end = start + entries[i].size - 1;
+            if (address >= start && address <= end) {	/* Match */
+                if (match != MMU_DTLB_MISS)
+                    return MMU_DTLB_MULTIPLE;	/* Multiple match */
+                match = i;
+            }
+        }
     }
+
     return match;
 }
 
@@ -403,6 +448,7 @@ static int get_mmu_address(CPUState * env, target_ulong * physical,
 		*prot = (rw == 1)? PAGE_WRITE : PAGE_READ;
 		break;
 	    }
+
 	} else if (n == MMU_DTLB_MISS) {
 	    n = (rw == 1) ? MMU_DTLB_MISS_WRITE :
 		MMU_DTLB_MISS_READ;
@@ -412,8 +458,8 @@ static int get_mmu_address(CPUState * env, target_ulong * physical,
 	*physical = ((matching->ppn << 10) & ~(matching->size - 1)) |
 	    (address & (matching->size - 1));
 	if ((rw == 1) & !matching->d)
-	    n = MMU_DTLB_INITIAL_WRITE;
-	else
+            n = MMU_DTLB_INITIAL_WRITE;
+        else
 	    n = MMU_OK;
     }
     return n;
@@ -545,18 +591,23 @@ void cpu_load_tlb(CPUState * env)
     entry->v    = (uint8_t)cpu_ptel_v(env->ptel);
     entry->ppn  = cpu_ptel_ppn(env->ptel);
     entry->sz   = (uint8_t)cpu_ptel_sz(env->ptel);
+
     switch (entry->sz) {
     case 0: /* 00 */
         entry->size = 1024; /* 1K */
+        env->utlbs_1k[entry->vpn] = n;
         break;
     case 1: /* 01 */
         entry->size = 1024 * 4; /* 4K */
+        env->utlbs_4k[entry->vpn >> 2] = n;
         break;
     case 2: /* 10 */
         entry->size = 1024 * 64; /* 64K */
+        env->utlbs_64k[entry->vpn >> 6] = n;
         break;
     case 3: /* 11 */
         entry->size = 1024 * 1024; /* 1M */
+        env->utlbs_1m[entry->vpn >> 10] = n;
         break;
     default:
         assert(0);
@@ -571,7 +622,7 @@ void cpu_load_tlb(CPUState * env)
     entry->tc   = (uint8_t)cpu_ptea_tc(env->ptea);
 }
 
-void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
+void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *env, target_phys_addr_t addr,
 				    uint32_t mem_value)
 {
     int associate = addr & 0x0000080;
@@ -579,7 +630,7 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
     uint8_t d = (uint8_t)((mem_value & 0x00000200) >> 9);
     uint8_t v = (uint8_t)((mem_value & 0x00000100) >> 8);
     uint8_t asid = (uint8_t)(mem_value & 0x000000ff);
-    int use_asid = (s->mmucr & MMUCR_SV) == 0 || (s->sr & SR_MD) == 0;
+    int use_asid = (env->mmucr & MMUCR_SV) == 0 || (env->sr & SR_MD) == 0;
 
     if (associate) {
         int i;
@@ -588,7 +639,7 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
 
 	/* search UTLB */
 	for (i = 0; i < UTLB_SIZE; i++) {
-            tlb_t * entry = &s->utlb[i];
+            tlb_t * entry = &env->utlb[i];
             if (!entry->v)
 	        continue;
 
@@ -596,8 +647,8 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
                 && (!use_asid || entry->asid == asid || entry->sh)) {
 	        if (utlb_match_entry) {
 		    /* Multiple TLB Exception */
-		    s->exception_index = 0x140;
-		    s->tea = addr;
+		    env->exception_index = 0x140;
+		    env->tea = addr;
 		    break;
 	        }
 		if (entry->v && !v)
@@ -606,12 +657,12 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
 		entry->d = d;
 	        utlb_match_entry = entry;
 	    }
-	    increment_urc(s); /* per utlb access */
+	    increment_urc(env); /* per utlb access */
 	}
 
 	/* search ITLB */
 	for (i = 0; i < ITLB_SIZE; i++) {
-            tlb_t * entry = &s->itlb[i];
+            tlb_t * entry = &env->itlb[i];
             if (entry->vpn == vpn
                 && (!use_asid || entry->asid == asid || entry->sh)) {
 	        if (entry->v && !v)
@@ -625,23 +676,40 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
 	}
 
 	if (needs_tlb_flush)
-	    tlb_flush_page(s, vpn << 10);
+	    tlb_flush_page(env, vpn << 10);
 
     } else {
         int index = (addr & 0x00003f00) >> 8;
-        tlb_t * entry = &s->utlb[index];
+        tlb_t * entry = &env->utlb[index];
 	if (entry->v) {
 	    /* Overwriting valid entry in utlb. */
             target_ulong address = entry->vpn << 10;
-	    if (!same_tlb_entry_exists(s->itlb, ITLB_SIZE, entry)) {
-	        tlb_flush_page(s, address);
+	    if (!same_tlb_entry_exists(env->itlb, ITLB_SIZE, entry)) {
+	        tlb_flush_page(env, address);
 	    }
 	}
 	entry->asid = asid;
 	entry->vpn = vpn;
 	entry->d = d;
 	entry->v = v;
-	increment_urc(s);
+        switch (entry->sz) {
+        case 0: /* 00 */
+            env->utlbs_1k[entry->vpn] = index;
+            break;
+        case 1: /* 01 */
+            env->utlbs_4k[entry->vpn >> 2] = index;
+            break;
+        case 2: /* 10 */
+            env->utlbs_64k[entry->vpn >> 6] = index;
+            break;
+        case 3: /* 11 */
+            env->utlbs_1m[entry->vpn >> 10] = index;
+            break;
+        default:
+            assert(0);
+            break;
+        }
+	increment_urc(env);
     }
 }
 
diff --git a/target-sh4/translate.c b/target-sh4/translate.c
index df81052..eab21a6 100644
--- a/target-sh4/translate.c
+++ b/target-sh4/translate.c
@@ -202,6 +202,11 @@ static void cpu_sh4_reset(CPUSH4State * env)
     set_float_rounding_mode(float_round_to_zero, &env->fp_status);
 #endif
     env->mmucr = 0;
+
+    memset (env->utlbs_1k, 0, sizeof (env->utlbs_1k));
+    memset (env->utlbs_4k, 0, sizeof (env->utlbs_4k));
+    memset (env->utlbs_64k, 0, sizeof (env->utlbs_64k));
+    memset (env->utlbs_1m, 0, sizeof (env->utlbs_1m));
 }
 
 typedef struct {
@@ -510,7 +515,7 @@ static void _decode_opc(DisasContext * ctx)
        to this list.  When we see an instruction that is neither movca.l
        nor ocbi, we perform the stores recorded in this list.  When we see
        ocbi, we check if the stores list has the address being invalidated.
-       If so, we remove the address from the list.  
+       If so, we remove the address from the list.
 
        To optimize, we only try to flush stores when we're at the start of
        TB, or if we already saw movca.l in this TB and did not flush stores
@@ -1565,7 +1570,7 @@ static void _decode_opc(DisasContext * ctx)
 	}
 	return;
     case 0x00c3:		/* movca.l R0,@Rm */
-	gen_helper_movcal (REG(B11_8), REG(0));	
+	gen_helper_movcal (REG(B11_8), REG(0));
 	ctx->has_movcal = 1;
 	return;
     case 0x40a9:
-- 
1.5.6.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-02 23:58 ` Lionel Landwerlin
@ 2009-03-03  0:09   ` Lionel Landwerlin
  2009-03-03 15:46   ` Shin-ichiro KAWASAKI
  1 sibling, 0 replies; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-03  0:09 UTC (permalink / raw)
  To: qemu-devel

You might want this patch too


-- 
Lionel Landwerlin <lionel.landwerlin@openwide.fr>

[PATCH] SH4: Fixed last UTLB unused and URB/URC management

Signed-off-by: Lionel Landwerlin <lionel.landwerlin@openwide.fr>
---
 target-sh4/helper.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/target-sh4/helper.c b/target-sh4/helper.c
index cd0f392..54a3f1f 100644
--- a/target-sh4/helper.c
+++ b/target-sh4/helper.c
@@ -297,7 +297,7 @@ static int same_tlb_entry_exists(const tlb_t *
haystack, uint8_t nbtlb,
     return 0;
 }
 
-static void increment_urc(CPUState * env)
+static inline void increment_urc(CPUState * env)
 {
     uint8_t urb, urc;
 
@@ -305,7 +305,7 @@ static void increment_urc(CPUState * env)
     urb = ((env->mmucr) >> 18) & 0x3f;
     urc = ((env->mmucr) >> 10) & 0x3f;
     urc++;
-    if (urc == urb || urc == UTLB_SIZE - 1)
+    if ((urb > 0 && urc > urb) || urc > (UTLB_SIZE - 1))
 	urc = 0;
     env->mmucr = (env->mmucr & 0xffff03ff) | (urc << 10);
 }
-- 
1.5.6.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-02 22:30     ` Lionel Landwerlin
@ 2009-03-03 15:32       ` Shin-ichiro KAWASAKI
  0 siblings, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-03 15:32 UTC (permalink / raw)
  To: qemu-devel

Lionel Landwerlin wrote:
> Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
>> Hi, Lionel.
>> Thank you for the response!
>>
>> Lionel Landwerlin wrote:
>>> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
>>> écrit :
>>>
>>> Hi Shin-ichiro,
>>>
>>> I'm interested by your patch, but it does not apply on top of the r6215
>>> (qemu svn) + qemu-sh patches
>>> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary) 
>>>
>>> Should I try on the last qemu svn ?
>> Yes.  When I send patch to qemu-devel ml, I make diff against
>> qemu main repository.  Could you try on it?
>>
>>
> 
> Are you working on a 32 or 64bits host system ?
> 
> I experiencing a lot of random behavior with qemu...

My host system is 32 bit. Core 2 Duo and VMware.
I often see sometimes user process fails with segmentation fault.
I know nothing about sh4 system emulation on 64 bit host.

> >From now I don't a have video anymore...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-02 23:58 ` Lionel Landwerlin
  2009-03-03  0:09   ` Lionel Landwerlin
@ 2009-03-03 15:46   ` Shin-ichiro KAWASAKI
  2009-03-03 18:57     ` Lionel Landwerlin
  2009-03-03 23:11     ` Lionel Landwerlin
  1 sibling, 2 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-03 15:46 UTC (permalink / raw)
  To: qemu-devel

Lionel Landwerlin wrote:
> Shin-ichiro,
> 
> Sorry, but I cannot apply your patch cleanly on the last qemu-svn.
> 
> Instead, I would like to try another approach. The patch you proposed to
> find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
> something like that if I remember) instead here is a patch with a
> complexity of O(1).

Good work.  I evaluated your patch on my environment, measuring
compile time for empty main() with gcc.

  sh4 : 5.8 [seconds]     O(n) utlb search.
  sh4 : 4.6 [seconds]     O(log2(n)) utlb search.
  sh4 : 4.1 [seconds]     O(1) utlb search by Lionel
  arm : 0.8 [seconds]     (-M versatilepb + Debian ARM)

Your patch has a nice score!

Now I've done the work to increase number of utlb entries from 64 to 256,
and found that the score get arround 2.4 seconds.
I'm trying to increase it to 4096.  Your O(1) search will be more important
as the entry number increase.

> +#if !defined(CONFIG_USER_ONLY)
> +    /* vpn to utlb entry caches (too much space for user emulation) */
> +    uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
> +    uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
> +    uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
> +    uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
> +#endif
>  } CPUSH4State;

Isn't it too gorgeous?
How about allocating them on demand?
I guess sh-linux uses only utlbs_4k[], in general.
If so, 4 Mb utlbs_1k[] is waste.


Regards,
Shin-ichiro KAWASAKI

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-03 15:46   ` Shin-ichiro KAWASAKI
@ 2009-03-03 18:57     ` Lionel Landwerlin
  2009-03-03 19:25       ` Laurent Desnogues
  2009-03-03 23:11     ` Lionel Landwerlin
  1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-03 18:57 UTC (permalink / raw)
  To: qemu-devel

Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
> Lionel Landwerlin wrote:
> > Shin-ichiro,
> > 
> > Sorry, but I cannot apply your patch cleanly on the last qemu-svn.
> > 
> > Instead, I would like to try another approach. The patch you proposed to
> > find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
> > something like that if I remember) instead here is a patch with a
> > complexity of O(1).
> 
> Good work.  I evaluated your patch on my environment, measuring
> compile time for empty main() with gcc.
> 
>   sh4 : 5.8 [seconds]     O(n) utlb search.
>   sh4 : 4.6 [seconds]     O(log2(n)) utlb search.
>   sh4 : 4.1 [seconds]     O(1) utlb search by Lionel
>   arm : 0.8 [seconds]     (-M versatilepb + Debian ARM)
> 
> Your patch has a nice score!

Great :) But we're still far from arm :(

> 
> Now I've done the work to increase number of utlb entries from 64 to 256,
> and found that the score get arround 2.4 seconds.
> I'm trying to increase it to 4096.  Your O(1) search will be more important
> as the entry number increase.
> 

What do you mean by more important ?

Is the arm emulation increasing the number of TLB ?

> > +#if !defined(CONFIG_USER_ONLY)
> > +    /* vpn to utlb entry caches (too much space for user emulation) */
> > +    uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
> > +    uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
> > +    uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
> > +    uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
> > +#endif
> >  } CPUSH4State;
> 
> Isn't it too gorgeous?
> How about allocating them on demand?
> I guess sh-linux uses only utlbs_4k[], in general.
> If so, 4 Mb utlbs_1k[] is waste.
> 

sh-linux can also use huge pages of 64k and 1M.

I think it is important to keep the emulation as close as possible from
the real the cpu capabilities.


Regards,

-- 
Lionel Landwerlin <lionel.landwerlin@openwide.fr>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-03 18:57     ` Lionel Landwerlin
@ 2009-03-03 19:25       ` Laurent Desnogues
  2009-03-03 22:28         ` Lionel Landwerlin
  0 siblings, 1 reply; 20+ messages in thread
From: Laurent Desnogues @ 2009-03-03 19:25 UTC (permalink / raw)
  To: qemu-devel

On Tue, Mar 3, 2009 at 7:57 PM, Lionel Landwerlin
<lionel.landwerlin@openwide.fr> wrote:
> Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
[...]
>>   sh4 : 5.8 [seconds]     O(n) utlb search.
>>   sh4 : 4.6 [seconds]     O(log2(n)) utlb search.
>>   sh4 : 4.1 [seconds]     O(1) utlb search by Lionel
>>   arm : 0.8 [seconds]     (-M versatilepb + Debian ARM)
>>
>> Your patch has a nice score!
>
> Great :) But we're still far from arm :(

It would be interesting if you could run oprofile of both platforms
(and even better if that was with call-graph output).

>> > +#if !defined(CONFIG_USER_ONLY)
>> > +    /* vpn to utlb entry caches (too much space for user emulation) */
>> > +    uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
>> > +    uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
>> > +    uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
>> > +    uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
>> > +#endif
>> >  } CPUSH4State;
>>
>> Isn't it too gorgeous?
>> How about allocating them on demand?
>> I guess sh-linux uses only utlbs_4k[], in general.
>> If so, 4 Mb utlbs_1k[] is waste.
>>
>
> sh-linux can also use huge pages of 64k and 1M.
>
> I think it is important to keep the emulation as close as possible from
> the real the cpu capabilities.

I think Shin-ichiro was rather pointing a the number of 1k pages,
which are probably not used.  OTOH I agree being close to
CPU capabilities is important.


Laurent

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-03 19:25       ` Laurent Desnogues
@ 2009-03-03 22:28         ` Lionel Landwerlin
  2009-03-04  2:59           ` Paul Brook
  2009-03-04 15:39           ` Shin-ichiro KAWASAKI
  0 siblings, 2 replies; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-03 22:28 UTC (permalink / raw)
  To: qemu-devel

Le mardi 03 mars 2009 à 20:25 +0100, Laurent Desnogues a écrit :
> On Tue, Mar 3, 2009 at 7:57 PM, Lionel Landwerlin
> <lionel.landwerlin@openwide.fr> wrote:
> > Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
> [...]
> >>   sh4 : 5.8 [seconds]     O(n) utlb search.
> >>   sh4 : 4.6 [seconds]     O(log2(n)) utlb search.
> >>   sh4 : 4.1 [seconds]     O(1) utlb search by Lionel
> >>   arm : 0.8 [seconds]     (-M versatilepb + Debian ARM)
> >>
> >> Your patch has a nice score!
> >
> > Great :) But we're still far from arm :(
> 
> It would be interesting if you could run oprofile of both platforms
> (and even better if that was with call-graph output).
> 
> >> > +#if !defined(CONFIG_USER_ONLY)
> >> > +    /* vpn to utlb entry caches (too much space for user emulation) */
> >> > +    uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
> >> > +    uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
> >> > +    uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
> >> > +    uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
> >> > +#endif
> >> >  } CPUSH4State;
> >>
> >> Isn't it too gorgeous?
> >> How about allocating them on demand?
> >> I guess sh-linux uses only utlbs_4k[], in general.
> >> If so, 4 Mb utlbs_1k[] is waste.
> >>
> >
> > sh-linux can also use huge pages of 64k and 1M.
> >
> > I think it is important to keep the emulation as close as possible from
> > the real the cpu capabilities.
> 
> I think Shin-ichiro was rather pointing a the number of 1k pages,
> which are probably not used.  OTOH I agree being close to
> CPU capabilities is important.

The problem is, when a virtual address space must be resolved, the cpu
emulator is not able to know what page size has been (or not) used to
map the address. Thus we must handle the worst case...

And after all, 5Mb isn't so much, considering the fact that the physical
guest memory, in system emulation, can be mapped anywhere in the host
virtual address space, contrary to the user emulation that have harder
contraints...



By the way, does someone know why there is some kind of "tlb management
code" in exec.c ??

Does the SH4 architecture have special features that can't be handled in
a generic code ? Or are we just rewriting some code that is already
there ... ?


-- 
Lionel Landwerlin <lionel.landwerlin@openwide.fr>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-03 15:46   ` Shin-ichiro KAWASAKI
  2009-03-03 18:57     ` Lionel Landwerlin
@ 2009-03-03 23:11     ` Lionel Landwerlin
  2009-03-04 15:12       ` Shin-ichiro KAWASAKI
  1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-03 23:11 UTC (permalink / raw)
  To: qemu-devel

Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
> Lionel Landwerlin wrote:
> > Shin-ichiro,
> > 
> > Sorry, but I cannot apply your patch cleanly on the last qemu-svn.
> > 
> > Instead, I would like to try another approach. The patch you proposed to
> > find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
> > something like that if I remember) instead here is a patch with a
> > complexity of O(1).
> 
> Good work.  I evaluated your patch on my environment, measuring
> compile time for empty main() with gcc.
> 
>   sh4 : 5.8 [seconds]     O(n) utlb search.
>   sh4 : 4.6 [seconds]     O(log2(n)) utlb search.
>   sh4 : 4.1 [seconds]     O(1) utlb search by Lionel
>   arm : 0.8 [seconds]     (-M versatilepb + Debian ARM)
> 
> Your patch has a nice score!
> 
> Now I've done the work to increase number of utlb entries from 64 to 256,
> and found that the score get arround 2.4 seconds.
> I'm trying to increase it to 4096.  Your O(1) search will be more important
> as the entry number increase.

We should setup a git repository to improve that or at least work on the
same basis.

Do you have a lot of patch on top of the last svn ?


-- 
Lionel Landwerlin <lionel.landwerlin@openwide.fr>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-03 22:28         ` Lionel Landwerlin
@ 2009-03-04  2:59           ` Paul Brook
  2009-03-04 15:22             ` Shin-ichiro KAWASAKI
  2009-03-04 15:39           ` Shin-ichiro KAWASAKI
  1 sibling, 1 reply; 20+ messages in thread
From: Paul Brook @ 2009-03-04  2:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: Lionel Landwerlin

> > > Great :) But we're still far from arm :(

> By the way, does someone know why there is some kind of "tlb management
> code" in exec.c ??
>
> Does the SH4 architecture have special features that can't be handled in
> a generic code ? Or are we just rewriting some code that is already
> there ... ?

I think you're missing the most important difference; SH uses a software 
managed TLB, whereas ARM uses a hardware managed TLB.

The main consequence of this is that we don't have to model the actual ARM TLB 
at all, it is never directly visible. We effectively implement an infinitely 
large TLB.

For SH the TLB is programmed directly, so we end up having to maintain two 
TLBs: The qemu TLB and the architectural SH TLB. For correct operation pages 
must be removed from the qemu TLB when they are evicted/replaced in the SH 
TLB. The SH TLB is quite small, and flushing qemu TLB entries is quite 
expensive, so this results in fairly poor performance.

MIPS has a similar problem. However in that case the most common TLB 
operations do not directly expose the TLB state. In particular when setting a 
new TLB entry it is unspecified which TLB entry is replaced. At that point 
the OS can't know which ehtry was evicted, so we can lie, and not evict pages 
until the guest does something that allows it to determine the exact TLB 
state. In practice this is sufficient to make mips-linux workreasonably well.

I'm not sure if the same is posible for SH. It probably depends whether URC is 
visible to/used by the guest.

Large pages add even more complications. The qemu tlb canonly handle a single 
page size. In practice means that when large pages are used invalidating a 
single page entry requires the whole qemu tlb to be flushed. I'm pretty sure 
x86 getsand works mainly be chance (nothing actually ues large pages enough 
to notice it's broken). ARM takes the hit of a full TLB flush (linux breakss 
if you only flush a 1k region of a  4k entry), but single pge flushes are 
rare so in practice this doesn't hurt too much

Paul

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-03 23:11     ` Lionel Landwerlin
@ 2009-03-04 15:12       ` Shin-ichiro KAWASAKI
  0 siblings, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-04 15:12 UTC (permalink / raw)
  To: qemu-devel

Lionel Landwerlin wrote:
> Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
>> Lionel Landwerlin wrote:
>>> Shin-ichiro,
>>>
>>> Sorry, but I cannot apply your patch cleanly on the last qemu-svn.
>>>
>>> Instead, I would like to try another approach. The patch you proposed to
>>> find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
>>> something like that if I remember) instead here is a patch with a
>>> complexity of O(1).
>> Good work.  I evaluated your patch on my environment, measuring
>> compile time for empty main() with gcc.
>>
>>   sh4 : 5.8 [seconds]     O(n) utlb search.
>>   sh4 : 4.6 [seconds]     O(log2(n)) utlb search.
>>   sh4 : 4.1 [seconds]     O(1) utlb search by Lionel
>>   arm : 0.8 [seconds]     (-M versatilepb + Debian ARM)
>>
>> Your patch has a nice score!
>>
>> Now I've done the work to increase number of utlb entries from 64 to 256,
>> and found that the score get arround 2.4 seconds.
>> I'm trying to increase it to 4096.  Your O(1) search will be more important
>> as the entry number increase.
> 
> We should setup a git repository to improve that or at least work on the
> same basis.
> 
> Do you have a lot of patch on top of the last svn ?

I personally manages my staging repository corresponds to 'qemu-sh'.

  git://github.com/kawasaki/qemu-sh.git
  http://github.com/kawasaki/qemu-sh/tree/master

Patches in it can be applied to current qemu subversion trunk.
I put your O(1) patch and my utlb entry increasing patch also.
Will it work as the basis?

I attach the utlb entry increasing patch to this mail for reference.
It should be applied with other patches in my staging repository, or
patch command will fail.
It increases the number of utlb entry from 64 to 256.
URC and URB values in MMUCR can take the number from 0 to 255.

Here's the benchmark value.

   sh4 : 5.8 [seconds]     O(n) utlb search.
   sh4 : 4.6 [seconds]     O(log2(n)) utlb search.
   sh4 : 4.1 [seconds]     O(1) utlb search by Lionel
   sh4 : 2.1 [seconds]     O(1) utlb search, and 256 utlb enries  <= 
   arm : 0.8 [seconds]     (-M versatilepb + Debian ARM)

Getting better :) but still behind arm.

Regards,
Shin-ichiro KAWASAKI


Index: trunk/target-sh4/cpu.h
===================================================================
--- trunk/target-sh4/cpu.h	(revision 6676)
+++ trunk/target-sh4/cpu.h	(working copy)
@@ -88,7 +88,8 @@
     uint8_t tc;			/* timing control */
 } tlb_t;
 
-#define UTLB_SIZE 64
+#define UTLB_BITS 8     /* real hard : 6 */
+#define UTLB_SIZE (1 << UTLB_BITS)
 #define ITLB_SIZE 4
 
 #define NB_MMU_MODES 2
@@ -228,14 +229,18 @@
 #define MMUCR    0x1F000010
 #define MMUCR_AT (1<<0)
 #define MMUCR_SV (1<<8)
-#define MMUCR_URC_BITS (6)
 #define MMUCR_URC_OFFSET (10)
-#define MMUCR_URC_SIZE (1 << MMUCR_URC_BITS)
-#define MMUCR_URC_MASK (((MMUCR_URC_SIZE) - 1) << MMUCR_URC_OFFSET)
+#define MMUCR_URC_MASK ((UTLB_SIZE - 1) << MMUCR_URC_OFFSET)
 static inline int cpu_mmucr_urc (uint32_t mmucr)
 {
     return ((mmucr & MMUCR_URC_MASK) >> MMUCR_URC_OFFSET);
 }
+#define MMUCR_URB_OFFSET (18)
+#define MMUCR_URB_MASK ((UTLB_SIZE - 1) << MMUCR_URB_OFFSET)
+static inline int cpu_mmucr_urb (uint32_t mmucr)
+{
+    return ((mmucr & MMUCR_URB_MASK) >> MMUCR_URB_OFFSET);
+}
 
 /* PTEH : Page Translation Entry High register */
 #define PTEH_ASID_BITS (8)
Index: trunk/target-sh4/helper.c
===================================================================
--- trunk/target-sh4/helper.c	(revision 6676)
+++ trunk/target-sh4/helper.c	(work copy)
@@ -346,12 +346,12 @@
     uint8_t urb, urc;
 
     /* Increment URC */
-    urb = ((env->mmucr) >> 18) & 0x3f;
-    urc = ((env->mmucr) >> 10) & 0x3f;
+    urb = cpu_mmucr_urb(env->mmucr);
+    urc = cpu_mmucr_urc(env->mmucr);
     urc++;
     if ((urb > 0 && urc > urb) || urc > (UTLB_SIZE - 1))
-	urc = 0;
-    env->mmucr = (env->mmucr & 0xffff03ff) | (urc << 10);
+        urc = 0;
+    env->mmucr = (env->mmucr & ~MMUCR_URC_MASK) | (urc << MMUCR_URC_OFFSET);
 }
 
 /* Find itlb entry - update itlb from utlb if necessary and asked for

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-04  2:59           ` Paul Brook
@ 2009-03-04 15:22             ` Shin-ichiro KAWASAKI
  0 siblings, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-04 15:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: Lionel Landwerlin

Paul Brook wrote:
>>>> Great :) But we're still far from arm :(
> 
>> By the way, does someone know why there is some kind of "tlb management
>> code" in exec.c ??
>>
>> Does the SH4 architecture have special features that can't be handled in
>> a generic code ? Or are we just rewriting some code that is already
>> there ... ?
> 
> I think you're missing the most important difference; SH uses a software 
> managed TLB, whereas ARM uses a hardware managed TLB.
> 
> The main consequence of this is that we don't have to model the actual ARM TLB 
> at all, it is never directly visible. We effectively implement an infinitely 
> large TLB.
> 
> For SH the TLB is programmed directly, so we end up having to maintain two 
> TLBs: The qemu TLB and the architectural SH TLB. For correct operation pages 
> must be removed from the qemu TLB when they are evicted/replaced in the SH 
> TLB. The SH TLB is quite small, and flushing qemu TLB entries is quite 
> expensive, so this results in fairly poor performance.
> 
> MIPS has a similar problem. However in that case the most common TLB 
> operations do not directly expose the TLB state. In particular when setting a 
> new TLB entry it is unspecified which TLB entry is replaced. At that point 
> the OS can't know which ehtry was evicted, so we can lie, and not evict pages 
> until the guest does something that allows it to determine the exact TLB 
> state. In practice this is sufficient to make mips-linux workreasonably well.
> 
> I'm not sure if the same is posible for SH. It probably depends whether URC is 
> visible to/used by the guest.

Thank you Paul for your clear explanation.  It confirms my guess, and answers my
question also.

As I posted for other mail, I tried to increase the number of tlb entry from 64
to 256 and get performance improvement.  This approach modifies real hardware
specification including URC, but same SH-Linux kernel works fine.  It implies
that SH-Linux does not refer URC, and MIPS approach might be the solution for
SH too.


Regards,
Shin-ichiro KAWASAKI

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] sh : performance problem
  2009-03-03 22:28         ` Lionel Landwerlin
  2009-03-04  2:59           ` Paul Brook
@ 2009-03-04 15:39           ` Shin-ichiro KAWASAKI
  1 sibling, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-04 15:39 UTC (permalink / raw)
  To: qemu-devel

Lionel Landwerlin wrote:
> Le mardi 03 mars 2009 à 20:25 +0100, Laurent Desnogues a écrit :
>> On Tue, Mar 3, 2009 at 7:57 PM, Lionel Landwerlin
>> <lionel.landwerlin@openwide.fr> wrote:
>>> Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
>> [...]
>>>>   sh4 : 5.8 [seconds]     O(n) utlb search.
>>>>   sh4 : 4.6 [seconds]     O(log2(n)) utlb search.
>>>>   sh4 : 4.1 [seconds]     O(1) utlb search by Lionel
>>>>   arm : 0.8 [seconds]     (-M versatilepb + Debian ARM)
>>>>
>>>> Your patch has a nice score!
>>> Great :) But we're still far from arm :(
>> It would be interesting if you could run oprofile of both platforms
>> (and even better if that was with call-graph output).
>>
>>>>> +#if !defined(CONFIG_USER_ONLY)
>>>>> +    /* vpn to utlb entry caches (too much space for user emulation) */
>>>>> +    uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
>>>>> +    uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
>>>>> +    uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
>>>>> +    uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
>>>>> +#endif
>>>>>  } CPUSH4State;
>>>> Isn't it too gorgeous?
>>>> How about allocating them on demand?
>>>> I guess sh-linux uses only utlbs_4k[], in general.
>>>> If so, 4 Mb utlbs_1k[] is waste.
>>>>
>>> sh-linux can also use huge pages of 64k and 1M.
>>>
>>> I think it is important to keep the emulation as close as possible from
>>> the real the cpu capabilities.
>> I think Shin-ichiro was rather pointing a the number of 1k pages,
>> which are probably not used.  OTOH I agree being close to
>> CPU capabilities is important.
> 
> The problem is, when a virtual address space must be resolved, the cpu
> emulator is not able to know what page size has been (or not) used to
> map the address. Thus we must handle the worst case...
> 
> And after all, 5Mb isn't so much, considering the fact that the physical
> guest memory, in system emulation, can be mapped anywhere in the host
> virtual address space, contrary to the user emulation that have harder
> contraints...

IMO, it will be another patch which introduces lines like this.

  if (!env->utlbs_1k)
       env->utlbs_1k = qemu_mallocz(4194304);

But footprint tuning can be left as future task.
Sorry for being too critical.

Regards,
Shin-ichiro KAWASAKI

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-03-04 15:39 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-26 16:28 [Qemu-devel] sh : performance problem Shin-ichiro KAWASAKI
2009-02-28 19:15 ` Lionel Landwerlin
2009-03-01  6:05   ` Shin-ichiro KAWASAKI
2009-03-01 12:46     ` Lionel Landwerlin
2009-03-01 16:10       ` Shin-ichiro KAWASAKI
2009-03-01 17:39         ` Lionel Landwerlin
2009-03-02 14:46           ` Shin-ichiro KAWASAKI
2009-03-02 22:30     ` Lionel Landwerlin
2009-03-03 15:32       ` Shin-ichiro KAWASAKI
2009-03-02 23:58 ` Lionel Landwerlin
2009-03-03  0:09   ` Lionel Landwerlin
2009-03-03 15:46   ` Shin-ichiro KAWASAKI
2009-03-03 18:57     ` Lionel Landwerlin
2009-03-03 19:25       ` Laurent Desnogues
2009-03-03 22:28         ` Lionel Landwerlin
2009-03-04  2:59           ` Paul Brook
2009-03-04 15:22             ` Shin-ichiro KAWASAKI
2009-03-04 15:39           ` Shin-ichiro KAWASAKI
2009-03-03 23:11     ` Lionel Landwerlin
2009-03-04 15:12       ` Shin-ichiro KAWASAKI

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).