* [Qemu-devel] sh : performance problem
@ 2009-02-26 16:28 Shin-ichiro KAWASAKI
2009-02-28 19:15 ` Lionel Landwerlin
2009-03-02 23:58 ` Lionel Landwerlin
0 siblings, 2 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-02-26 16:28 UTC (permalink / raw)
To: qemu-devel
Hi, all.
One of the current problems of qemu-sh4 system emulation is performance.
Kernel boot process seems not so slow, but userland process performance
is too bad. This problem can be seen numerically when you compile simple '.c'
source with empty main(). The result is as follows.
sh4 : 5.8 [seconds] (-M r2d + Fedora 6)
arm : 0.8 [seconds] (-M versatilepb + Debian ARM)
sh4 is 7 times slower than arm...
/* I repeated the compile work until the measured time converges. */
Using Oprofile, we can see what qemu doing while the compile work on sh4
system emulation. Top 20 time consuming functions are as follows.
---------------------------------------------------------------------------
CPU: Core 2, speed 1600 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % image name app name symbol name
38481 25.0550 qemu-system-sh4 qemu-system-sh4 find_tlb_entry
19836 12.9152 anon (tgid:2483 range:0xb2181000-0xb3381000) qemu-system-sh4 (no symbols)
9333 6.0767 qemu-system-sh4 qemu-system-sh4 cpu_sh4_exec
8180 5.3260 qemu-system-sh4 qemu-system-sh4 tcg_reg_alloc_op
7236 4.7114 qemu-system-sh4 qemu-system-sh4 temp_save
7051 4.5909 qemu-system-sh4 qemu-system-sh4 tcg_liveness_analysis
4339 2.8251 libc-2.8.90.so libc-2.8.90.so (no symbols)
3414 2.2229 qemu-system-sh4 qemu-system-sh4 tlb_set_page_exec
3017 1.9644 qemu-system-sh4 qemu-system-sh4 decode_opc
2802 1.8244 qemu-system-sh4 qemu-system-sh4 get_physical_address
sa2358 1.5353 qemu-system-sh4 qemu-system-sh4 find_itlb_entry
2215 1.4422 qemu-system-sh4 qemu-system-sh4 cpu_sh4_write_mmaped_utlb_addr
1771 1.1531 qemu-system-sh4 qemu-system-sh4 tcg_gen_code_search_pc
1746 1.1368 qemu-system-sh4 qemu-system-sh4 __ldl_mmu
1663 1.0828 qemu-system-sh4 qemu-system-sh4 cpu_sh4_handle_mmu_fault
1557 1.0138 qemu-system-sh4 qemu-system-sh4 gen_intermediate_code_pc
1328 0.8647 qemu-system-sh4 qemu-system-sh4 tcg_temp_new_internal_i32
1233 0.8028 qemu-system-arm qemu-system-arm cpu_arm_exec
1229 0.8002 vmlinux vmlinux vmi_activate
1046 0.6811 Xvnc4 Xvnc4 (no symbols)
---------------------------------------------------------------------------
Most time consuming function is 'find_tlb_entry()', which search the sh4's TLB
entries to find an entry which matches with given address.
If my understanding is right, this search happens when TLB miss happens.
Too many TLB misses causes bad perfomance, I guess.
The actions to solve this problem will be,
(i) tune up 'find_tlb_entry()'
(ii) reduce TLB miss by expanding page size
(iii) reduce TLB miss by increase the number of TLB entries virtually,
more than the real cpu has.
First, I tried (i). The attached patch introduces binary search.
It shortens the gcc compile time from 5.8 seconds to 4.6 seconds :
it make sh4 system emulation 20% faster.
'find_tlb_entry()' searches unified tlb array with 64 entries, and
instruction tlb array with 4 entries. This patch focus only on unified
tlb array search.
This patch is rather rough one. Any advise to brush it up will be appreciated.
I'm going to work on approaches (ii) and (iii). Advises for them will be
thanked too.
Regards,
Shin-ichiro KAWASAKI
Index: trunk/target-sh4/cpu.h
===================================================================
--- trunk/target-sh4/cpu.h (revision 6628)
+++ trunk/target-sh4/cpu.h (working copy)
@@ -137,13 +137,16 @@
uint32_t intevt; /* interrupt event register */
uint32_t pvr; /* Processor Version Register */
uint32_t prr; /* Processor Revision Register */
uint32_t cvr; /* Cache Version Register */
- CPU_COMMON tlb_t utlb[UTLB_SIZE]; /* unified translation table */
+ CPU_COMMON
+ tlb_t utlb[UTLB_SIZE]; /* unified translation table */
+ tlb_t * sorted_utlb[UTLB_SIZE];
+ uint32_t sorted_utlb_num;
tlb_t itlb[ITLB_SIZE]; /* instruction translation table */
void *intc_handle;
int intr_at_halt; /* SR_BL ignored during sleep */
} CPUSH4State;
CPUSH4State *cpu_sh4_init(const char *cpu_model);
Index: trunk/target-sh4/helper.c
===================================================================
--- trunk/target-sh4/helper.c (revision 6628)
+++ trunk/target-sh4/helper.c (working copy)
@@ -322,7 +322,7 @@
if (e == MMU_DTLB_MULTIPLE)
e = MMU_ITLB_MULTIPLE;
else if (e == MMU_DTLB_MISS && update) {
- e = find_tlb_entry(env, address, env->utlb, UTLB_SIZE, use_asid);
+ e = find_utlb_entry(env, address, use_asid);
if (e >= 0) {
tlb_t * ientry;
n = itlb_replacement(env);
@@ -342,15 +342,69 @@
return e;
}
+static inline int is_utlb_match(tlb_t * e, uint32_t addr)
+{
+ return (e->vpn << 10 & ~(e->size - 1)) == (addr & ~(e->size - 1));
+}
+
/* Find utlb entry
Return entry, MMU_DTLB_MISS, MMU_DTLB_MULTIPLE */
int find_utlb_entry(CPUState * env, target_ulong address, int use_asid)
{
+ int min = 0;
+ int max = env->sorted_utlb_num - 1;
+ int cur = (min + max) / 2 ;
+ int save;
+ uint8_t asid = env->pteh & 0xff;
+ int ret = MMU_DTLB_MISS;
+
/* per utlb access */
increment_urc(env);
- /* Return entry */
- return find_tlb_entry(env, address, env->utlb, UTLB_SIZE, use_asid);
+ if (env->sorted_utlb_num <= 0)
+ return MMU_DTLB_MISS;
+
+ /* binary search */
+ while (!is_utlb_match(env->sorted_utlb[cur], address)) {
+ if (min >= max)
+ return MMU_DTLB_MISS;
+ if (((env->sorted_utlb[cur]->vpn << 10)
+ & ~(env->sorted_utlb[cur]->size - 1)) <
+ (address & ~(env->sorted_utlb[cur]->size - 1))) {
+ min = cur + 1;
+ } else {
+ max = cur - 1;
+ }
+ cur = (min + max) / 2;
+ }
+
+ save = cur;
+
+ /* minus search */
+ do {
+ tlb_t * e = env->sorted_utlb[cur];
+ if (e->sh || (!use_asid) || e->asid == asid) {
+ if (ret != MMU_DTLB_MISS)
+ return MMU_DTLB_MULTIPLE;
+ ret = e - &env->utlb[0];
+ }
+ cur--;
+ } while(cur >= 0 && is_utlb_match(env->sorted_utlb[cur], address));
+
+ /* plus search */
+ cur = save + 1;
+ while (cur < env->sorted_utlb_num &&
+ is_utlb_match(env->sorted_utlb[cur], address)) {
+ tlb_t * e = env->sorted_utlb[cur];
+ if (e->sh || (!use_asid) || e->asid == asid) {
+ if (ret != MMU_DTLB_MISS)
+ return MMU_DTLB_MULTIPLE;
+ ret = e - &env->utlb[0];
+ }
+ cur++;
+ }
+
+ return ret;
}
/* Match address against MMU
@@ -525,6 +579,63 @@
return physical;
}
+static void add_entry_to_sorted_utlb(CPUState * env, tlb_t * entry)
+{
+ int i;
+
+ if (env->sorted_utlb_num == 0) {
+ env->sorted_utlb[0] = entry;
+ env->sorted_utlb_num = 1;
+ return;
+ }
+
+ if (entry->vpn <= env->sorted_utlb[0]->vpn) {
+ memmove(&env->sorted_utlb[1],
+ &env->sorted_utlb[0],
+ sizeof(tlb_t *) * env->sorted_utlb_num);
+ env->sorted_utlb[0] = entry;
+ env->sorted_utlb_num++;
+ return;
+ }
+
+ for (i = 0; i < env->sorted_utlb_num - 1; i++) {
+ tlb_t * e1 = env->sorted_utlb[i];
+ tlb_t * e2 = env->sorted_utlb[i + 1];
+ if (e1->vpn <= entry->vpn && entry->vpn <= e2->vpn) {
+ memmove(&env->sorted_utlb[i + 2],
+ &env->sorted_utlb[i + 1],
+ sizeof(tlb_t *) * (env->sorted_utlb_num - i - 1));
+ env->sorted_utlb[i + 1] = entry;
+ env->sorted_utlb_num++;
+ return;
+ }
+ }
+
+#if 0
+ assert(env->sorted_utlb_num < UTLB_SIZE);
+#endif
+ env->sorted_utlb[env->sorted_utlb_num] = entry;
+ env->sorted_utlb_num++;
+}
+
+static void remove_entry_from_sorted_utlb(CPUState * env, tlb_t * entry)
+{
+ int i;
+#if 0
+ assert(env->sorted_utlb_num > 0);
+#endif
+
+ for (i = 0; i < env->sorted_utlb_num; i++) {
+ if (env->sorted_utlb[i] == entry) {
+ env->sorted_utlb_num--;
+ memmove(&env->sorted_utlb[i],
+ &env->sorted_utlb[i + 1],
+ sizeof(tlb_t *) * (env->sorted_utlb_num - i));
+ return;
+ }
+ }
+}
+
void cpu_load_tlb(CPUState * env)
{
int n = cpu_mmucr_urc(env->mmucr);
@@ -536,6 +647,8 @@
if (!same_tlb_entry_exists(env->itlb, ITLB_SIZE, entry)) {
tlb_flush_page(env, address);
}
+
+ remove_entry_from_sorted_utlb(env, entry);
}
/* Take values into cpu status from registers. */
@@ -568,6 +681,9 @@
entry->wt = (uint8_t)cpu_ptel_wt(env->ptel);
entry->sa = (uint8_t)cpu_ptea_sa(env->ptea);
entry->tc = (uint8_t)cpu_ptea_tc(env->ptea);
+
+ /* add to sorted list */
+ add_entry_to_sorted_utlb(env, entry);
}
void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
@@ -599,8 +715,12 @@
s->tea = addr;
break;
}
- if (entry->v && !v)
+ if (entry->v && !v) {
needs_tlb_flush = 1;
+ remove_entry_from_sorted_utlb(s, entry);
+ } else if (!entry->v && v){
+ add_entry_to_sorted_utlb(s, entry);
+ }
entry->v = v;
entry->d = d;
utlb_match_entry = entry;
@@ -635,6 +755,7 @@
if (!same_tlb_entry_exists(s->itlb, ITLB_SIZE, entry)) {
tlb_flush_page(s, address);
}
+ remove_entry_from_sorted_utlb(s, entry);
}
entry->asid = asid;
entry->vpn = vpn;
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-02-26 16:28 [Qemu-devel] sh : performance problem Shin-ichiro KAWASAKI
@ 2009-02-28 19:15 ` Lionel Landwerlin
2009-03-01 6:05 ` Shin-ichiro KAWASAKI
2009-03-02 23:58 ` Lionel Landwerlin
1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-02-28 19:15 UTC (permalink / raw)
To: qemu-devel
Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
écrit :
Hi Shin-ichiro,
I'm interested by your patch, but it does not apply on top of the r6215
(qemu svn) + qemu-sh patches
( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary)
Should I try on the last qemu svn ?
Regards,
--
Lionel Landwerlin <lionel.landwerlin@openwide.fr>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-02-28 19:15 ` Lionel Landwerlin
@ 2009-03-01 6:05 ` Shin-ichiro KAWASAKI
2009-03-01 12:46 ` Lionel Landwerlin
2009-03-02 22:30 ` Lionel Landwerlin
0 siblings, 2 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-01 6:05 UTC (permalink / raw)
To: qemu-devel
Hi, Lionel.
Thank you for the response!
Lionel Landwerlin wrote:
> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
> écrit :
>
> Hi Shin-ichiro,
>
> I'm interested by your patch, but it does not apply on top of the r6215
> (qemu svn) + qemu-sh patches
> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary)
>
> Should I try on the last qemu svn ?
Yes. When I send patch to qemu-devel ml, I make diff against
qemu main repository. Could you try on it?
Regards,
Shin-ichiro KAWASAKI
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-01 6:05 ` Shin-ichiro KAWASAKI
@ 2009-03-01 12:46 ` Lionel Landwerlin
2009-03-01 16:10 ` Shin-ichiro KAWASAKI
2009-03-02 22:30 ` Lionel Landwerlin
1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-01 12:46 UTC (permalink / raw)
To: qemu-devel; +Cc: Shin-ichiro KAWASAKI
Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
> Hi, Lionel.
> Thank you for the response!
>
> Lionel Landwerlin wrote:
> > Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
> > écrit :
> >
> > Hi Shin-ichiro,
> >
> > I'm interested by your patch, but it does not apply on top of the r6215
> > (qemu svn) + qemu-sh patches
> > ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary)
> >
> > Should I try on the last qemu svn ?
>
> Yes. When I send patch to qemu-devel ml, I make diff against
> qemu main repository. Could you try on it?
Yes, I will.
Do you know when the qemu-sh patches will be integrated to qemu ?
--
Lionel Landwerlin <lionel.landwerlin@openwide.fr>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-01 12:46 ` Lionel Landwerlin
@ 2009-03-01 16:10 ` Shin-ichiro KAWASAKI
2009-03-01 17:39 ` Lionel Landwerlin
0 siblings, 1 reply; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-01 16:10 UTC (permalink / raw)
To: Lionel Landwerlin; +Cc: qemu-devel
Lionel Landwerlin wrote:
> Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
>> Hi, Lionel.
>> Thank you for the response!
>>
>> Lionel Landwerlin wrote:
>>> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
>>> écrit :
>>>
>>> Hi Shin-ichiro,
>>>
>>> I'm interested by your patch, but it does not apply on top of the r6215
>>> (qemu svn) + qemu-sh patches
>>> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary)
>>>
>>> Should I try on the last qemu svn ?
>> Yes. When I send patch to qemu-devel ml, I make diff against
>> qemu main repository. Could you try on it?
>
> Yes, I will.
>
> Do you know when the qemu-sh patches will be integrated to qemu ?
Nope.
Sometimes the main developers afford some time for sh kindly, then
some of the patches in the qemu-sh staging repository are already
integrated. But some patches are left. I guess they are busy on
other CPU archs.
Especially, it's not a good time to apply patches which is not so
serious, because version cutting work is going now.
Regards,
Shin-ichiro KAWASAKI
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-01 16:10 ` Shin-ichiro KAWASAKI
@ 2009-03-01 17:39 ` Lionel Landwerlin
2009-03-02 14:46 ` Shin-ichiro KAWASAKI
0 siblings, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-01 17:39 UTC (permalink / raw)
To: Shin-ichiro KAWASAKI; +Cc: qemu-devel
Le lundi 02 mars 2009 à 01:10 +0900, Shin-ichiro KAWASAKI a écrit :
> Lionel Landwerlin wrote:
> > Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
> >> Hi, Lionel.
> >> Thank you for the response!
> >>
> >> Lionel Landwerlin wrote:
> >>> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
> >>> écrit :
> >>>
> >>> Hi Shin-ichiro,
> >>>
> >>> I'm interested by your patch, but it does not apply on top of the r6215
> >>> (qemu svn) + qemu-sh patches
> >>> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary)
> >>>
> >>> Should I try on the last qemu svn ?
> >> Yes. When I send patch to qemu-devel ml, I make diff against
> >> qemu main repository. Could you try on it?
> >
> > Yes, I will.
> >
> > Do you know when the qemu-sh patches will be integrated to qemu ?
>
> Nope.
>
> Sometimes the main developers afford some time for sh kindly, then
> some of the patches in the qemu-sh staging repository are already
> integrated. But some patches are left. I guess they are busy on
> other CPU archs.
>
> Especially, it's not a good time to apply patches which is not so
> serious, because version cutting work is going now.
>
> Regards,
> Shin-ichiro KAWASAKI
>
>
There is also one thing that is annoying me.
With the algorithm you proposed in your patch, you're not able anymore
to raise the MMU_ITLB_MULTIPLE and MMU_DTLB_MULTIPLE exceptions. This
makes the MMU emulation different from the real processor (even if
thoses cases aren't even handled in Linux and probably others OS).
For example you can probably not map the same physical area with
differents virtual page sizes.
--
Lionel Landwerlin <lionel.landwerlin@openwide.fr>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-01 17:39 ` Lionel Landwerlin
@ 2009-03-02 14:46 ` Shin-ichiro KAWASAKI
0 siblings, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-02 14:46 UTC (permalink / raw)
To: qemu-devel
Lionel Landwerlin wrote:
> Le lundi 02 mars 2009 à 01:10 +0900, Shin-ichiro KAWASAKI a écrit :
>> Lionel Landwerlin wrote:
>>> Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
>>>> Hi, Lionel.
>>>> Thank you for the response!
>>>>
>>>> Lionel Landwerlin wrote:
>>>>> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
>>>>> écrit :
>>>>>
>>>>> Hi Shin-ichiro,
>>>>>
>>>>> I'm interested by your patch, but it does not apply on top of the r6215
>>>>> (qemu svn) + qemu-sh patches
>>>>> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary)
>>>>>
>>>>> Should I try on the last qemu svn ?
>>>> Yes. When I send patch to qemu-devel ml, I make diff against
>>>> qemu main repository. Could you try on it?
>>> Yes, I will.
>>>
>>> Do you know when the qemu-sh patches will be integrated to qemu ?
>> Nope.
>>
>> Sometimes the main developers afford some time for sh kindly, then
>> some of the patches in the qemu-sh staging repository are already
>> integrated. But some patches are left. I guess they are busy on
>> other CPU archs.
>>
>> Especially, it's not a good time to apply patches which is not so
>> serious, because version cutting work is going now.
>>
>> Regards,
>> Shin-ichiro KAWASAKI
>>
>>
>
> There is also one thing that is annoying me.
>
> With the algorithm you proposed in your patch, you're not able anymore
> to raise the MMU_ITLB_MULTIPLE and MMU_DTLB_MULTIPLE exceptions. This
> makes the MMU emulation different from the real processor (even if
> thoses cases aren't even handled in Linux and probably others OS).
>
> For example you can probably not map the same physical area with
> differents virtual page sizes.
Right. I'll consider on it for next version of the patch.
Thank you!
Regards,
Shin-ichiro KAWASAKI
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-01 6:05 ` Shin-ichiro KAWASAKI
2009-03-01 12:46 ` Lionel Landwerlin
@ 2009-03-02 22:30 ` Lionel Landwerlin
2009-03-03 15:32 ` Shin-ichiro KAWASAKI
1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-02 22:30 UTC (permalink / raw)
To: qemu-devel
Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
> Hi, Lionel.
> Thank you for the response!
>
> Lionel Landwerlin wrote:
> > Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
> > écrit :
> >
> > Hi Shin-ichiro,
> >
> > I'm interested by your patch, but it does not apply on top of the r6215
> > (qemu svn) + qemu-sh patches
> > ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary)
> >
> > Should I try on the last qemu svn ?
>
> Yes. When I send patch to qemu-devel ml, I make diff against
> qemu main repository. Could you try on it?
>
>
Are you working on a 32 or 64bits host system ?
I experiencing a lot of random behavior with qemu...
>From now I don't a have video anymore...
--
Lionel Landwerlin <lionel.landwerlin@openwide.fr>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-02-26 16:28 [Qemu-devel] sh : performance problem Shin-ichiro KAWASAKI
2009-02-28 19:15 ` Lionel Landwerlin
@ 2009-03-02 23:58 ` Lionel Landwerlin
2009-03-03 0:09 ` Lionel Landwerlin
2009-03-03 15:46 ` Shin-ichiro KAWASAKI
1 sibling, 2 replies; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-02 23:58 UTC (permalink / raw)
To: qemu-devel; +Cc: Shin-ichiro KAWASAKI
Shin-ichiro,
Sorry, but I cannot apply your patch cleanly on the last qemu-svn.
Instead, I would like to try another approach. The patch you proposed to
find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
something like that if I remember) instead here is a patch with a
complexity of O(1).
This patch has the same problem concerning the multiple hit exception
that is not raised anymore, but I think we can handle it with some more
variables in tlb entries.
The patch apply on
qemu-svn r6215
+
git://git.kernel.org/pub/scm/virt/qemu/lethal/qemu-sh.git patches
Regards,
--
Lionel Landwerlin <lionel.landwerlin@openwide.fr>
[PATCH] SH4: Use O(1) algorithm instead of O(n) for utlb lookup
The current algorithm to know whether a virtual address belongs to the
TLB cache is to parse all the TLB one by one. The complexity of this
algorithm is O(n).
The following patch implements a TLB entry cache, by keeping for each
virtual page number the associated TLB entry. As usual the complexity
gain is balanced by an overall memory comsumption that is about 5Mb.
This patches also introduce a behavior divergence with the real SH4
processor, because no more 'multiple hit exception' is raised, when 2
TLB map the same physical area. This will be fixed in another
iteration of this patch.
Signed-off-by: Lionel Landwerlin <lionel.landwerlin@openwide.fr>
---
target-sh4/cpu.h | 8 +++
target-sh4/helper.c | 156 ++++++++++++++++++++++++++++++++++--------------
target-sh4/translate.c | 9 ++-
3 files changed, 127 insertions(+), 46 deletions(-)
diff --git a/target-sh4/cpu.h b/target-sh4/cpu.h
index 4a9ddea..2b17a93 100644
--- a/target-sh4/cpu.h
+++ b/target-sh4/cpu.h
@@ -153,6 +153,14 @@ typedef struct CPUSH4State {
int intr_at_halt; /* SR_BL ignored during sleep */
store_request_t *store_requests;
store_request_t **store_request_tail;
+
+#if !defined(CONFIG_USER_ONLY)
+ /* vpn to utlb entry caches (too much space for user emulation) */
+ uint8_t utlbs_1k[4194304]; /* 2^22 => 4 Mb */
+ uint8_t utlbs_4k[1048576]; /* 2^20 => 1 Mb */
+ uint8_t utlbs_64k[65536]; /* 2^16 => 64 Kb */
+ uint8_t utlbs_1m[4096]; /* 2^12 => 4 Kb */
+#endif
} CPUSH4State;
CPUSH4State *cpu_sh4_init(const char *cpu_model);
diff --git a/target-sh4/helper.c b/target-sh4/helper.c
index 54a3f1f..c9b2ef0 100644
--- a/target-sh4/helper.c
+++ b/target-sh4/helper.c
@@ -240,6 +240,29 @@ static int itlb_replacement(CPUState * env)
assert(0);
}
+static inline int is_tlb_matching (tlb_t *entry, int use_asid,
+ uint8_t asid, target_ulong address)
+{
+ target_ulong start, end;
+
+ /* Invalid entry */
+ if (unlikely(!entry->v))
+ return 0;
+
+ /* Bad ASID */
+ if (unlikely (!entry->sh && use_asid && entry->asid != asid))
+ return 0;
+
+ start = (entry->vpn << 10) & ~(entry->size - 1);
+ end = start + entry->size - 1;
+
+ /* Match */
+ if (likely (address >= start && address <= end))
+ return 1;
+
+ return 0;
+}
+
/* Find the corresponding entry in the right TLB
Return entry, MMU_DTLB_MISS or MMU_DTLB_MULTIPLE
*/
@@ -253,37 +276,59 @@ static int find_tlb_entry(CPUState * env, target_ulong address,
asid = env->pteh & 0xff;
- for (i = 0; i < nbtlb; i++) {
- if (!entries[i].v)
- continue; /* Invalid entry */
- if (!entries[i].sh && use_asid && entries[i].asid != asid)
- continue; /* Bad ASID */
-#if 0
- switch (entries[i].sz) {
- case 0:
- size = 1024; /* 1kB */
- break;
- case 1:
- size = 4 * 1024; /* 4kB */
- break;
- case 2:
- size = 64 * 1024; /* 64kB */
- break;
- case 3:
- size = 1024 * 1024; /* 1MB */
- break;
- default:
- assert(0);
- }
-#endif
- start = (entries[i].vpn << 10) & ~(entries[i].size - 1);
- end = start + entries[i].size - 1;
- if (address >= start && address <= end) { /* Match */
- if (match != MMU_DTLB_MISS)
- return MMU_DTLB_MULTIPLE; /* Multiple match */
- match = i;
- }
+ /* In case we are looking at the utlbs. */
+ if (env->utlb == entries)
+ {
+ /* We are most likely to use 4k virtual pages. */
+ i = env->utlbs_4k[address >> 12];
+ if (unlikely (!is_tlb_matching (&entries[i], use_asid,
+ asid, address)))
+ {
+ /* Then huge pages */
+ i = env->utlbs_64k[address >> 14];
+ if (unlikely (!is_tlb_matching (&entries[i], use_asid,
+ asid, address)))
+ {
+ /* Or *HUGE* pages */
+ i = env->utlbs_1m[address >> 16];
+ if (unlikely (!is_tlb_matching (&entries[i], use_asid,
+ asid, address)))
+ {
+ /* And finally tiny pages (who uses that ??). */
+ i = env->utlbs_1k[address >> 10];
+ if (likely (is_tlb_matching (&entries[i], use_asid,
+ asid, address)))
+ match = i;
+ }
+ else
+ match = i;
+ }
+ else
+ match = i;
+ }
+ else
+ match = i;
+ }
+ else
+ {
+ /* We keep the traditional o(n) algorithm for itlbs (only 4 of
+ * them). */
+ for (i = 0; i < nbtlb; i++) {
+ if (!entries[i].v)
+ continue; /* Invalid entry */
+ if (!entries[i].sh && use_asid && entries[i].asid != asid)
+ continue; /* Bad ASID */
+
+ start = (entries[i].vpn << 10) & ~(entries[i].size - 1);
+ end = start + entries[i].size - 1;
+ if (address >= start && address <= end) { /* Match */
+ if (match != MMU_DTLB_MISS)
+ return MMU_DTLB_MULTIPLE; /* Multiple match */
+ match = i;
+ }
+ }
}
+
return match;
}
@@ -403,6 +448,7 @@ static int get_mmu_address(CPUState * env, target_ulong * physical,
*prot = (rw == 1)? PAGE_WRITE : PAGE_READ;
break;
}
+
} else if (n == MMU_DTLB_MISS) {
n = (rw == 1) ? MMU_DTLB_MISS_WRITE :
MMU_DTLB_MISS_READ;
@@ -412,8 +458,8 @@ static int get_mmu_address(CPUState * env, target_ulong * physical,
*physical = ((matching->ppn << 10) & ~(matching->size - 1)) |
(address & (matching->size - 1));
if ((rw == 1) & !matching->d)
- n = MMU_DTLB_INITIAL_WRITE;
- else
+ n = MMU_DTLB_INITIAL_WRITE;
+ else
n = MMU_OK;
}
return n;
@@ -545,18 +591,23 @@ void cpu_load_tlb(CPUState * env)
entry->v = (uint8_t)cpu_ptel_v(env->ptel);
entry->ppn = cpu_ptel_ppn(env->ptel);
entry->sz = (uint8_t)cpu_ptel_sz(env->ptel);
+
switch (entry->sz) {
case 0: /* 00 */
entry->size = 1024; /* 1K */
+ env->utlbs_1k[entry->vpn] = n;
break;
case 1: /* 01 */
entry->size = 1024 * 4; /* 4K */
+ env->utlbs_4k[entry->vpn >> 2] = n;
break;
case 2: /* 10 */
entry->size = 1024 * 64; /* 64K */
+ env->utlbs_64k[entry->vpn >> 6] = n;
break;
case 3: /* 11 */
entry->size = 1024 * 1024; /* 1M */
+ env->utlbs_1m[entry->vpn >> 10] = n;
break;
default:
assert(0);
@@ -571,7 +622,7 @@ void cpu_load_tlb(CPUState * env)
entry->tc = (uint8_t)cpu_ptea_tc(env->ptea);
}
-void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
+void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *env, target_phys_addr_t addr,
uint32_t mem_value)
{
int associate = addr & 0x0000080;
@@ -579,7 +630,7 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
uint8_t d = (uint8_t)((mem_value & 0x00000200) >> 9);
uint8_t v = (uint8_t)((mem_value & 0x00000100) >> 8);
uint8_t asid = (uint8_t)(mem_value & 0x000000ff);
- int use_asid = (s->mmucr & MMUCR_SV) == 0 || (s->sr & SR_MD) == 0;
+ int use_asid = (env->mmucr & MMUCR_SV) == 0 || (env->sr & SR_MD) == 0;
if (associate) {
int i;
@@ -588,7 +639,7 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
/* search UTLB */
for (i = 0; i < UTLB_SIZE; i++) {
- tlb_t * entry = &s->utlb[i];
+ tlb_t * entry = &env->utlb[i];
if (!entry->v)
continue;
@@ -596,8 +647,8 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
&& (!use_asid || entry->asid == asid || entry->sh)) {
if (utlb_match_entry) {
/* Multiple TLB Exception */
- s->exception_index = 0x140;
- s->tea = addr;
+ env->exception_index = 0x140;
+ env->tea = addr;
break;
}
if (entry->v && !v)
@@ -606,12 +657,12 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
entry->d = d;
utlb_match_entry = entry;
}
- increment_urc(s); /* per utlb access */
+ increment_urc(env); /* per utlb access */
}
/* search ITLB */
for (i = 0; i < ITLB_SIZE; i++) {
- tlb_t * entry = &s->itlb[i];
+ tlb_t * entry = &env->itlb[i];
if (entry->vpn == vpn
&& (!use_asid || entry->asid == asid || entry->sh)) {
if (entry->v && !v)
@@ -625,23 +676,40 @@ void cpu_sh4_write_mmaped_utlb_addr(CPUSH4State *s, target_phys_addr_t addr,
}
if (needs_tlb_flush)
- tlb_flush_page(s, vpn << 10);
+ tlb_flush_page(env, vpn << 10);
} else {
int index = (addr & 0x00003f00) >> 8;
- tlb_t * entry = &s->utlb[index];
+ tlb_t * entry = &env->utlb[index];
if (entry->v) {
/* Overwriting valid entry in utlb. */
target_ulong address = entry->vpn << 10;
- if (!same_tlb_entry_exists(s->itlb, ITLB_SIZE, entry)) {
- tlb_flush_page(s, address);
+ if (!same_tlb_entry_exists(env->itlb, ITLB_SIZE, entry)) {
+ tlb_flush_page(env, address);
}
}
entry->asid = asid;
entry->vpn = vpn;
entry->d = d;
entry->v = v;
- increment_urc(s);
+ switch (entry->sz) {
+ case 0: /* 00 */
+ env->utlbs_1k[entry->vpn] = index;
+ break;
+ case 1: /* 01 */
+ env->utlbs_4k[entry->vpn >> 2] = index;
+ break;
+ case 2: /* 10 */
+ env->utlbs_64k[entry->vpn >> 6] = index;
+ break;
+ case 3: /* 11 */
+ env->utlbs_1m[entry->vpn >> 10] = index;
+ break;
+ default:
+ assert(0);
+ break;
+ }
+ increment_urc(env);
}
}
diff --git a/target-sh4/translate.c b/target-sh4/translate.c
index df81052..eab21a6 100644
--- a/target-sh4/translate.c
+++ b/target-sh4/translate.c
@@ -202,6 +202,11 @@ static void cpu_sh4_reset(CPUSH4State * env)
set_float_rounding_mode(float_round_to_zero, &env->fp_status);
#endif
env->mmucr = 0;
+
+ memset (env->utlbs_1k, 0, sizeof (env->utlbs_1k));
+ memset (env->utlbs_4k, 0, sizeof (env->utlbs_4k));
+ memset (env->utlbs_64k, 0, sizeof (env->utlbs_64k));
+ memset (env->utlbs_1m, 0, sizeof (env->utlbs_1m));
}
typedef struct {
@@ -510,7 +515,7 @@ static void _decode_opc(DisasContext * ctx)
to this list. When we see an instruction that is neither movca.l
nor ocbi, we perform the stores recorded in this list. When we see
ocbi, we check if the stores list has the address being invalidated.
- If so, we remove the address from the list.
+ If so, we remove the address from the list.
To optimize, we only try to flush stores when we're at the start of
TB, or if we already saw movca.l in this TB and did not flush stores
@@ -1565,7 +1570,7 @@ static void _decode_opc(DisasContext * ctx)
}
return;
case 0x00c3: /* movca.l R0,@Rm */
- gen_helper_movcal (REG(B11_8), REG(0));
+ gen_helper_movcal (REG(B11_8), REG(0));
ctx->has_movcal = 1;
return;
case 0x40a9:
--
1.5.6.5
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-02 23:58 ` Lionel Landwerlin
@ 2009-03-03 0:09 ` Lionel Landwerlin
2009-03-03 15:46 ` Shin-ichiro KAWASAKI
1 sibling, 0 replies; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-03 0:09 UTC (permalink / raw)
To: qemu-devel
You might want this patch too
--
Lionel Landwerlin <lionel.landwerlin@openwide.fr>
[PATCH] SH4: Fixed last UTLB unused and URB/URC management
Signed-off-by: Lionel Landwerlin <lionel.landwerlin@openwide.fr>
---
target-sh4/helper.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/target-sh4/helper.c b/target-sh4/helper.c
index cd0f392..54a3f1f 100644
--- a/target-sh4/helper.c
+++ b/target-sh4/helper.c
@@ -297,7 +297,7 @@ static int same_tlb_entry_exists(const tlb_t *
haystack, uint8_t nbtlb,
return 0;
}
-static void increment_urc(CPUState * env)
+static inline void increment_urc(CPUState * env)
{
uint8_t urb, urc;
@@ -305,7 +305,7 @@ static void increment_urc(CPUState * env)
urb = ((env->mmucr) >> 18) & 0x3f;
urc = ((env->mmucr) >> 10) & 0x3f;
urc++;
- if (urc == urb || urc == UTLB_SIZE - 1)
+ if ((urb > 0 && urc > urb) || urc > (UTLB_SIZE - 1))
urc = 0;
env->mmucr = (env->mmucr & 0xffff03ff) | (urc << 10);
}
--
1.5.6.5
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-02 22:30 ` Lionel Landwerlin
@ 2009-03-03 15:32 ` Shin-ichiro KAWASAKI
0 siblings, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-03 15:32 UTC (permalink / raw)
To: qemu-devel
Lionel Landwerlin wrote:
> Le dimanche 01 mars 2009 à 15:05 +0900, Shin-ichiro KAWASAKI a écrit :
>> Hi, Lionel.
>> Thank you for the response!
>>
>> Lionel Landwerlin wrote:
>>> Le vendredi 27 février 2009 à 01:28 +0900, Shin-ichiro KAWASAKI a
>>> écrit :
>>>
>>> Hi Shin-ichiro,
>>>
>>> I'm interested by your patch, but it does not apply on top of the r6215
>>> (qemu svn) + qemu-sh patches
>>> ( http://git.kernel.org/?p=virt/qemu/lethal/qemu-sh.git;a=summary)
>>>
>>> Should I try on the last qemu svn ?
>> Yes. When I send patch to qemu-devel ml, I make diff against
>> qemu main repository. Could you try on it?
>>
>>
>
> Are you working on a 32 or 64bits host system ?
>
> I experiencing a lot of random behavior with qemu...
My host system is 32 bit. Core 2 Duo and VMware.
I often see sometimes user process fails with segmentation fault.
I know nothing about sh4 system emulation on 64 bit host.
> >From now I don't a have video anymore...
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-02 23:58 ` Lionel Landwerlin
2009-03-03 0:09 ` Lionel Landwerlin
@ 2009-03-03 15:46 ` Shin-ichiro KAWASAKI
2009-03-03 18:57 ` Lionel Landwerlin
2009-03-03 23:11 ` Lionel Landwerlin
1 sibling, 2 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-03 15:46 UTC (permalink / raw)
To: qemu-devel
Lionel Landwerlin wrote:
> Shin-ichiro,
>
> Sorry, but I cannot apply your patch cleanly on the last qemu-svn.
>
> Instead, I would like to try another approach. The patch you proposed to
> find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
> something like that if I remember) instead here is a patch with a
> complexity of O(1).
Good work. I evaluated your patch on my environment, measuring
compile time for empty main() with gcc.
sh4 : 5.8 [seconds] O(n) utlb search.
sh4 : 4.6 [seconds] O(log2(n)) utlb search.
sh4 : 4.1 [seconds] O(1) utlb search by Lionel
arm : 0.8 [seconds] (-M versatilepb + Debian ARM)
Your patch has a nice score!
Now I've done the work to increase number of utlb entries from 64 to 256,
and found that the score get arround 2.4 seconds.
I'm trying to increase it to 4096. Your O(1) search will be more important
as the entry number increase.
> +#if !defined(CONFIG_USER_ONLY)
> + /* vpn to utlb entry caches (too much space for user emulation) */
> + uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
> + uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
> + uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
> + uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
> +#endif
> } CPUSH4State;
Isn't it too gorgeous?
How about allocating them on demand?
I guess sh-linux uses only utlbs_4k[], in general.
If so, 4 Mb utlbs_1k[] is waste.
Regards,
Shin-ichiro KAWASAKI
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-03 15:46 ` Shin-ichiro KAWASAKI
@ 2009-03-03 18:57 ` Lionel Landwerlin
2009-03-03 19:25 ` Laurent Desnogues
2009-03-03 23:11 ` Lionel Landwerlin
1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-03 18:57 UTC (permalink / raw)
To: qemu-devel
Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
> Lionel Landwerlin wrote:
> > Shin-ichiro,
> >
> > Sorry, but I cannot apply your patch cleanly on the last qemu-svn.
> >
> > Instead, I would like to try another approach. The patch you proposed to
> > find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
> > something like that if I remember) instead here is a patch with a
> > complexity of O(1).
>
> Good work. I evaluated your patch on my environment, measuring
> compile time for empty main() with gcc.
>
> sh4 : 5.8 [seconds] O(n) utlb search.
> sh4 : 4.6 [seconds] O(log2(n)) utlb search.
> sh4 : 4.1 [seconds] O(1) utlb search by Lionel
> arm : 0.8 [seconds] (-M versatilepb + Debian ARM)
>
> Your patch has a nice score!
Great :) But we're still far from arm :(
>
> Now I've done the work to increase number of utlb entries from 64 to 256,
> and found that the score get arround 2.4 seconds.
> I'm trying to increase it to 4096. Your O(1) search will be more important
> as the entry number increase.
>
What do you mean by more important ?
Is the arm emulation increasing the number of TLB ?
> > +#if !defined(CONFIG_USER_ONLY)
> > + /* vpn to utlb entry caches (too much space for user emulation) */
> > + uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
> > + uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
> > + uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
> > + uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
> > +#endif
> > } CPUSH4State;
>
> Isn't it too gorgeous?
> How about allocating them on demand?
> I guess sh-linux uses only utlbs_4k[], in general.
> If so, 4 Mb utlbs_1k[] is waste.
>
sh-linux can also use huge pages of 64k and 1M.
I think it is important to keep the emulation as close as possible from
the real the cpu capabilities.
Regards,
--
Lionel Landwerlin <lionel.landwerlin@openwide.fr>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-03 18:57 ` Lionel Landwerlin
@ 2009-03-03 19:25 ` Laurent Desnogues
2009-03-03 22:28 ` Lionel Landwerlin
0 siblings, 1 reply; 20+ messages in thread
From: Laurent Desnogues @ 2009-03-03 19:25 UTC (permalink / raw)
To: qemu-devel
On Tue, Mar 3, 2009 at 7:57 PM, Lionel Landwerlin
<lionel.landwerlin@openwide.fr> wrote:
> Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
[...]
>> sh4 : 5.8 [seconds] O(n) utlb search.
>> sh4 : 4.6 [seconds] O(log2(n)) utlb search.
>> sh4 : 4.1 [seconds] O(1) utlb search by Lionel
>> arm : 0.8 [seconds] (-M versatilepb + Debian ARM)
>>
>> Your patch has a nice score!
>
> Great :) But we're still far from arm :(
It would be interesting if you could run oprofile of both platforms
(and even better if that was with call-graph output).
>> > +#if !defined(CONFIG_USER_ONLY)
>> > + /* vpn to utlb entry caches (too much space for user emulation) */
>> > + uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
>> > + uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
>> > + uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
>> > + uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
>> > +#endif
>> > } CPUSH4State;
>>
>> Isn't it too gorgeous?
>> How about allocating them on demand?
>> I guess sh-linux uses only utlbs_4k[], in general.
>> If so, 4 Mb utlbs_1k[] is waste.
>>
>
> sh-linux can also use huge pages of 64k and 1M.
>
> I think it is important to keep the emulation as close as possible from
> the real the cpu capabilities.
I think Shin-ichiro was rather pointing a the number of 1k pages,
which are probably not used. OTOH I agree being close to
CPU capabilities is important.
Laurent
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-03 19:25 ` Laurent Desnogues
@ 2009-03-03 22:28 ` Lionel Landwerlin
2009-03-04 2:59 ` Paul Brook
2009-03-04 15:39 ` Shin-ichiro KAWASAKI
0 siblings, 2 replies; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-03 22:28 UTC (permalink / raw)
To: qemu-devel
Le mardi 03 mars 2009 à 20:25 +0100, Laurent Desnogues a écrit :
> On Tue, Mar 3, 2009 at 7:57 PM, Lionel Landwerlin
> <lionel.landwerlin@openwide.fr> wrote:
> > Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
> [...]
> >> sh4 : 5.8 [seconds] O(n) utlb search.
> >> sh4 : 4.6 [seconds] O(log2(n)) utlb search.
> >> sh4 : 4.1 [seconds] O(1) utlb search by Lionel
> >> arm : 0.8 [seconds] (-M versatilepb + Debian ARM)
> >>
> >> Your patch has a nice score!
> >
> > Great :) But we're still far from arm :(
>
> It would be interesting if you could run oprofile of both platforms
> (and even better if that was with call-graph output).
>
> >> > +#if !defined(CONFIG_USER_ONLY)
> >> > + /* vpn to utlb entry caches (too much space for user emulation) */
> >> > + uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
> >> > + uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
> >> > + uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
> >> > + uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
> >> > +#endif
> >> > } CPUSH4State;
> >>
> >> Isn't it too gorgeous?
> >> How about allocating them on demand?
> >> I guess sh-linux uses only utlbs_4k[], in general.
> >> If so, 4 Mb utlbs_1k[] is waste.
> >>
> >
> > sh-linux can also use huge pages of 64k and 1M.
> >
> > I think it is important to keep the emulation as close as possible from
> > the real the cpu capabilities.
>
> I think Shin-ichiro was rather pointing a the number of 1k pages,
> which are probably not used. OTOH I agree being close to
> CPU capabilities is important.
The problem is, when a virtual address space must be resolved, the cpu
emulator is not able to know what page size has been (or not) used to
map the address. Thus we must handle the worst case...
And after all, 5Mb isn't so much, considering the fact that the physical
guest memory, in system emulation, can be mapped anywhere in the host
virtual address space, contrary to the user emulation that have harder
contraints...
By the way, does someone know why there is some kind of "tlb management
code" in exec.c ??
Does the SH4 architecture have special features that can't be handled in
a generic code ? Or are we just rewriting some code that is already
there ... ?
--
Lionel Landwerlin <lionel.landwerlin@openwide.fr>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-03 15:46 ` Shin-ichiro KAWASAKI
2009-03-03 18:57 ` Lionel Landwerlin
@ 2009-03-03 23:11 ` Lionel Landwerlin
2009-03-04 15:12 ` Shin-ichiro KAWASAKI
1 sibling, 1 reply; 20+ messages in thread
From: Lionel Landwerlin @ 2009-03-03 23:11 UTC (permalink / raw)
To: qemu-devel
Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
> Lionel Landwerlin wrote:
> > Shin-ichiro,
> >
> > Sorry, but I cannot apply your patch cleanly on the last qemu-svn.
> >
> > Instead, I would like to try another approach. The patch you proposed to
> > find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
> > something like that if I remember) instead here is a patch with a
> > complexity of O(1).
>
> Good work. I evaluated your patch on my environment, measuring
> compile time for empty main() with gcc.
>
> sh4 : 5.8 [seconds] O(n) utlb search.
> sh4 : 4.6 [seconds] O(log2(n)) utlb search.
> sh4 : 4.1 [seconds] O(1) utlb search by Lionel
> arm : 0.8 [seconds] (-M versatilepb + Debian ARM)
>
> Your patch has a nice score!
>
> Now I've done the work to increase number of utlb entries from 64 to 256,
> and found that the score get arround 2.4 seconds.
> I'm trying to increase it to 4096. Your O(1) search will be more important
> as the entry number increase.
We should setup a git repository to improve that or at least work on the
same basis.
Do you have a lot of patch on top of the last svn ?
--
Lionel Landwerlin <lionel.landwerlin@openwide.fr>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-03 22:28 ` Lionel Landwerlin
@ 2009-03-04 2:59 ` Paul Brook
2009-03-04 15:22 ` Shin-ichiro KAWASAKI
2009-03-04 15:39 ` Shin-ichiro KAWASAKI
1 sibling, 1 reply; 20+ messages in thread
From: Paul Brook @ 2009-03-04 2:59 UTC (permalink / raw)
To: qemu-devel; +Cc: Lionel Landwerlin
> > > Great :) But we're still far from arm :(
> By the way, does someone know why there is some kind of "tlb management
> code" in exec.c ??
>
> Does the SH4 architecture have special features that can't be handled in
> a generic code ? Or are we just rewriting some code that is already
> there ... ?
I think you're missing the most important difference; SH uses a software
managed TLB, whereas ARM uses a hardware managed TLB.
The main consequence of this is that we don't have to model the actual ARM TLB
at all, it is never directly visible. We effectively implement an infinitely
large TLB.
For SH the TLB is programmed directly, so we end up having to maintain two
TLBs: The qemu TLB and the architectural SH TLB. For correct operation pages
must be removed from the qemu TLB when they are evicted/replaced in the SH
TLB. The SH TLB is quite small, and flushing qemu TLB entries is quite
expensive, so this results in fairly poor performance.
MIPS has a similar problem. However in that case the most common TLB
operations do not directly expose the TLB state. In particular when setting a
new TLB entry it is unspecified which TLB entry is replaced. At that point
the OS can't know which ehtry was evicted, so we can lie, and not evict pages
until the guest does something that allows it to determine the exact TLB
state. In practice this is sufficient to make mips-linux workreasonably well.
I'm not sure if the same is posible for SH. It probably depends whether URC is
visible to/used by the guest.
Large pages add even more complications. The qemu tlb canonly handle a single
page size. In practice means that when large pages are used invalidating a
single page entry requires the whole qemu tlb to be flushed. I'm pretty sure
x86 getsand works mainly be chance (nothing actually ues large pages enough
to notice it's broken). ARM takes the hit of a full TLB flush (linux breakss
if you only flush a 1k region of a 4k entry), but single pge flushes are
rare so in practice this doesn't hurt too much
Paul
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-03 23:11 ` Lionel Landwerlin
@ 2009-03-04 15:12 ` Shin-ichiro KAWASAKI
0 siblings, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-04 15:12 UTC (permalink / raw)
To: qemu-devel
Lionel Landwerlin wrote:
> Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
>> Lionel Landwerlin wrote:
>>> Shin-ichiro,
>>>
>>> Sorry, but I cannot apply your patch cleanly on the last qemu-svn.
>>>
>>> Instead, I would like to try another approach. The patch you proposed to
>>> find (or not) a valid TLB entry has a complexity of O(log2(n)) (or
>>> something like that if I remember) instead here is a patch with a
>>> complexity of O(1).
>> Good work. I evaluated your patch on my environment, measuring
>> compile time for empty main() with gcc.
>>
>> sh4 : 5.8 [seconds] O(n) utlb search.
>> sh4 : 4.6 [seconds] O(log2(n)) utlb search.
>> sh4 : 4.1 [seconds] O(1) utlb search by Lionel
>> arm : 0.8 [seconds] (-M versatilepb + Debian ARM)
>>
>> Your patch has a nice score!
>>
>> Now I've done the work to increase number of utlb entries from 64 to 256,
>> and found that the score get arround 2.4 seconds.
>> I'm trying to increase it to 4096. Your O(1) search will be more important
>> as the entry number increase.
>
> We should setup a git repository to improve that or at least work on the
> same basis.
>
> Do you have a lot of patch on top of the last svn ?
I personally manages my staging repository corresponds to 'qemu-sh'.
git://github.com/kawasaki/qemu-sh.git
http://github.com/kawasaki/qemu-sh/tree/master
Patches in it can be applied to current qemu subversion trunk.
I put your O(1) patch and my utlb entry increasing patch also.
Will it work as the basis?
I attach the utlb entry increasing patch to this mail for reference.
It should be applied with other patches in my staging repository, or
patch command will fail.
It increases the number of utlb entry from 64 to 256.
URC and URB values in MMUCR can take the number from 0 to 255.
Here's the benchmark value.
sh4 : 5.8 [seconds] O(n) utlb search.
sh4 : 4.6 [seconds] O(log2(n)) utlb search.
sh4 : 4.1 [seconds] O(1) utlb search by Lionel
sh4 : 2.1 [seconds] O(1) utlb search, and 256 utlb enries <=
arm : 0.8 [seconds] (-M versatilepb + Debian ARM)
Getting better :) but still behind arm.
Regards,
Shin-ichiro KAWASAKI
Index: trunk/target-sh4/cpu.h
===================================================================
--- trunk/target-sh4/cpu.h (revision 6676)
+++ trunk/target-sh4/cpu.h (working copy)
@@ -88,7 +88,8 @@
uint8_t tc; /* timing control */
} tlb_t;
-#define UTLB_SIZE 64
+#define UTLB_BITS 8 /* real hard : 6 */
+#define UTLB_SIZE (1 << UTLB_BITS)
#define ITLB_SIZE 4
#define NB_MMU_MODES 2
@@ -228,14 +229,18 @@
#define MMUCR 0x1F000010
#define MMUCR_AT (1<<0)
#define MMUCR_SV (1<<8)
-#define MMUCR_URC_BITS (6)
#define MMUCR_URC_OFFSET (10)
-#define MMUCR_URC_SIZE (1 << MMUCR_URC_BITS)
-#define MMUCR_URC_MASK (((MMUCR_URC_SIZE) - 1) << MMUCR_URC_OFFSET)
+#define MMUCR_URC_MASK ((UTLB_SIZE - 1) << MMUCR_URC_OFFSET)
static inline int cpu_mmucr_urc (uint32_t mmucr)
{
return ((mmucr & MMUCR_URC_MASK) >> MMUCR_URC_OFFSET);
}
+#define MMUCR_URB_OFFSET (18)
+#define MMUCR_URB_MASK ((UTLB_SIZE - 1) << MMUCR_URB_OFFSET)
+static inline int cpu_mmucr_urb (uint32_t mmucr)
+{
+ return ((mmucr & MMUCR_URB_MASK) >> MMUCR_URB_OFFSET);
+}
/* PTEH : Page Translation Entry High register */
#define PTEH_ASID_BITS (8)
Index: trunk/target-sh4/helper.c
===================================================================
--- trunk/target-sh4/helper.c (revision 6676)
+++ trunk/target-sh4/helper.c (work copy)
@@ -346,12 +346,12 @@
uint8_t urb, urc;
/* Increment URC */
- urb = ((env->mmucr) >> 18) & 0x3f;
- urc = ((env->mmucr) >> 10) & 0x3f;
+ urb = cpu_mmucr_urb(env->mmucr);
+ urc = cpu_mmucr_urc(env->mmucr);
urc++;
if ((urb > 0 && urc > urb) || urc > (UTLB_SIZE - 1))
- urc = 0;
- env->mmucr = (env->mmucr & 0xffff03ff) | (urc << 10);
+ urc = 0;
+ env->mmucr = (env->mmucr & ~MMUCR_URC_MASK) | (urc << MMUCR_URC_OFFSET);
}
/* Find itlb entry - update itlb from utlb if necessary and asked for
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-04 2:59 ` Paul Brook
@ 2009-03-04 15:22 ` Shin-ichiro KAWASAKI
0 siblings, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-04 15:22 UTC (permalink / raw)
To: qemu-devel; +Cc: Lionel Landwerlin
Paul Brook wrote:
>>>> Great :) But we're still far from arm :(
>
>> By the way, does someone know why there is some kind of "tlb management
>> code" in exec.c ??
>>
>> Does the SH4 architecture have special features that can't be handled in
>> a generic code ? Or are we just rewriting some code that is already
>> there ... ?
>
> I think you're missing the most important difference; SH uses a software
> managed TLB, whereas ARM uses a hardware managed TLB.
>
> The main consequence of this is that we don't have to model the actual ARM TLB
> at all, it is never directly visible. We effectively implement an infinitely
> large TLB.
>
> For SH the TLB is programmed directly, so we end up having to maintain two
> TLBs: The qemu TLB and the architectural SH TLB. For correct operation pages
> must be removed from the qemu TLB when they are evicted/replaced in the SH
> TLB. The SH TLB is quite small, and flushing qemu TLB entries is quite
> expensive, so this results in fairly poor performance.
>
> MIPS has a similar problem. However in that case the most common TLB
> operations do not directly expose the TLB state. In particular when setting a
> new TLB entry it is unspecified which TLB entry is replaced. At that point
> the OS can't know which ehtry was evicted, so we can lie, and not evict pages
> until the guest does something that allows it to determine the exact TLB
> state. In practice this is sufficient to make mips-linux workreasonably well.
>
> I'm not sure if the same is posible for SH. It probably depends whether URC is
> visible to/used by the guest.
Thank you Paul for your clear explanation. It confirms my guess, and answers my
question also.
As I posted for other mail, I tried to increase the number of tlb entry from 64
to 256 and get performance improvement. This approach modifies real hardware
specification including URC, but same SH-Linux kernel works fine. It implies
that SH-Linux does not refer URC, and MIPS approach might be the solution for
SH too.
Regards,
Shin-ichiro KAWASAKI
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] sh : performance problem
2009-03-03 22:28 ` Lionel Landwerlin
2009-03-04 2:59 ` Paul Brook
@ 2009-03-04 15:39 ` Shin-ichiro KAWASAKI
1 sibling, 0 replies; 20+ messages in thread
From: Shin-ichiro KAWASAKI @ 2009-03-04 15:39 UTC (permalink / raw)
To: qemu-devel
Lionel Landwerlin wrote:
> Le mardi 03 mars 2009 à 20:25 +0100, Laurent Desnogues a écrit :
>> On Tue, Mar 3, 2009 at 7:57 PM, Lionel Landwerlin
>> <lionel.landwerlin@openwide.fr> wrote:
>>> Le mercredi 04 mars 2009 à 00:46 +0900, Shin-ichiro KAWASAKI a écrit :
>> [...]
>>>> sh4 : 5.8 [seconds] O(n) utlb search.
>>>> sh4 : 4.6 [seconds] O(log2(n)) utlb search.
>>>> sh4 : 4.1 [seconds] O(1) utlb search by Lionel
>>>> arm : 0.8 [seconds] (-M versatilepb + Debian ARM)
>>>>
>>>> Your patch has a nice score!
>>> Great :) But we're still far from arm :(
>> It would be interesting if you could run oprofile of both platforms
>> (and even better if that was with call-graph output).
>>
>>>>> +#if !defined(CONFIG_USER_ONLY)
>>>>> + /* vpn to utlb entry caches (too much space for user emulation) */
>>>>> + uint8_t utlbs_1k[4194304]; /* 222 => 4 Mb */
>>>>> + uint8_t utlbs_4k[1048576]; /* 220 => 1 Mb */
>>>>> + uint8_t utlbs_64k[65536]; /* 216 => 64 Kb */
>>>>> + uint8_t utlbs_1m[4096]; /* 212 => 4 Kb */
>>>>> +#endif
>>>>> } CPUSH4State;
>>>> Isn't it too gorgeous?
>>>> How about allocating them on demand?
>>>> I guess sh-linux uses only utlbs_4k[], in general.
>>>> If so, 4 Mb utlbs_1k[] is waste.
>>>>
>>> sh-linux can also use huge pages of 64k and 1M.
>>>
>>> I think it is important to keep the emulation as close as possible from
>>> the real the cpu capabilities.
>> I think Shin-ichiro was rather pointing a the number of 1k pages,
>> which are probably not used. OTOH I agree being close to
>> CPU capabilities is important.
>
> The problem is, when a virtual address space must be resolved, the cpu
> emulator is not able to know what page size has been (or not) used to
> map the address. Thus we must handle the worst case...
>
> And after all, 5Mb isn't so much, considering the fact that the physical
> guest memory, in system emulation, can be mapped anywhere in the host
> virtual address space, contrary to the user emulation that have harder
> contraints...
IMO, it will be another patch which introduces lines like this.
if (!env->utlbs_1k)
env->utlbs_1k = qemu_mallocz(4194304);
But footprint tuning can be left as future task.
Sorry for being too critical.
Regards,
Shin-ichiro KAWASAKI
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2009-03-04 15:39 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-26 16:28 [Qemu-devel] sh : performance problem Shin-ichiro KAWASAKI
2009-02-28 19:15 ` Lionel Landwerlin
2009-03-01 6:05 ` Shin-ichiro KAWASAKI
2009-03-01 12:46 ` Lionel Landwerlin
2009-03-01 16:10 ` Shin-ichiro KAWASAKI
2009-03-01 17:39 ` Lionel Landwerlin
2009-03-02 14:46 ` Shin-ichiro KAWASAKI
2009-03-02 22:30 ` Lionel Landwerlin
2009-03-03 15:32 ` Shin-ichiro KAWASAKI
2009-03-02 23:58 ` Lionel Landwerlin
2009-03-03 0:09 ` Lionel Landwerlin
2009-03-03 15:46 ` Shin-ichiro KAWASAKI
2009-03-03 18:57 ` Lionel Landwerlin
2009-03-03 19:25 ` Laurent Desnogues
2009-03-03 22:28 ` Lionel Landwerlin
2009-03-04 2:59 ` Paul Brook
2009-03-04 15:22 ` Shin-ichiro KAWASAKI
2009-03-04 15:39 ` Shin-ichiro KAWASAKI
2009-03-03 23:11 ` Lionel Landwerlin
2009-03-04 15:12 ` Shin-ichiro KAWASAKI
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).