* [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses
@ 2005-04-23 17:23 Joakim Tjernlund
2005-04-23 12:42 ` Marcelo Tosatti
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Joakim Tjernlund @ 2005-04-23 17:23 UTC (permalink / raw)
To: linuxppc-embedded, marcelo.tosatti
> Now, what is the best way to bring the performance back to v2.4 levels?
>
> For this "dd" test, which is dominated by "sys_read/sys_write", I thought
> of trying to bring the hotpath functions into the same pages, thus
> decreasing the number of page translations required for such tasks.
>
> Comments are appreciated
Does CONFIG_PIN_TLB make a difference?
Jocke
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 17:23 [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses Joakim Tjernlund @ 2005-04-23 12:42 ` Marcelo Tosatti 2005-04-23 21:31 ` Joakim Tjernlund 2005-04-23 21:32 ` Dan Malek 2005-04-23 17:35 ` Joakim Tjernlund 2005-04-23 23:12 ` Dan Malek 2 siblings, 2 replies; 21+ messages in thread From: Marcelo Tosatti @ 2005-04-23 12:42 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: linuxppc-embedded Hi Joakim, On Sat, Apr 23, 2005 at 07:23:51PM +0200, Joakim Tjernlund wrote: > > Now, what is the best way to bring the performance back to v2.4 levels? > > > > For this "dd" test, which is dominated by "sys_read/sys_write", I thought > > of trying to bring the hotpath functions into the same pages, thus > > decreasing the number of page translations required for such tasks. > > > > Comments are appreciated > > Does CONFIG_PIN_TLB make a difference? No it does not. ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 12:42 ` Marcelo Tosatti @ 2005-04-23 21:31 ` Joakim Tjernlund 2005-04-23 21:32 ` Dan Malek 1 sibling, 0 replies; 21+ messages in thread From: Joakim Tjernlund @ 2005-04-23 21:31 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linuxppc-embedded > Hi Joakim, > > On Sat, Apr 23, 2005 at 07:23:51PM +0200, Joakim Tjernlund wrote: > > > Now, what is the best way to bring the performance back to v2.4 levels? > > > > > > For this "dd" test, which is dominated by "sys_read/sys_write", I thought > > > of trying to bring the hotpath functions into the same pages, thus > > > decreasing the number of page translations required for such tasks. > > > > > > Comments are appreciated > > > > Does CONFIG_PIN_TLB make a difference? > > No it does not. hmm, strange. I would expect the kernel ITLB Misses to be zero and the kernel DTLB Misses be a lot smaller. Have any numbers handy? Jocke ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 12:42 ` Marcelo Tosatti 2005-04-23 21:31 ` Joakim Tjernlund @ 2005-04-23 21:32 ` Dan Malek 2005-04-23 21:55 ` Joakim Tjernlund 1 sibling, 1 reply; 21+ messages in thread From: Dan Malek @ 2005-04-23 21:32 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Joakim Tjernlund, linuxppc-embedded On Apr 23, 2005, at 8:42 AM, Marcelo Tosatti wrote: >> Does CONFIG_PIN_TLB make a difference? > > No it does not. For some reason this option and code didn't make it from 2.4 to 2.6. It should have some effect on small memory (16Mbyte) systems with processor that have more than 16 TLB entries. -- Dan ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 21:32 ` Dan Malek @ 2005-04-23 21:55 ` Joakim Tjernlund 2005-04-23 22:12 ` Dan Malek 0 siblings, 1 reply; 21+ messages in thread From: Joakim Tjernlund @ 2005-04-23 21:55 UTC (permalink / raw) To: Dan Malek, Marcelo Tosatti; +Cc: linuxppc-embedded > On Apr 23, 2005, at 8:42 AM, Marcelo Tosatti wrote: > > >> Does CONFIG_PIN_TLB make a difference? > > > > No it does not. > > For some reason this option and code didn't make it from > 2.4 to 2.6. It should have some effect on small memory (16Mbyte) > systems with processor that have more than 16 TLB entries. Oh, but the CONFIG_PIN_TLB code in head_8xx.S is there. I guess something is missing. Jocke ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 21:55 ` Joakim Tjernlund @ 2005-04-23 22:12 ` Dan Malek 0 siblings, 0 replies; 21+ messages in thread From: Dan Malek @ 2005-04-23 22:12 UTC (permalink / raw) To: Joakim.Tjernlund; +Cc: linuxppc-embedded On Apr 23, 2005, at 5:55 PM, Joakim Tjernlund wrote: > Oh, but the CONFIG_PIN_TLB code in head_8xx.S is there. I guess > something is missing. Ooops, I searched for the wrong name, I missed it :-) Maybe it needs a different set up for something other than the 860. -- Dan ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 17:23 [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses Joakim Tjernlund 2005-04-23 12:42 ` Marcelo Tosatti @ 2005-04-23 17:35 ` Joakim Tjernlund 2005-04-23 21:29 ` Dan Malek 2005-04-23 23:12 ` Dan Malek 2 siblings, 1 reply; 21+ messages in thread From: Joakim Tjernlund @ 2005-04-23 17:35 UTC (permalink / raw) To: linuxppc-embedded, marcelo.tosatti > > > Now, what is the best way to bring the performance back to v2.4 levels? > > > > For this "dd" test, which is dominated by "sys_read/sys_write", I thought > > of trying to bring the hotpath functions into the same pages, thus > > decreasing the number of page translations required for such tasks. > > > > Comments are appreciated > > Does CONFIG_PIN_TLB make a difference? > > Jocke Is it possible to handle the _PAGE_ACCESSED handling at pte creation in fault.c instead of doing it for every TLB miss? That should make the TLB Miss handler faster. Jocke ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 17:35 ` Joakim Tjernlund @ 2005-04-23 21:29 ` Dan Malek 2005-04-23 21:51 ` Joakim Tjernlund 0 siblings, 1 reply; 21+ messages in thread From: Dan Malek @ 2005-04-23 21:29 UTC (permalink / raw) To: Joakim.Tjernlund; +Cc: linuxppc-embedded On Apr 23, 2005, at 1:35 PM, Joakim Tjernlund wrote: > Is it possible to handle the _PAGE_ACCESSED handling at pte creation > in fault.c instead > of doing it for every TLB miss? That should make the TLB Miss handler > faster. No. As part of VM management to determine working sets, it's possible to have this flag change state but the page to remain valid. The cost of setting this properly in the miss handler is minimal compared to the other stuff that we should try and streamline. Thanks. -- Dan ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 21:29 ` Dan Malek @ 2005-04-23 21:51 ` Joakim Tjernlund 2005-04-23 22:09 ` Dan Malek 0 siblings, 1 reply; 21+ messages in thread From: Joakim Tjernlund @ 2005-04-23 21:51 UTC (permalink / raw) To: Dan Malek; +Cc: linuxppc-embedded > > On Apr 23, 2005, at 1:35 PM, Joakim Tjernlund wrote: > > > Is it possible to handle the _PAGE_ACCESSED handling at pte creation > > in fault.c instead > > of doing it for every TLB miss? That should make the TLB Miss handler > > faster. > > No. As part of VM management to determine working sets, it's possible > to have > this flag change state but the page to remain valid. OK, strange though. I would have expected this flag to stay untouched until the pte is invalidated. > The cost of > setting this properly > in the miss handler is minimal compared to the other stuff that we > should try > and streamline. Well, every instruction counts. I this case we would have saved 2 in ITLB Miss, 3 in DTLB Miss and a cache line write in both. Would be nice to do away with the kernel space test, but thats a lot harder. > > Thanks. > > -- Dan > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 21:51 ` Joakim Tjernlund @ 2005-04-23 22:09 ` Dan Malek 0 siblings, 0 replies; 21+ messages in thread From: Dan Malek @ 2005-04-23 22:09 UTC (permalink / raw) To: Joakim.Tjernlund; +Cc: linuxppc-embedded On Apr 23, 2005, at 5:51 PM, Joakim Tjernlund wrote: > Well, every instruction counts. I this case we would have saved > 2 in ITLB Miss, 3 in DTLB Miss and a cache line write in both. You have already read the PTE and instructions into the cache, there are no branches, but not a big deal. > Would be nice to do away with the kernel space test, but thats a lot > harder. With some clever first level pointer page creation and management we could do this, but it would be custom 8xx code in generic files. Thanks. -- Dan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 17:23 [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses Joakim Tjernlund 2005-04-23 12:42 ` Marcelo Tosatti 2005-04-23 17:35 ` Joakim Tjernlund @ 2005-04-23 23:12 ` Dan Malek 2005-04-23 23:51 ` Joakim Tjernlund 2 siblings, 1 reply; 21+ messages in thread From: Dan Malek @ 2005-04-23 23:12 UTC (permalink / raw) To: Joakim.Tjernlund; +Cc: linuxppc-embedded On Apr 23, 2005, at 1:23 PM, Joakim Tjernlund wrote: > Does CONFIG_PIN_TLB make a difference? While looking at the code, I noticed this will only work if you can map all of the memory with wired TLB entries. If you have more than 24M of memory, make sure you enable CONFIG_MODULES to eliminate the TLB miss optimization that will certainly crash a system by failing to look up kernel PTEs correctly. Thanks. -- Dan ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 23:12 ` Dan Malek @ 2005-04-23 23:51 ` Joakim Tjernlund 2005-04-24 0:00 ` Dan Malek 0 siblings, 1 reply; 21+ messages in thread From: Joakim Tjernlund @ 2005-04-23 23:51 UTC (permalink / raw) To: Dan Malek; +Cc: linuxppc-embedded > > On Apr 23, 2005, at 1:23 PM, Joakim Tjernlund wrote: > > > Does CONFIG_PIN_TLB make a difference? > > While looking at the code, I noticed this will only work > if you can map all of the memory with wired TLB entries. > If you have more than 24M of memory, make sure you > enable CONFIG_MODULES to eliminate the TLB miss > optimization that will certainly crash a system by failing > to look up kernel PTEs correctly. hmm, I have more than 24MB of memory and I can run CONFIG_PIN_TLB just fine with modules off in kernel 2.4. Havn't tried 2.6 yet. Jocke ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-23 23:51 ` Joakim Tjernlund @ 2005-04-24 0:00 ` Dan Malek 2005-04-24 16:55 ` Marcelo Tosatti 0 siblings, 1 reply; 21+ messages in thread From: Dan Malek @ 2005-04-24 0:00 UTC (permalink / raw) To: Joakim.Tjernlund; +Cc: linuxppc-embedded On Apr 23, 2005, at 7:51 PM, Joakim Tjernlund wrote: > hmm, I have more than 24MB of memory and I can run CONFIG_PIN_TLB just > fine with modules off in kernel 2.4. Havn't tried 2.6 yet. Doh. Oh, I see. We only do the optimization for the instruction misses. I'll have to take a closer look at Marcelo's 2.6 tests. -- Dan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-24 0:00 ` Dan Malek @ 2005-04-24 16:55 ` Marcelo Tosatti 2005-04-25 9:57 ` Joakim Tjernlund 2005-05-07 18:10 ` Joakim Tjernlund 0 siblings, 2 replies; 21+ messages in thread From: Marcelo Tosatti @ 2005-04-24 16:55 UTC (permalink / raw) To: Dan Malek; +Cc: Joakim.Tjernlund, linuxppc-embedded Hi Dan, Joakim, On Sat, Apr 23, 2005 at 08:00:39PM -0400, Dan Malek wrote: > > On Apr 23, 2005, at 7:51 PM, Joakim Tjernlund wrote: > > >hmm, I have more than 24MB of memory and I can run CONFIG_PIN_TLB just > >fine with modules off in kernel 2.4. Havn't tried 2.6 yet. > > Doh. Oh, I see. We only do the optimization for the instruction > misses. > I'll have to take a closer look at Marcelo's 2.6 tests. The PIN TLB entry option does not make much difference in my tests, never did. Who wrote the code? Are there results which indicate a performance gain from TLB pinning on 8xx? If so, where are such results? One problem that I've noted is that initial_mmu sets {I,D}TLB index to be 27 (11100). MI_RSV4I protects TLB's 27...31. Given that both {I,D}TLB INDEX's are _decreased_ on each update, it seems to me that initial_mmu should set {I,D}TLB INDEX to 31, which will then decrease down to 27 after 4 TLB's are created. Another question that comes to mind is why initial_mmu does create additional 8Meg TLB entries for D-cache but not for I-cache: #ifdef CONFIG_PIN_TLB /* Map two more 8M kernel data pages. */ ... #endif I'll do some more CONFIG_PIN_TLB tests this week... --- head_8xx.S.orig2 2005-04-24 17:55:59.000000000 -0300 +++ head_8xx.S 2005-04-24 17:57:44.000000000 -0300 @@ -697,7 +697,7 @@ tlbia /* Invalidate all TLB entries */ #ifdef CONFIG_PIN_TLB lis r8, MI_RSV4I@h - ori r8, r8, 0x1c00 + ori r8, r8, 0x1f00 #else li r8, 0 #endif @@ -705,7 +705,7 @@ #ifdef CONFIG_PIN_TLB lis r10, (MD_RSV4I | MD_RESETVAL)@h - ori r10, r10, 0x1c00 + ori r10, r10, 0x1f00 mr r8, r10 #else lis r10, MD_RESETVAL@h ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-24 16:55 ` Marcelo Tosatti @ 2005-04-25 9:57 ` Joakim Tjernlund 2005-05-07 18:10 ` Joakim Tjernlund 1 sibling, 0 replies; 21+ messages in thread From: Joakim Tjernlund @ 2005-04-25 9:57 UTC (permalink / raw) To: Marcelo Tosatti, Dan Malek; +Cc: linuxppc-embedded > > Hi Dan, Joakim, > > On Sat, Apr 23, 2005 at 08:00:39PM -0400, Dan Malek wrote: > > > > On Apr 23, 2005, at 7:51 PM, Joakim Tjernlund wrote: > > > > >hmm, I have more than 24MB of memory and I can run CONFIG_PIN_TLB just > > >fine with modules off in kernel 2.4. Havn't tried 2.6 yet. > > > > Doh. Oh, I see. We only do the optimization for the instruction > > misses. > > I'll have to take a closer look at Marcelo's 2.6 tests. > > The PIN TLB entry option does not make much difference in my tests, > never did. Don't your TLB Miss counters look different for kernel space? If they don't there must be something very wrong with the CONFIG_PIN_TLB code. > > Who wrote the code? Are there results which indicate a performance gain > from TLB pinning on 8xx? If so, where are such results? I think Dan wrote this code. In 2.4 I improved the ITLB Miss handler a little for pinned ITLBs > > One problem that I've noted is that initial_mmu sets {I,D}TLB index > to be 27 (11100). > > MI_RSV4I protects TLB's 27...31. > > Given that both {I,D}TLB INDEX's are _decreased_ on each update, it seems > to me that initial_mmu should set {I,D}TLB INDEX to 31, which will then > decrease down to 27 after 4 TLB's are created. Makes sense but I can't say for sure. I tried the patch below on my 2.4 tree and it works fine. > > Another question that comes to mind is why initial_mmu does create > additional 8Meg TLB entries for D-cache but not for I-cache: Because the kernel code will never grow beyond 8MB, but data will due to kmalloc() etc. [SNIP] ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-24 16:55 ` Marcelo Tosatti 2005-04-25 9:57 ` Joakim Tjernlund @ 2005-05-07 18:10 ` Joakim Tjernlund 2005-05-07 14:42 ` Marcelo Tosatti 2005-05-07 20:24 ` Dan Malek 1 sibling, 2 replies; 21+ messages in thread From: Joakim Tjernlund @ 2005-05-07 18:10 UTC (permalink / raw) To: Marcelo Tosatti, Dan Malek; +Cc: linuxppc-embedded > Hi Dan, Joakim, > > On Sat, Apr 23, 2005 at 08:00:39PM -0400, Dan Malek wrote: > > > > On Apr 23, 2005, at 7:51 PM, Joakim Tjernlund wrote: > > > > >hmm, I have more than 24MB of memory and I can run CONFIG_PIN_TLB just > > >fine with modules off in kernel 2.4. Havn't tried 2.6 yet. > > > > Doh. Oh, I see. We only do the optimization for the instruction > > misses. > > I'll have to take a closer look at Marcelo's 2.6 tests. > > The PIN TLB entry option does not make much difference in my tests, > never did. > > Who wrote the code? Are there results which indicate a performance gain > from TLB pinning on 8xx? If so, where are such results? > > One problem that I've noted is that initial_mmu sets {I,D}TLB index > to be 27 (11100). > > MI_RSV4I protects TLB's 27...31. > > Given that both {I,D}TLB INDEX's are _decreased_ on each update, it seems > to me that initial_mmu should set {I,D}TLB INDEX to 31, which will then > decrease down to 27 after 4 TLB's are created. > > Another question that comes to mind is why initial_mmu does create > additional 8Meg TLB entries for D-cache but not for I-cache: > > #ifdef CONFIG_PIN_TLB > /* Map two more 8M kernel data pages. > */ > ... > #endif Not completly sure that this is correct. There are a few: addi r10, r10, 0x0100 mtspr SPRN_MD_CTR, r10 later on which will "overflow" 0x1f00 into 0x2000 etc. Jocke > > I'll do some more CONFIG_PIN_TLB tests this week... > > --- head_8xx.S.orig2 2005-04-24 17:55:59.000000000 -0300 > +++ head_8xx.S 2005-04-24 17:57:44.000000000 -0300 > @@ -697,7 +697,7 @@ > tlbia /* Invalidate all TLB entries */ > #ifdef CONFIG_PIN_TLB > lis r8, MI_RSV4I@h > - ori r8, r8, 0x1c00 > + ori r8, r8, 0x1f00 > #else > li r8, 0 > #endif > @@ -705,7 +705,7 @@ > > #ifdef CONFIG_PIN_TLB > lis r10, (MD_RSV4I | MD_RESETVAL)@h > - ori r10, r10, 0x1c00 > + ori r10, r10, 0x1f00 > mr r8, r10 > #else > lis r10, MD_RESETVAL@h ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-05-07 18:10 ` Joakim Tjernlund @ 2005-05-07 14:42 ` Marcelo Tosatti 2005-05-07 20:24 ` Dan Malek 1 sibling, 0 replies; 21+ messages in thread From: Marcelo Tosatti @ 2005-05-07 14:42 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: linuxppc-embedded On Sat, May 07, 2005 at 08:10:38PM +0200, Joakim Tjernlund wrote: > > Hi Dan, Joakim, > > > > On Sat, Apr 23, 2005 at 08:00:39PM -0400, Dan Malek wrote: > > > > > > On Apr 23, 2005, at 7:51 PM, Joakim Tjernlund wrote: > > > > > > >hmm, I have more than 24MB of memory and I can run CONFIG_PIN_TLB just > > > >fine with modules off in kernel 2.4. Havn't tried 2.6 yet. > > > > > > Doh. Oh, I see. We only do the optimization for the instruction > > > misses. > > > I'll have to take a closer look at Marcelo's 2.6 tests. > > > > The PIN TLB entry option does not make much difference in my tests, > > never did. > > > > Who wrote the code? Are there results which indicate a performance gain > > from TLB pinning on 8xx? If so, where are such results? > > > > One problem that I've noted is that initial_mmu sets {I,D}TLB index > > to be 27 (11100). > > > > MI_RSV4I protects TLB's 27...31. > > > > Given that both {I,D}TLB INDEX's are _decreased_ on each update, it seems > > to me that initial_mmu should set {I,D}TLB INDEX to 31, which will then > > decrease down to 27 after 4 TLB's are created. > > > > Another question that comes to mind is why initial_mmu does create > > additional 8Meg TLB entries for D-cache but not for I-cache: > > > > #ifdef CONFIG_PIN_TLB > > /* Map two more 8M kernel data pages. > > */ > > ... > > #endif > > Not completly sure that this is correct. There are a few: > addi r10, r10, 0x0100 > mtspr SPRN_MD_CTR, r10 > later on which will "overflow" 0x1f00 into 0x2000 etc. Yep. This is not correct at all: the TLB index is increased at each miss, not decreased as the manual says. I have confirmed it with the BDI... > Jocke > > > > I'll do some more CONFIG_PIN_TLB tests this week... > > > > --- head_8xx.S.orig2 2005-04-24 17:55:59.000000000 -0300 > > +++ head_8xx.S 2005-04-24 17:57:44.000000000 -0300 > > @@ -697,7 +697,7 @@ > > tlbia /* Invalidate all TLB entries */ > > #ifdef CONFIG_PIN_TLB > > lis r8, MI_RSV4I@h > > - ori r8, r8, 0x1c00 > > + ori r8, r8, 0x1f00 > > #else > > li r8, 0 > > #endif > > @@ -705,7 +705,7 @@ > > > > #ifdef CONFIG_PIN_TLB > > lis r10, (MD_RSV4I | MD_RESETVAL)@h > > - ori r10, r10, 0x1c00 > > + ori r10, r10, 0x1f00 > > mr r8, r10 > > #else > > lis r10, MD_RESETVAL@h ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-05-07 18:10 ` Joakim Tjernlund 2005-05-07 14:42 ` Marcelo Tosatti @ 2005-05-07 20:24 ` Dan Malek 1 sibling, 0 replies; 21+ messages in thread From: Dan Malek @ 2005-05-07 20:24 UTC (permalink / raw) To: Joakim.Tjernlund; +Cc: linuxppc-embedded On May 7, 2005, at 2:10 PM, Joakim Tjernlund wrote: > Not completly sure that this is correct. There are a few: > addi r10, r10, 0x0100 > mtspr SPRN_MD_CTR, r10 > later on which will "overflow" 0x1f00 into 0x2000 etc. Oh right, I forgot I did that. I explicitly set the tlb index before each write. Sorry, I thought it was due to more bits of index in the 885. So, I guess what was there should have worked. OK, so the reason TLB pinning doesn't work is a tlbie() can evict the pinned entry. That stupid code in the cpm reset will throw them out, plus anything else that would do a tlbie() of a kernel address within the pinned space (like the update_mmu_cache() hack). We have to fix those, and look for any others where that may happen. Thanks. -- Dan ^ permalink raw reply [flat|nested] 21+ messages in thread
* v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses @ 2005-04-21 18:32 Marcelo Tosatti 2005-04-21 18:50 ` [26-devel] " Marcelo Tosatti 0 siblings, 1 reply; 21+ messages in thread From: Marcelo Tosatti @ 2005-04-21 18:32 UTC (permalink / raw) To: 26-devel, linux-ppc-embedded Hi everyone, I found out that the previous TLB counter numbers were wrong, two of the values were switched! CPU is a 48Mhz 855T with 32 TLB entries, and 128Mb of RAM. Now I've got valid results. With an idle machine, this are the results of /proc/tlbmiss capture session with 1 second interval. Note that idle actually means about 4/5 processes (AcsWeb, cy_pmd, cy_alarm, cy_wdt kernel's keventd) running and switching over, but CPU is about 96-97% idle. As you can see, the ratio which TLB misses happen in v2.6 is significantly higher, for both I/D caches, even with an almost idle machine. The v2.6 kernel has grown in size relative to TLB usage (cache footprint), which is, I start to believe, the major cause for this issue. If that is the case other platforms will also suffer. As one example, the number of page addresses which the "sys_read()" system call needs to fetch to the I-cache in order to execute the task (the calltree) is about twice in size as in v2.4. Pantelis Antoniou informed that that 64 TLB-entry versions of MPC8xx processors do not suffer such significant performance slowdown. One point in reading these numbers is that v2.6 will count twice for page fault misses which result in pte creation (DataTLBMiss->DataTLBError), but I hope to change that for better precision. In this specific case I guess it should not be significant given that no processes are being created, mostly already mapped (periodic) routines are running. I hope that capturing the TLB miss difference between v2.4 and v2.6 on a simple CPU intense benchmark such as the "dd" I've been using before and multiplying that by translation cache miss penalty (20-23 clocks on a miss versus 1 clock on a hit) should give us a good estimate the real cost of these misses). And I wonder, no other arches have been noticed this? Comments are appreciated. Capture session of /proc/tlbmiss with 1 second interval: v2.6: v2.4: I-TLB userspace misses: 2577 I-TLB userspace misses: 2192 I-TLB kernel misses: 1557 I-TLB kernel misses: 1328 D-TLB userspace misses: 7173 D-TLB userspace misses: 6801 D-TLB kernel misses: 4442 D-TLB kernel misses: 4260 * * I-TLB userspace misses: 5324 I-TLB userspace misses: 4557 I-TLB kernel misses: 3277 I-TLB kernel misses: 2821 D-TLB userspace misses: 14399 D-TLB userspace misses: 13816 D-TLB kernel misses: 9069 D-TLB kernel misses: 8734 * * I-TLB userspace misses: 8078 I-TLB userspace misses: 7003 I-TLB kernel misses: 4960 I-TLB kernel misses: 4360 D-TLB userspace misses: 22038 D-TLB userspace misses: 20952 D-TLB kernel misses: 13929 D-TLB kernel misses: 13299 * * I-TLB userspace misses: 10791 I-TLB userspace misses: 9404 I-TLB kernel misses: 6643 I-TLB kernel misses: 5874 D-TLB userspace misses: 29350 D-TLB userspace misses: 27963 D-TLB kernel misses: 18555 D-TLB kernel misses: 17768 * * I-TLB userspace misses: 13531 I-TLB userspace misses: 11801 I-TLB kernel misses: 8311 I-TLB kernel misses: 7390 D-TLB userspace misses: 36750 D-TLB userspace misses: 35123 D-TLB kernel misses: 23271 D-TLB kernel misses: 22416 * * I-TLB userspace misses: 16434 I-TLB userspace misses: 14229 I-TLB kernel misses: 10172 I-TLB kernel misses: 8925 D-TLB userspace misses: 51096 D-TLB userspace misses: 42241 D-TLB kernel misses: 34982 D-TLB kernel misses: 26995 * * I-TLB userspace misses: 19183 I-TLB userspace misses: 16646 I-TLB kernel misses: 11890 I-TLB kernel misses: 10445 D-TLB userspace misses: 58557 D-TLB userspace misses: 49291 D-TLB kernel misses: 39726 D-TLB kernel misses: 31479 * * I-TLB userspace misses: 21973 I-TLB userspace misses: 19125 I-TLB kernel misses: 13596 I-TLB kernel misses: 12011 D-TLB userspace misses: 65933 D-TLB userspace misses: 56376 D-TLB kernel misses: 44401 D-TLB kernel misses: 36025 * * I-TLB userspace misses: 24644 I-TLB userspace misses: 21509 I-TLB kernel misses: 15231 I-TLB kernel misses: 13526 D-TLB userspace misses: 73345 D-TLB userspace misses: 63431 D-TLB kernel misses: 49083 D-TLB kernel misses: 40567 * * I-TLB userspace misses: 27451 I-TLB userspace misses: 23894 I-TLB kernel misses: 16974 I-TLB kernel misses: 15031 D-TLB userspace misses: 80652 D-TLB userspace misses: 70467 D-TLB kernel misses: 53739 D-TLB kernel misses: 45089 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-21 18:32 Marcelo Tosatti @ 2005-04-21 18:50 ` Marcelo Tosatti 2005-04-22 6:18 ` Pantelis Antoniou 0 siblings, 1 reply; 21+ messages in thread From: Marcelo Tosatti @ 2005-04-21 18:50 UTC (permalink / raw) To: 26-devel, linux-ppc-embedded [-- Attachment #1: Type: text/plain, Size: 175 bytes --] On Thu, Apr 21, 2005 at 03:32:39PM -0300, Marcelo Tosatti wrote: > Capture session of /proc/tlbmiss with 1 second interval: Forgot to attach /proc/tlbmiss patch, here it is. [-- Attachment #2: tlbmiss-count-2.4.patch --] [-- Type: text/plain, Size: 4835 bytes --] --- linux-216.orig/arch/ppc/kernel/head_8xx.S 2005-01-19 10:37:12.000000000 -0200 +++ linux-216/arch/ppc/kernel/head_8xx.S 2005-03-04 18:56:38.351004576 -0300 @@ -331,10 +331,21 @@ * kernel page tables. */ andi. r21, r20, 0x0800 /* Address >= 0x80000000 */ - beq 3f + beq 4f lis r21, swapper_pg_dir@h ori r21, r21, swapper_pg_dir@l rlwimi r20, r21, 0, 2, 19 + + lis r3,(itlbkernel_miss-KERNELBASE)@ha + lwz r11,(itlbkernel_miss-KERNELBASE)@l(r3) + addi r11,r11,1 + stw r11,(itlbkernel_miss-KERNELBASE)@l(r3) + beq 3f +4: + lis r3,(itlbuser_miss-KERNELBASE)@ha + lwz r11,(itlbuser_miss-KERNELBASE)@l(r3) + addi r11,r11,1 + stw r11,(itlbuser_miss-KERNELBASE)@l(r3) 3: lwz r21, 0(r20) /* Get the level 1 entry */ rlwinm. r20, r21,0,0,19 /* Extract page descriptor page address */ @@ -414,10 +425,23 @@ * kernel page tables. */ andi. r21, r20, 0x0800 - beq 3f + beq 4f lis r21, swapper_pg_dir@h ori r21, r21, swapper_pg_dir@l rlwimi r20, r21, 0, 2, 19 + + lis r3,(dtlbkernel_miss-KERNELBASE)@ha + lwz r11,(dtlbkernel_miss-KERNELBASE)@l(r3) + addi r11,r11,1 + stw r11,(dtlbkernel_miss-KERNELBASE)@l(r3) + beq 3f + +4: + lis r3,(dtlbuser_miss-KERNELBASE)@ha + lwz r11,(dtlbuser_miss-KERNELBASE)@l(r3) + addi r11,r11,1 + stw r11,(dtlbuser_miss-KERNELBASE)@l(r3) + 3: lwz r21, 0(r20) /* Get the level 1 entry */ rlwinm. r20, r21,0,0,19 /* Extract page descriptor page address */ @@ -989,3 +1013,14 @@ .space 16 #endif +_GLOBAL(itlbuser_miss) + .space 4 + +_GLOBAL(itlbkernel_miss) + .space 4 + +_GLOBAL(dtlbuser_miss) + .long 0 + +_GLOBAL(dtlbkernel_miss) + .long 0 --- linux-216.orig/fs/proc/proc_misc.c 2005-01-19 10:37:12.000000000 -0200 +++ linux-216/fs/proc/proc_misc.c 2005-03-04 18:57:37.241051928 -0300 @@ -621,6 +621,12 @@ if (entry) entry->proc_fops = &ppc_htab_operations; } + { + extern struct file_operations ppc_tlbmiss_operations; + entry = create_proc_entry("tlbmiss", S_IRUGO|S_IWUSR, NULL); + if (entry) + entry->proc_fops = &ppc_tlbmiss_operations; + } #endif entry = create_proc_read_entry("slabinfo", S_IWUSR | S_IRUGO, NULL, slabinfo_read_proc, NULL); --- linux-216.orig/arch/ppc/kernel/ppc_htab.c 2005-01-19 10:37:12.000000000 -0200 +++ linux-216/arch/ppc/kernel/ppc_htab.c 2005-03-04 19:04:05.276061640 -0300 @@ -21,6 +21,7 @@ #include <linux/sysctl.h> #include <linux/ctype.h> #include <linux/threads.h> +#include <linux/seq_file.h> #include <asm/uaccess.h> #include <asm/bitops.h> @@ -32,6 +33,51 @@ #include <asm/cputable.h> #include <asm/system.h> +#if 1 + +extern unsigned long itlbuser_miss, itlbkernel_miss; +extern unsigned long dtlbuser_miss, dtlbkernel_miss; + +static ssize_t ppc_tlbmiss_write(struct file *file, const char * buffer, + size_t count, loff_t *ppos); +static int ppc_tlbmiss_show(struct seq_file *m, void *v); +static int ppc_tlbmiss_open(struct inode *inode, struct file *file); + +struct file_operations ppc_tlbmiss_operations = { + .open = ppc_tlbmiss_open, + .read = seq_read, + .llseek = seq_lseek, + .write = ppc_tlbmiss_write, + .release = seq_release, +}; + +static int ppc_tlbmiss_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &ppc_tlbmiss_show); +} + +static int ppc_tlbmiss_show(struct seq_file *m, void *v) +{ + seq_printf(m, "I-TLB userspace misses: %lu\n" + "I-TLB kernel misses: %lu\n" + "D-TLB userspace misses: %lu\n" + "D-TLB kernel misses: %lu\n", + itlbuser_miss, itlbkernel_miss, + dtlbuser_miss, dtlbkernel_miss); + return 0; +} + +static ssize_t ppc_tlbmiss_write(struct file *file, const char * buffer, + size_t count, loff_t *ppos) +{ + itlbuser_miss = 0; + itlbkernel_miss = 0; + dtlbuser_miss = 0; + dtlbkernel_miss = 0; +} +#endif + + static ssize_t ppc_htab_read(struct file * file, char * buf, size_t count, loff_t *ppos); static ssize_t ppc_htab_write(struct file * file, const char * buffer, ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-21 18:50 ` [26-devel] " Marcelo Tosatti @ 2005-04-22 6:18 ` Pantelis Antoniou 2005-04-22 15:39 ` Marcelo Tosatti 0 siblings, 1 reply; 21+ messages in thread From: Pantelis Antoniou @ 2005-04-22 6:18 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: 26-devel, linux-ppc-embedded Marcelo Tosatti wrote: > On Thu, Apr 21, 2005 at 03:32:39PM -0300, Marcelo Tosatti wrote: > >>Capture session of /proc/tlbmiss with 1 second interval: > > > Forgot to attach /proc/tlbmiss patch, here it is. > > [snip] > > Thanks Marcelo. I'll try to run this on my 870 board & mail the results. Regards Pantelis ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses 2005-04-22 6:18 ` Pantelis Antoniou @ 2005-04-22 15:39 ` Marcelo Tosatti 0 siblings, 0 replies; 21+ messages in thread From: Marcelo Tosatti @ 2005-04-22 15:39 UTC (permalink / raw) To: Pantelis Antoniou; +Cc: 26-devel, linux-ppc-embedded On Fri, Apr 22, 2005 at 09:18:17AM +0300, Pantelis Antoniou wrote: > Marcelo Tosatti wrote: > >On Thu, Apr 21, 2005 at 03:32:39PM -0300, Marcelo Tosatti wrote: > > > >>Capture session of /proc/tlbmiss with 1 second interval: > > > > > >Forgot to attach /proc/tlbmiss patch, here it is. > > > > > [snip] > > > > > > > Thanks Marcelo. > > I'll try to run this on my 870 board & mail the results. Hi, Here goes more data about the v2.6 performance slowdown on MPC8xx. Thanks Benjamin for the TLB miss counter idea! This are results of the following test script which zeroes the TLB counters, copies 16MB of data from memory to memory using "dd", and reads the counters again. -- #!/bin/bash echo 0 > /proc/tlbmiss time dd if=/dev/zero of=file bs=4k count=3840 cat /proc/tlbmiss -- The results: v2.6: v2.4: delta [root@CAS root]# sh script [root@CAS root]# sh script real 0m4.241s real 0m3.440s user 0m0.140s user 0m0.090s sys 0m3.820s sys 0m3.330s I-TLB userspace misses: 142369 I-TLB userspace misses: 2179 ITLB u: 139190 I-TLB kernel misses: 118288 I-TLB kernel misses: 1369 ITLB k: 116319 D-TLB userspace misses: 222916 D-TLB userspace misses: 180249 DTLB u: 38667 D-TLB kernel misses: 207773 D-TLB kernel misses: 167236 DTLB k: 38273 The sum of all TLB miss counter delta's between v2.4 and v2.6 is: 139190 + 116319 + 38667 + 38273 = 332449 Multiplied by 23 cycles, which is the average wait time to read a page translation miss from memory: 332449 * 23 = 7646327 cycles. Which is about 16% of 48000000, the total number of cycles this CPU performs on one second. Its very likely that there is a significant indirect effect of this TLB miss increase, other than the wasted cycles to bring the page tables from memory: exception execution time and context switching. Checking "time" output, we can see 1s of slowdown: [root@CAS root]# time dd if=/dev/zero of=file bs=4k count=3840 v2.4: v2.6: diff real 0m3.366s real 0m4.360s 0.994s user 0m0.080s user 0m0.111s 0.31s sys 0m3.260s sys 0m4.218s 0.958s Mostly caused by increased kernel execution time. This proves that the slowdown is, in great part, due to increased translation cache trashing. Now, what is the best way to bring the performance back to v2.4 levels? For this "dd" test, which is dominated by "sys_read/sys_write", I thought of trying to bring the hotpath functions into the same pages, thus decreasing the number of page translations required for such tasks. Comments are appreciated. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2005-05-07 20:24 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-04-23 17:23 [26-devel] v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses Joakim Tjernlund 2005-04-23 12:42 ` Marcelo Tosatti 2005-04-23 21:31 ` Joakim Tjernlund 2005-04-23 21:32 ` Dan Malek 2005-04-23 21:55 ` Joakim Tjernlund 2005-04-23 22:12 ` Dan Malek 2005-04-23 17:35 ` Joakim Tjernlund 2005-04-23 21:29 ` Dan Malek 2005-04-23 21:51 ` Joakim Tjernlund 2005-04-23 22:09 ` Dan Malek 2005-04-23 23:12 ` Dan Malek 2005-04-23 23:51 ` Joakim Tjernlund 2005-04-24 0:00 ` Dan Malek 2005-04-24 16:55 ` Marcelo Tosatti 2005-04-25 9:57 ` Joakim Tjernlund 2005-05-07 18:10 ` Joakim Tjernlund 2005-05-07 14:42 ` Marcelo Tosatti 2005-05-07 20:24 ` Dan Malek -- strict thread matches above, loose matches on Subject: below -- 2005-04-21 18:32 Marcelo Tosatti 2005-04-21 18:50 ` [26-devel] " Marcelo Tosatti 2005-04-22 6:18 ` Pantelis Antoniou 2005-04-22 15:39 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).