* LMBench and CONFIG_PIN_TLB
@ 2002-05-29 3:08 David Gibson
2002-05-29 14:40 ` Dan Malek
0 siblings, 1 reply; 13+ messages in thread
From: David Gibson @ 2002-05-29 3:08 UTC (permalink / raw)
To: linuxppc-embedded; +Cc: Paul Mackerras
I did some LMBench runs to observe the effect of CONFIG_PIN_TLB. I've
run the tests in three cases:
a) linuxppc_2_4_devel with the CONFIG_PIN_TLB option disabled
("nopintlb")
b) linuxppc_2_4_devel with the CONFIG_PIN_TLB option enabled
("2pintlb")
c) linuxppc_2_4_devel with the CONFIG_PIN_TLB option enabled,
but modified so that only 1 16MB page is pinned rather than 2
(i.e. only the fist 16MB rather than the first 32MB are mapped with
pinned entries) ("1pintlb")
These tests were done on an IBM Walnut board with 200MHz 405GP. Root
filesystem was ext3 on an IDE disk attached to a Promise PCI IDE
controller.
Overall summary:
Having pinned entries (1 or 2) performs as well or better than
not having them on virtually everything, the difference varies between
nothing (lost in the noise) to around 15% (fork proc). The only
measurement where no pinned entries might be argued to win is
LMbench's main memory latency measurement. The difference is < 0.1%
and may just be chance fluctation.
The difference between 1 and 2 pinned entries is very small.
There are a few cases where 1 might be better (but it might just be
random noise) and a very few where 2 might be better than one. On the
basis of that there seems little point in pinning 2 entries.
Using pinned TLB entries also means its easier to make sure the
exception exit path is safe, especially in 2.5 (we mustn't take a TLB
miss after SRR0 or SRR1 is loaded).
It's certainly possible to construct a workload that will work poorly
with pinned TLB entries compared to without (make it have an
instruction+data working set of precisely 64 pages), but similarly
it's possible to construct a workload that will work well with 65
available TLB entries and not 64. Unless someone can come up with a
real life workload which works poorly with pinned TLBs, I see little
point in keeping the option - pinned TLBs should always be on (pinning
1 entry).
L M B E N C H 2 . 0 S U M M A R Y
------------------------------------
Basic system parameters
----------------------------------------------------
Host OS Description Mhz
--------- ------------- ----------------------- ----
1pintlb Linux 2.4.19- powerpc-linux-gnu 199
1pintlb Linux 2.4.19- powerpc-linux-gnu 199
1pintlb Linux 2.4.19- powerpc-linux-gnu 199
2pintlb Linux 2.4.19- powerpc-linux-gnu 199
2pintlb Linux 2.4.19- powerpc-linux-gnu 199
2pintlb Linux 2.4.19- powerpc-linux-gnu 199
nopintlb Linux 2.4.19- powerpc-linux-gnu 199
nopintlb Linux 2.4.19- powerpc-linux-gnu 199
nopintlb Linux 2.4.19- powerpc-linux-gnu 199
Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host OS Mhz null null open selct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
1pintlb Linux 2.4.19- 199 1.44 3.21 16.0 24.1 152.2 5.60 16.5 1784 8231 30.K
1pintlb Linux 2.4.19- 199 1.44 3.20 16.1 24.3 152.4 5.60 16.5 1768 8186 30.K
1pintlb Linux 2.4.19- 199 1.44 3.20 16.1 24.8 152.4 5.60 16.5 1762 8199 30.K
2pintlb Linux 2.4.19- 199 1.44 3.20 16.8 25.0 152.4 5.60 16.4 1773 8191 30.K
2pintlb Linux 2.4.19- 199 1.44 3.21 17.0 25.2 151.9 5.58 17.1 1765 8241 30.K
2pintlb Linux 2.4.19- 199 1.44 3.21 16.8 24.6 153.9 5.60 16.9 1731 8102 30.K
nopintlb Linux 2.4.19- 199 1.46 3.34 17.2 24.6 156.1 5.66 16.5 2014 9012 33.K
nopintlb Linux 2.4.19- 199 1.46 3.35 17.0 25.2 157.9 5.66 16.5 2070 9091 33.K
nopintlb Linux 2.4.19- 199 1.46 3.35 17.2 25.1 154.7 5.65 16.5 2059 9044 33.K
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
1pintlb Linux 2.4.19- 5.260 81.1 269.1 96.1 275.8 95.8 276.7
1pintlb Linux 2.4.19- 3.460 81.7 272.0 95.9 276.5 96.1 276.4
1pintlb Linux 2.4.19- 2.820 82.0 268.4 95.1 275.2 96.2 274.9
2pintlb Linux 2.4.19- 3.930 80.6 280.7 95.3 276.8 95.5 275.1
2pintlb Linux 2.4.19- 6.350 84.0 265.2 95.0 273.7 96.0 273.7
2pintlb Linux 2.4.19- 2.780 82.5 257.8 93.5 272.8 95.6 273.4
nopintlb Linux 2.4.19- 3.590 93.4 282.2 101.5 284.4 101.7 284.1
nopintlb Linux 2.4.19- 0.780 83.1 284.3 100.0 283.1 99.7 282.7
nopintlb Linux 2.4.19- 1.540 93.3 282.4 99.2 281.1 99.1 282.9
*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
1pintlb Linux 2.4.19- 5.260 28.2 72.0 248.3 909.
1pintlb Linux 2.4.19- 3.460 33.0 73.8 268.6 902.
1pintlb Linux 2.4.19- 2.820 30.0 71.8 279.6 903.
2pintlb Linux 2.4.19- 3.930 27.9 73.9 258.6 923.
2pintlb Linux 2.4.19- 6.350 23.9 81.0 244.6 918.
2pintlb Linux 2.4.19- 2.780 27.9 77.5 287.9 910.
nopintlb Linux 2.4.19- 3.590 29.7 75.9 386.9 1194
nopintlb Linux 2.4.19- 0.780 29.0 77.2 388.4 1208
nopintlb Linux 2.4.19- 1.540 31.8 83.4 391.9 1190
File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
--------- ------------- ------ ------ ------ ------ ------- ----- -----
1pintlb Linux 2.4.19- 579.4 160.5 1231.5 300.6 1448.0 3.358 18.0
1pintlb Linux 2.4.19- 579.7 160.1 1231.5 315.7 1442.0 3.443 18.0
1pintlb Linux 2.4.19- 579.7 160.6 1236.1 300.8 1456.0 3.405 18.0
2pintlb Linux 2.4.19- 579.0 161.1 1231.5 304.7 1454.0 3.495 18.0
2pintlb Linux 2.4.19- 580.0 159.1 1236.1 317.0 1446.0 2.816 18.0
2pintlb Linux 2.4.19- 579.0 159.8 1228.5 317.7 1444.0 3.342 18.0
nopintlb Linux 2.4.19- 643.5 213.9 1426.5 404.0 1810.0 3.540 21.0
nopintlb Linux 2.4.19- 643.9 213.2 1418.4 394.9 1761.0 3.637 21.0
nopintlb Linux 2.4.19- 645.6 217.2 1436.8 420.2 1776.0 4.233 21.0
*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
1pintlb Linux 2.4.19- 39.9 41.9 31.5 47.2 115.6 85.3 83.6 115. 128.0
1pintlb Linux 2.4.19- 43.1 41.6 30.8 48.1 115.6 85.7 84.2 115. 128.9
1pintlb Linux 2.4.19- 42.5 41.1 31.6 48.2 115.6 86.2 84.4 115. 130.6
2pintlb Linux 2.4.19- 42.6 42.4 32.0 48.4 115.6 85.6 84.1 115. 128.7
2pintlb Linux 2.4.19- 42.3 42.4 62.7 48.1 115.6 85.5 84.0 115. 129.4
2pintlb Linux 2.4.19- 44.4 43.7 64.6 48.5 115.6 86.0 84.3 115. 129.4
nopintlb Linux 2.4.19- 39.0 39.3 29.3 46.9 115.5 85.5 83.9 115. 127.8
nopintlb Linux 2.4.19- 41.7 39.3 59.9 47.2 115.5 85.2 84.1 115. 130.1
nopintlb Linux 2.4.19- 41.1 38.2 29.4 47.0 115.5 85.7 84.1 115. 130.5
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
---------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Guesses
--------- ------------- ---- ----- ------ -------- -------
1pintlb Linux 2.4.19- 199 15.0 134.0 149.2 No L2 cache?
1pintlb Linux 2.4.19- 199 15.0 133.9 149.2 No L2 cache?
1pintlb Linux 2.4.19- 199 15.0 133.8 149.2 No L2 cache?
2pintlb Linux 2.4.19- 199 15.0 133.8 149.2 No L2 cache?
2pintlb Linux 2.4.19- 199 15.0 133.8 149.2 No L2 cache?
2pintlb Linux 2.4.19- 199 15.0 133.8 149.1 No L2 cache?
nopintlb Linux 2.4.19- 199 15.0 134.0 149.1 No L2 cache?
nopintlb Linux 2.4.19- 199 15.0 134.1 149.1 No L2 cache?
nopintlb Linux 2.4.19- 199 15.0 133.9 149.0 No L2 cache?
--
David Gibson | For every complex problem there is a
david@gibson.dropbear.id.au | solution which is simple, neat and
| wrong. -- H.L. Mencken
http://www.ozlabs.org/people/dgibson
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: LMBench and CONFIG_PIN_TLB 2002-05-29 3:08 LMBench and CONFIG_PIN_TLB David Gibson @ 2002-05-29 14:40 ` Dan Malek 2002-05-29 23:04 ` Paul Mackerras 2002-05-30 5:05 ` David Gibson 0 siblings, 2 replies; 13+ messages in thread From: Dan Malek @ 2002-05-29 14:40 UTC (permalink / raw) To: David Gibson; +Cc: linuxppc-embedded, Paul Mackerras David Gibson wrote: > I did some LMBench runs to observe the effect of CONFIG_PIN_TLB. I implemented the tlb pinning for two reasons. One, politics, since everyone "just knows it is signficanlty better", and two, to alleviate the exception path return problem of taking a TLB miss after loading SRR0/1. > .... the difference varies between > nothing (lost in the noise) to around 15% (fork proc). The only > measurement where no pinned entries might be argued to win is > LMbench's main memory latency measurement. The difference is < 0.1% > and may just be chance fluctation. It has been my experience over the last 20 years that in general applications that show high TLB miss activity are making inefficient use of all system resources and aren't likely to be doing any useful work. Why aren't we measuring cache efficiency? Why aren't we profiling the kernel to see where code changes will really make a difference? Why aren't we measuring TLB performace on all processors? If you want to improve TLB performance, get a processor with larger TLBs or better hardware support. Pinning TLB entries simply reduces the resource availability. When I'm running a real application, doing real work in a real product, I don't want these resources allocated for something else that is seldom used. There are lots of other TLB management implementations that can really improve performance, they just don't fit well into the current Linux/PowerPC design. I have seen exactly one application where TLB pinning actually improved the performace of the system. It was a real-time system, based on Linux using an MPC8xx, where the maximum event response latency had to be guaranteed. With the proper locking of pages and TLB pins this could be done. It didn't improve the performance of the application, but did ensure the system operated properly. > The difference between 1 and 2 pinned entries is very small. > There are a few cases where 1 might be better (but it might just be > random noise) and a very few where 2 might be better than one. On the > basis of that there seems little point in pinning 2 entries. What kind of scientific analysis is this? Run controlled tests, post the results, explain the variances, and allow it to be repeatable by others. Is there any consistency to the results? > ..... Unless someone can come up with a > real life workload which works poorly with pinned TLBs, I see little > point in keeping the option - pinned TLBs should always be on (pinning > 1 entry). Where is your data that supports this? Where is your "real life workload" that actually supports what you want to do? From my perspective, your data shows we shouldn't do it. A "real life workload" is not a fork proc test, but rather main memory latency test, where your tests showed it was better to not pin entries but you can't explain the "fluctuation." I contend the difference is due to the fact you have reduced the TLB resources, increasing the number of TLB misses to an application that is trying to do real work. I suggest you heed the quote you always attach to your messages. This isn't a simple solution that is suitable for all applications. It's one option among many that needs to be tuned to meet the requirements of an application. Thanks. -- Dan ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-29 14:40 ` Dan Malek @ 2002-05-29 23:04 ` Paul Mackerras 2002-05-29 23:16 ` Tom Rini 2002-05-30 1:34 ` Dan Malek 2002-05-30 5:05 ` David Gibson 1 sibling, 2 replies; 13+ messages in thread From: Paul Mackerras @ 2002-05-29 23:04 UTC (permalink / raw) To: Dan Malek; +Cc: David Gibson, linuxppc-embedded Dan Malek writes: > I implemented the tlb pinning for two reasons. One, politics, since > everyone "just knows it is signficanlty better", and two, to alleviate > the exception path return problem of taking a TLB miss after loading SRR0/1. The second thing there is important, but there may be other ways around that problem. > Pinning TLB entries simply reduces the resource availability. When I'm > running a real application, doing real work in a real product, I don't > want these resources allocated for something else that is seldom used. > There are lots of other TLB management implementations that can really > improve performance, they just don't fit well into the current Linux/PowerPC > design. I suspect we are all confusing two things here: (1) having pinned TLB entries and (2) using large-page TLB entries for the kernel. At the moment the first is a prerequisite for the second. The second gives us a significant performance improvement, and David's measurements show that. We could have (2) without pinning any TLB entries but it would take more code in the TLB miss handler to do that. It is an interesting question whether the benefit of having the 64th TLB slot available for applications would outweigh the cost of the slightly slower TLB misses. My feeling is that it would be a close-run thing either way. > I have seen exactly one application where TLB pinning actually > improved the performace of the system. It was a real-time system, > based on Linux using an MPC8xx, where the maximum event response latency > had to be guaranteed. With the proper locking of pages and TLB pins > this could be done. It didn't improve the performance of the application, > but did ensure the system operated properly. Were you using any large-page TLB entries at all? The other point that comes to mind is that the downside of pinning a TLB entry is going to be much larger when you have fewer TLB entries available. Tom Rini mentioned the other day that some 8xx processors only have 8 (I assume he meant 8 data + 8 instruction). Having one pinned entry out of 8 is going to be a lot more significant that one out of 64. David's suggestion was purely in the context of the 405 processor, which has 64. I don't think he was advocating removing the config option on the 8xx processors (actually, why is there the "860 only" comment in there?) Paul. ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-29 23:04 ` Paul Mackerras @ 2002-05-29 23:16 ` Tom Rini 2002-05-30 1:34 ` Dan Malek 1 sibling, 0 replies; 13+ messages in thread From: Tom Rini @ 2002-05-29 23:16 UTC (permalink / raw) To: Paul Mackerras; +Cc: Dan Malek, David Gibson, linuxppc-embedded On Thu, May 30, 2002 at 09:04:31AM +1000, Paul Mackerras wrote: > available. Tom Rini mentioned the other day that some 8xx processors > only have 8 (I assume he meant 8 data + 8 instruction). Quite probably, yes. :) [snip] > processor, which has 64. I don't think he was advocating removing the > config option on the 8xx processors (actually, why is there the "860 > only" comment in there?) Because the current code goes and pins 8 or so TLBs (4 data, 4 instruction) which won't fly on the ones which only allow for 2/8 to be pinned. So 860 is a slight mislabing, if I read it all correctly. -- Tom Rini (TR1265) http://gate.crashing.org/~trini/ ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-29 23:04 ` Paul Mackerras 2002-05-29 23:16 ` Tom Rini @ 2002-05-30 1:34 ` Dan Malek 2002-05-30 5:14 ` David Gibson 2002-05-30 16:09 ` Matthew Locke 1 sibling, 2 replies; 13+ messages in thread From: Dan Malek @ 2002-05-30 1:34 UTC (permalink / raw) To: Paul Mackerras; +Cc: David Gibson, linuxppc-embedded Paul Mackerras wrote: > I suspect we are all confusing two things here: (1) having pinned TLB > entries and (2) using large-page TLB entries for the kernel. I wasn't confusing them :-). I know that large page sizes are beneficial. Someday I hope to finish the code that allows large page sizes in the Linux page tables, so we can just load them. > We could have (2) without pinning any TLB entries but it would take > more code in the TLB miss handler to do that. Only on the 4xx. I have code for the 8xx that loads them using the standard lookup. Unfortunately, I have found something that isn't quite stable with the large page sizes, but I don't know what it is. > .... It is an interesting > question whether the benefit of having the 64th TLB slot available for > applications would outweigh the cost of the slightly slower TLB > misses. Removing the entry will increase the TLB miss rate by 1/64 * 100 percent, or a little over 1.5%, right? Any application that is thrashing the TLB cache by removing one entry is running on luck anyway, so we can't consider those. When you have applications using lots of CPU in user space (which is usually a good thing :-), increased TLB misses will add up. > .... My feeling is that it would be a close-run thing either way. So, if you have a product that runs better one way or the other, just select the option that suits your needs. If the 4xx didn't require the extra code in the miss handler to fangle the PTE, large pages without pinning would clearly be the way to go (that's why it's an easy decision on 8xx and I'm using it for testing). > Were you using any large-page TLB entries at all? Yes, but the problem was taking the tlb hit to get the first couple of pages loaded and hitting the hardware register in time. It was a hack from the first line of code :-) If you are going to pin a kernel entry, you may as well map the whole space. I don't think it would even work if we were loading large pages out of the PTE tables. > .... Tom Rini mentioned the other day that some 8xx processors > only have 8 (I assume he meant 8 data + 8 instruction). Yes, there are a number of variants now that have everything from 8 to 64 I believe. It was just easier to pick out the 860 (which always has lots of entries) for testing purposes. The 8xx also has hardware support for pinning entries that basically emulates BATs. It doesn't require any software changes except for the initial programming of the MMU control and loading of the pinned entries. > .... David's suggestion was purely in the context of the 405 > processor, which has 64. There is an option to enable it, so just enable it by default. What do you gain by removing the option, except the possibility to prevent someone from using it when it may be to their benefit? It certainly isn't a proven modification, as there may be some latent bugs associated with dual mapping pages that may be covered by the large page and some other mapping (I think this is the problem I see on the 8xx). > .... (actually, why is there the "860 > only" comment in there?) Because the MMU control registers are slightly different among the 8xx processor variants, and I only wrote the code to work with the 860 :-) Thanks. -- Dan ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-30 1:34 ` Dan Malek @ 2002-05-30 5:14 ` David Gibson 2002-05-30 16:09 ` Matthew Locke 1 sibling, 0 replies; 13+ messages in thread From: David Gibson @ 2002-05-30 5:14 UTC (permalink / raw) To: Dan Malek; +Cc: Paul Mackerras, linuxppc-embedded On Wed, May 29, 2002 at 09:34:54PM -0400, Dan Malek wrote: > > Paul Mackerras wrote: > > > >I suspect we are all confusing two things here: (1) having pinned TLB > >entries and (2) using large-page TLB entries for the kernel. > > I wasn't confusing them :-). I know that large page sizes are beneficial. > Someday I hope to finish the code that allows large page sizes in the > Linux page tables, so we can just load them. Well it so happens that Paul and I have tried implementing that this morning. More data coming in the next day or two. > >We could have (2) without pinning any TLB entries but it would take > >more code in the TLB miss handler to do that. > > Only on the 4xx. I have code for the 8xx that loads them using the > standard lookup. Unfortunately, I have found something that isn't quite > stable with the large page sizes, but I don't know what it is. I'm only talking about 4xx. > >.... It is an interesting > >question whether the benefit of having the 64th TLB slot available for > >applications would outweigh the cost of the slightly slower TLB > >misses. > > Removing the entry will increase the TLB miss rate by 1/64 * 100 percent, > or a little over 1.5%, right? Any application that is thrashing the TLB > cache by removing one entry is running on luck anyway, so we can't consider > those. When you have applications using lots of CPU in user space (which > is usually a good thing :-), increased TLB misses will add up. Um, assuming a program with some degree of locality, I'd expect it to increase the miss rate by somewhat less than 1/64, but it will certainly increase them to an extent. So, show us the data. > >.... My feeling is that it would be a close-run thing either way. > > So, if you have a product that runs better one way or the other, just > select the option that suits your needs. If the 4xx didn't require the > extra code in the miss handler to fangle the PTE, large pages without > pinning would clearly be the way to go (that's why it's an easy decision > on 8xx and I'm using it for testing). Actually from the looks of this implementation doing large pages won't be too bad - we can hijack an existing test so we only hit the extra code if we hit a large page entry. Tests coming soon, I would expect it to beat the current CONFIG_PIN_TLB. > >.... David's suggestion was purely in the context of the 405 > >processor, which has 64. > > There is an option to enable it, so just enable it by default. What > do you gain by removing the option, except the possibility to prevent > someone from using it when it may be to their benefit? It certainly > isn't a proven modification, as there may be some latent bugs associated > with dual mapping pages that may be covered by the large page and > some other mapping (I think this is the problem I see on the 8xx). We gain simplicity of code. Feeping creaturism isn't a good thing. -- David Gibson | For every complex problem there is a david@gibson.dropbear.id.au | solution which is simple, neat and | wrong. -- H.L. Mencken http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-30 1:34 ` Dan Malek 2002-05-30 5:14 ` David Gibson @ 2002-05-30 16:09 ` Matthew Locke 2002-05-30 23:50 ` Paul Mackerras 1 sibling, 1 reply; 13+ messages in thread From: Matthew Locke @ 2002-05-30 16:09 UTC (permalink / raw) To: Dan Malek; +Cc: Paul Mackerras, David Gibson, linuxppc-embedded Dan Malek wrote: > > Paul Mackerras wrote: > > >> I suspect we are all confusing two things here: (1) having pinned TLB >> entries and (2) using large-page TLB entries for the kernel. > > > I wasn't confusing them :-). I know that large page sizes are > beneficial. > Someday I hope to finish the code that allows large page sizes in the > Linux page tables, so we can just load them. > >> We could have (2) without pinning any TLB entries but it would take >> more code in the TLB miss handler to do that. > > > Only on the 4xx. I have code for the 8xx that loads them using the > standard lookup. Unfortunately, I have found something that isn't quite > stable with the large page sizes, but I don't know what it is. > > >> .... It is an interesting >> question whether the benefit of having the 64th TLB slot available for >> applications would outweigh the cost of the slightly slower TLB >> misses. > > > Removing the entry will increase the TLB miss rate by 1/64 * 100 percent, > or a little over 1.5%, right? Any application that is thrashing the TLB > cache by removing one entry is running on luck anyway, so we can't > consider > those. When you have applications using lots of CPU in user space (which > is usually a good thing :-), increased TLB misses will add up. > >> .... My feeling is that it would be a close-run thing either way. > > > So, if you have a product that runs better one way or the other, just > select the option that suits your needs. If the 4xx didn't require the > extra code in the miss handler to fangle the PTE, large pages without > pinning would clearly be the way to go (that's why it's an easy decision > on 8xx and I'm using it for testing). > >> Were you using any large-page TLB entries at all? > > > Yes, but the problem was taking the tlb hit to get the first couple of > pages loaded and hitting the hardware register in time. It was a hack > from the first line of code :-) If you are going to pin a kernel entry, > you may as well map the whole space. I don't think it would even work > if we were loading large pages out of the PTE tables. > >> .... Tom Rini mentioned the other day that some 8xx processors >> only have 8 (I assume he meant 8 data + 8 instruction). > > > Yes, there are a number of variants now that have everything from 8 to > 64 I believe. It was just easier to pick out the 860 (which always has > lots of entries) for testing purposes. > > The 8xx also has hardware support for pinning entries that basically > emulates BATs. It doesn't require any software changes except for > the initial programming of the MMU control and loading of the pinned > entries. > >> .... David's suggestion was purely in the context of the 405 >> processor, which has 64. > > > There is an option to enable it, so just enable it by default. What > do you gain by removing the option, except the possibility to prevent > someone from using it when it may be to their benefit? It certainly > isn't a proven modification, as there may be some latent bugs associated > with dual mapping pages that may be covered by the large page and > some other mapping (I think this is the problem I see on the 8xx). btw, there are bugs with it. Starting several processes with init or even telnetd will expose the bug. > > >> .... (actually, why is there the "860 >> only" comment in there?) > > > Because the MMU control registers are slightly different among the 8xx > processor variants, and I only wrote the code to work with the 860 :-) > > Thanks. > > > -- Dan > > > ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-30 16:09 ` Matthew Locke @ 2002-05-30 23:50 ` Paul Mackerras 2002-05-30 23:01 ` Matthew Locke 2002-05-31 0:10 ` Tom Rini 0 siblings, 2 replies; 13+ messages in thread From: Paul Mackerras @ 2002-05-30 23:50 UTC (permalink / raw) To: Matthew Locke; +Cc: Dan Malek, David Gibson, linuxppc-embedded > btw, there are bugs with it. Starting several processes with init or > even telnetd will expose the bug. David and I haven't been able to reproduce this on the Walnut or the EP405. What sort of machine are you using, what processor, how much RAM, and what distro are you using? Paul. ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-30 23:50 ` Paul Mackerras @ 2002-05-30 23:01 ` Matthew Locke 2002-05-31 2:39 ` David Gibson 2002-05-31 0:10 ` Tom Rini 1 sibling, 1 reply; 13+ messages in thread From: Matthew Locke @ 2002-05-30 23:01 UTC (permalink / raw) To: Paul Mackerras; +Cc: Dan Malek, David Gibson, linuxppc-embedded Paul Mackerras wrote: >>btw, there are bugs with it. Starting several processes with init or >>even telnetd will expose the bug. >> > >David and I haven't been able to reproduce this on the Walnut or the >EP405. What sort of machine are you using, what processor, how much >RAM, and what distro are you using? > >Paul. > I run MVL (of course) on a walnut with 32MB of RAM. What is your environment? btw, MVL uses soft-float in glibc not floating point emulation in the kernel. ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-30 23:01 ` Matthew Locke @ 2002-05-31 2:39 ` David Gibson 0 siblings, 0 replies; 13+ messages in thread From: David Gibson @ 2002-05-31 2:39 UTC (permalink / raw) To: Matthew Locke; +Cc: Paul Mackerras, Dan Malek, linuxppc-embedded On Thu, May 30, 2002 at 04:01:32PM -0700, Matthew Locke wrote: > > Paul Mackerras wrote: > > >>btw, there are bugs with it. Starting several processes with init or > >>even telnetd will expose the bug. > >> > > > >David and I haven't been able to reproduce this on the Walnut or the > >EP405. What sort of machine are you using, what processor, how much > >RAM, and what distro are you using? > > > >Paul. > > I run MVL (of course) on a walnut with 32MB of RAM. What is your > environment? btw, MVL uses soft-float in glibc not floating point > emulation in the kernel. I've tried it both on a Walnut (PVR 401100c4) with 128MB of RAM, root filesystem on an IDE disk attached to a Promis PCI IDE controller and on an EP405PC board (PVR 40110145) with 64MB of RAM with NFS root. In both cases userland is Debian/sid running with kernel math emulation. -- David Gibson | For every complex problem there is a david@gibson.dropbear.id.au | solution which is simple, neat and | wrong. -- H.L. Mencken http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-30 23:50 ` Paul Mackerras 2002-05-30 23:01 ` Matthew Locke @ 2002-05-31 0:10 ` Tom Rini 2002-05-31 14:48 ` Tom Rini 1 sibling, 1 reply; 13+ messages in thread From: Tom Rini @ 2002-05-31 0:10 UTC (permalink / raw) To: Paul Mackerras; +Cc: Matthew Locke, Dan Malek, David Gibson, linuxppc-embedded On Fri, May 31, 2002 at 09:50:39AM +1000, Paul Mackerras wrote: > > > btw, there are bugs with it. Starting several processes with init or > > even telnetd will expose the bug. > > David and I haven't been able to reproduce this on the Walnut or the > EP405. What sort of machine are you using, what processor, how much > RAM, and what distro are you using? A Walnut (pvr: 40110145) with 32mb of RAM and Debian/Woody shows it off quite nicely here. Login via serial, telnet to localhost, login, do it again. -- Tom Rini (TR1265) http://gate.crashing.org/~trini/ ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-31 0:10 ` Tom Rini @ 2002-05-31 14:48 ` Tom Rini 0 siblings, 0 replies; 13+ messages in thread From: Tom Rini @ 2002-05-31 14:48 UTC (permalink / raw) To: Paul Mackerras; +Cc: Matthew Locke, Dan Malek, David Gibson, linuxppc-embedded On Thu, May 30, 2002 at 05:10:05PM -0700, Tom Rini wrote: > > On Fri, May 31, 2002 at 09:50:39AM +1000, Paul Mackerras wrote: > > > > > btw, there are bugs with it. Starting several processes with init or > > > even telnetd will expose the bug. > > > > David and I haven't been able to reproduce this on the Walnut or the > > EP405. What sort of machine are you using, what processor, how much > > RAM, and what distro are you using? > > A Walnut (pvr: 40110145) with 32mb of RAM and Debian/Woody shows it off > quite nicely here. With nfsroot, and logging in via serial console initially even. -- Tom Rini (TR1265) http://gate.crashing.org/~trini/ ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: LMBench and CONFIG_PIN_TLB 2002-05-29 14:40 ` Dan Malek 2002-05-29 23:04 ` Paul Mackerras @ 2002-05-30 5:05 ` David Gibson 1 sibling, 0 replies; 13+ messages in thread From: David Gibson @ 2002-05-30 5:05 UTC (permalink / raw) To: Dan Malek; +Cc: linuxppc-embedded, Paul Mackerras On Wed, May 29, 2002 at 10:40:02AM -0400, Dan Malek wrote: > > David Gibson wrote: > > >I did some LMBench runs to observe the effect of CONFIG_PIN_TLB. > > I implemented the tlb pinning for two reasons. One, politics, since > everyone "just knows it is signficanlty better", and two, to alleviate > the exception path return problem of taking a TLB miss after loading SRR0/1. Ok. > >.... the difference varies between > >nothing (lost in the noise) to around 15% (fork proc). The only > >measurement where no pinned entries might be argued to win is > >LMbench's main memory latency measurement. The difference is < 0.1% > >and may just be chance fluctation. > > It has been my experience over the last 20 years that in general > applications that show high TLB miss activity are making inefficient > use of all system resources and aren't likely to be doing any useful > work. Why aren't we measuring cache efficiency? Why aren't we profiling > the kernel to see where code changes will really make a difference? > Why aren't we measuring TLB performace on all processors? If you want > to improve TLB performance, get a processor with larger TLBs or better > hardware support. Good question. Because we all have finite time. I figure an LMBench run on CONFIG_PIN_TLB, while admittedly quite incomplete information is better than no data at all. > Pinning TLB entries simply reduces the resource availability. When I'm > running a real application, doing real work in a real product, I don't > want these resources allocated for something else that is seldom used. > There are lots of other TLB management implementations that can really > improve performance, they just don't fit well into the current Linux/PowerPC > design. As paulus also points out there are two issues here. Pinning the TLB entries per se reduces resource availability. However it provides an easy way to use a large page TLB entry for the kernel, which for a number of not infrequent kernel activities is a win according to LMBench. > I have seen exactly one application where TLB pinning actually > improved the performace of the system. It was a real-time system, > based on Linux using an MPC8xx, where the maximum event response latency > had to be guaranteed. With the proper locking of pages and TLB pins > this could be done. It didn't improve the performance of the application, > but did ensure the system operated properly. > > > The difference between 1 and 2 pinned entries is very small. > >There are a few cases where 1 might be better (but it might just be > >random noise) and a very few where 2 might be better than one. On the > >basis of that there seems little point in pinning 2 entries. > > What kind of scientific analysis is this? Run controlled tests, post > the results, explain the variances, and allow it to be repeatable by > others. Is there any consistency to the results? Ok, put it like this: a) this LMbench run shows very weak evidence that 1 pinned entry is better than 2, but certainly no evidence that 2 beats 1. b) I see no theoretical reason that 2 pinned entries would do significantly better than 1 (16MB being sufficient to cover all the kernel text, static data and BSS), c) 1 pinned entry is slightly simpler than 2 and therefore wins by default. > >..... Unless someone can come up with a > >real life workload which works poorly with pinned TLBs, I see little > >point in keeping the option - pinned TLBs should always be on (pinning > >1 entry). > > Where is your data that supports this? Where is your "real life workload" > that actually supports what you want to do? Ok, put it this way: Pro CONFIG_PIN_TLB (as currently implemented): - LMbench results, admittedly inconclusive - Makes ensuring the exception exit is safe easier Con CONFIG_PIN_TLB (as currently implemented): - You think it isn't a good idea - Possible miniscule improvement in main memory latency Data from a real life workload would certainly trump all the "pro" arguments I've listed there. Give me some numbers supporting your case and I'll probably agree with you, but given no other data this suggests that CONFIG_PIN_TLB wins. Oh, incidentally a kernel compile also appears to be slightly faster with CONFIG_PIN_TLB. > From my perspective, your data shows we shouldn't do it. A "real life > workload" is not a fork proc test, but rather main memory latency test, > where your tests showed it was better to not pin entries but you can't > explain the "fluctuation." I contend the difference is due to the fact > you have reduced the TLB resources, increasing the number of TLB misses > to an application that is trying to do real work. Dan, either you're not reading or you're not thinking. The difference between the memory latency numbers is tiny, less than 0.1%. If you actually look at the LMbench numbers (I have three runs in each situation), the random variation between each run is around the same size. Therefore the data is inconclusive, put possibly suggests a slowdown with CONFIG_PIN_TLB - particularly given that there are at least two plausible explanations for the slowdown, (a) because we have less free TLB entries we are taking more TLB misses and (b) with CONFIG_PIN_TLB the TLB fault handler has a few extra instructions. *But* any such slowdown is <0.1%. It doesn't take that many page faults (say) which appear to be around 15% faster with CONFIG_PIN_TLB, for that to be a bigger win than the (possible) memory access slowdown. > I suggest you heed the quote you always attach to your messages. This > isn't a simple solution that is suitable for all applications. It's one > option among many that needs to be tuned to meet the requirements of > an application. Ok. Show me an application where CONFIG_PIN_TLB loses. I'm perfectly willing to accept they exist. At the moment I've presented little data, but you've presented none. -- David Gibson | For every complex problem there is a david@gibson.dropbear.id.au | solution which is simple, neat and | wrong. -- H.L. Mencken http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/ ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2002-05-31 14:48 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-05-29 3:08 LMBench and CONFIG_PIN_TLB David Gibson 2002-05-29 14:40 ` Dan Malek 2002-05-29 23:04 ` Paul Mackerras 2002-05-29 23:16 ` Tom Rini 2002-05-30 1:34 ` Dan Malek 2002-05-30 5:14 ` David Gibson 2002-05-30 16:09 ` Matthew Locke 2002-05-30 23:50 ` Paul Mackerras 2002-05-30 23:01 ` Matthew Locke 2002-05-31 2:39 ` David Gibson 2002-05-31 0:10 ` Tom Rini 2002-05-31 14:48 ` Tom Rini 2002-05-30 5:05 ` David Gibson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).