LMBench and CONFIG_PIN

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* LMBench and CONFIG_PIN_TLB
@ 2002-05-29  3:08 David Gibson
  2002-05-29 14:40 ` Dan Malek
  0 siblings, 1 reply; 13+ messages in thread
From: David Gibson @ 2002-05-29  3:08 UTC (permalink / raw)
  To: linuxppc-embedded; +Cc: Paul Mackerras


I did some LMBench runs to observe the effect of CONFIG_PIN_TLB.  I've
run the tests in three cases:
	a) linuxppc_2_4_devel with the CONFIG_PIN_TLB option disabled
("nopintlb")
	b) linuxppc_2_4_devel with the CONFIG_PIN_TLB option enabled
("2pintlb")
	c) linuxppc_2_4_devel with the CONFIG_PIN_TLB option enabled,
but modified so that only 1 16MB page is pinned rather than 2
(i.e. only the fist 16MB rather than the first 32MB are mapped with
pinned entries) ("1pintlb")

These tests were done on an IBM Walnut board with 200MHz 405GP.  Root
filesystem was ext3 on an IDE disk attached to a Promise PCI IDE
controller.

Overall summary:
	Having pinned entries (1 or 2) performs as well or better than
not having them on virtually everything, the difference varies between
nothing (lost in the noise) to around 15% (fork proc).  The only
measurement where no pinned entries might be argued to win is
LMbench's main memory latency measurement.  The difference is < 0.1%
and may just be chance fluctation.
	The difference between 1 and 2 pinned entries is very small.
There are a few cases where 1 might be better (but it might just be
random noise) and a very few where 2 might be better than one.  On the
basis of that there seems little point in pinning 2 entries.

Using pinned TLB entries also means its easier to make sure the
exception exit path is safe, especially in 2.5 (we mustn't take a TLB
miss after SRR0 or SRR1 is loaded).

It's certainly possible to construct a workload that will work poorly
with pinned TLB entries compared to without (make it have an
instruction+data working set of precisely 64 pages), but similarly
it's possible to construct a workload that will work well with 65
available TLB entries and not 64.  Unless someone can come up with a
real life workload which works poorly with pinned TLBs, I see little
point in keeping the option - pinned TLBs should always be on (pinning
1 entry).

                 L M B E N C H  2 . 0   S U M M A R Y
                 ------------------------------------


Basic system parameters
----------------------------------------------------
Host                 OS Description              Mhz

--------- ------------- ----------------------- ----
1pintlb   Linux 2.4.19-       powerpc-linux-gnu  199
1pintlb   Linux 2.4.19-       powerpc-linux-gnu  199
1pintlb   Linux 2.4.19-       powerpc-linux-gnu  199
2pintlb   Linux 2.4.19-       powerpc-linux-gnu  199
2pintlb   Linux 2.4.19-       powerpc-linux-gnu  199
2pintlb   Linux 2.4.19-       powerpc-linux-gnu  199
nopintlb  Linux 2.4.19-       powerpc-linux-gnu  199
nopintlb  Linux 2.4.19-       powerpc-linux-gnu  199
nopintlb  Linux 2.4.19-       powerpc-linux-gnu  199

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host                 OS  Mhz null null      open selct sig  sig  fork exec sh
                             call  I/O stat clos TCP   inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
1pintlb   Linux 2.4.19-  199 1.44 3.21 16.0 24.1 152.2 5.60 16.5 1784 8231 30.K
1pintlb   Linux 2.4.19-  199 1.44 3.20 16.1 24.3 152.4 5.60 16.5 1768 8186 30.K
1pintlb   Linux 2.4.19-  199 1.44 3.20 16.1 24.8 152.4 5.60 16.5 1762 8199 30.K
2pintlb   Linux 2.4.19-  199 1.44 3.20 16.8 25.0 152.4 5.60 16.4 1773 8191 30.K
2pintlb   Linux 2.4.19-  199 1.44 3.21 17.0 25.2 151.9 5.58 17.1 1765 8241 30.K
2pintlb   Linux 2.4.19-  199 1.44 3.21 16.8 24.6 153.9 5.60 16.9 1731 8102 30.K
nopintlb  Linux 2.4.19-  199 1.46 3.34 17.2 24.6 156.1 5.66 16.5 2014 9012 33.K
nopintlb  Linux 2.4.19-  199 1.46 3.35 17.0 25.2 157.9 5.66 16.5 2070 9091 33.K
nopintlb  Linux 2.4.19-  199 1.46 3.35 17.2 25.1 154.7 5.65 16.5 2059 9044 33.K

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
1pintlb   Linux 2.4.19- 5.260   81.1  269.1   96.1  275.8    95.8   276.7
1pintlb   Linux 2.4.19- 3.460   81.7  272.0   95.9  276.5    96.1   276.4
1pintlb   Linux 2.4.19- 2.820   82.0  268.4   95.1  275.2    96.2   274.9
2pintlb   Linux 2.4.19- 3.930   80.6  280.7   95.3  276.8    95.5   275.1
2pintlb   Linux 2.4.19- 6.350   84.0  265.2   95.0  273.7    96.0   273.7
2pintlb   Linux 2.4.19- 2.780   82.5  257.8   93.5  272.8    95.6   273.4
nopintlb  Linux 2.4.19- 3.590   93.4  282.2  101.5  284.4   101.7   284.1
nopintlb  Linux 2.4.19- 0.780   83.1  284.3  100.0  283.1    99.7   282.7
nopintlb  Linux 2.4.19- 1.540   93.3  282.4   99.2  281.1    99.1   282.9

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
1pintlb   Linux 2.4.19- 5.260  28.2 72.0             248.3       909.
1pintlb   Linux 2.4.19- 3.460  33.0 73.8             268.6       902.
1pintlb   Linux 2.4.19- 2.820  30.0 71.8             279.6       903.
2pintlb   Linux 2.4.19- 3.930  27.9 73.9             258.6       923.
2pintlb   Linux 2.4.19- 6.350  23.9 81.0             244.6       918.
2pintlb   Linux 2.4.19- 2.780  27.9 77.5             287.9       910.
nopintlb  Linux 2.4.19- 3.590  29.7 75.9             386.9       1194
nopintlb  Linux 2.4.19- 0.780  29.0 77.2             388.4       1208
nopintlb  Linux 2.4.19- 1.540  31.8 83.4             391.9       1190

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host                 OS   0K File      10K File      Mmap    Prot    Page
                        Create Delete Create Delete  Latency Fault   Fault
--------- ------------- ------ ------ ------ ------  ------- -----   -----
1pintlb   Linux 2.4.19-  579.4  160.5 1231.5  300.6   1448.0 3.358    18.0
1pintlb   Linux 2.4.19-  579.7  160.1 1231.5  315.7   1442.0 3.443    18.0
1pintlb   Linux 2.4.19-  579.7  160.6 1236.1  300.8   1456.0 3.405    18.0
2pintlb   Linux 2.4.19-  579.0  161.1 1231.5  304.7   1454.0 3.495    18.0
2pintlb   Linux 2.4.19-  580.0  159.1 1236.1  317.0   1446.0 2.816    18.0
2pintlb   Linux 2.4.19-  579.0  159.8 1228.5  317.7   1444.0 3.342    18.0
nopintlb  Linux 2.4.19-  643.5  213.9 1426.5  404.0   1810.0 3.540    21.0
nopintlb  Linux 2.4.19-  643.9  213.2 1418.4  394.9   1761.0 3.637    21.0
nopintlb  Linux 2.4.19-  645.6  217.2 1436.8  420.2   1776.0 4.233    21.0

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
1pintlb   Linux 2.4.19- 39.9 41.9 31.5   47.2  115.6   85.3   83.6 115. 128.0
1pintlb   Linux 2.4.19- 43.1 41.6 30.8   48.1  115.6   85.7   84.2 115. 128.9
1pintlb   Linux 2.4.19- 42.5 41.1 31.6   48.2  115.6   86.2   84.4 115. 130.6
2pintlb   Linux 2.4.19- 42.6 42.4 32.0   48.4  115.6   85.6   84.1 115. 128.7
2pintlb   Linux 2.4.19- 42.3 42.4 62.7   48.1  115.6   85.5   84.0 115. 129.4
2pintlb   Linux 2.4.19- 44.4 43.7 64.6   48.5  115.6   86.0   84.3 115. 129.4
nopintlb  Linux 2.4.19- 39.0 39.3 29.3   46.9  115.5   85.5   83.9 115. 127.8
nopintlb  Linux 2.4.19- 41.7 39.3 59.9   47.2  115.5   85.2   84.1 115. 130.1
nopintlb  Linux 2.4.19- 41.1 38.2 29.4   47.0  115.5   85.7   84.1 115. 130.5

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
---------------------------------------------------
Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
--------- -------------  ---- ----- ------    --------    -------
1pintlb   Linux 2.4.19-   199  15.0  134.0  149.2    No L2 cache?
1pintlb   Linux 2.4.19-   199  15.0  133.9  149.2    No L2 cache?
1pintlb   Linux 2.4.19-   199  15.0  133.8  149.2    No L2 cache?
2pintlb   Linux 2.4.19-   199  15.0  133.8  149.2    No L2 cache?
2pintlb   Linux 2.4.19-   199  15.0  133.8  149.2    No L2 cache?
2pintlb   Linux 2.4.19-   199  15.0  133.8  149.1    No L2 cache?
nopintlb  Linux 2.4.19-   199  15.0  134.0  149.1    No L2 cache?
nopintlb  Linux 2.4.19-   199  15.0  134.1  149.1    No L2 cache?
nopintlb  Linux 2.4.19-   199  15.0  133.9  149.0    No L2 cache?

--
David Gibson			| For every complex problem there is a
david@gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.  -- H.L. Mencken
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-29  3:08 LMBench and CONFIG_PIN_TLB David Gibson
@ 2002-05-29 14:40 ` Dan Malek
  2002-05-29 23:04   ` Paul Mackerras
  2002-05-30  5:05   ` David Gibson
  0 siblings, 2 replies; 13+ messages in thread
From: Dan Malek @ 2002-05-29 14:40 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-embedded, Paul Mackerras

David Gibson wrote:

> I did some LMBench runs to observe the effect of CONFIG_PIN_TLB.

I implemented the tlb pinning for two reasons.  One, politics, since
everyone "just knows it is signficanlty better", and two, to alleviate
the exception path return problem of taking a TLB miss after loading SRR0/1.

> .... the difference varies between
> nothing (lost in the noise) to around 15% (fork proc).  The only
> measurement where no pinned entries might be argued to win is
> LMbench's main memory latency measurement.  The difference is < 0.1%
> and may just be chance fluctation.

It has been my experience over the last 20 years that in general
applications that show high TLB miss activity are making inefficient
use of all system resources and aren't likely to be doing any useful
work.  Why aren't we measuring cache efficiency?  Why aren't we profiling
the kernel to see where code changes will really make a difference?
Why aren't we measuring TLB performace on all processors?  If you want
to improve TLB performance, get a processor with larger TLBs or better
hardware support.

Pinning TLB entries simply reduces the resource availability.  When I'm
running a real application, doing real work in a real product, I don't
want these resources allocated for something else that is seldom used.
There are lots of other TLB management implementations that can really
improve performance, they just don't fit well into the current Linux/PowerPC
design.

I have seen exactly one application where TLB pinning actually
improved the performace of the system.  It was a real-time system,
based on Linux using an MPC8xx, where the maximum event response latency
had to be guaranteed.  With the proper locking of pages and TLB pins
this could be done.  It didn't improve the performance of the application,
but did ensure the system operated properly.

> 	The difference between 1 and 2 pinned entries is very small.
> There are a few cases where 1 might be better (but it might just be
> random noise) and a very few where 2 might be better than one.  On the
> basis of that there seems little point in pinning 2 entries.

What kind of scientific analysis is this?  Run controlled tests, post
the results, explain the variances, and allow it to be repeatable by
others.  Is there any consistency to the results?

> ..... Unless someone can come up with a
> real life workload which works poorly with pinned TLBs, I see little
> point in keeping the option - pinned TLBs should always be on (pinning
> 1 entry).

Where is your data that supports this?  Where is your "real life workload"
that actually supports what you want to do?

 From my perspective, your data shows we shouldn't do it.  A "real life
workload" is not a fork proc test, but rather main memory latency test,
where your tests showed it was better to not pin entries but you can't
explain the "fluctuation."  I contend the difference is due to the fact
you have reduced the TLB resources, increasing the number of TLB misses
to an application that is trying to do real work.

I suggest you heed the quote you always attach to your messages.  This
isn't a simple solution that is suitable for all applications.  It's one
option among many that needs to be tuned to meet the requirements of
an application.

Thanks.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-29 14:40 ` Dan Malek
@ 2002-05-29 23:04   ` Paul Mackerras
  2002-05-29 23:16     ` Tom Rini
  2002-05-30  1:34     ` Dan Malek
  2002-05-30  5:05   ` David Gibson
  1 sibling, 2 replies; 13+ messages in thread
From: Paul Mackerras @ 2002-05-29 23:04 UTC (permalink / raw)
  To: Dan Malek; +Cc: David Gibson, linuxppc-embedded

Dan Malek writes:

> I implemented the tlb pinning for two reasons.  One, politics, since
> everyone "just knows it is signficanlty better", and two, to alleviate
> the exception path return problem of taking a TLB miss after loading SRR0/1.

The second thing there is important, but there may be other ways
around that problem.

> Pinning TLB entries simply reduces the resource availability.  When I'm
> running a real application, doing real work in a real product, I don't
> want these resources allocated for something else that is seldom used.
> There are lots of other TLB management implementations that can really
> improve performance, they just don't fit well into the current Linux/PowerPC
> design.

I suspect we are all confusing two things here: (1) having pinned TLB
entries and (2) using large-page TLB entries for the kernel.  At the
moment the first is a prerequisite for the second.  The second gives
us a significant performance improvement, and David's measurements
show that.

We could have (2) without pinning any TLB entries but it would take
more code in the TLB miss handler to do that.  It is an interesting
question whether the benefit of having the 64th TLB slot available for
applications would outweigh the cost of the slightly slower TLB
misses.  My feeling is that it would be a close-run thing either way.

> I have seen exactly one application where TLB pinning actually
> improved the performace of the system.  It was a real-time system,
> based on Linux using an MPC8xx, where the maximum event response latency
> had to be guaranteed.  With the proper locking of pages and TLB pins
> this could be done.  It didn't improve the performance of the application,
> but did ensure the system operated properly.

Were you using any large-page TLB entries at all?

The other point that comes to mind is that the downside of pinning a
TLB entry is going to be much larger when you have fewer TLB entries
available.  Tom Rini mentioned the other day that some 8xx processors
only have 8 (I assume he meant 8 data + 8 instruction).  Having one
pinned entry out of 8 is going to be a lot more significant that one
out of 64.  David's suggestion was purely in the context of the 405
processor, which has 64.  I don't think he was advocating removing the
config option on the 8xx processors (actually, why is there the "860
only" comment in there?)

Paul.

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-29 23:04   ` Paul Mackerras
@ 2002-05-29 23:16     ` Tom Rini
  2002-05-30  1:34     ` Dan Malek
  1 sibling, 0 replies; 13+ messages in thread
From: Tom Rini @ 2002-05-29 23:16 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Dan Malek, David Gibson, linuxppc-embedded


On Thu, May 30, 2002 at 09:04:31AM +1000, Paul Mackerras wrote:

> available.  Tom Rini mentioned the other day that some 8xx processors
> only have 8 (I assume he meant 8 data + 8 instruction).

Quite probably, yes. :)

[snip]
> processor, which has 64.  I don't think he was advocating removing the
> config option on the 8xx processors (actually, why is there the "860
> only" comment in there?)

Because the current code goes and pins 8 or so TLBs (4 data, 4
instruction) which won't fly on the ones which only allow for 2/8 to be
pinned.  So 860 is a slight mislabing, if I read it all correctly.

--
Tom Rini (TR1265)
http://gate.crashing.org/~trini/

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-29 23:04   ` Paul Mackerras
  2002-05-29 23:16     ` Tom Rini
@ 2002-05-30  1:34     ` Dan Malek
  2002-05-30  5:14       ` David Gibson
  2002-05-30 16:09       ` Matthew Locke
  1 sibling, 2 replies; 13+ messages in thread
From: Dan Malek @ 2002-05-30  1:34 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: David Gibson, linuxppc-embedded

Paul Mackerras wrote:

> I suspect we are all confusing two things here: (1) having pinned TLB
> entries and (2) using large-page TLB entries for the kernel.

I wasn't confusing them :-).  I know that large page sizes are beneficial.
Someday I hope to finish the code that allows large page sizes in the
Linux page tables, so we can just load them.

> We could have (2) without pinning any TLB entries but it would take
> more code in the TLB miss handler to do that.

Only on the 4xx.  I have code for the 8xx that loads them using the
standard lookup.  Unfortunately, I have found something that isn't quite
stable with the large page sizes, but I don't know what it is.

> ....  It is an interesting
> question whether the benefit of having the 64th TLB slot available for
> applications would outweigh the cost of the slightly slower TLB
> misses.

Removing the entry will increase the TLB miss rate by 1/64 * 100 percent,
or a little over 1.5%, right?  Any application that is thrashing the TLB
cache by removing one entry is running on luck anyway, so we can't consider
those.  When you have applications using lots of CPU in user space (which
is usually a good thing :-), increased TLB misses will add up.

> .... My feeling is that it would be a close-run thing either way.

So, if you have a product that runs better one way or the other, just
select the option that suits your needs.  If the 4xx didn't require the
extra code in the miss handler to fangle the PTE, large pages without
pinning would clearly be the way to go (that's why it's an easy decision
on 8xx and I'm using it for testing).

> Were you using any large-page TLB entries at all?

Yes, but the problem was taking the tlb hit to get the first couple of
pages loaded and hitting the hardware register in time.  It was a hack
from the first line of code :-)  If you are going to pin a kernel entry,
you may as well map the whole space.  I don't think it would even work
if we were loading large pages out of the PTE tables.

> .... Tom Rini mentioned the other day that some 8xx processors
> only have 8 (I assume he meant 8 data + 8 instruction).

Yes, there are a number of variants now that have everything from 8 to
64 I believe.  It was just easier to pick out the 860 (which always has
lots of entries) for testing purposes.

The 8xx also has hardware support for pinning entries that basically
emulates BATs.  It doesn't require any software changes except for
the initial programming of the MMU control and loading of the pinned
entries.

> .... David's suggestion was purely in the context of the 405
> processor, which has 64.

There is an option to enable it, so just enable it by default.  What
do you gain by removing the option, except the possibility to prevent
someone from using it when it may be to their benefit?  It certainly
isn't a proven modification, as there may be some latent bugs associated
with dual mapping pages that may be covered by the large page and
some other mapping (I think this is the problem I see on the 8xx).

> ....  (actually, why is there the "860
> only" comment in there?)

Because the MMU control registers are slightly different among the 8xx
processor variants, and I only wrote the code to work with the 860 :-)

Thanks.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-30  1:34     ` Dan Malek
@ 2002-05-30  5:14       ` David Gibson
  2002-05-30 16:09       ` Matthew Locke
  1 sibling, 0 replies; 13+ messages in thread
From: David Gibson @ 2002-05-30  5:14 UTC (permalink / raw)
  To: Dan Malek; +Cc: Paul Mackerras, linuxppc-embedded


On Wed, May 29, 2002 at 09:34:54PM -0400, Dan Malek wrote:
>
> Paul Mackerras wrote:
>
>
> >I suspect we are all confusing two things here: (1) having pinned TLB
> >entries and (2) using large-page TLB entries for the kernel.
>
> I wasn't confusing them :-).  I know that large page sizes are beneficial.
> Someday I hope to finish the code that allows large page sizes in the
> Linux page tables, so we can just load them.

Well it so happens that Paul and I have tried implementing that this
morning.  More data coming in the next day or two.

> >We could have (2) without pinning any TLB entries but it would take
> >more code in the TLB miss handler to do that.
>
> Only on the 4xx.  I have code for the 8xx that loads them using the
> standard lookup.  Unfortunately, I have found something that isn't quite
> stable with the large page sizes, but I don't know what it is.

I'm only talking about 4xx.

> >....  It is an interesting
> >question whether the benefit of having the 64th TLB slot available for
> >applications would outweigh the cost of the slightly slower TLB
> >misses.
>
> Removing the entry will increase the TLB miss rate by 1/64 * 100 percent,
> or a little over 1.5%, right?  Any application that is thrashing the TLB
> cache by removing one entry is running on luck anyway, so we can't consider
> those.  When you have applications using lots of CPU in user space (which
> is usually a good thing :-), increased TLB misses will add up.

Um, assuming a program with some degree of locality, I'd expect it to
increase the miss rate by somewhat less than 1/64, but it will
certainly increase them to an extent.  So, show us the data.

> >.... My feeling is that it would be a close-run thing either way.
>
> So, if you have a product that runs better one way or the other, just
> select the option that suits your needs.  If the 4xx didn't require the
> extra code in the miss handler to fangle the PTE, large pages without
> pinning would clearly be the way to go (that's why it's an easy decision
> on 8xx and I'm using it for testing).

Actually from the looks of this implementation doing large pages won't
be too bad - we can hijack an existing test so we only hit the extra
code if we hit a large page entry.  Tests coming soon, I would expect
it to beat the current CONFIG_PIN_TLB.

> >.... David's suggestion was purely in the context of the 405
> >processor, which has 64.
>
> There is an option to enable it, so just enable it by default.  What
> do you gain by removing the option, except the possibility to prevent
> someone from using it when it may be to their benefit?  It certainly
> isn't a proven modification, as there may be some latent bugs associated
> with dual mapping pages that may be covered by the large page and
> some other mapping (I think this is the problem I see on the 8xx).

We gain simplicity of code.  Feeping creaturism isn't a good thing.

--
David Gibson			| For every complex problem there is a
david@gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.  -- H.L. Mencken
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-30  1:34     ` Dan Malek
  2002-05-30  5:14       ` David Gibson
@ 2002-05-30 16:09       ` Matthew Locke
  2002-05-30 23:50         ` Paul Mackerras
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Locke @ 2002-05-30 16:09 UTC (permalink / raw)
  To: Dan Malek; +Cc: Paul Mackerras, David Gibson, linuxppc-embedded


Dan Malek wrote:

>
> Paul Mackerras wrote:
>
>
>> I suspect we are all confusing two things here: (1) having pinned TLB
>> entries and (2) using large-page TLB entries for the kernel.
>
>
> I wasn't confusing them :-).  I know that large page sizes are
> beneficial.
> Someday I hope to finish the code that allows large page sizes in the
> Linux page tables, so we can just load them.
>
>> We could have (2) without pinning any TLB entries but it would take
>> more code in the TLB miss handler to do that.
>
>
> Only on the 4xx.  I have code for the 8xx that loads them using the
> standard lookup.  Unfortunately, I have found something that isn't quite
> stable with the large page sizes, but I don't know what it is.
>
>
>> ....  It is an interesting
>> question whether the benefit of having the 64th TLB slot available for
>> applications would outweigh the cost of the slightly slower TLB
>> misses.
>
>
> Removing the entry will increase the TLB miss rate by 1/64 * 100 percent,
> or a little over 1.5%, right?  Any application that is thrashing the TLB
> cache by removing one entry is running on luck anyway, so we can't
> consider
> those.  When you have applications using lots of CPU in user space (which
> is usually a good thing :-), increased TLB misses will add up.
>
>> .... My feeling is that it would be a close-run thing either way.
>
>
> So, if you have a product that runs better one way or the other, just
> select the option that suits your needs.  If the 4xx didn't require the
> extra code in the miss handler to fangle the PTE, large pages without
> pinning would clearly be the way to go (that's why it's an easy decision
> on 8xx and I'm using it for testing).
>
>> Were you using any large-page TLB entries at all?
>
>
> Yes, but the problem was taking the tlb hit to get the first couple of
> pages loaded and hitting the hardware register in time.  It was a hack
> from the first line of code :-)  If you are going to pin a kernel entry,
> you may as well map the whole space.  I don't think it would even work
> if we were loading large pages out of the PTE tables.
>
>> .... Tom Rini mentioned the other day that some 8xx processors
>> only have 8 (I assume he meant 8 data + 8 instruction).
>
>
> Yes, there are a number of variants now that have everything from 8 to
> 64 I believe.  It was just easier to pick out the 860 (which always has
> lots of entries) for testing purposes.
>
> The 8xx also has hardware support for pinning entries that basically
> emulates BATs.  It doesn't require any software changes except for
> the initial programming of the MMU control and loading of the pinned
> entries.
>
>> .... David's suggestion was purely in the context of the 405
>> processor, which has 64.
>
>
> There is an option to enable it, so just enable it by default.  What
> do you gain by removing the option, except the possibility to prevent
> someone from using it when it may be to their benefit?  It certainly
> isn't a proven modification, as there may be some latent bugs associated
> with dual mapping pages that may be covered by the large page and
> some other mapping (I think this is the problem I see on the 8xx).


btw, there are bugs with it.  Starting several processes with init or
even telnetd will expose the bug.

>
>
>> ....  (actually, why is there the "860
>> only" comment in there?)
>
>
> Because the MMU control registers are slightly different among the 8xx
> processor variants, and I only wrote the code to work with the 860 :-)
>
> Thanks.
>
>
>     -- Dan
>
>
>


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-30 16:09       ` Matthew Locke
@ 2002-05-30 23:50         ` Paul Mackerras
  2002-05-30 23:01           ` Matthew Locke
  2002-05-31  0:10           ` Tom Rini
  0 siblings, 2 replies; 13+ messages in thread
From: Paul Mackerras @ 2002-05-30 23:50 UTC (permalink / raw)
  To: Matthew Locke; +Cc: Dan Malek, David Gibson, linuxppc-embedded

> btw, there are bugs with it.  Starting several processes with init or
> even telnetd will expose the bug.

David and I haven't been able to reproduce this on the Walnut or the
EP405.  What sort of machine are you using, what processor, how much
RAM, and what distro are you using?

Paul.

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-30 23:50         ` Paul Mackerras
@ 2002-05-30 23:01           ` Matthew Locke
  2002-05-31  2:39             ` David Gibson
  2002-05-31  0:10           ` Tom Rini
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Locke @ 2002-05-30 23:01 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Dan Malek, David Gibson, linuxppc-embedded


Paul Mackerras wrote:

>>btw, there are bugs with it.  Starting several processes with init or
>>even telnetd will expose the bug.
>>
>
>David and I haven't been able to reproduce this on the Walnut or the
>EP405.  What sort of machine are you using, what processor, how much
>RAM, and what distro are you using?
>
>Paul.
>

I run MVL (of course) on a walnut with 32MB of RAM.  What is your
environment?  btw, MVL uses soft-float in glibc not floating point
emulation in the kernel.


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-30 23:01           ` Matthew Locke
@ 2002-05-31  2:39             ` David Gibson
  0 siblings, 0 replies; 13+ messages in thread
From: David Gibson @ 2002-05-31  2:39 UTC (permalink / raw)
  To: Matthew Locke; +Cc: Paul Mackerras, Dan Malek, linuxppc-embedded


On Thu, May 30, 2002 at 04:01:32PM -0700, Matthew Locke wrote:
>
> Paul Mackerras wrote:
>
> >>btw, there are bugs with it.  Starting several processes with init or
> >>even telnetd will expose the bug.
> >>
> >
> >David and I haven't been able to reproduce this on the Walnut or the
> >EP405.  What sort of machine are you using, what processor, how much
> >RAM, and what distro are you using?
> >
> >Paul.
>
> I run MVL (of course) on a walnut with 32MB of RAM.  What is your
> environment?  btw, MVL uses soft-float in glibc not floating point
> emulation in the kernel.

I've tried it both on a Walnut (PVR 401100c4) with 128MB of RAM, root
filesystem on an IDE disk attached to a Promis PCI IDE controller and
on an EP405PC board (PVR 40110145) with 64MB of RAM with NFS root.  In
both cases userland is Debian/sid running with kernel math emulation.

--
David Gibson			| For every complex problem there is a
david@gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.  -- H.L. Mencken
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-30 23:50         ` Paul Mackerras
  2002-05-30 23:01           ` Matthew Locke
@ 2002-05-31  0:10           ` Tom Rini
  2002-05-31 14:48             ` Tom Rini
  1 sibling, 1 reply; 13+ messages in thread
From: Tom Rini @ 2002-05-31  0:10 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Matthew Locke, Dan Malek, David Gibson, linuxppc-embedded


On Fri, May 31, 2002 at 09:50:39AM +1000, Paul Mackerras wrote:
>
> > btw, there are bugs with it.  Starting several processes with init or
> > even telnetd will expose the bug.
>
> David and I haven't been able to reproduce this on the Walnut or the
> EP405.  What sort of machine are you using, what processor, how much
> RAM, and what distro are you using?

A Walnut (pvr: 40110145) with 32mb of RAM and Debian/Woody shows it off
quite nicely here.

Login via serial, telnet to localhost, login, do it again.

--
Tom Rini (TR1265)
http://gate.crashing.org/~trini/

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-31  0:10           ` Tom Rini
@ 2002-05-31 14:48             ` Tom Rini
  0 siblings, 0 replies; 13+ messages in thread
From: Tom Rini @ 2002-05-31 14:48 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Matthew Locke, Dan Malek, David Gibson, linuxppc-embedded


On Thu, May 30, 2002 at 05:10:05PM -0700, Tom Rini wrote:
>
> On Fri, May 31, 2002 at 09:50:39AM +1000, Paul Mackerras wrote:
> >
> > > btw, there are bugs with it.  Starting several processes with init or
> > > even telnetd will expose the bug.
> >
> > David and I haven't been able to reproduce this on the Walnut or the
> > EP405.  What sort of machine are you using, what processor, how much
> > RAM, and what distro are you using?
>
> A Walnut (pvr: 40110145) with 32mb of RAM and Debian/Woody shows it off
> quite nicely here.

With nfsroot, and logging in via serial console initially even.

--
Tom Rini (TR1265)
http://gate.crashing.org/~trini/

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: LMBench and CONFIG_PIN_TLB
  2002-05-29 14:40 ` Dan Malek
  2002-05-29 23:04   ` Paul Mackerras
@ 2002-05-30  5:05   ` David Gibson
  1 sibling, 0 replies; 13+ messages in thread
From: David Gibson @ 2002-05-30  5:05 UTC (permalink / raw)
  To: Dan Malek; +Cc: linuxppc-embedded, Paul Mackerras

On Wed, May 29, 2002 at 10:40:02AM -0400, Dan Malek wrote:
>
> David Gibson wrote:
>
> >I did some LMBench runs to observe the effect of CONFIG_PIN_TLB.
>
> I implemented the tlb pinning for two reasons.  One, politics, since
> everyone "just knows it is signficanlty better", and two, to alleviate
> the exception path return problem of taking a TLB miss after loading SRR0/1.

Ok.

> >.... the difference varies between
> >nothing (lost in the noise) to around 15% (fork proc).  The only
> >measurement where no pinned entries might be argued to win is
> >LMbench's main memory latency measurement.  The difference is < 0.1%
> >and may just be chance fluctation.
>
> It has been my experience over the last 20 years that in general
> applications that show high TLB miss activity are making inefficient
> use of all system resources and aren't likely to be doing any useful
> work.  Why aren't we measuring cache efficiency?  Why aren't we profiling
> the kernel to see where code changes will really make a difference?
> Why aren't we measuring TLB performace on all processors?  If you want
> to improve TLB performance, get a processor with larger TLBs or better
> hardware support.

Good question.  Because we all have finite time.  I figure an LMBench
run on CONFIG_PIN_TLB, while admittedly quite incomplete information
is better than no data at all.

> Pinning TLB entries simply reduces the resource availability.  When I'm
> running a real application, doing real work in a real product, I don't
> want these resources allocated for something else that is seldom used.
> There are lots of other TLB management implementations that can really
> improve performance, they just don't fit well into the current Linux/PowerPC
> design.

As paulus also points out there are two issues here.  Pinning the TLB
entries per se reduces resource availability.  However it provides an
easy way to use a large page TLB entry for the kernel, which for a
number of not infrequent kernel activities is a win according to
LMBench.

> I have seen exactly one application where TLB pinning actually
> improved the performace of the system.  It was a real-time system,
> based on Linux using an MPC8xx, where the maximum event response latency
> had to be guaranteed.  With the proper locking of pages and TLB pins
> this could be done.  It didn't improve the performance of the application,
> but did ensure the system operated properly.
>
> >	The difference between 1 and 2 pinned entries is very small.
> >There are a few cases where 1 might be better (but it might just be
> >random noise) and a very few where 2 might be better than one.  On the
> >basis of that there seems little point in pinning 2 entries.
>
> What kind of scientific analysis is this?  Run controlled tests, post
> the results, explain the variances, and allow it to be repeatable by
> others.  Is there any consistency to the results?

Ok, put it like this: a) this LMbench run shows very weak evidence
that 1 pinned entry is better than 2, but certainly no evidence that 2
beats 1. b) I see no theoretical reason that 2 pinned entries would do
significantly better than 1 (16MB being sufficient to cover all the
kernel text, static data and BSS), c) 1 pinned entry is slightly
simpler than 2 and therefore wins by default.

> >..... Unless someone can come up with a
> >real life workload which works poorly with pinned TLBs, I see little
> >point in keeping the option - pinned TLBs should always be on (pinning
> >1 entry).
>
> Where is your data that supports this?  Where is your "real life workload"
> that actually supports what you want to do?

Ok, put it this way:

Pro CONFIG_PIN_TLB (as currently implemented):
    	- LMbench results, admittedly inconclusive
	- Makes ensuring the exception exit is safe easier
Con CONFIG_PIN_TLB (as currently implemented):
	- You think it isn't a good idea
	- Possible miniscule improvement in main memory latency

Data from a real life workload would certainly trump all the "pro"
arguments I've listed there.  Give me some numbers supporting your
case and I'll probably agree with you, but given no other data this
suggests that CONFIG_PIN_TLB wins.  Oh, incidentally a kernel compile
also appears to be slightly faster with CONFIG_PIN_TLB.

> From my perspective, your data shows we shouldn't do it.  A "real life
> workload" is not a fork proc test, but rather main memory latency test,
> where your tests showed it was better to not pin entries but you can't
> explain the "fluctuation."  I contend the difference is due to the fact
> you have reduced the TLB resources, increasing the number of TLB misses
> to an application that is trying to do real work.

Dan, either you're not reading or you're not thinking.  The difference
between the memory latency numbers is tiny, less than 0.1%.  If you
actually look at the LMbench numbers (I have three runs in each
situation), the random variation between each run is around the same
size.  Therefore the data is inconclusive, put possibly suggests a
slowdown with CONFIG_PIN_TLB - particularly given that there are at
least two plausible explanations for the slowdown, (a) because we have
less free TLB entries we are taking more TLB misses and (b) with
CONFIG_PIN_TLB the TLB fault handler has a few extra instructions.
*But* any such slowdown is <0.1%.  It doesn't take that many page
faults (say) which appear to be around 15% faster with CONFIG_PIN_TLB,
for that to be a bigger win than the (possible) memory access
slowdown.

> I suggest you heed the quote you always attach to your messages.  This
> isn't a simple solution that is suitable for all applications.  It's one
> option among many that needs to be tuned to meet the requirements of
> an application.

Ok.  Show me an application where CONFIG_PIN_TLB loses.  I'm perfectly
willing to accept they exist.  At the moment I've presented little
data, but you've presented none.

--
David Gibson			| For every complex problem there is a
david@gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.  -- H.L. Mencken
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2002-05-31 14:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-29  3:08 LMBench and CONFIG_PIN_TLB David Gibson
2002-05-29 14:40 ` Dan Malek
2002-05-29 23:04   ` Paul Mackerras
2002-05-29 23:16     ` Tom Rini
2002-05-30  1:34     ` Dan Malek
2002-05-30  5:14       ` David Gibson
2002-05-30 16:09       ` Matthew Locke
2002-05-30 23:50         ` Paul Mackerras
2002-05-30 23:01           ` Matthew Locke
2002-05-31  2:39             ` David Gibson
2002-05-31  0:10           ` Tom Rini
2002-05-31 14:48             ` Tom Rini
2002-05-30  5:05   ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).