increased translation cache footprint in v2.6

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* increased translation cache footprint in v2.6
@ 2005-06-26 17:23 Marcelo Tosatti
  2005-06-26 23:42 ` Paul Mackerras
  2005-06-26 23:49 ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2005-06-26 17:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Benjamin LaHaise

Hi,

I'm sending this info because others also seem to be interested (and I assume there
are other small address translation cache machines where the following is is an issue).

We've noticed a slowdown while moving from v2.4 to v2.6 on a small PPC platform
(855T CPU running at 48Mhz, containing pair of separate I/D TLB caches with 
32 entries each), with a relatively recent kernel (v2.6.11).

Test in question is a "dd" copying 16MB from /dev/zero to RAMDISK. 

Pinning an 8Mbyte TLB entry at KERNELBASE brought performance back to v2.4 levels.

The following "itlb-content-before.txt" and "itlb-content-after.txt" files are the
contents of the virt. addresses cached in the I/TLB before and after a "sys_read" 
system call.

[marcelo@logos itlb]$ diff -u 24-itlb-content-before.txt 24-itlb-content-after.txt  | grep SPR | grep 816 | grep "+"
+SPR  816 : 0x0ffe800f    268337167
+SPR  816 : 0x0ffeb00f    268349455
+SPR  816 : 0xc009e01f  -1073094625
+SPR  816 : 0xc009d01f  -1073098721
+SPR  816 : 0xc000301f  -1073729505
+SPR  816 : 0xc009c01f  -1073102817

[marcelo@logos itlb]$ diff -u 24-itlb-content-before.txt 24-itlb-content-after.txt  | grep SPR | grep 818 | grep "+"  | wc -l
6

[marcelo@logos itlb]$ diff -u 26-itlb-before.txt 26-itlb-after.txt  | grep 816 | grep SPR | grep "+"
+SPR  816 : 0x0feda16f    267231599
+SPR  816 : 0xc004b17f  -1073434241
+SPR  816 : 0xc004a17f  -1073438337
+SPR  816 : 0x0ff7e16f    267903343
+SPR  816 : 0x1001016f    268501359
+SPR  816 : 0xc000217f  -1073733249
+SPR  816 : 0xc001617f  -1073651329
+SPR  816 : 0xc002e17f  -1073553025
+SPR  816 : 0xc010e17f  -1072635521
+SPR  816 : 0xc002d17f  -1073557121
+SPR  816 : 0xc010d17f  -1072639617
+SPR  816 : 0xc000c17f  -1073692289
+SPR  816 : 0xc000317f  -1073729153

[marcelo@logos itlb]$ diff -u 26-itlb-before.txt 26-itlb-after.txt  | grep 816 | grep SPR | grep "+" | wc -l
13

As can be seen the number of entries is more than twice (dominated by kernel addresses).

Sorry, I've got no list of functions for these addresses, but it was pretty obvious at the time 
looking at the sys_read() codepath and respective virtual addresses.

Manual reorganization of the functions sounded too messy, although BenL mentions something about
fget_light() can and should be optimized.

I suppose the compiler/linker should be doing optimization based function order knowledge about 
the on hottest paths (eg sys_read/sys_write). Why is it not doing this already and what are 
the implications?

Date: Fri, 22 Apr 2005 12:39:21 -0300
Subject: v2.6 performance slowdown on MPC8xx: Measuring TLB cache misses

Here goes more data about the v2.6 performance slowdown on MPC8xx.

Thanks Benjamin for the TLB miss counter idea! 

This are results of the following test script which zeroes the TLB counters,
copies 16MB of data from memory to memory using "dd", and reads the counters
again. 

-- 

#!/bin/bash
echo 0 > /proc/tlbmiss
time dd if=/dev/zero of=file bs=4k count=3840
cat /proc/tlbmiss

-- 

The results:

v2.6: 				v2.4: 				delta
[root@CAS root]# sh script     	[root@CAS root]# sh script     
real    0m4.241s                         real    0m3.440s
user    0m0.140s                         user    0m0.090s
sys     0m3.820s                         sys     0m3.330s

I-TLB userspace misses: 142369  I-TLB userspace misses: 2179    ITLB u: 139190
I-TLB kernel misses: 118288    	I-TLB kernel misses: 1369	ITLB k: 116319
D-TLB userspace misses: 222916 	D-TLB userspace misses: 180249	DTLB u: 38667
D-TLB kernel misses: 207773    	D-TLB kernel misses: 167236	DTLB k: 38273

The sum of all TLB miss counter delta's between v2.4 and v2.6 is: 

139190 + 116319 + 38667 + 38273  = 332449 

Multiplied by 23 cycles, which is the average wait time to read a 
page translation miss from memory:

332449 * 23 = 7646327 cycles.

Which is about 16% of 48000000, the total number of cycles this CPU 
performs on one second. Its very likely that there is a significant
indirect effect of this TLB miss increase, other than the wasted 
cycles to bring the page tables from memory: exception execution time 
and context switching.

Checking "time" output, we can see 1s of slowdown:  

[root@CAS root]# time dd if=/dev/zero of=file bs=4k count=3840 

v2.4:				v2.6:				diff
real    0m3.366s		real    0m4.360s		0.994s
user    0m0.080s		user    0m0.111s	        0.31s
sys     0m3.260s		sys     0m4.218s 		0.958s

Mostly caused by increased kernel execution time.

This proves that the slowdown is, in great part, due to increased 
translation cache trashing. 

Now, what is the best way to bring the performance back to v2.4 levels? 

For this "dd" test, which is dominated by "sys_read/sys_write", I thought 
of trying to bring the hotpath functions into the same pages, thus
decreasing the number of page translations required for such tasks.

Comments are appreciated.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-26 23:42 ` Paul Mackerras
@ 2005-06-26 18:31   ` Marcelo Tosatti
  0 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2005-06-26 18:31 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel, Benjamin LaHaise

On Mon, Jun 27, 2005 at 09:42:52AM +1000, Paul Mackerras wrote:
> Marcelo Tosatti writes:
> 
> > We've noticed a slowdown while moving from v2.4 to v2.6 on a small PPC platform
> > (855T CPU running at 48Mhz, containing pair of separate I/D TLB caches with 
> > 32 entries each), with a relatively recent kernel (v2.6.11).
> > 
> > Test in question is a "dd" copying 16MB from /dev/zero to RAMDISK. 
> > 
> > Pinning an 8Mbyte TLB entry at KERNELBASE brought performance back to v2.4 levels.
> 
> Why are we not pinning a large TLB entry at KERNELBASE in 2.6?  Was
> that taken out to reduce the size of the tlb miss handler or
> something?

Paul,

There are buggy instances of tlbie() destroying the 8Mbyte TLB entry - 
this is going to be fixed soon (its MPC8xx specific...)

I worry about machines who can't pin and/or smaller number of TLB entries.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-26 23:49 ` Andrew Morton
@ 2005-06-26 18:52   ` Marcelo Tosatti
  2005-06-27  0:33     ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2005-06-26 18:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, bcrl

On Sun, Jun 26, 2005 at 04:49:39PM -0700, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > As can be seen the number of entries is more than twice (dominated by kernel addresses).
> 
> But doesn't this:
> 
> I-TLB userspace misses: 142369  I-TLB userspace misses: 2179    ITLB u: 139190
> I-TLB kernel misses: 118288    	I-TLB kernel misses: 1369	ITLB k: 116319
> D-TLB userspace misses: 222916 	D-TLB userspace misses: 180249	DTLB u: 38667
> D-TLB kernel misses: 207773    	D-TLB kernel misses: 167236	DTLB k: 38273
> 
> mean that we're mainly missing on data accesses?

The input files where "diff -u" works are listing only entries from the 
instruction cache. 

I say that because I suppose that you thought that the files "diff -u" 
works on could have mixed i/d cache entries.

Yes, the ratio between instruction/data misses is about 1/2, but the amount 
of data misses is about the same between v2.4 and v2.6.

> >  Sorry, I've got no list of functions for these addresses, but it was pretty obvious at the time 
> >  looking at the sys_read() codepath and respective virtual addresses.
> > 
> >  Manual reorganization of the functions sounded too messy, although BenL mentions something about
> >  fget_light() can and should be optimized.
> 
> The workload you're using also does write(), and the write() paths got
> significantly deeper.

What can be done to bring those functions which compose the paths into the 
smaller amounts of pages as possible? 

> Stack misses, perhaps. 

Can you elaborate? The deltas of data cache misses are about the same.

>  But a tlb entry caches the translation for a single
> page, yes?

Well, a TLB entry might cache different sized pages. The platform support 4kb, 
16kb and 8Mb (IIRC, maybe some other size also).

The bigger pages (8Mb) are only used to map 8Mbytes of instruction at KERNELBASE,
24Mbytes of data (3 8Mbyte entries) also at KERNELBASE and another 8Mbytes of the
configuration registers memory space, which lives outside RAM space.

There was a bug causing the first 8Mbyte entry to be invalidated, which led the 
system to use translations from the 4kB pagetables at KERNELBASE. 

So, the issue has been "solved" for this particular machine, but its still there
(and potentially affects platforms I wonder).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-27  0:33     ` David S. Miller
@ 2005-06-26 19:09       ` Marcelo Tosatti
  2005-06-27  0:53         ` David S. Miller
  2005-06-27 15:46         ` Dan Malek
  2005-06-27  1:55       ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2005-06-26 19:09 UTC (permalink / raw)
  To: David S. Miller, Dan Malek; +Cc: akpm, linux-kernel

On Sun, Jun 26, 2005 at 05:33:38PM -0700, David S. Miller wrote:
> From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
> Date: Sun, 26 Jun 2005 15:52:10 -0300
> 
> > Well, a TLB entry might cache different sized pages. The platform
> > support 4kb, 16kb and 8Mb (IIRC, maybe some other size also).  The
> > bigger pages (8Mb) are only used to map 8Mbytes of instruction at
> > KERNELBASE, 24Mbytes of data (3 8Mbyte entries) also at KERNELBASE
> > and another 8Mbytes of the configuration registers memory space,
> > which lives outside RAM space.
> 
> Why don't you use 8MB TLB entries when there is a miss to
> one of the PAGE_OFFSET pages?  I'm not saying to lock them,
> just to use large 8MB TLB entries when a miss is taken for
> kernel data accesses to where the kernel maps all of lowmem.

David, 

Thats a very interesting idea, will probably optimize performance in 
general ("why did nobody thought of it before?" kind). 

The increase in TLB miss handler size might be offset by the reduced
kernel misses...

Dan, what do you think? 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-26 17:23 increased translation cache footprint in v2.6 Marcelo Tosatti
@ 2005-06-26 23:42 ` Paul Mackerras
  2005-06-26 18:31   ` Marcelo Tosatti
  2005-06-26 23:49 ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Paul Mackerras @ 2005-06-26 23:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, Benjamin LaHaise

Marcelo Tosatti writes:

> We've noticed a slowdown while moving from v2.4 to v2.6 on a small PPC platform
> (855T CPU running at 48Mhz, containing pair of separate I/D TLB caches with 
> 32 entries each), with a relatively recent kernel (v2.6.11).
> 
> Test in question is a "dd" copying 16MB from /dev/zero to RAMDISK. 
> 
> Pinning an 8Mbyte TLB entry at KERNELBASE brought performance back to v2.4 levels.

Why are we not pinning a large TLB entry at KERNELBASE in 2.6?  Was
that taken out to reduce the size of the tlb miss handler or
something?

Paul.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-26 17:23 increased translation cache footprint in v2.6 Marcelo Tosatti
  2005-06-26 23:42 ` Paul Mackerras
@ 2005-06-26 23:49 ` Andrew Morton
  2005-06-26 18:52   ` Marcelo Tosatti
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2005-06-26 23:49 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, bcrl

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> As can be seen the number of entries is more than twice (dominated by kernel addresses).

But doesn't this:

I-TLB userspace misses: 142369  I-TLB userspace misses: 2179    ITLB u: 139190
I-TLB kernel misses: 118288    	I-TLB kernel misses: 1369	ITLB k: 116319
D-TLB userspace misses: 222916 	D-TLB userspace misses: 180249	DTLB u: 38667
D-TLB kernel misses: 207773    	D-TLB kernel misses: 167236	DTLB k: 38273

mean that we're mainly missing on data accesses?

>  Sorry, I've got no list of functions for these addresses, but it was pretty obvious at the time 
>  looking at the sys_read() codepath and respective virtual addresses.
> 
>  Manual reorganization of the functions sounded too messy, although BenL mentions something about
>  fget_light() can and should be optimized.

The workload you're using also does write(), and the write() paths got
significantly deeper.

Stack misses, perhaps.  But a tlb entry caches the translation for a single
page, yes?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-26 18:52   ` Marcelo Tosatti
@ 2005-06-27  0:33     ` David S. Miller
  2005-06-26 19:09       ` Marcelo Tosatti
  2005-06-27  1:55       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-27  0:33 UTC (permalink / raw)
  To: marcelo.tosatti; +Cc: akpm, linux-kernel, bcrl

From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
Date: Sun, 26 Jun 2005 15:52:10 -0300

> Well, a TLB entry might cache different sized pages. The platform
> support 4kb, 16kb and 8Mb (IIRC, maybe some other size also).  The
> bigger pages (8Mb) are only used to map 8Mbytes of instruction at
> KERNELBASE, 24Mbytes of data (3 8Mbyte entries) also at KERNELBASE
> and another 8Mbytes of the configuration registers memory space,
> which lives outside RAM space.

Why don't you use 8MB TLB entries when there is a miss to
one of the PAGE_OFFSET pages?  I'm not saying to lock them,
just to use large 8MB TLB entries when a miss is taken for
kernel data accesses to where the kernel maps all of lowmem.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-26 19:09       ` Marcelo Tosatti
@ 2005-06-27  0:53         ` David S. Miller
  2005-06-27 15:57           ` Dan Malek
  2005-06-27 15:46         ` Dan Malek
  1 sibling, 1 reply; 18+ messages in thread
From: David S. Miller @ 2005-06-27  0:53 UTC (permalink / raw)
  To: marcelo.tosatti; +Cc: dan, akpm, linux-kernel

From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
Date: Sun, 26 Jun 2005 16:09:44 -0300

> Thats a very interesting idea, will probably optimize performance in 
> general ("why did nobody thought of it before?" kind). 
> 
> The increase in TLB miss handler size might be offset by the reduced
> kernel misses...
> 
> Dan, what do you think? 

I doubt it, it cost a single comparison on sparc64 to implement this.

Basically, the TLB miss handler for data accesses on sparc64 looks
like the following.

Load the miss information:

	ldxa		[%g1 + %g1] ASI_DMMU, %g4	! Get TAG_ACCESS

If "TAG_CONTEXT_BITS" is zero, it's for the kernel:

	andcc		%g4, TAG_CONTEXT_BITS, %g0	! From Nucleus?
	be,pn		%xcc, 3f			! Yep, special processing
 ...

If the virtual address has the top-most bit set (and thus it's
"negative"), then it's in the physical memory direct mapping
area.  The trap handler register %g2 is preloaded with a fixed
value, that when XOR'd with the fault address information
produces a suitable PTE for loading right into the TLB.  This
PTE uses 4MB pages.

3:	brlz,pt		%g4, 9b				! Kernel virtual map?
	 xor		%g2, %g4, %g5			! Finish bit twiddles

 ...

Store the calculated TLB entry and return from the trap.

9:	stxa		%g5, [%g0] ASI_DTLB_DATA_IN	! Reload TLB
	retry						! Trap return

So that's 7 instructions, 2 instruction cache lines, with no main
memory accesses.  Surely the PPC folks can do something similar. :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-27  0:33     ` David S. Miller
  2005-06-26 19:09       ` Marcelo Tosatti
@ 2005-06-27  1:55       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 18+ messages in thread
From: Benjamin Herrenschmidt @ 2005-06-27  1:55 UTC (permalink / raw)
  To: David S. Miller; +Cc: marcelo.tosatti, akpm, linux-kernel, bcrl

On Sun, 2005-06-26 at 17:33 -0700, David S. Miller wrote:
> From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
> Date: Sun, 26 Jun 2005 15:52:10 -0300
> 
> > Well, a TLB entry might cache different sized pages. The platform
> > support 4kb, 16kb and 8Mb (IIRC, maybe some other size also).  The
> > bigger pages (8Mb) are only used to map 8Mbytes of instruction at
> > KERNELBASE, 24Mbytes of data (3 8Mbyte entries) also at KERNELBASE
> > and another 8Mbytes of the configuration registers memory space,
> > which lives outside RAM space.
> 
> Why don't you use 8MB TLB entries when there is a miss to
> one of the PAGE_OFFSET pages?  I'm not saying to lock them,
> just to use large 8MB TLB entries when a miss is taken for
> kernel data accesses to where the kernel maps all of lowmem.

Looks like the right thing to do indeed. Should be fairly easy, just
test if the address if negative (you'll never use >2Gb address space on
these) and ... heh, you already do it in your 8xx TLB miss handlers to
separate user page tables from kernel page tables :) So I doubt the
normal user TLB miss path will be any different and the kernel TLB miss
path though be separate and faster due to having much less misses.

Also, you may want to bump the whole page size to 64k on these,
interesting exercise but probably not very difficult. Our ABI is already
clean at the userland level for up to 64k. We did some experiments on
ppc64 with pseudo-64k pages (emulating them in software) and common
ppc32 userland is just fine.

Using a larger page size makes a lot of sense of those embedded CPUs
with small TLBs where you usually don't use much or no swap at all.

Ben.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-26 19:09       ` Marcelo Tosatti
  2005-06-27  0:53         ` David S. Miller
@ 2005-06-27 15:46         ` Dan Malek
  2005-06-28  6:21           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 18+ messages in thread
From: Dan Malek @ 2005-06-27 15:46 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: akpm, David S. Miller, linux-kernel

On Jun 26, 2005, at 3:09 PM, Marcelo Tosatti wrote:

> Thats a very interesting idea, will probably optimize performance in
> general ("why did nobody thought of it before?" kind).

I've done this before, used the pgd/pmd or pte  to hold large page
size entries.  The problem is the amount of code needed in the
tlbmiss handler to implement this.  The Linux page table structure
doesn't allow us to easily format this information, so we have lots
of code in the handler to fabricate these entries.  It's a significant
overhead for the normal 4K path that was hard to justify.

> The increase in TLB miss handler size might be offset by the reduced
> kernel misses...

We need to be optimizing the applications, since that is where the
real work is done and where the system spends most of it's time.
The kernel is easy to optimize with pinned entries, then we have the
best solution.  A minimal overhead for the 4K pages, plus an optimal
kernel mapping.

I do want the solution of variable page sizes in the kernel, because
we don't have to reserve wired entries, providing the best solution.
I'm always thinking of this and experiment with it from time to time, 
but
I haven't found a solution that is satisfactory to me :-)  Maybe 
something
like an early kernel/user test and separate code paths, but I now have
a solution that eliminates our current test, and I don't want to put it
back in :-)  My holy grail is a 4 instruction tlb miss handler, but I 
haven't
been able to get the PTEs formatted correctly so everyone is happy.

Thanks.

	-- Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-27  0:53         ` David S. Miller
@ 2005-06-27 15:57           ` Dan Malek
  2005-06-27 19:50             ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Malek @ 2005-06-27 15:57 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, marcelo.tosatti, linux-kernel

On Jun 26, 2005, at 8:53 PM, David S. Miller wrote:

> So that's 7 instructions, 2 instruction cache lines, with no main
> memory accesses.  Surely the PPC folks can do something similar. :-)

It's not that easy on the 8xx.  It actually implements a two level
hardware page table.  Basically, I want to load the PMD into the
first level of the hardware, then the PTE into the second level 
register.
We have to load both registers with some information, but I can't
get the control bits organized in the pmd/pte to do this easily.
There is also a fair amount of hardware assist in the MMU for
initializing these registers and providing page table offset computation
that we need to utilize.

With the right page table structure the tlb miss handler is very 
trivial.
Without it, we have to spend lots of time building the entries 
dynamically.
Because of the configurability of the address space among text, data,
IO, and uncached mapping, we simply can't test an address bit and
build a new TLB entry.  So, I want to use the existing page tables to
represent the spaces, then have the tlb miss handler just use that
information.  I'll take a closer look at the kernel/user separate code
paths again.

Thanks.

	-- Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: increased translation cache footprint in v2.6
@ 2005-06-27 19:04 Al Boldi
  0 siblings, 0 replies; 18+ messages in thread
From: Al Boldi @ 2005-06-27 19:04 UTC (permalink / raw)
  To: linux-kernel

Marcelo Tosatti wrote:

> We've noticed a slowdown while moving from v2.4 to v2.6 on a small PPC 
> platform (855T CPU running at 48Mhz, containing pair of separate I/D 
> TLB caches with
> 32 entries each), with a relatively recent kernel (v2.6.11).
> 
> Test in question is a "dd" copying 16MB from /dev/zero to RAMDISK. 
> 
> Which is about 16% of 48000000, the total number of cycles this CPU
performs on one second.

Testing with hdparm -tT /dev/hda shows an 18% slowdown on a PII-400Mhz using
2.4.31/2.6.11.

WHAT'S going on?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-27 15:57           ` Dan Malek
@ 2005-06-27 19:50             ` David S. Miller
  2005-06-27 20:35               ` Dan Malek
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2005-06-27 19:50 UTC (permalink / raw)
  To: dan; +Cc: akpm, marcelo.tosatti, linux-kernel

From: Dan Malek <dan@embeddededge.com>
Date: Mon, 27 Jun 2005 11:57:51 -0400

> Because of the configurability of the address space among text, data,
> IO, and uncached mapping, we simply can't test an address bit and
> build a new TLB entry.

Maybe not by testing a bit, but instead via a range test.

      cmp %reg, PAGE_OFFSET_BEGIN
      bl  not_kernel
      cmp %reg, PAGE_OFFSET_END
      bge not_kernel

      Calculate 8MB PTE here

not_kernel:

That's 4 instructions, completely trivial.

I think you're making this problem more complex than it really
is.  There is no reason at all to hold page tables for the direct
physical memory mappings of lowmem if you have any control whatsoever
over the TLB miss handler.

You'll be saving tons of memory accesses, and that alone should
count for some significant performance savings especially on
embedded setups.  What's more, you'll get 8MB mappings as well,
decreasing the TLB miss rate.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-27 19:50             ` David S. Miller
@ 2005-06-27 20:35               ` Dan Malek
  2005-06-28  6:18                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Malek @ 2005-06-27 20:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, marcelo.tosatti, linux-kernel

On Jun 27, 2005, at 3:50 PM, David S. Miller wrote:

> I think you're making this problem more complex than it really
> is.  There is no reason at all to hold page tables for the direct
> physical memory mappings of lowmem if you have any control whatsoever
> over the TLB miss handler.

I'm not one to make it more complex, I just want to cover
all of the possibilities :-)  The "compute kernel" part of
it needs to be generic for all of the possibilities.   Like I mentioned.
we have a quite configurable address space for mapping text,
data, IO, and uncached spaces.  In addition, we have execute
in place out of flash and other embedded custom options.
I was hoping to find a solution where the kernel TLBs could
be dynamically loaded as well, with the standard look up
algorithm.  Yes, I still need a kernel path for the special
case processing of other than 4K pages, but it would be nice
to keep that generic as well.

I agree, it's rather trivial to fabricate the kernel text/data
large page sizes on the fly, but there are other address spaces
that would also benefit from such mapping.  I'd like to find a
complete solution to this, and I am working on it as time permits.
I'd rather not try to define a half-baked solution in the future, as
I sometimes have to do of code written years ago :-)

Even if we don't use the page tables, we still need to create
them, as the Abatron BDI2000 has knowledge of Linux page
tables.  When using this jtag debugger, it performs virtual->physical
translations of addresses not currently mapped by the MMU.

Thanks.

	-- Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-27 20:35               ` Dan Malek
@ 2005-06-28  6:18                 ` Benjamin Herrenschmidt
  2005-06-28 13:42                   ` Dan Malek
  0 siblings, 1 reply; 18+ messages in thread
From: Benjamin Herrenschmidt @ 2005-06-28  6:18 UTC (permalink / raw)
  To: Dan Malek; +Cc: David S. Miller, akpm, marcelo.tosatti, linux-kernel

On Mon, 2005-06-27 at 16:35 -0400, Dan Malek wrote:
> On Jun 27, 2005, at 3:50 PM, David S. Miller wrote:
> 
> > I think you're making this problem more complex than it really
> > is.  There is no reason at all to hold page tables for the direct
> > physical memory mappings of lowmem if you have any control whatsoever
> > over the TLB miss handler.
> 
> I'm not one to make it more complex, I just want to cover
> all of the possibilities :-)  The "compute kernel" part of
> it needs to be generic for all of the possibilities.   Like I mentioned.
> we have a quite configurable address space for mapping text,
> data, IO, and uncached spaces.  In addition, we have execute
> in place out of flash and other embedded custom options.
> I was hoping to find a solution where the kernel TLBs could
> be dynamically loaded as well, with the standard look up
> algorithm.  Yes, I still need a kernel path for the special
> case processing of other than 4K pages, but it would be nice
> to keep that generic as well.

Can't you put the "8Mb page" flag at the PMD level and use normal kernel
page tables ? You'll have to fill PMD entries two by two but that
shouldn't be too difficult.

You can then have the kernel linear mapping use 8Mb pages, along with
some "block IO" translations and leave the rest to normal 4k page tables

Ben.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-27 15:46         ` Dan Malek
@ 2005-06-28  6:21           ` Benjamin Herrenschmidt
  2005-06-28 13:49             ` Dan Malek
  0 siblings, 1 reply; 18+ messages in thread
From: Benjamin Herrenschmidt @ 2005-06-28  6:21 UTC (permalink / raw)
  To: Dan Malek; +Cc: Marcelo Tosatti, akpm, David S. Miller, linux-kernel

On Mon, 2005-06-27 at 11:46 -0400, Dan Malek wrote:
> On Jun 26, 2005, at 3:09 PM, Marcelo Tosatti wrote:
> 
> > Thats a very interesting idea, will probably optimize performance in
> > general ("why did nobody thought of it before?" kind).
> 
> I've done this before, used the pgd/pmd or pte  to hold large page
> size entries.  The problem is the amount of code needed in the
> tlbmiss handler to implement this.  The Linux page table structure
> doesn't allow us to easily format this information, so we have lots
> of code in the handler to fabricate these entries.  It's a significant
> overhead for the normal 4K path that was hard to justify.

How so ? the linux page table structure allow you to format the PTE and
PMD contents pretty much the way you want ...

> We need to be optimizing the applications, since that is where the
> real work is done and where the system spends most of it's time.
> The kernel is easy to optimize with pinned entries, then we have the
> best solution.  A minimal overhead for the 4K pages, plus an optimal
> kernel mapping.

Pinned entry are never a good solution, more like a workaround... It's
never good to pin an entry on such a small TLB (though I can understand
that you may want to always pin the kernel first entry) I don't think
it's necessary.

> I do want the solution of variable page sizes in the kernel, because
> we don't have to reserve wired entries, providing the best solution.
> I'm always thinking of this and experiment with it from time to time, 
> but
> I haven't found a solution that is satisfactory to me :-)  Maybe 
> something
> like an early kernel/user test and separate code paths, but I now have
> a solution that eliminates our current test, and I don't want to put it
> back in :-)  My holy grail is a 4 instruction tlb miss handler, but I 
> haven't
> been able to get the PTEs formatted correctly so everyone is happy.

Paul told me the 8xx has some restrictions about what goes at the "PMD"
level that is a problem for us (is it cache inhibited bit ?) and thus we
cannot completely do the PMD/PTE thingy, but I don't know the details,
can you tell me more ?

For the kernel address space, however, we are pretty much free to do
what we want. The only thing for which the kernel need page tables is
the vmalloc space. The rest can be implemented the way you want by arch
code (though it's often useful to also use page tables for io space).

Ben.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-28  6:18                 ` Benjamin Herrenschmidt
@ 2005-06-28 13:42                   ` Dan Malek
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Malek @ 2005-06-28 13:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: akpm, David S. Miller, marcelo.tosatti, linux-kernel

On Jun 28, 2005, at 2:18 AM, Benjamin Herrenschmidt wrote:

> Can't you put the "8Mb page" flag at the PMD level and use normal 
> kernel
> page tables ? You'll have to fill PMD entries two by two but that
> shouldn't be too difficult.

Yep.  You guys are suggesting all of the things that have been tried,
some more successfully than others.  The values in the pmd are just
half of the battle, you also need plenty of code to adjust the MMU
assist registers, copy stuff out of the ptes into the pmds, and so on.

None of this is new to me, I've been working on various solutions
for quite some time.  The only thing I'm going to do different now is
separate the paths for the user and kernel TLB miss handlers.  I
just never wanted the more frequently used 4K path to pay the
overhead of these other page sizes.  In the past, I tried to write
a single, fast code path for all cases, but it just doesn't work out.

Thanks.

	-- Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: increased translation cache footprint in v2.6
  2005-06-28  6:21           ` Benjamin Herrenschmidt
@ 2005-06-28 13:49             ` Dan Malek
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Malek @ 2005-06-28 13:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: akpm, Marcelo Tosatti, David S. Miller, linux-kernel


On Jun 28, 2005, at 2:21 AM, Benjamin Herrenschmidt wrote:

> How so ? the linux page table structure allow you to format the PTE and
> PMD contents pretty much the way you want ...

Pretty much the way, but not exactly :-)

> Paul told me the 8xx has some restrictions about what goes at the "PMD"
> level that is a problem for us (is it cache inhibited bit ?) and thus 
> we
> cannot completely do the PMD/PTE thingy,

There are some page control options that would be easier handled at
the pmd boundary.  Cache mode is one of them.


	-- Dan


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-06-28 13:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-26 17:23 increased translation cache footprint in v2.6 Marcelo Tosatti
2005-06-26 23:42 ` Paul Mackerras
2005-06-26 18:31   ` Marcelo Tosatti
2005-06-26 23:49 ` Andrew Morton
2005-06-26 18:52   ` Marcelo Tosatti
2005-06-27  0:33     ` David S. Miller
2005-06-26 19:09       ` Marcelo Tosatti
2005-06-27  0:53         ` David S. Miller
2005-06-27 15:57           ` Dan Malek
2005-06-27 19:50             ` David S. Miller
2005-06-27 20:35               ` Dan Malek
2005-06-28  6:18                 ` Benjamin Herrenschmidt
2005-06-28 13:42                   ` Dan Malek
2005-06-27 15:46         ` Dan Malek
2005-06-28  6:21           ` Benjamin Herrenschmidt
2005-06-28 13:49             ` Dan Malek
2005-06-27  1:55       ` Benjamin Herrenschmidt
  -- strict thread matches above, loose matches on Subject: below --
2005-06-27 19:04 Al Boldi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox