Re: larger default page sizes...

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: larger default page sizes...
@ 2008-03-25 23:47 J.C. Pizarro
  2008-03-26 15:57 ` H. Peter Anvin
  0 siblings, 1 reply; 36+ messages in thread
From: J.C. Pizarro @ 2008-03-25 23:47 UTC (permalink / raw)
  To: David Miller, LKML

On Tue, 25 Mar 2008 16:22:44 -0700 (PDT), David Miller wrote:
> > On Mon, 24 Mar 2008, David Miller wrote:
> >
> > > There are ways to get large pages into the process address space for
> > > compute bound tasks, without suffering the well known negative side
> > > effects of using larger pages for everything.
> >
> > These hacks have limitations. F.e. they do not deal with I/O and
> > require application changes.
>
> Transparent automatic hugepages are definitely doable, I don't know
> why you think this requires application changes.
>
> People want these larger pages for HPC apps.

But there is a general problem of larger pages in systems that
don't support them natively (in hardware) depending in how it's
implemented the memory manager in the kernel:

   "Doubling the soft page size implies
      halfing the TLB soft-entries in the old hardware".

   "x4 soft page size=> 1/4 TLB soft-entries, ... and so on."

Assuming one soft double-sized page represents 2 real-sized pages,
one replacing of one soft double-sized page implies replacing
2 TLB's entries containing the 2 real-sized pages.

The TLB is very small, its entries are around 24 entries aprox. in
some processors!.

Assuming soft 64 KiB page using real 4 KiB pages => 1/16 TLB soft-entries.
If the TLB has 24 entries then calculating 24/16=1.5 soft-entries,
   the TLB will have only 1 soft-entry for soft 64 KiB pages!!! Weird!!!

The normal soft sizes are 8 KiB or 16 KiB for non-native processors, not more.
  So, the TLB of 24 entries of real 4 KiB will have 12 or 6
soft-entries respect.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:47 larger default page sizes J.C. Pizarro
@ 2008-03-26 15:57 ` H. Peter Anvin
  0 siblings, 0 replies; 36+ messages in thread
From: H. Peter Anvin @ 2008-03-26 15:57 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: David Miller, LKML

J.C. Pizarro wrote:
> 
> But there is a general problem of larger pages in systems that
> don't support them natively (in hardware) depending in how it's
> implemented the memory manager in the kernel:
> 
>    "Doubling the soft page size implies
>       halfing the TLB soft-entries in the old hardware".
> 
>    "x4 soft page size=> 1/4 TLB soft-entries, ... and so on."
> 
> Assuming one soft double-sized page represents 2 real-sized pages,
> one replacing of one soft double-sized page implies replacing
> 2 TLB's entries containing the 2 real-sized pages.
> 
> The TLB is very small, its entries are around 24 entries aprox. in
> some processors!.
> 

That's not a problem, actually, since the TLB entries can get shuffled 
like any other (for software TLBs it's a little different, but it can be 
dealt with there too.)

The *real* problem is ABI breakage.

	-hpa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
@ 2008-03-21 17:40 Christoph Lameter
  2008-03-21 21:57 ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2008-03-21 17:40 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel

On Fri, 21 Mar 2008, David Miller wrote:

> I would be very careful with this especially on IA64.
> 
> If the TLB miss or other low-level trap handler depends upon being
> able to dereference thread info, task struct, or kernel stack stuff
> without causing a fault outside of the linear PAGE_OFFSET area, this
> patch will cause problems.

Hmmm. Does not sound good for arches that cannot handle TLB misses in 
hardware. I wonder how arch specific this is? Last time around I was told 
that some arches already virtually map their stacks.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21 17:40 [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Christoph Lameter
@ 2008-03-21 21:57 ` David Miller
  2008-03-24 18:27   ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2008-03-21 21:57 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel

From: Christoph Lameter <clameter@sgi.com>
Date: Fri, 21 Mar 2008 10:40:18 -0700 (PDT)

> On Fri, 21 Mar 2008, David Miller wrote:
> 
> > I would be very careful with this especially on IA64.
> > 
> > If the TLB miss or other low-level trap handler depends upon being
> > able to dereference thread info, task struct, or kernel stack stuff
> > without causing a fault outside of the linear PAGE_OFFSET area, this
> > patch will cause problems.
> 
> Hmmm. Does not sound good for arches that cannot handle TLB misses in 
> hardware. I wonder how arch specific this is? Last time around I was told 
> that some arches already virtually map their stacks.

I'm not saying there is a problem, I'm saying "tread lightly"
because there might be one.

The thing to do is to first validate the way that IA64
handles recursive TLB misses occuring during an initial
TLB miss, and if there are any limitations therein.

That's the kind of thing I'm talking about.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86
  2008-03-21 21:57 ` David Miller
@ 2008-03-24 18:27   ` Christoph Lameter
  2008-03-24 20:37     ` larger default page sizes David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2008-03-24 18:27 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64

On Fri, 21 Mar 2008, David Miller wrote:

> The thing to do is to first validate the way that IA64
> handles recursive TLB misses occuring during an initial
> TLB miss, and if there are any limitations therein.

I am familiar with that area and I am resonably sure that this 
is an issue on IA64 under some conditions (the processor decides to spill 
some registers either onto the stack or into the register backing store 
during tlb processing). Recursion (in the kernel context) still expects 
the stack and register backing store to be available. ccing linux-ia64 for 
any thoughts to the contrary.

The move to 64k page size on IA64 is another way that this issue can be 
addressed though. So I think its best to drop the IA64 portion.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* larger default page sizes...
  2008-03-24 18:27   ` Christoph Lameter
@ 2008-03-24 20:37     ` David Miller
  2008-03-24 21:05       ` Christoph Lameter
                         ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: David Miller @ 2008-03-24 20:37 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)

> The move to 64k page size on IA64 is another way that this issue can
> be addressed though.

This is such a huge mistake I wish platforms such as powerpc and IA64
would not make such decisions so lightly.

The memory wastage is just rediculious.

I already see several distributions moving to 64K pages for powerpc,
so I want to nip this in the bud before this monkey-see-monkey-do
thing gets any more out of hand.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-24 20:37     ` larger default page sizes David Miller
@ 2008-03-24 21:05       ` Christoph Lameter
  2008-03-24 21:43         ` David Miller
  2008-03-24 21:25       ` Luck, Tony
  2008-03-25  3:29       ` Paul Mackerras
  2 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2008-03-24 21:05 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

On Mon, 24 Mar 2008, David Miller wrote:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> 
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
> 
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

Its certainly not a light decision if your customer tells you that the box 
is almost unusable with 16k page size. For our new 2k and 4k processor 
systems this seems to be a requirement. Customers start hacking SLES10 to 
run with 64k pages....

> The memory wastage is just rediculious.

Well yes if you would use such a box for kernel compiles and small files 
then its a bad move. However, if you have to process terabytes of data 
then this is significantly reducing the VM and I/O overhead.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

powerpc also runs HPC codes. They certainly see the same results that we 
see.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-24 21:05       ` Christoph Lameter
@ 2008-03-24 21:43         ` David Miller
  2008-03-25 17:48           ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2008-03-24 21:43 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Mon, 24 Mar 2008 14:05:02 -0700 (PDT)

> On Mon, 24 Mar 2008, David Miller wrote:
> 
> > From: Christoph Lameter <clameter@sgi.com>
> > Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> > 
> > > The move to 64k page size on IA64 is another way that this issue can
> > > be addressed though.
> > 
> > This is such a huge mistake I wish platforms such as powerpc and IA64
> > would not make such decisions so lightly.
> 
> Its certainly not a light decision if your customer tells you that the box 
> is almost unusable with 16k page size. For our new 2k and 4k processor 
> systems this seems to be a requirement. Customers start hacking SLES10 to 
> run with 64k pages....

We should fix the underlying problems.

I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
stuff like contention on the per-zone page allocator locks.

Which is very fixable, without going to larger pages.

> powerpc also runs HPC codes. They certainly see the same results
> that we see.

There are ways to get large pages into the process address space for
compute bound tasks, without suffering the well known negative side
effects of using larger pages for everything.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-24 21:43         ` David Miller
@ 2008-03-25 17:48           ` Christoph Lameter
  2008-03-25 23:22             ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2008-03-25 17:48 UTC (permalink / raw)
  To: David Miller; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

On Mon, 24 Mar 2008, David Miller wrote:

> We should fix the underlying problems.
> 
> I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
> stuff like contention on the per-zone page allocator locks.
> 
> Which is very fixable, without going to larger pages.

No its not fixable. You are doing linear optimizations to a slowdown that 
grows exponentially. Going just one order up for page size reduces the
necessary locks and handling of the kernel by 50%.
 
> > powerpc also runs HPC codes. They certainly see the same results
> > that we see.
> 
> There are ways to get large pages into the process address space for
> compute bound tasks, without suffering the well known negative side
> effects of using larger pages for everything.

These hacks have limitations. F.e. they do not deal with I/O and 
require application changes.
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 17:48           ` Christoph Lameter
@ 2008-03-25 23:22             ` David Miller
  2008-03-25 23:41               ` Peter Chubb
  0 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2008-03-25 23:22 UTC (permalink / raw)
  To: clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Tue, 25 Mar 2008 10:48:19 -0700 (PDT)

> On Mon, 24 Mar 2008, David Miller wrote:
> 
> > There are ways to get large pages into the process address space for
> > compute bound tasks, without suffering the well known negative side
> > effects of using larger pages for everything.
> 
> These hacks have limitations. F.e. they do not deal with I/O and 
> require application changes.

Transparent automatic hugepages are definitely doable, I don't know
why you think this requires application changes.

People want these larger pages for HPC apps.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:22             ` David Miller
@ 2008-03-25 23:41               ` Peter Chubb
  2008-03-25 23:49                 ` David Miller
  2008-03-26  0:34                 ` David Mosberger-Tang
  0 siblings, 2 replies; 36+ messages in thread
From: Peter Chubb @ 2008-03-25 23:41 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

>>>>> "David" == David Miller <davem@davemloft.net> writes:

David> From: Christoph Lameter <clameter@sgi.com> Date: Tue, 25 Mar
David> 2008 10:48:19 -0700 (PDT)

>> On Mon, 24 Mar 2008, David Miller wrote:
>> 
>> > There are ways to get large pages into the process address space
>> for > compute bound tasks, without suffering the well known
>> negative side > effects of using larger pages for everything.
>> 
>> These hacks have limitations. F.e. they do not deal with I/O and
>> require application changes.

David> Transparent automatic hugepages are definitely doable, I don't
David> know why you think this requires application changes.

It's actually harder than it looks.  Ian Wienand just finished his
Master's project in this area, so we have *lots* of data.  The main
issue is that, at least on Itanium, you have to turn off the hardware
page table walker for hugepages if you want to mix superpages and
standard pages in the same region. (The long format VHPT isn't the
panacea we'd like it to be because the hash function it uses depends
on the page size).  This means that although you have fewer TLB misses
with larger pages, the cost of those TLB misses is three to four times
higher than with the standard pages.  In addition, to set up a large
page takes more effort... and it turns out there are few applications
where the cost is amortised enough, so on SpecCPU for example, some
tests improved performance slightly, some got slightly worse.

What we saw was essentially that we could almost eliminate DTLB misses,
other than the first, for a huge page.  For most applications, though,
the extra cost of that first miss, plus the cost of setting up the
huge page, was greater than the few hundred DTLB misses we avoided.

I'm expecting Ian to publish the full results soon.

Other architectures (where the page size isn't tied into the hash
function, so the hardware walked can be used for superpages) will have
different tradeoffs.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:41               ` Peter Chubb
@ 2008-03-25 23:49                 ` David Miller
  2008-03-26  0:25                   ` Peter Chubb
  2008-03-26  0:34                 ` David Mosberger-Tang
  1 sibling, 1 reply; 36+ messages in thread
From: David Miller @ 2008-03-25 23:49 UTC (permalink / raw)
  To: peterc; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

From: Peter Chubb <peterc@gelato.unsw.edu.au>
Date: Wed, 26 Mar 2008 10:41:32 +1100

> It's actually harder than it looks.  Ian Wienand just finished his
> Master's project in this area, so we have *lots* of data.  The main
> issue is that, at least on Itanium, you have to turn off the hardware
> page table walker for hugepages if you want to mix superpages and
> standard pages in the same region. (The long format VHPT isn't the
> panacea we'd like it to be because the hash function it uses depends
> on the page size).  This means that although you have fewer TLB misses
> with larger pages, the cost of those TLB misses is three to four times
> higher than with the standard pages.

If the hugepage is more than 3 to 4 times larger than the base
page size, which it almost certainly is, it's still an enormous
win.

> Other architectures (where the page size isn't tied into the hash
> function, so the hardware walked can be used for superpages) will have
> different tradeoffs.

Right, admittedly this is just a (one of many) strange IA64 quirk.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:49                 ` David Miller
@ 2008-03-26  0:25                   ` Peter Chubb
  2008-03-26  0:31                     ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Chubb @ 2008-03-26  0:25 UTC (permalink / raw)
  To: David Miller
  Cc: peterc, clameter, linux-mm, linux-kernel, linux-ia64, torvalds,
	ianw

>>>>> "David" == David Miller <davem@davemloft.net> writes:

David> From: Peter Chubb <peterc@gelato.unsw.edu.au> Date: Wed, 26 Mar
David> 2008 10:41:32 +1100

>> It's actually harder than it looks.  Ian Wienand just finished his
>> Master's project in this area, so we have *lots* of data.  The main
>> issue is that, at least on Itanium, you have to turn off the
>> hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).  This means that although you
>> have fewer TLB misses with larger pages, the cost of those TLB
>> misses is three to four times higher than with the standard pages.

David> If the hugepage is more than 3 to 4 times larger than the base
David> page size, which it almost certainly is, it's still an enormous
David> win.

That depends on the access pattern.  We measured a small win for some
workloads, and a small loss for others, using 4k base pages, and
allowing up to 4G superpages (the actual sizes used depended on the
size of the objects being allocated, and the amount of contiguous
memory available).

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26  0:25                   ` Peter Chubb
@ 2008-03-26  0:31                     ` David Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2008-03-26  0:31 UTC (permalink / raw)
  To: peterc; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds, ianw

From: Peter Chubb <peterc@gelato.unsw.edu.au>
Date: Wed, 26 Mar 2008 11:25:58 +1100

> That depends on the access pattern.

Absolutely.

FWIW, I bet it helps enormously for gcc which, even for
small compiles, swims around chaotically in an 8MB pool
of GC'd memory.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:41               ` Peter Chubb
  2008-03-25 23:49                 ` David Miller
@ 2008-03-26  0:34                 ` David Mosberger-Tang
  2008-03-26  0:39                   ` David Miller
  2008-03-26  0:57                   ` Peter Chubb
  1 sibling, 2 replies; 36+ messages in thread
From: David Mosberger-Tang @ 2008-03-26  0:34 UTC (permalink / raw)
  To: Peter Chubb
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64,
	torvalds, ianw

On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb <peterc@gelato.unsw.edu.au> wrote:
>  The main issue is that, at least on Itanium, you have to turn off the hardware
>  page table walker for hugepages if you want to mix superpages and
>  standard pages in the same region. (The long format VHPT isn't the
>  panacea we'd like it to be because the hash function it uses depends
>  on the page size).

Why not just repeat the PTEs for super-pages?  That won't work for
huge pages, but for superpages that are a reasonable multiple (e.g.,
16-times) the base-page size, it should work nicely.

  --david
-- 
Mosberger Consulting LLC, http://www.mosberger-consulting.com/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26  0:34                 ` David Mosberger-Tang
@ 2008-03-26  0:39                   ` David Miller
  2008-03-26  0:57                   ` Peter Chubb
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2008-03-26  0:39 UTC (permalink / raw)
  To: dmosberger
  Cc: peterc, clameter, linux-mm, linux-kernel, linux-ia64, torvalds,
	ianw

From: "David Mosberger-Tang" <dmosberger@gmail.com>
Date: Tue, 25 Mar 2008 18:34:13 -0600

> Why not just repeat the PTEs for super-pages?

This is basically how we implement hugepages in the page
tables on sparc64.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26  0:34                 ` David Mosberger-Tang
  2008-03-26  0:39                   ` David Miller
@ 2008-03-26  0:57                   ` Peter Chubb
  2008-03-26  4:16                     ` John Marvin
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Chubb @ 2008-03-26  0:57 UTC (permalink / raw)
  To: David Mosberger-Tang
  Cc: Peter Chubb, David Miller, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, ianw

>>>>> "David" == David Mosberger-Tang <dmosberger@gmail.com> writes:

David> On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb
David> <peterc@gelato.unsw.edu.au> wrote:
>> The main issue is that, at least on Itanium, you have to turn off
>> the hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).

David> Why not just repeat the PTEs for super-pages?  That won't work
David> for huge pages, but for superpages that are a reasonable
David> multiple (e.g., 16-times) the base-page size, it should work
David> nicely.

You end up having to repeat PTEs to fit into Linux's page table
structure *anyway* (unless we can change Linux's page table).  But
there's no place in the short format hardware-walked page table (that
reuses the leaf entries in Linux's table) for a page size.  And if you
use some of the holes in the format, the hardware walker doesn't
understand it --- so you have to turn off the hardware walker for
*any* regions where there might be a superpage.  

If you use the long format VHPT, you have a choice:  load the
hash table with just the translation that caused the miss, load all
possible hash entries that could have caused the miss for the page, or
preload the hash table when the page is instantiated, with all
possible entries that could hash to the huge page.  I don't remember
the details, but I seem to remember all these being bad choices for
one reason or other ... Ian, can you elaborate?

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26  0:57                   ` Peter Chubb
@ 2008-03-26  4:16                     ` John Marvin
  2008-03-26  4:36                       ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: John Marvin @ 2008-03-26  4:16 UTC (permalink / raw)
  To: linux-ia64; +Cc: linux-mm, linux-kernel

Peter Chubb wrote:

> 
> You end up having to repeat PTEs to fit into Linux's page table
> structure *anyway* (unless we can change Linux's page table).  But
> there's no place in the short format hardware-walked page table (that
> reuses the leaf entries in Linux's table) for a page size.  And if you
> use some of the holes in the format, the hardware walker doesn't
> understand it --- so you have to turn off the hardware walker for
> *any* regions where there might be a superpage.  

No, you can set an illegal memory attribute in the pte for any superpage entry, 
and leave the hardware walker enabled for the base page size. The software tlb 
miss handler can then install the superpage tlb entry. I posted a working 
prototype of Shimizu superpages working on ia64 using short format vhpt's to the 
linux kernel list a while back.

> 
> If you use the long format VHPT, you have a choice:  load the
> hash table with just the translation that caused the miss, load all
> possible hash entries that could have caused the miss for the page, or
> preload the hash table when the page is instantiated, with all
> possible entries that could hash to the huge page.  I don't remember
> the details, but I seem to remember all these being bad choices for
> one reason or other ... Ian, can you elaborate?

When I was doing measurements of long format vs. short format, the two main 
problems with long format (and why I eventually chose to stick with short 
format) were:

1) There was no easy way of determining what size the long format vhpt cache 
should be automatically, and changing it dynamically would be too painful. 
Different workloads performed better with different size vhpt caches.

2) Regardless of the size, the vhpt cache is duplicated information. Using long 
format vhpt's significantly increased the number of cache misses for some 
workloads. Theoretically there should have been some cases where the long format 
solution would have performed better than the short format solution, but I was 
never able to create such a case. In many cases the performance difference 
between the long format solution and the short format solution was essentially 
the same. In other cases the short format vhpt solution outperformed the long 
format solution, and in those cases there was a significant difference in cache 
misses that I believe explained the performance difference.

John

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26  4:16                     ` John Marvin
@ 2008-03-26  4:36                       ` David Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2008-03-26  4:36 UTC (permalink / raw)
  To: jsm; +Cc: linux-ia64, linux-mm, linux-kernel

From: John Marvin <jsm@fc.hp.com>
Date: Tue, 25 Mar 2008 22:16:00 -0600

> 1) There was no easy way of determining what size the long format vhpt cache 
> should be automatically, and changing it dynamically would be too painful. 
> Different workloads performed better with different size vhpt caches.

This is exactly what sparc64 does btw, dynamic TLB miss hash table
sizing based upon task RSS

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: larger default page sizes...
  2008-03-24 20:37     ` larger default page sizes David Miller
  2008-03-24 21:05       ` Christoph Lameter
@ 2008-03-24 21:25       ` Luck, Tony
  2008-03-24 21:46         ` David Miller
  2008-03-25  3:29       ` Paul Mackerras
  2 siblings, 1 reply; 36+ messages in thread
From: Luck, Tony @ 2008-03-24 21:25 UTC (permalink / raw)
  To: David Miller, clameter; +Cc: linux-mm, linux-kernel, linux-ia64, torvalds

> The memory wastage is just rediculious.

In an ideal world we'd have variable sized pages ... but
since most arcthitectures have no h/w support for these
it may be a long time before that comes to Linux.

In a fixed page size world the right page size to use
depends on the workload and the capacity of the system.

When memory capacity is measured in hundreds of GB, then
a larger page size doesn't look so ridiculous.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-24 21:25       ` Luck, Tony
@ 2008-03-24 21:46         ` David Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2008-03-24 21:46 UTC (permalink / raw)
  To: tony.luck; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: "Luck, Tony" <tony.luck@intel.com>
Date: Mon, 24 Mar 2008 14:25:11 -0700

> When memory capacity is measured in hundreds of GB, then
> a larger page size doesn't look so ridiculous.

We have hugepages and such for a reason.  And this can be
made more dynamic and flexible, as needed.

Increasing the page size is a "stick your head in the sand"
type solution by my book.

Especially when you can make the hugepage facility stronger
and thus get what you want without the memory wastage side
effects.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-24 20:37     ` larger default page sizes David Miller
  2008-03-24 21:05       ` Christoph Lameter
  2008-03-24 21:25       ` Luck, Tony
@ 2008-03-25  3:29       ` Paul Mackerras
  2008-03-25  4:15         ` David Miller
                           ` (2 more replies)
  2 siblings, 3 replies; 36+ messages in thread
From: Paul Mackerras @ 2008-03-25  3:29 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

David Miller writes:

> From: Christoph Lameter <clameter@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> 
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
> 
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

The performance advantage of using hardware 64k pages is pretty
compelling, on a wide range of programs, and particularly on HPC apps.

> The memory wastage is just rediculious.

Depends on the distribution of file sizes you have.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

I just tried a kernel compile on a 4.2GHz POWER6 partition with 4
threads (2 cores) and 2GB of RAM, with two kernels.  One was
configured with 4kB pages and the other with 64kB kernels but they
were otherwise identically configured.  Here are the times for the
same kernel compile (total time across all threads, for a fairly
full-featured config):

4kB pages:	444.051s user + 34.406s system time
64kB pages:	419.963s user + 16.869s system time

That's nearly 10% faster with 64kB pages -- on a kernel compile.

Yes, the fragmentation in the page cache can be a pain in some
circumstances, but on the whole I think the performance advantage is
worth that pain, particularly for the sort of applications that people
will tend to be running on RHEL on Power boxes.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25  3:29       ` Paul Mackerras
@ 2008-03-25  4:15         ` David Miller
  2008-03-25 11:50           ` Paul Mackerras
  2008-03-25 12:05         ` Andi Kleen
  2008-03-25 18:27         ` Dave Hansen
  2 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2008-03-25  4:15 UTC (permalink / raw)
  To: paulus; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: Paul Mackerras <paulus@samba.org>
Date: Tue, 25 Mar 2008 14:29:55 +1100

> The performance advantage of using hardware 64k pages is pretty
> compelling, on a wide range of programs, and particularly on HPC apps.

Please read the rest of my responses in this thread, you
can have your HPC cake and eat it too.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25  4:15         ` David Miller
@ 2008-03-25 11:50           ` Paul Mackerras
  2008-03-25 23:32             ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Paul Mackerras @ 2008-03-25 11:50 UTC (permalink / raw)
  To: David Miller; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

David Miller writes:

> From: Paul Mackerras <paulus@samba.org>
> Date: Tue, 25 Mar 2008 14:29:55 +1100
> 
> > The performance advantage of using hardware 64k pages is pretty
> > compelling, on a wide range of programs, and particularly on HPC apps.
> 
> Please read the rest of my responses in this thread, you
> can have your HPC cake and eat it too.

It's not just HPC, as I pointed out, it's pretty much everything,
including kernel compiles.  And "use hugepages" is a pretty inadequate
answer given the restrictions of hugepages and the difficulty of using
them.  How do I get gcc to use hugepages, for instance?  Using 64k
pages gives us a performance boost for almost everything without the
user having to do anything.

If the hugepage stuff was in a state where it enabled large pages to
be used for mapping an existing program, where possible, without any
changes to the executable, then I would agree with you.  But it isn't,
it's a long way from that, and (as I understand it) Linus has in the
past opposed the suggestion that we should move in that direction.

Paul.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 11:50           ` Paul Mackerras
@ 2008-03-25 23:32             ` David Miller
  2008-03-25 23:49               ` Luck, Tony
  0 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2008-03-25 23:32 UTC (permalink / raw)
  To: paulus; +Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: Paul Mackerras <paulus@samba.org>
Date: Tue, 25 Mar 2008 22:50:00 +1100

> How do I get gcc to use hugepages, for instance?

Implementing transparent automatic usage of hugepages has been
discussed many times, it's definitely doable and other OSs have
implemented this for years.

This is what I was implying.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: larger default page sizes...
  2008-03-25 23:32             ` David Miller
@ 2008-03-25 23:49               ` Luck, Tony
  2008-03-26  0:16                 ` David Miller
  2008-03-26 15:54                 ` Nish Aravamudan
  0 siblings, 2 replies; 36+ messages in thread
From: Luck, Tony @ 2008-03-25 23:49 UTC (permalink / raw)
  To: David Miller, paulus
  Cc: clameter, linux-mm, linux-kernel, linux-ia64, torvalds

> > How do I get gcc to use hugepages, for instance?
>
> Implementing transparent automatic usage of hugepages has been
> discussed many times, it's definitely doable and other OSs have
> implemented this for years.
>
> This is what I was implying.

"large" pages, or "super" pages perhaps ... but Linux "huge" pages
seem pretty hard to adapt for generic use by applications.  They
are generally a somewhere between a bit too big (2MB on X86) to
way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.

Right now they also suffer from making the sysadmin pick at
boot time how much memory to allocate as huge pages (while it
is possible to break huge pages into normal pages, going in
the reverse direction requires a memory defragmenter that
doesn't exist).

Making an application use huge pages as heap may be simple
(just link with a different library to provide with a different
version of malloc()) ... code, stack, mmap'd files are all
a lot harder to do transparently.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:49               ` Luck, Tony
@ 2008-03-26  0:16                 ` David Miller
  2008-03-26 15:54                 ` Nish Aravamudan
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2008-03-26  0:16 UTC (permalink / raw)
  To: tony.luck; +Cc: paulus, clameter, linux-mm, linux-kernel, linux-ia64, torvalds

From: "Luck, Tony" <tony.luck@intel.com>
Date: Tue, 25 Mar 2008 16:49:23 -0700

> Making an application use huge pages as heap may be simple
> (just link with a different library to provide with a different
> version of malloc()) ... code, stack, mmap'd files are all
> a lot harder to do transparently.

The kernel should be able to do this transparently, at the
very least for the anonymous page case.  It should also
be able to handle just fine chips that provide multiple
page size support, as many do.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 23:49               ` Luck, Tony
  2008-03-26  0:16                 ` David Miller
@ 2008-03-26 15:54                 ` Nish Aravamudan
  2008-03-26 17:05                   ` Luck, Tony
  1 sibling, 1 reply; 36+ messages in thread
From: Nish Aravamudan @ 2008-03-26 15:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: David Miller, paulus, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, agl, Mel Gorman

On 3/25/08, Luck, Tony <tony.luck@intel.com> wrote:
> > > How do I get gcc to use hugepages, for instance?
>  >
>  > Implementing transparent automatic usage of hugepages has been
>  > discussed many times, it's definitely doable and other OSs have
>  > implemented this for years.
>  >
>  > This is what I was implying.
>
>
> "large" pages, or "super" pages perhaps ... but Linux "huge" pages
>  seem pretty hard to adapt for generic use by applications.  They
>  are generally a somewhere between a bit too big (2MB on X86) to
>  way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.
>
>  Right now they also suffer from making the sysadmin pick at
>  boot time how much memory to allocate as huge pages (while it
>  is possible to break huge pages into normal pages, going in
>  the reverse direction requires a memory defragmenter that
>  doesn't exist).

That's not entirely true. We have a dynamic pool now, thanks to Adam
Litke [added to Cc], which can be treated as a high watermark for the
hugetlb pool (and the static pool value serves as a low watermark).
Unless by hugepages you mean something other than what I think (but
referring to a 2M size on x86 imples you are not). And with the
antifragmentation improvements, hugepage pool changes at run-time are
more likely to succeed [added Mel to Cc].

>  Making an application use huge pages as heap may be simple
>  (just link with a different library to provide with a different
>  version of malloc()) ... code, stack, mmap'd files are all
>  a lot harder to do transparently.

I feel like I should promote libhugetlbfs here. We're trying to make
things easier for applications to use. You can back the heap by
hugepages via LD_PRELOAD. But even that isn't always simple (what
happens when something is already allocated on the heap?, which we've
seen happen even in our constructor in the library, for instance).
We're working on hugepage stack support. Text/BSS/Data segment
remapping exists now, too, but does require relinking to be more
successful. We have a mode that allows libhugetlbfs to try to fit the
segments into hugepages, or even just those parts that might fit --
but we have limitations on power and IA64, for instance, where
hugepages are restricted in their placement (either depending on the
process' existing mappings or generally). libhugetlbfs has, at least,
been tested a bit on IA64 to validate the heap backing (IIRC) and the
various kernel tests. We also have basic sparc support -- however, I
don't have any boxes handy to test on (working on getting them added
to our testing grid and then will revisit them), and then one box I
used before gave me semi-spurious soft-lockups (old bug, unclear if it
is software or just buggy hardware).

In any case, my point is people are trying to work on this from
various angles. Both making hugepages more available at run-time (in a
dynamic fashion, based upon need) and making them easier to use for
applications. Is it easy? Not necessarily. Is it guaranteed to work? I
like to think we make a best effort. But as others have pointed out,
it doesn't seem like we're going to get mainline transparent hugepage
support anytime soon.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: larger default page sizes...
  2008-03-26 15:54                 ` Nish Aravamudan
@ 2008-03-26 17:05                   ` Luck, Tony
  2008-03-26 18:54                     ` Mel Gorman
  0 siblings, 1 reply; 36+ messages in thread
From: Luck, Tony @ 2008-03-26 17:05 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: David Miller, paulus, clameter, linux-mm, linux-kernel,
	linux-ia64, torvalds, agl, Mel Gorman

> That's not entirely true. We have a dynamic pool now, thanks to Adam
> Litke [added to Cc], which can be treated as a high watermark for the
> hugetlb pool (and the static pool value serves as a low watermark).
> Unless by hugepages you mean something other than what I think (but
> referring to a 2M size on x86 imples you are not). And with the
> antifragmentation improvements, hugepage pool changes at run-time are
> more likely to succeed [added Mel to Cc].

Things are better than I thought ... though the phrase "more likely
to succeed" doesn't fill me with confidence.  Instead I imagine a
system where an occasional spike in memory load causes some memory
fragmentation that can't be handled, and so from that point many of
the applications that relied on huge pages take a 10% performance
hit.  This results in sysadmins scheduling regular reboots to unjam
things. [Reminds me of the instructions that came with my first
flatbed scanner that recommended rebooting the system before and
after each use :-( ]

> I feel like I should promote libhugetlbfs here.

This is also better than I thought ... sounds like some really
good things have already happened here.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26 17:05                   ` Luck, Tony
@ 2008-03-26 18:54                     ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2008-03-26 18:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Nish Aravamudan, David Miller, paulus, clameter, linux-mm,
	linux-kernel, linux-ia64, torvalds, agl

On (26/03/08 10:05), Luck, Tony didst pronounce:
> > That's not entirely true. We have a dynamic pool now, thanks to Adam
> > Litke [added to Cc], which can be treated as a high watermark for the
> > hugetlb pool (and the static pool value serves as a low watermark).
> > Unless by hugepages you mean something other than what I think (but
> > referring to a 2M size on x86 imples you are not). And with the
> > antifragmentation improvements, hugepage pool changes at run-time are
> > more likely to succeed [added Mel to Cc].
> 
> Things are better than I thought ... though the phrase "more likely
> to succeed" doesn't fill me with confidence. 

It's a lot more likely to succeed since 2.6.24 than it has in the past. On
workloads where it is mainly user data that is occuping memory, the chances
are even better. If min_free_kbytes is hugepage_size*num_online_nodes(),
it becomes a harder again to fragment memory.

> Instead I imagine a
> system where an occasional spike in memory load causes some memory
> fragmentation that can't be handled, and so from that point many of
> the applications that relied on huge pages take a 10% performance
> hit. 

If it was found to be a problem and normal anti-frag is not coping for hugepage
pool resizes, then specify movablecore=MAX_POSSIBLE_POOL_SIZE_YOU_WOULD_NEED
on the command-line and the hugepage pool will be able to expand to that
side independent of workload. This would avoid the need to scheduled regular
reboots.

> This results in sysadmins scheduling regular reboots to unjam
> things. [Reminds me of the instructions that came with my first
> flatbed scanner that recommended rebooting the system before and
> after each use :-( ]
> 
> > I feel like I should promote libhugetlbfs here.
> 
> This is also better than I thought ... sounds like some really
> good things have already happened here.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25  3:29       ` Paul Mackerras
  2008-03-25  4:15         ` David Miller
@ 2008-03-25 12:05         ` Andi Kleen
  2008-03-25 21:27           ` Paul Mackerras
  2008-03-26  5:24           ` Paul Mackerras
  2008-03-25 18:27         ` Dave Hansen
  2 siblings, 2 replies; 36+ messages in thread
From: Andi Kleen @ 2008-03-25 12:05 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64,
	torvalds

Paul Mackerras <paulus@samba.org> writes:
> 
> 4kB pages:	444.051s user + 34.406s system time
> 64kB pages:	419.963s user + 16.869s system time
> 
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Do you have some idea where the improvement mainly comes from?
Is it TLB misses or reduced in kernel overhead? Ok I assume both
play together but which part of the equation is more important?

-Andi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 12:05         ` Andi Kleen
@ 2008-03-25 21:27           ` Paul Mackerras
  2008-03-26  5:24           ` Paul Mackerras
  1 sibling, 0 replies; 36+ messages in thread
From: Paul Mackerras @ 2008-03-25 21:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64,
	torvalds

Andi Kleen writes:

> Paul Mackerras <paulus@samba.org> writes:
> > 
> > 4kB pages:	444.051s user + 34.406s system time
> > 64kB pages:	419.963s user + 16.869s system time
> > 
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
> 
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

I think that to a first approximation, the improvement in user time
(24 seconds) is due to the increased TLB reach and reduced TLB misses,
and the improvement in system time (18 seconds) is due to the reduced
number of page faults and reductions in other kernel overheads.

As Dave Hansen points out, I can separate the two effects by having
the kernel use 64k pages at the VM level but 4k pages in the hardware
page table, which is easy since we have support for 64k base page size
on machines that don't have hardware 64k page support.  I'll do that
today.

Paul.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25 12:05         ` Andi Kleen
  2008-03-25 21:27           ` Paul Mackerras
@ 2008-03-26  5:24           ` Paul Mackerras
  2008-03-26 15:59             ` Linus Torvalds
  2008-03-26 17:56             ` Christoph Lameter
  1 sibling, 2 replies; 36+ messages in thread
From: Paul Mackerras @ 2008-03-26  5:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64,
	torvalds

Andi Kleen writes:

> Paul Mackerras <paulus@samba.org> writes:
> > 
> > 4kB pages:	444.051s user + 34.406s system time
> > 64kB pages:	419.963s user + 16.869s system time
> > 
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
> 
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

With the kernel configured for a 64k page size, but using 4k pages in
the hardware page table, I get:

64k/4k: 441.723s user + 27.258s system time

So the improvement in the user time is almost all due to the reduced
TLB misses (as one would expect).  For the system time, using 64k
pages in the VM reduces it by about 21%, and using 64k hardware pages
reduces it by another 30%.  So the reduction in kernel overhead is
significant but not as large as the impact of reducing TLB misses.

Paul.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26  5:24           ` Paul Mackerras
@ 2008-03-26 15:59             ` Linus Torvalds
  2008-03-27  1:08               ` Paul Mackerras
  2008-03-26 17:56             ` Christoph Lameter
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2008-03-26 15:59 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, David Miller, clameter, linux-mm, linux-kernel,
	linux-ia64

On Wed, 26 Mar 2008, Paul Mackerras wrote:
> 
> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect).  For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%.  So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

I realize that getting the POWER people to accept that they have been 
total morons when it comes to VM for the last three decades is hard, but 
somebody in the POWER hardware design camp should (a) be told and (b) be 
really ashamed of themselves.

Is this a POWER6 or what? Becasue 21% overhead from TLB handling on 
something like gcc shows that some piece of hardware is absolute crap. 

May I suggest people inside IBM try to fix this some day, and in the 
meantime people outside should probably continue to buy Intel/AMD CPU's 
until the others can get their act together.

			Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26 15:59             ` Linus Torvalds
@ 2008-03-27  1:08               ` Paul Mackerras
  0 siblings, 0 replies; 36+ messages in thread
From: Paul Mackerras @ 2008-03-27  1:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, David Miller, clameter, linux-mm, linux-kernel,
	linux-ia64

Linus Torvalds writes:

> On Wed, 26 Mar 2008, Paul Mackerras wrote:
> > 
> > So the improvement in the user time is almost all due to the reduced
> > TLB misses (as one would expect).  For the system time, using 64k
> > pages in the VM reduces it by about 21%, and using 64k hardware pages
> > reduces it by another 30%.  So the reduction in kernel overhead is
> > significant but not as large as the impact of reducing TLB misses.
> 
> I realize that getting the POWER people to accept that they have been 
> total morons when it comes to VM for the last three decades is hard, but 
> somebody in the POWER hardware design camp should (a) be told and (b) be 
> really ashamed of themselves.
> 
> Is this a POWER6 or what? Becasue 21% overhead from TLB handling on 
> something like gcc shows that some piece of hardware is absolute crap. 

You have misunderstood the 21% number.  That number has *nothing* to
do with hardware TLB miss handling, and everything to do with how long
the generic Linux virtual memory code spends doing its thing (page
faults, setting up and tearing down Linux page tables, etc.).  It
doesn't even have anything to do with the hash table (hardware page
table), because both cases are using 4k hardware pages.  Thus in both
cases the TLB misses and hash-table misses would have been the same.

The *only* difference between the cases is the page size that the
generic Linux virtual memory code is using.  With the 64k page size
our architecture-independent kernel code runs 21% faster.

Thus the 21% is not about the TLB or any hardware thing at all, it's
about the larger per-byte overhead of our kernel code when using the
smaller page size.

The thing you were ranting about -- hardware TLB handling overhead --
comes in at 5%, comparing 4k hardware pages to 64k hardware pages (444
seconds vs. 420 seconds user time for the kernel compile).  And yes,
it's a POWER6.

Paul.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26  5:24           ` Paul Mackerras
  2008-03-26 15:59             ` Linus Torvalds
@ 2008-03-26 17:56             ` Christoph Lameter
  2008-03-26 23:21               ` David Miller
  2008-03-27  3:00               ` Paul Mackerras
  1 sibling, 2 replies; 36+ messages in thread
From: Christoph Lameter @ 2008-03-26 17:56 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, David Miller, linux-mm, linux-kernel, linux-ia64,
	torvalds

On Wed, 26 Mar 2008, Paul Mackerras wrote:

> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect).  For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%.  So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

One should emphasize that this test was a kernel compile which is not 
a load that gains much from larger pages. 4k pages are mostly okay for 
loads that use large amounts of small files.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26 17:56             ` Christoph Lameter
@ 2008-03-26 23:21               ` David Miller
  2008-03-27  3:00               ` Paul Mackerras
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2008-03-26 23:21 UTC (permalink / raw)
  To: clameter; +Cc: paulus, andi, linux-mm, linux-kernel, linux-ia64, torvalds

From: Christoph Lameter <clameter@sgi.com>
Date: Wed, 26 Mar 2008 10:56:17 -0700 (PDT)

> One should emphasize that this test was a kernel compile which is not 
> a load that gains much from larger pages.

Actually, ever since gcc went to a garbage collecting allocator, I've
found it to be a TLB thrasher.

It will repeatedly randomly walk over a GC pool of at least 8MB in
size, which to fit fully in the TLB with 4K pages reaquires a TLB with
2048 entries assuming gcc touches no other data which is of course a
false assumption.

For some compiles this GC pool is more than 100MB in size.

GCC does not fit into any modern TLB using it's base page size.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-26 17:56             ` Christoph Lameter
  2008-03-26 23:21               ` David Miller
@ 2008-03-27  3:00               ` Paul Mackerras
  1 sibling, 0 replies; 36+ messages in thread
From: Paul Mackerras @ 2008-03-27  3:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, David Miller, linux-mm, linux-kernel, linux-ia64,
	torvalds

Christoph Lameter writes:

> One should emphasize that this test was a kernel compile which is not 
> a load that gains much from larger pages. 4k pages are mostly okay for 
> loads that use large amounts of small files.

It's also worth emphasizing that 1.5% of the total time, or 21% of the
system time, is pure software overhead in the Linux kernel that has
nothing to do with the TLB or with gcc's memory access patterns.

That's the cost of handling memory in small (i.e. 4kB) chunks inside
the generic Linux VM code, rather than bigger chunks.

Paul.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: larger default page sizes...
  2008-03-25  3:29       ` Paul Mackerras
  2008-03-25  4:15         ` David Miller
  2008-03-25 12:05         ` Andi Kleen
@ 2008-03-25 18:27         ` Dave Hansen
  2 siblings, 0 replies; 36+ messages in thread
From: Dave Hansen @ 2008-03-25 18:27 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, clameter, linux-mm, linux-kernel, linux-ia64,
	torvalds

On Tue, 2008-03-25 at 14:29 +1100, Paul Mackerras wrote:
> 4kB pages:      444.051s user + 34.406s system time
> 64kB pages:     419.963s user + 16.869s system time
> 
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Can you do the same thing with the 4k MMU pages and 64k PAGE_SIZE?
Wouldn't that easily break out whether the advantage is from the TLB or
from less kernel overhead?

-- Dave


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2008-03-27  3:00 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-25 23:47 larger default page sizes J.C. Pizarro
2008-03-26 15:57 ` H. Peter Anvin
  -- strict thread matches above, loose matches on Subject: below --
2008-03-21 17:40 [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86 Christoph Lameter
2008-03-21 21:57 ` David Miller
2008-03-24 18:27   ` Christoph Lameter
2008-03-24 20:37     ` larger default page sizes David Miller
2008-03-24 21:05       ` Christoph Lameter
2008-03-24 21:43         ` David Miller
2008-03-25 17:48           ` Christoph Lameter
2008-03-25 23:22             ` David Miller
2008-03-25 23:41               ` Peter Chubb
2008-03-25 23:49                 ` David Miller
2008-03-26  0:25                   ` Peter Chubb
2008-03-26  0:31                     ` David Miller
2008-03-26  0:34                 ` David Mosberger-Tang
2008-03-26  0:39                   ` David Miller
2008-03-26  0:57                   ` Peter Chubb
2008-03-26  4:16                     ` John Marvin
2008-03-26  4:36                       ` David Miller
2008-03-24 21:25       ` Luck, Tony
2008-03-24 21:46         ` David Miller
2008-03-25  3:29       ` Paul Mackerras
2008-03-25  4:15         ` David Miller
2008-03-25 11:50           ` Paul Mackerras
2008-03-25 23:32             ` David Miller
2008-03-25 23:49               ` Luck, Tony
2008-03-26  0:16                 ` David Miller
2008-03-26 15:54                 ` Nish Aravamudan
2008-03-26 17:05                   ` Luck, Tony
2008-03-26 18:54                     ` Mel Gorman
2008-03-25 12:05         ` Andi Kleen
2008-03-25 21:27           ` Paul Mackerras
2008-03-26  5:24           ` Paul Mackerras
2008-03-26 15:59             ` Linus Torvalds
2008-03-27  1:08               ` Paul Mackerras
2008-03-26 17:56             ` Christoph Lameter
2008-03-26 23:21               ` David Miller
2008-03-27  3:00               ` Paul Mackerras
2008-03-25 18:27         ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox