Re: RFC: Transparent Hugepage support - Benjamin Herrenschmidt

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, Marcelo Tosatti <mtosatti@redhat.com>,
	Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
	Izik Eidus <ieidus@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Nick Piggin <npiggin@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: RFC: Transparent Hugepage support
Date: Wed, 04 Nov 2009 15:10:26 +1100	[thread overview]
Message-ID: <1257307826.13611.45.camel@pasglop> (raw)
In-Reply-To: <20091103111829.GJ11981@random.random>

On Tue, 2009-11-03 at 12:18 +0100, Andrea Arcangeli wrote:
> On Sun, Nov 01, 2009 at 08:29:27AM +1100, Benjamin Herrenschmidt wrote:
> > This isn't possible on all architectures. Some archs have "segment"
> > constraints which mean only one page size per such "segment". Server
> > ppc's for example (segment size being either 256M or 1T depending on the
> > CPU).
> 
> Hmm 256M is already too large for a transparent allocation. 

Right.

> It will
> require reservation and hugetlbfs to me actually seems a perfect fit
> for this hardware limitation. The software limits of hugetlbfs matches
> the hardware limit perfectly and it already provides all necessary
> permission and reservation features needed to deal with extremely huge
> page sizes that probabilistically would never be found in the buddy
> (even if we were to extend it to make it not impossible). 

Yes. Note that powerpc -embedded- processors don't have that limitation
though (in large part because they are mostly SW loaded TLBs and they
support a wider collection of page sizes). So it would be possible to
implement your transparent scheme on those.

> That are
> hugely expensive to defrag dynamically even if we could [and we can't
> hope to defrag many of those because of slab]. Just in case it's not
> obvious the probability we can defrag degrades exponentially with the
> increase of the hugepagesize (which also means 256M is already orders
> of magnitude more realistic to function than than 1G).

True. 256M might even be worth toying with as an experiment on huge
machines with TBs of memory in fact :-)

>  Clearly if we
> increase slab to allocate with a front allocator in 256M chunk then
> our probability increases substantially, but to make something
> realistic there's at minimum an order of 10000 times between
> hugepagesize and total ram size. I.e. if 2M page makes some
> probabilistic sense with slab front-allocating 2M pages on a 64G
> system, for 256M pages to make an equivalent sense, system would
> require minimum 8Terabyte of ram.

Well... such systems aren't that far around the corner, so as I said, it
might still make sense to toy a bit with it. That would definitely -not-
include my G5 workstation though :-)

>  If pages were 1G sized system would
> require 32 Terabyte of ram (and the bigger overhead and trouble we
> would have considering some allocation would still happen in 4k ptes
> and the fixed overhead of relocating those 4k ranges would be much
> bigger if the hugepage size is a lot bigger than 2M and the regular
> page size is still 4k).
> 
> > > The most important design choice is: always fallback to 4k allocation
> > > if the hugepage allocation fails! This is the _very_ opposite of some
> > > large pagecache patches that failed with -EIO back then if a 64k (or
> > > similar) allocation failed...
> > 
> > Precisely because the approach cannot work on all architectures ?
> 
> I thought the main reason for those patches was to allow a fs
> blocksize bigger than PAGE_SIZE, a PAGE_CACHE_SIZE of 64k would allow
> for a 64k fs blocksize without much fs changes. But yes, if the mmu
> can't fallback, then software can't fallback either and so it impedes
> the transparent design on those architectures... To me hugetlbfs looks
> as best as you can get on those mmu.

Right.

I need to look whether your patch would work "better" for us with our
embedded processors though.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2009-11-04  4:10 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-26 18:51 RFC: Transparent Hugepage support Andrea Arcangeli
2009-10-27 15:41 ` Rik van Riel
2009-10-27 18:18 ` Andi Kleen
2009-10-27 19:30   ` Andrea Arcangeli
2009-10-28  4:28     ` Andi Kleen
2009-10-28 12:00       ` Andrea Arcangeli
2009-10-28 14:18         ` Andi Kleen
2009-10-28 14:54           ` Adam Litke
2009-10-28 15:13             ` Andi Kleen
2009-10-28 15:30               ` Andrea Arcangeli
2009-10-29 15:59             ` Dave Hansen
2009-10-31 21:32             ` Benjamin Herrenschmidt
2009-10-28 15:48           ` Andrea Arcangeli
2009-10-28 16:03             ` Andi Kleen
2009-10-28 16:22               ` Andrea Arcangeli
2009-10-28 16:34                 ` Andi Kleen
2009-10-28 16:56                   ` Adam Litke
2009-10-28 17:18                     ` Andi Kleen
2009-10-28 19:04                   ` Andrea Arcangeli
2009-10-28 19:22                     ` Andrea Arcangeli
2009-10-29  9:43       ` Ingo Molnar
2009-10-29  9:43         ` Ingo Molnar
2009-10-29 10:36         ` Andrea Arcangeli
2009-10-29 10:36           ` Andrea Arcangeli
2009-10-29 16:50           ` Mike Travis
2009-10-29 16:50             ` Mike Travis
2009-10-30  0:40           ` KAMEZAWA Hiroyuki
2009-10-30  0:40             ` KAMEZAWA Hiroyuki
2009-11-03 10:55             ` Andrea Arcangeli
2009-11-03 10:55               ` Andrea Arcangeli
2009-11-04  0:36               ` KAMEZAWA Hiroyuki
2009-11-04  0:36                 ` KAMEZAWA Hiroyuki
2009-10-29 12:54     ` Andrea Arcangeli
2009-10-27 20:42 ` Christoph Lameter
2009-10-27 18:21   ` Andrea Arcangeli
2009-10-27 20:25     ` Chris Wright
2009-10-29 18:51       ` Christoph Lameter
2009-11-01 10:56         ` Andrea Arcangeli
2009-10-29 18:55     ` Christoph Lameter
2009-10-31 21:29 ` Benjamin Herrenschmidt
2009-11-03 11:18   ` Andrea Arcangeli
2009-11-03 19:10     ` Dave Hansen
2009-11-04  4:10     ` Benjamin Herrenschmidt [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1257307826.13611.45.camel@pasglop \
    --to=benh@kernel.crashing.org \
    --cc=aarcange@redhat.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=avi@redhat.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=ieidus@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mtosatti@redhat.com \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.