public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Hugetlbpages in very large memory machines.......
@ 2004-03-13  3:44 Ray Bryant
  2004-03-13  3:48 ` Andi Kleen
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Ray Bryant @ 2004-03-13  3:44 UTC (permalink / raw)
  To: lse-tech, linux-ia64@vger.kernel.org, linux-kernel

We've run into a scaling problem using hugetlbpages in very large memory machines, e. g. machines 
with 1TB or more of main memory.  The problem is that hugetlbpage pages are not faulted in, rather 
they are zeroed and mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
response to the user's mmap() request.  The net is that all of the hugetlb pages end up being 
allocated and zeroed by a single thread, and if most of the machine's memory is allocated to hugetlb 
pages, and there is 1 TB or more of main memory, zeroing and allocating all of those pages can take 
a long time (500 s or more).

We've looked at allocating and zeroing hugetlbpages at fault time, which would at least allow 
multiple processors to be thrown at the problem.  Question is, has anyone else been working on
this problem and might they have prototype code they could share with us?

Thanks,
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------


^ permalink raw reply	[flat|nested] 32+ messages in thread
* RE: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
@ 2004-03-15 23:31 Seth, Rohit
  0 siblings, 0 replies; 32+ messages in thread
From: Seth, Rohit @ 2004-03-15 23:31 UTC (permalink / raw)
  To: Ray Bryant, Andrew Morton; +Cc: ak, lse-tech, linux-ia64, linux-kernel



>-----Original Message-----
>From: Ray Bryant
>Andrew Morton wrote:
><unrelated text snipped>
>>
>> As for holding mmap_sem for too long, well, that can presumably be
worked
>> around by not mmapping the whole lot in one hit?
>>
>
>There are a number of places that one could do this (explicitly in user
code,
>hidden in library level, or in do_mmap2() where the mm->map_sem is
taken).
>I'm not happy with requiring the user to make a modification to solve
this
>kernel problem.  Hiding the split has the problem of making sure that
if any
>of the sub mmap() operations fail then the rest of the mmap()
operations have
>to be undone, and this all has to happen in a way that makes the mmap()
look
>like a single system call.
>
>An alternative would be put some info in the mm_struct indicating that
a
>hugetlb_prefault() is in progress, then drop the mm->mmap_sem while
>hugetlb_prefault() is running.  Once it is done, regrab the
mm->mmap_sem,
>clear the "in progress flag" and finish up processing.  Any other
mmap()
>that got the mmap_sem and found the "in progress flag" set would have
to
>fail, perhaps with -EAGAIN (again, an mmap() extension).  One can also
>implement more elaborate schemes where there is a list of pending
hugetlb
>mmaps() with the associated address space ranges being listed; one
could
>check this list in get_unmapped_area() and return -EAGAIN if there is
>a conflict.
>

I think both of above options are bit of stretch.

>I'd still rather see us do the "allocate on fault" approach with
prereservation
>to maintain the current ENOMEM return code from mmap() for hugepages.
Let me
>work on that and get back to y'all with a patch and see where we can go
from
>there.  

I think this allocation on fault behavior will become essential when
Andi's mbind becomes part of the base kernel. And this scheme has an
added advantage of following normal semantics of page allocation (if a
user wants preallocation then MAP_LOCKED can be used).  As Andrew said
earlier in the thread that this though runs the risk of different
behavior with applications that currently assume pre-faulting behavior
in terms of performance (even if you decrement count upfront but do lazy
allocation).  As they will get penalized at fault time.  But this is the
kind of optimization that apps can do when porting to 2.6 based
distributions....

> I'll start by taking a look at all of the arch dependent
hugetlbpage.c's and
>see how common they all are and move the common code up to
mm/hugetlbpage.c.
>(or did WLI's note imply that this is impossible?)
>

You should be able to move prefault code to common tree.

>However, is this set of changes something that would still be accepted
in 2.6,
>or is this now a 2.7 discussion?
>
>--
>Best Regards,
>Ray
>-----------------------------------------------
>                   Ray Bryant
>512-453-9679 (work)         512-507-7807 (cell)
>raybry@sgi.com             raybry@austin.rr.com
>The box said: "Requires Windows 98 or better",
>            so I installed Linux.
>-----------------------------------------------
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-ia64"
in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2004-04-01  9:12 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-13  3:44 Hugetlbpages in very large memory machines Ray Bryant
2004-03-13  3:48 ` Andi Kleen
2004-03-13  5:49   ` William Lee Irwin III
2004-03-13 16:10     ` [Lse-tech] " Andi Kleen
2004-03-14  0:05       ` William Lee Irwin III
2004-03-14  5:22         ` Peter Chubb
     [not found]     ` <844231526.20040313030948@adinet.com.uy>
     [not found]       ` <20040313061232.GB655@holomorphy.com>
2004-03-13 16:32         ` Re[2]: " Luis Mirabal
2004-03-14  2:45   ` Andrew Morton
2004-03-14  4:06     ` [Lse-tech] " Anton Blanchard
2004-03-17 19:05       ` Andy Whitcroft
2004-03-18 20:25         ` Andrew Morton
2004-03-18 21:22           ` Stephen Smalley
2004-03-18 22:21             ` Andy Whitcroft
2004-03-23 17:30         ` Andy Whitcroft
2004-03-24 17:38           ` Andy Whitcroft
2004-03-14  8:38     ` Ray Bryant
2004-03-14  8:48       ` William Lee Irwin III
2004-03-14  8:57       ` Andrew Morton
2004-03-14  9:02         ` Andrew Morton
2004-03-14  9:07         ` William Lee Irwin III
2004-03-15  6:45         ` Ray Bryant
2004-03-15 23:54           ` William Lee Irwin III
2004-03-13  3:55 ` William Lee Irwin III
2004-03-13  4:56 ` Hirokazu Takahashi
2004-03-16  0:30   ` Nobuhiko Yoshida
2004-03-16  1:54     ` Andi Kleen
2004-03-16  2:32       ` Hirokazu Takahashi
2004-03-16  3:20         ` Hirokazu Takahashi
2004-03-16  3:15       ` Nobuhiko Yoshida
2004-04-01  9:10         ` Nobuhiko Yoshida
2004-03-15 15:28 ` jlnance
  -- strict thread matches above, loose matches on Subject: below --
2004-03-15 23:31 [Lse-tech] " Seth, Rohit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox