public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ray Bryant <raybry@sgi.com>
To: "Martin J. Bligh" <mbligh@aracnet.com>
Cc: Andy Whitcroft <apw@shadowen.org>, Andrew Morton <akpm@osdl.org>,
	linux-kernel@vger.kernel.org, anton@samba.org,
	sds@epoch.ncsc.mil, ak@suse.de, lse-tech@lists.sourceforge.net,
	linux-ia64@vger.kernel.org
Subject: Re: [Lse-tech] Re: [PATCH] [0/6] HUGETLB memory commitment
Date: Sun, 28 Mar 2004 15:32:02 -0600	[thread overview]
Message-ID: <40674452.1080305@sgi.com> (raw)
In-Reply-To: <98220000.1080501001@[10.10.2.4]>



Martin J. Bligh wrote:
>>As I understood this originally, the suggestion was to reserve hugetlb 
>>pages at mmap() or shm_get() time so that the user would get an -ENOMEM 
>>at that time if there aren't enough hugetlb pages to (eventually) satisfy 
>>the request, as per the notion that we shouldn't modify the user API due 
>>to going with allocate on fault instead of hugetlb_prefault().
> 
> 
> Yup, but there were two parts to it:
> 
> 1. Stop hugepages using the existing overcommit pool for small pages, 
> which breaks small page allocations by prematurely the pool.
> 2. Give hugepages their own over-commit pool, instead of prefaulting.
> 
> Personally I think we need both (as you seem to), but (1) is probably
> more urgent.

Just to review:  even if we allocate hugetlb pages at fault rather than at mmap() time, hugetlb 
pages are created either at system boot time (kernel parameter "hugepages=") or by setting 
/proc/sys/vm/nr_hugepages (or by using the corresponding sysctl).  Once the set of hugepages is 
created this way, it never is changed by the act of allocating a huge page to a process.  (Changing 
nr_pages can cause the number of unallocated hugetlbpages to be increased or decreased.)

The reason for pointing this out (apologies if this was obvious to all) is to emphaisze that 
hugetlbpages are not created at hugetlbpage allocation time (which is now done at mmap() time and 
we'd like to change it to happen at fault time).

So to stop hugepages from using the small page overcommit pool, we just need code in 
set_hugetlb_mem_size() to reduce the number of hugetlbpages created by that code.

As for (2), I'm a little confused there, as later you appear to agree with me that overcomitting 
hugetlbpages is likely not useful.   Is it possible that you meant that there should be a list of 
hugetlbpages from which all allocations are made?  If so, that is the way the code has always 
worked; step 1 was to create the list of hugetlbpages, and step 2 was to allocate them.

(Once again, if this is obvious to all, I apologize and we can dump the last 4 paragraphs into the 
bit bucket with no known effect on entropy in this universe, at least.)

> 
> 
>>Since the reservation belongs to the mapped object (file or segment), 
>>I've been storing the current file/segments's reservation in the file 
>>system dependent part of the inode.  That way, it is easily accessible 
>>when the hugetlbfs file or SysV segment is removed and we can reduce 
>>the total number of reserved pages by that file's reservation at that 
>>time.  This also allows us to handle the reservation in the absence 
>>of a vma, as per Andy'c comment below.
> 
> 
> Do we need to store it there, or is one central pool number sufficient?
> I would have thought it was ...
> 

Yes, there is a central pool number indicating how many hugepages are reserved. The question is, 
when (and how) do you release that reservation?   My take is that the reservation is associated with 
the file (for mmap) or segment for SysV.

For example, program A mmap()'s a hugetlbfs file, but only touches part of the pages.  Program B 
then mmap()'s the same file with the same size, etc.  When program B does the mmap() the previous 
reservation should still be in place, right?  (The file is persistent in the page cache even if it 
does not persist over reboot, so the 2nd program is exepecting to see the data that the first 
program put there.)

Ditto for a SysV segement.

So one can't release the reservation when the current process doing the mmap() goes away, one has to 
release the reservation when the file/segment is deleted.  Since both mmap() and shmget() create an 
inode, and the inode is released by hugetlbfs_drop_inode() and friends, it seemed simplest to put 
the size of the mapped object's reservation in the inode.

The global count of reserved pages (the "central pool number" in your note), is incremented at 
mmap() time (well, actually done by hugetlbfs_file_mmap() for both mmap() and shmget()) and 
decremented at hugetlbfs_drop_inode() time.  If at mmap() time, incrementing the global reservation 
count would make the global reserved pages count > the number of hugetlbpages, we fail the mmap() 
with -ENONMEM.

At least that is the way my 2.4.21 code works.  Does that make things clearer?

> 
>>Admittedly this doesn't alow one to request that hugetlbpages be 
>>overcommitted, or to handle problems caused to the "normal" page 
>>overcommit code due to the presence of hugepages.  But we figure that 
>>anyone that is actually using hugetlb pages is likely to take over 
>>almost all of main memory anyway in a single job, so overcommit 
>>doesn't make much sense to us.
> 
> 
> Seeing as you can't swap them, overcommitting makes no sense to me
> either ;-)
> 
> M.
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech
> 

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------


  reply	other threads:[~2004-03-28 21:27 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-03-25 16:54 [PATCH] [0/6] HUGETLB memory commitment Andy Whitcroft
2004-03-25 16:58 ` [PATCH] [1/6] " Andy Whitcroft
2004-03-25 16:59 ` [PATCH] [2/6] " Andy Whitcroft
2004-03-25 17:00 ` [PATCH] [3/6] " Andy Whitcroft
2004-03-25 17:01 ` [PATCH] [4/6] " Andy Whitcroft
2004-03-25 17:02 ` [PATCH] [5/6] " Andy Whitcroft
2004-03-25 17:03 ` [PATCH] [6/6] " Andy Whitcroft
2004-03-25 21:04 ` [PATCH] [0/6] " Andrew Morton
2004-03-25 23:27   ` Andy Whitcroft
2004-03-25 23:51     ` Andrew Morton
2004-03-25 23:59       ` Andy Whitcroft
2004-03-26  0:10         ` Keith Owens
2004-03-26  0:22           ` Andrew Morton
2004-03-26  8:58             ` [Lse-tech] " Suparna Bhattacharya
2004-03-26  3:39               ` Keith Owens
2004-03-26 17:15                 ` Suparna Bhattacharya
2004-03-26  2:01         ` Andy Whitcroft
2004-03-26  0:18       ` Martin J. Bligh
2004-03-28 18:02     ` Ray Bryant
2004-03-28 19:10       ` Martin J. Bligh
2004-03-28 21:32         ` Ray Bryant [this message]
2004-03-29 16:50           ` [Lse-tech] " Martin J. Bligh
2004-03-29 12:30         ` Andy Whitcroft
2004-03-29 20:45           ` Chen, Kenneth W
2004-03-29 20:49             ` Chen, Kenneth W
2004-03-30 12:57               ` Andy Whitcroft
2004-03-30 20:04                 ` Chen, Kenneth W
2004-03-30 21:48                   ` Andy Whitcroft
2004-03-31  1:48                     ` Andy Whitcroft
2004-03-31  8:51                       ` Chen, Kenneth W
2004-03-31 16:20                         ` Andy Whitcroft
2004-04-01 21:15                         ` Andy Whitcroft
2004-04-01 22:50                           ` Andy Whitcroft
2004-04-01 23:09                           ` Chen, Kenneth W
2004-04-03  3:57                             ` [PATCH] " Ray Bryant
2004-04-04  3:31                               ` Chen, Kenneth W
2004-04-04 22:15                                 ` Ray Bryant
2004-04-05 15:26                                 ` [Lse-tech] " Ray Bryant
2004-04-05 17:01                                   ` Chen, Kenneth W
2004-04-05 18:22                                     ` Ray Bryant
2004-04-05 23:18                                       ` Chen, Kenneth W
2004-04-06  1:05                                         ` Ray Bryant
2004-04-06 16:14                                         ` Andy Whitcroft
2004-04-06 17:40                                           ` Chen, Kenneth W

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=40674452.1080305@sgi.com \
    --to=raybry@sgi.com \
    --cc=ak@suse.de \
    --cc=akpm@osdl.org \
    --cc=anton@samba.org \
    --cc=apw@shadowen.org \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lse-tech@lists.sourceforge.net \
    --cc=mbligh@aracnet.com \
    --cc=sds@epoch.ncsc.mil \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox