linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/21] hugetlb multi size, giant hugetlb support, etc
@ 2008-06-03  9:59 npiggin
  2008-06-03  9:59 ` [patch 01/21] hugetlb: factor out prep_new_huge_page npiggin
                   ` (23 more replies)
  0 siblings, 24 replies; 49+ messages in thread
From: npiggin @ 2008-06-03  9:59 UTC (permalink / raw)
  To: akpm; +Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

Hi,

Here is my submission to be merged in -mm. Given the amount of hunks this
patchset has, and the recent flurry of hugetlb development work, I'd hope to
get this merged up provided there aren't major issues (I would prefer to fix
minor ones with incremental patches). It's just a lot of error prone work to
track -mm when multiple concurrent development is happening.

Patch against latest mmotm.


What I have done for the user API issues with this release is to take the
safe way out and just maintain the existing hugepages user interfaces
unchanged. I've integrated Nish's sysfs API to control the other huge
page sizes.

I had initially opted to drop the /proc/sys/vm/* parts of the changes, but
I found that the libhugetlbfs suite continued to have failures, so I
decided to revert the multi column /proc/meminfo changes too, for now.

I say for now, because it is very easy to subsequently agree on some
extention to the API, but it is much harder to revert such an extention once
it has been made. I also think the main thing at this point is to get the
existing patchset merged. User API changes I really don't want to worry
with at the moment... point is: the infrastructure changes are a lot of
work to code, but not so hard to get right; the user API changes are
easy to code but harder to get right.

New to this patchset: I have implemented a default_hugepagesz= boot option
(defaulting to the arch's HPAGE_SIZE if unspecified), which can be used to
specify the default hugepage size for all /proc/* files, SHM, and default
hugetlbfs mount size. This is the best compromise I could find to keep back
compatibility while allowing the possibility to try different sizes with
legacy code.

One thing I worry about is whether the sysfs API is going to be foward
compatible with NUMA allocation changes that might be in the pipe.
This need not hold up a merge into -mm, but I'd like some reassurances
that thought is put in before it goes upstream.

Lastly, embarassingly, I'm not the best source of information for the
sysfs tunables, so incremental patches against Documentation/ABI would
be welcome :P

Thanks,
Nick
-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 01/21] hugetlb: factor out prep_new_huge_page
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
@ 2008-06-03  9:59 ` npiggin
  2008-06-03  9:59 ` [patch 02/21] hugetlb: modular state npiggin
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03  9:59 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: hugetlb-factor-page-prep.patch --]
[-- Type: text/plain, Size: 1627 bytes --]

Needed to avoid code duplication in follow up patches.

Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:52:57.000000000 +1000
@@ -331,6 +331,16 @@ static int adjust_pool_surplus(int delta
 	return ret;
 }
 
+static void prep_new_huge_page(struct page *page, int nid)
+{
+	set_compound_page_dtor(page, free_huge_page);
+	spin_lock(&hugetlb_lock);
+	nr_huge_pages++;
+	nr_huge_pages_node[nid]++;
+	spin_unlock(&hugetlb_lock);
+	put_page(page); /* free it into the hugepage allocator */
+}
+
 static struct page *alloc_fresh_huge_page_node(int nid)
 {
 	struct page *page;
@@ -344,12 +354,7 @@ static struct page *alloc_fresh_huge_pag
 			__free_pages(page, HUGETLB_PAGE_ORDER);
 			return NULL;
 		}
-		set_compound_page_dtor(page, free_huge_page);
-		spin_lock(&hugetlb_lock);
-		nr_huge_pages++;
-		nr_huge_pages_node[nid]++;
-		spin_unlock(&hugetlb_lock);
-		put_page(page); /* free it into the hugepage allocator */
+		prep_new_huge_page(page, nid);
 	}
 
 	return page;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 02/21] hugetlb: modular state
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
  2008-06-03  9:59 ` [patch 01/21] hugetlb: factor out prep_new_huge_page npiggin
@ 2008-06-03  9:59 ` npiggin
  2008-06-03 10:58   ` [patch 02/21] hugetlb: modular state (take 2) Nick Piggin
  2008-06-03  9:59 ` [patch 03/21] hugetlb: multiple hstates npiggin
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 49+ messages in thread
From: npiggin @ 2008-06-03  9:59 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: hugetlb-modular-state.patch --]
[-- Type: text/plain, Size: 54848 bytes --]

Large, but rather mechanical patch that converts most of the hugetlb.c
globals into structure members and passes them around.

Right now there is only a single global hstate structure, but 
most of the infrastructure to extend it is there.

Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    6 
 arch/powerpc/mm/hugetlbpage.c |    2 
 arch/sh/mm/hugetlbpage.c      |    2 
 arch/sparc64/mm/hugetlbpage.c |    4 
 arch/x86/mm/hugetlbpage.c     |    4 
 fs/hugetlbfs/inode.c          |   49 +++---
 include/asm-ia64/hugetlb.h    |    2 
 include/asm-powerpc/hugetlb.h |    2 
 include/asm-s390/hugetlb.h    |    2 
 include/asm-sh/hugetlb.h      |    2 
 include/asm-sparc64/hugetlb.h |    2 
 include/asm-x86/hugetlb.h     |    7 
 include/linux/hugetlb.h       |   81 +++++++++-
 ipc/shm.c                     |    3 
 mm/hugetlb.c                  |  321 ++++++++++++++++++++++--------------------
 mm/memory.c                   |    2 
 mm/mempolicy.c                |    9 -
 mm/mmap.c                     |    3 
 18 files changed, 308 insertions(+), 195 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:52:57.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:54:21.000000000 +1000
@@ -22,18 +22,12 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
-static unsigned long surplus_huge_pages;
-static unsigned long nr_overcommit_huge_pages;
 unsigned long max_huge_pages;
 unsigned long sysctl_overcommit_huge_pages;
-static struct list_head hugepage_freelists[MAX_NUMNODES];
-static unsigned int nr_huge_pages_node[MAX_NUMNODES];
-static unsigned int free_huge_pages_node[MAX_NUMNODES];
-static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
-static int hugetlb_next_nid;
+
+struct hstate default_hstate;
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -55,11 +49,11 @@ static pgoff_t vma_page_offset(struct vm
  * Convert the address within this vma to the page offset within
  * the mapping, in pagecache page units; huge pages here.
  */
-static pgoff_t vma_pagecache_offset(struct vm_area_struct *vma,
-					unsigned long address)
+static pgoff_t vma_pagecache_offset(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
 {
-	return ((address - vma->vm_start) >> HPAGE_SHIFT) +
-			(vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+	return ((address - vma->vm_start) >> huge_page_shift(h)) +
+			(vma->vm_pgoff >> huge_page_order(h));
 }
 
 #define HPAGE_RESV_OWNER    (1UL << (BITS_PER_LONG - 1))
@@ -120,14 +114,15 @@ static int is_vma_resv_set(struct vm_are
 }
 
 /* Decrement the reserved pages in the hugepage pool by one */
-static void decrement_hugepage_resv_vma(struct vm_area_struct *vma)
+static void decrement_hugepage_resv_vma(struct hstate *h,
+			struct vm_area_struct *vma)
 {
 	if (vma->vm_flags & VM_NORESERVE)
 		return;
 
 	if (vma->vm_flags & VM_SHARED) {
 		/* Shared mappings always use reserves */
-		resv_huge_pages--;
+		h->resv_huge_pages--;
 	} else {
 		/*
 		 * Only the process that called mmap() has reserves for
@@ -135,7 +130,7 @@ static void decrement_hugepage_resv_vma(
 		 */
 		if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 			unsigned long flags, reserve;
-			resv_huge_pages--;
+			h->resv_huge_pages--;
 			flags = (unsigned long)vma->vm_private_data &
 							HPAGE_RESV_MASK;
 			reserve = (unsigned long)vma->vm_private_data - 1;
@@ -162,12 +157,13 @@ static int vma_has_private_reserves(stru
 	return 1;
 }
 
-static void clear_huge_page(struct page *page, unsigned long addr)
+static void clear_huge_page(struct page *page,
+			unsigned long addr, unsigned long sz)
 {
 	int i;
 
 	might_sleep();
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
+	for (i = 0; i < sz/PAGE_SIZE; i++) {
 		cond_resched();
 		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
 	}
@@ -177,41 +173,43 @@ static void copy_huge_page(struct page *
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
+	struct hstate *h = hstate_vma(vma);
 
 	might_sleep();
-	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+	for (i = 0; i < pages_per_huge_page(h); i++) {
 		cond_resched();
 		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
 	}
 }
 
-static void enqueue_huge_page(struct page *page)
+static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_add(&page->lru, &hugepage_freelists[nid]);
-	free_huge_pages++;
-	free_huge_pages_node[nid]++;
+	list_add(&page->lru, &h->hugepage_freelists[nid]);
+	h->free_huge_pages++;
+	h->free_huge_pages_node[nid]++;
 }
 
-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct hstate *h)
 {
 	int nid;
 	struct page *page = NULL;
 
 	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 			break;
 		}
 	}
 	return page;
 }
 
-static struct page *dequeue_huge_page_vma(struct vm_area_struct *vma,
+static struct page *dequeue_huge_page_vma(struct hstate *h,
+				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
 	int nid;
@@ -229,26 +227,26 @@ static struct page *dequeue_huge_page_vm
 	 * not "stolen". The child may still get SIGKILLed
 	 */
 	if (!vma_has_private_reserves(vma) &&
-			free_huge_pages - resv_huge_pages == 0)
+			h->free_huge_pages - h->resv_huge_pages == 0)
 		return NULL;
 
 	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && free_huge_pages - resv_huge_pages == 0)
+	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
 		return NULL;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
 		nid = zone_to_nid(zone);
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
-		    !list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		    !list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 
 			if (!avoid_reserve)
-				decrement_hugepage_resv_vma(vma);
+				decrement_hugepage_resv_vma(h, vma);
 
 			break;
 		}
@@ -257,12 +255,13 @@ static struct page *dequeue_huge_page_vm
 	return page;
 }
 
-static void update_and_free_page(struct page *page)
+static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
-	nr_huge_pages--;
-	nr_huge_pages_node[page_to_nid(page)]--;
-	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
+
+	h->nr_huge_pages--;
+	h->nr_huge_pages_node[page_to_nid(page)]--;
+	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
 				1 << PG_private | 1<< PG_writeback);
@@ -270,11 +269,16 @@ static void update_and_free_page(struct 
 	set_compound_page_dtor(page, NULL);
 	set_page_refcounted(page);
 	arch_release_hugepage(page);
-	__free_pages(page, HUGETLB_PAGE_ORDER);
+	__free_pages(page, huge_page_order(h));
 }
 
 static void free_huge_page(struct page *page)
 {
+	/*
+	 * Can't pass hstate in here because it is called from the
+	 * compound page destructor.
+	 */
+	struct hstate *h = &default_hstate;
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
@@ -284,12 +288,12 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages_node[nid]) {
-		update_and_free_page(page);
-		surplus_huge_pages--;
-		surplus_huge_pages_node[nid]--;
+	if (h->surplus_huge_pages_node[nid]) {
+		update_and_free_page(h, page);
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
 	} else {
-		enqueue_huge_page(page);
+		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
 	if (mapping)
@@ -301,7 +305,7 @@ static void free_huge_page(struct page *
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(int delta)
+static int adjust_pool_surplus(struct hstate *h, int delta)
 {
 	static int prev_nid;
 	int nid = prev_nid;
@@ -314,15 +318,15 @@ static int adjust_pool_surplus(int delta
 			nid = first_node(node_online_map);
 
 		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !surplus_huge_pages_node[nid])
+		if (delta < 0 && !h->surplus_huge_pages_node[nid])
 			continue;
 		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && surplus_huge_pages_node[nid] >=
-						nr_huge_pages_node[nid])
+		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+						h->nr_huge_pages_node[nid])
 			continue;
 
-		surplus_huge_pages += delta;
-		surplus_huge_pages_node[nid] += delta;
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
 		ret = 1;
 		break;
 	} while (nid != prev_nid);
@@ -331,46 +335,46 @@ static int adjust_pool_surplus(int delta
 	return ret;
 }
 
-static void prep_new_huge_page(struct page *page, int nid)
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	set_compound_page_dtor(page, free_huge_page);
 	spin_lock(&hugetlb_lock);
-	nr_huge_pages++;
-	nr_huge_pages_node[nid]++;
+	h->nr_huge_pages++;
+	h->nr_huge_pages_node[nid]++;
 	spin_unlock(&hugetlb_lock);
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static struct page *alloc_fresh_huge_page_node(int nid)
+static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
-		HUGETLB_PAGE_ORDER);
+		huge_page_order(h));
 	if (page) {
 		if (arch_prepare_hugepage(page)) {
 			__free_pages(page, HUGETLB_PAGE_ORDER);
 			return NULL;
 		}
-		prep_new_huge_page(page, nid);
+		prep_new_huge_page(h, page, nid);
 	}
 
 	return page;
 }
 
-static int alloc_fresh_huge_page(void)
+static int alloc_fresh_huge_page(struct hstate *h)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hugetlb_next_nid;
+	start_nid = h->hugetlb_next_nid;
 
 	do {
-		page = alloc_fresh_huge_page_node(hugetlb_next_nid);
+		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
 		if (page)
 			ret = 1;
 		/*
@@ -384,11 +388,11 @@ static int alloc_fresh_huge_page(void)
 		 * if we just successfully allocated a hugepage so that
 		 * the next caller gets hugepages on the next node.
 		 */
-		next_nid = next_node(hugetlb_next_nid, node_online_map);
+		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
 		if (next_nid == MAX_NUMNODES)
 			next_nid = first_node(node_online_map);
-		hugetlb_next_nid = next_nid;
-	} while (!page && hugetlb_next_nid != start_nid);
+		h->hugetlb_next_nid = next_nid;
+	} while (!page && h->hugetlb_next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -398,8 +402,8 @@ static int alloc_fresh_huge_page(void)
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
-						unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
 {
 	struct page *page;
 	unsigned int nid;
@@ -428,18 +432,18 @@ static struct page *alloc_buddy_huge_pag
 	 * per-node value is checked there.
 	 */
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
+	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		spin_unlock(&hugetlb_lock);
 		return NULL;
 	} else {
-		nr_huge_pages++;
-		surplus_huge_pages++;
+		h->nr_huge_pages++;
+		h->surplus_huge_pages++;
 	}
 	spin_unlock(&hugetlb_lock);
 
 	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
 					__GFP_REPEAT|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
+					huge_page_order(h));
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -454,12 +458,12 @@ static struct page *alloc_buddy_huge_pag
 		/*
 		 * We incremented the global counters already
 		 */
-		nr_huge_pages_node[nid]++;
-		surplus_huge_pages_node[nid]++;
+		h->nr_huge_pages_node[nid]++;
+		h->surplus_huge_pages_node[nid]++;
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	} else {
-		nr_huge_pages--;
-		surplus_huge_pages--;
+		h->nr_huge_pages--;
+		h->surplus_huge_pages--;
 		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 	}
 	spin_unlock(&hugetlb_lock);
@@ -471,16 +475,16 @@ static struct page *alloc_buddy_huge_pag
  * Increase the hugetlb pool such that it can accomodate a reservation
  * of size 'delta'.
  */
-static int gather_surplus_pages(int delta)
+static int gather_surplus_pages(struct hstate *h, int delta)
 {
 	struct list_head surplus_list;
 	struct page *page, *tmp;
 	int ret, i;
 	int needed, allocated;
 
-	needed = (resv_huge_pages + delta) - free_huge_pages;
+	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
 	if (needed <= 0) {
-		resv_huge_pages += delta;
+		h->resv_huge_pages += delta;
 		return 0;
 	}
 
@@ -491,7 +495,7 @@ static int gather_surplus_pages(int delt
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(NULL, 0);
+		page = alloc_buddy_huge_page(h, NULL, 0);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -512,7 +516,8 @@ retry:
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
 	spin_lock(&hugetlb_lock);
-	needed = (resv_huge_pages + delta) - (free_huge_pages + allocated);
+	needed = (h->resv_huge_pages + delta) -
+			(h->free_huge_pages + allocated);
 	if (needed > 0)
 		goto retry;
 
@@ -525,7 +530,7 @@ retry:
 	 * before they are reserved.
 	 */
 	needed += allocated;
-	resv_huge_pages += delta;
+	h->resv_huge_pages += delta;
 	ret = 0;
 free:
 	/* Free the needed pages to the hugetlb pool */
@@ -533,7 +538,7 @@ free:
 		if ((--needed) < 0)
 			break;
 		list_del(&page->lru);
-		enqueue_huge_page(page);
+		enqueue_huge_page(h, page);
 	}
 
 	/* Free unnecessary surplus pages to the buddy allocator */
@@ -561,7 +566,8 @@ free:
  * allocated to satisfy the reservation must be explicitly freed if they were
  * never used.
  */
-static void return_unused_surplus_pages(unsigned long unused_resv_pages)
+static void return_unused_surplus_pages(struct hstate *h,
+					unsigned long unused_resv_pages)
 {
 	static int nid = -1;
 	struct page *page;
@@ -576,27 +582,27 @@ static void return_unused_surplus_pages(
 	unsigned long remaining_iterations = num_online_nodes();
 
 	/* Uncommit the reservation */
-	resv_huge_pages -= unused_resv_pages;
+	h->resv_huge_pages -= unused_resv_pages;
 
-	nr_pages = min(unused_resv_pages, surplus_huge_pages);
+	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
 	while (remaining_iterations-- && nr_pages) {
 		nid = next_node(nid, node_online_map);
 		if (nid == MAX_NUMNODES)
 			nid = first_node(node_online_map);
 
-		if (!surplus_huge_pages_node[nid])
+		if (!h->surplus_huge_pages_node[nid])
 			continue;
 
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
-			surplus_huge_pages--;
-			surplus_huge_pages_node[nid]--;
+			update_and_free_page(h, page);
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
+			h->surplus_huge_pages--;
+			h->surplus_huge_pages_node[nid]--;
 			nr_pages--;
 			remaining_iterations = num_online_nodes();
 		}
@@ -733,13 +739,14 @@ static long region_truncate(struct list_
  * an instantiated the change should be committed via vma_commit_reservation.
  * No action is required on failure.
  */
-static int vma_needs_reservation(struct vm_area_struct *vma, unsigned long addr)
+static int vma_needs_reservation(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 	if (vma->vm_flags & VM_SHARED) {
-		pgoff_t idx = vma_pagecache_offset(vma, addr);
+		pgoff_t idx = vma_pagecache_offset(h, vma, addr);
 		return region_chg(&inode->i_mapping->private_list,
 							idx, idx + 1);
 
@@ -750,14 +757,14 @@ static int vma_needs_reservation(struct 
 
 	return 0;
 }
-static void vma_commit_reservation(struct vm_area_struct *vma,
-							unsigned long addr)
+static void vma_commit_reservation(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 	if (vma->vm_flags & VM_SHARED) {
-		pgoff_t idx = vma_pagecache_offset(vma, addr);
+		pgoff_t idx = vma_pagecache_offset(h, vma, addr);
 		region_add(&inode->i_mapping->private_list, idx, idx + 1);
 	}
 }
@@ -765,6 +772,7 @@ static void vma_commit_reservation(struc
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
+	struct hstate *h = hstate_vma(vma);
 	struct page *page;
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
@@ -777,7 +785,7 @@ static struct page *alloc_huge_page(stru
 	 * MAP_NORESERVE mappings may also need pages and quota allocated
 	 * if no reserve mapping overlaps.
 	 */
-	chg = vma_needs_reservation(vma, addr);
+	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(chg);
 	if (chg)
@@ -785,11 +793,11 @@ static struct page *alloc_huge_page(stru
 			return ERR_PTR(-ENOSPC);
 
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(vma, addr, avoid_reserve);
+	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(vma, addr);
+		page = alloc_buddy_huge_page(h, vma, addr);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_OOM);
@@ -799,7 +807,7 @@ static struct page *alloc_huge_page(stru
 	set_page_refcounted(page);
 	set_page_private(page, (unsigned long) mapping);
 
-	vma_commit_reservation(vma, addr);
+	vma_commit_reservation(h, vma, addr);
 
 	return page;
 }
@@ -807,21 +815,28 @@ static struct page *alloc_huge_page(stru
 static int __init hugetlb_init(void)
 {
 	unsigned long i;
+	struct hstate *h = &default_hstate;
 
 	if (HPAGE_SHIFT == 0)
 		return 0;
 
+	if (!h->order) {
+		h->order = HPAGE_SHIFT - PAGE_SHIFT;
+		h->mask = HPAGE_MASK;
+	}
+
 	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&hugepage_freelists[i]);
+		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
-	hugetlb_next_nid = first_node(node_online_map);
+	h->hugetlb_next_nid = first_node(node_online_map);
 
 	for (i = 0; i < max_huge_pages; ++i) {
-		if (!alloc_fresh_huge_page())
+		if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	max_huge_pages = free_huge_pages = nr_huge_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	printk(KERN_INFO "Total HugeTLB memory allocated, %ld\n",
+			h->free_huge_pages);
 	return 0;
 }
 module_init(hugetlb_init);
@@ -847,34 +862,36 @@ static unsigned int cpuset_mems_nr(unsig
 
 #ifdef CONFIG_SYSCTL
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count)
 {
 	int i;
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
-		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
-			if (count >= nr_huge_pages)
+		struct list_head *freel = &h->hugepage_freelists[i];
+		list_for_each_entry_safe(page, next, freel, lru) {
+			if (count >= h->nr_huge_pages)
 				return;
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
 			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[page_to_nid(page)]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
 #else
-static inline void try_to_free_low(unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count)
 {
 }
 #endif
 
-#define persistent_huge_pages (nr_huge_pages - surplus_huge_pages)
+#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(unsigned long count)
 {
 	unsigned long min_count, ret;
+	struct hstate *h = &default_hstate;
 
 	/*
 	 * Increase the pool size
@@ -888,19 +905,19 @@ static unsigned long set_max_huge_pages(
 	 * within all the constraints specified by the sysctls.
 	 */
 	spin_lock(&hugetlb_lock);
-	while (surplus_huge_pages && count > persistent_huge_pages) {
-		if (!adjust_pool_surplus(-1))
+	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, -1))
 			break;
 	}
 
-	while (count > persistent_huge_pages) {
+	while (count > persistent_huge_pages(h)) {
 		/*
 		 * If this allocation races such that we no longer need the
 		 * page, free_huge_page will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page();
+		ret = alloc_fresh_huge_page(h);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -922,21 +939,21 @@ static unsigned long set_max_huge_pages(
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
 	 */
-	min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
+	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(min_count);
-	while (min_count < persistent_huge_pages) {
-		struct page *page = dequeue_huge_page();
+	try_to_free_low(h, min_count);
+	while (min_count < persistent_huge_pages(h)) {
+		struct page *page = dequeue_huge_page(h);
 		if (!page)
 			break;
-		update_and_free_page(page);
+		update_and_free_page(h, page);
 	}
-	while (count < persistent_huge_pages) {
-		if (!adjust_pool_surplus(1))
+	while (count < persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, 1))
 			break;
 	}
 out:
-	ret = persistent_huge_pages;
+	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
 	return ret;
 }
@@ -966,9 +983,10 @@ int hugetlb_overcommit_handler(struct ct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
+	struct hstate *h = &default_hstate;
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	spin_lock(&hugetlb_lock);
-	nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
+	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
 	spin_unlock(&hugetlb_lock);
 	return 0;
 }
@@ -977,37 +995,40 @@ int hugetlb_overcommit_handler(struct ct
 
 int hugetlb_report_meminfo(char *buf)
 {
+	struct hstate *h = &default_hstate;
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
 			"HugePages_Rsvd:  %5lu\n"
 			"HugePages_Surp:  %5lu\n"
 			"Hugepagesize:    %5lu kB\n",
-			nr_huge_pages,
-			free_huge_pages,
-			resv_huge_pages,
-			surplus_huge_pages,
-			HPAGE_SIZE/1024);
+			h->nr_huge_pages,
+			h->free_huge_pages,
+			h->resv_huge_pages,
+			h->surplus_huge_pages,
+			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
 }
 
 int hugetlb_report_node_meminfo(int nid, char *buf)
 {
+	struct hstate *h = &default_hstate;
 	return sprintf(buf,
 		"Node %d HugePages_Total: %5u\n"
 		"Node %d HugePages_Free:  %5u\n"
 		"Node %d HugePages_Surp:  %5u\n",
-		nid, nr_huge_pages_node[nid],
-		nid, free_huge_pages_node[nid],
-		nid, surplus_huge_pages_node[nid]);
+		nid, h->nr_huge_pages_node[nid],
+		nid, h->free_huge_pages_node[nid],
+		nid, h->surplus_huge_pages_node[nid]);
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
-	return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
+	struct hstate *h = &default_hstate;
+	return h->nr_huge_pages * pages_per_huge_page(h);
 }
 
-static int hugetlb_acct_memory(long delta)
+static int hugetlb_acct_memory(struct hstate *h, long delta)
 {
 	int ret = -ENOMEM;
 
@@ -1030,18 +1051,18 @@ static int hugetlb_acct_memory(long delt
 	 * semantics that cpuset has.
 	 */
 	if (delta > 0) {
-		if (gather_surplus_pages(delta) < 0)
+		if (gather_surplus_pages(h, delta) < 0)
 			goto out;
 
-		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
-			return_unused_surplus_pages(delta);
+		if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
+			return_unused_surplus_pages(h, delta);
 			goto out;
 		}
 	}
 
 	ret = 0;
 	if (delta < 0)
-		return_unused_surplus_pages((unsigned long) -delta);
+		return_unused_surplus_pages(h, (unsigned long) -delta);
 
 out:
 	spin_unlock(&hugetlb_lock);
@@ -1050,9 +1071,11 @@ out:
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
+	struct hstate *h = hstate_vma(vma);
 	unsigned long reserve = vma_resv_huge_pages(vma);
+
 	if (reserve)
-		hugetlb_acct_memory(-reserve);
+		hugetlb_acct_memory(h, -reserve);
 }
 
 /*
@@ -1108,14 +1131,16 @@ int copy_hugetlb_page_range(struct mm_st
 	struct page *ptepage;
 	unsigned long addr;
 	int cow;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, addr);
+		dst_pte = huge_pte_alloc(dst, addr, sz);
 		if (!dst_pte)
 			goto nomem;
 
@@ -1151,6 +1176,9 @@ void __unmap_hugepage_range(struct vm_ar
 	pte_t pte;
 	struct page *page;
 	struct page *tmp;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
+
 	/*
 	 * A page gathering list, protected by per file i_mmap_lock. The
 	 * lock is used to avoid list corruption from multiple unmapping
@@ -1159,11 +1187,11 @@ void __unmap_hugepage_range(struct vm_ar
 	LIST_HEAD(page_list);
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON(start & ~HPAGE_MASK);
-	BUG_ON(end & ~HPAGE_MASK);
+	BUG_ON(start & ~huge_page_mask(h));
+	BUG_ON(end & ~huge_page_mask(h));
 
 	spin_lock(&mm->page_table_lock);
-	for (address = start; address < end; address += HPAGE_SIZE) {
+	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -1276,6 +1304,7 @@ static int hugetlb_cow(struct mm_struct 
 			unsigned long address, pte_t *ptep, pte_t pte,
 			struct page *pagecache_page)
 {
+	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int avoidcopy;
 	int outside_reserve = 0;
@@ -1336,7 +1365,7 @@ retry_avoidcopy:
 	__SetPageUptodate(new_page);
 	spin_lock(&mm->page_table_lock);
 
-	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
+	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
@@ -1351,14 +1380,14 @@ retry_avoidcopy:
 }
 
 /* Return the pagecache page at a given address within a VMA */
-static struct page *hugetlbfs_pagecache_page(struct vm_area_struct *vma,
-			unsigned long address)
+static struct page *hugetlbfs_pagecache_page(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
 {
 	struct address_space *mapping;
 	pgoff_t idx;
 
 	mapping = vma->vm_file->f_mapping;
-	idx = vma_pagecache_offset(vma, address);
+	idx = vma_pagecache_offset(h, vma, address);
 
 	return find_lock_page(mapping, idx);
 }
@@ -1366,6 +1395,7 @@ static struct page *hugetlbfs_pagecache_
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, int write_access)
 {
+	struct hstate *h = hstate_vma(vma);
 	int ret = VM_FAULT_SIGBUS;
 	pgoff_t idx;
 	unsigned long size;
@@ -1386,7 +1416,7 @@ static int hugetlb_no_page(struct mm_str
 	}
 
 	mapping = vma->vm_file->f_mapping;
-	idx = vma_pagecache_offset(vma, address);
+	idx = vma_pagecache_offset(h, vma, address);
 
 	/*
 	 * Use page lock to guard against racing truncation
@@ -1395,7 +1425,7 @@ static int hugetlb_no_page(struct mm_str
 retry:
 	page = find_lock_page(mapping, idx);
 	if (!page) {
-		size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
 		page = alloc_huge_page(vma, address, 0);
@@ -1403,7 +1433,7 @@ retry:
 			ret = -PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, address);
+		clear_huge_page(page, address, huge_page_size(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_SHARED) {
@@ -1419,14 +1449,14 @@ retry:
 			}
 
 			spin_lock(&inode->i_lock);
-			inode->i_blocks += BLOCKS_PER_HUGEPAGE;
+			inode->i_blocks += blocks_per_huge_page(h);
 			spin_unlock(&inode->i_lock);
 		} else
 			lock_page(page);
 	}
 
 	spin_lock(&mm->page_table_lock);
-	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
 
@@ -1462,8 +1492,9 @@ int hugetlb_fault(struct mm_struct *mm, 
 	pte_t entry;
 	int ret;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	struct hstate *h = hstate_vma(vma);
 
-	ptep = huge_pte_alloc(mm, address);
+	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
 	if (!ptep)
 		return VM_FAULT_OOM;
 
@@ -1487,7 +1518,7 @@ int hugetlb_fault(struct mm_struct *mm, 
 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
 		if (write_access && !pte_write(entry)) {
 			struct page *page;
-			page = hugetlbfs_pagecache_page(vma, address);
+			page = hugetlbfs_pagecache_page(h, vma, address);
 			ret = hugetlb_cow(mm, vma, address, ptep, entry, page);
 			if (page) {
 				unlock_page(page);
@@ -1508,6 +1539,7 @@ int follow_hugetlb_page(struct mm_struct
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
 	int remainder = *length;
+	struct hstate *h = hstate_vma(vma);
 
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
@@ -1519,7 +1551,7 @@ int follow_hugetlb_page(struct mm_struct
 		 * each hugepage.  We have to make * sure we get the
 		 * first, for the page indexing below to work.
 		 */
-		pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
+		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
 
 		if (!pte || huge_pte_none(huge_ptep_get(pte)) ||
 		    (write && !pte_write(huge_ptep_get(pte)))) {
@@ -1537,7 +1569,7 @@ int follow_hugetlb_page(struct mm_struct
 			break;
 		}
 
-		pfn_offset = (vaddr & ~HPAGE_MASK) >> PAGE_SHIFT;
+		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(huge_ptep_get(pte));
 same_page:
 		if (pages) {
@@ -1553,7 +1585,7 @@ same_page:
 		--remainder;
 		++i;
 		if (vaddr < vma->vm_end && remainder &&
-				pfn_offset < HPAGE_SIZE/PAGE_SIZE) {
+				pfn_offset < pages_per_huge_page(h)) {
 			/*
 			 * We use pfn_offset to avoid touching the pageframes
 			 * of this compound page.
@@ -1575,13 +1607,14 @@ void hugetlb_change_protection(struct vm
 	unsigned long start = address;
 	pte_t *ptep;
 	pte_t pte;
+	struct hstate *h = hstate_vma(vma);
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
 	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
-	for (; address < end; address += HPAGE_SIZE) {
+	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -1604,6 +1637,7 @@ int hugetlb_reserve_pages(struct inode *
 					struct vm_area_struct *vma)
 {
 	long ret, chg;
+	struct hstate *h = hstate_inode(inode);
 
 	if (vma && vma->vm_flags & VM_NORESERVE)
 		return 0;
@@ -1627,7 +1661,7 @@ int hugetlb_reserve_pages(struct inode *
 
 	if (hugetlb_get_quota(inode->i_mapping, chg))
 		return -ENOSPC;
-	ret = hugetlb_acct_memory(chg);
+	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
 		hugetlb_put_quota(inode->i_mapping, chg);
 		return ret;
@@ -1639,12 +1673,13 @@ int hugetlb_reserve_pages(struct inode *
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
+	struct hstate *h = hstate_inode(inode);
 	long chg = region_truncate(&inode->i_mapping->private_list, offset);
 
 	spin_lock(&inode->i_lock);
-	inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
+	inode->i_blocks -= blocks_per_huge_page(h);
 	spin_unlock(&inode->i_lock);
 
 	hugetlb_put_quota(inode->i_mapping, (chg - freed));
-	hugetlb_acct_memory(-(chg - freed));
+	hugetlb_acct_memory(h, -(chg - freed));
 }
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
@@ -128,7 +128,8 @@ pte_t *huge_pte_offset(struct mm_struct 
 	return NULL;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pg;
 	pud_t *pu;
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
@@ -175,7 +175,7 @@ hugetlb_get_unmapped_area(struct file *f
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
@@ -195,7 +195,8 @@ hugetlb_get_unmapped_area(struct file *f
 				pgoff, flags);
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/hugetlbpage.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/arch/sh/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
@@ -22,7 +22,8 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
@@ -24,7 +24,7 @@
 unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
 
 pte_t *
-huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
+huge_pte_alloc (struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
@@ -75,7 +75,8 @@ int huge_pmd_unshare(struct mm_struct *m
  * Don't actually need to do any preparation, but need to make sure
  * the address is in the right region.
  */
-int prepare_hugepage_range(unsigned long addr, unsigned long len)
+int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
@@ -149,7 +150,7 @@ unsigned long hugetlb_get_unmapped_area(
 
 	/* Handle MAP_FIXED */
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/arch/x86/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
@@ -124,7 +124,8 @@ int huge_pmd_unshare(struct mm_struct *m
 	return 1;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -368,7 +369,7 @@ hugetlb_get_unmapped_area(struct file *f
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/include/linux/hugetlb.h	2008-06-03 19:53:30.000000000 +1000
@@ -8,7 +8,6 @@
 #include <linux/mempolicy.h>
 #include <linux/shm.h>
 #include <asm/tlbflush.h>
-#include <asm/hugetlb.h>
 
 struct ctl_table;
 
@@ -45,7 +44,8 @@ extern int sysctl_hugetlb_shm_group;
 
 /* arch callbacks */
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -80,7 +80,7 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL
-#define prepare_hugepage_range(addr,len)	(-EINVAL)
+#define prepare_hugepage_range(file, addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
@@ -134,8 +134,6 @@ struct file *hugetlb_file_setup(const ch
 int hugetlb_get_quota(struct address_space *mapping, long delta);
 void hugetlb_put_quota(struct address_space *mapping, long delta);
 
-#define BLOCKS_PER_HUGEPAGE	(HPAGE_SIZE / 512)
-
 static inline int is_file_hugepages(struct file *file)
 {
 	if (file->f_op == &hugetlbfs_file_operations)
@@ -164,4 +162,83 @@ unsigned long hugetlb_get_unmapped_area(
 					unsigned long flags);
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+/* Defines one hugetlb page size */
+struct hstate {
+	int hugetlb_next_nid;
+	unsigned int order;
+	unsigned long mask;
+	unsigned long max_huge_pages;
+	unsigned long nr_huge_pages;
+	unsigned long free_huge_pages;
+	unsigned long resv_huge_pages;
+	unsigned long surplus_huge_pages;
+	unsigned long nr_overcommit_huge_pages;
+	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int nr_huge_pages_node[MAX_NUMNODES];
+	unsigned int free_huge_pages_node[MAX_NUMNODES];
+	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+};
+
+extern struct hstate default_hstate;
+
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+{
+	return &default_hstate;
+}
+
+static inline struct hstate *hstate_file(struct file *f)
+{
+	return &default_hstate;
+}
+
+static inline struct hstate *hstate_inode(struct inode *i)
+{
+	return &default_hstate;
+}
+
+static inline unsigned long huge_page_size(struct hstate *h)
+{
+	return (unsigned long)PAGE_SIZE << h->order;
+}
+
+static inline unsigned long huge_page_mask(struct hstate *h)
+{
+	return h->mask;
+}
+
+static inline unsigned int huge_page_order(struct hstate *h)
+{
+	return h->order;
+}
+
+static inline unsigned huge_page_shift(struct hstate *h)
+{
+	return h->order + PAGE_SHIFT;
+}
+
+static inline unsigned int pages_per_huge_page(struct hstate *h)
+{
+	return 1 << h->order;
+}
+
+static inline unsigned int blocks_per_huge_page(struct hstate *h)
+{
+	return huge_page_size(h) / 512;
+}
+
+#else
+struct hstate {};
+#define hstate_file(f) NULL
+#define hstate_vma(v) NULL
+#define hstate_inode(i) NULL
+#define huge_page_size(h) PAGE_SIZE
+#define huge_page_mask(h) PAGE_MASK
+#define huge_page_order(h) 0
+#define huge_page_shift(h) PAGE_SHIFT
+#endif
+
+#include <asm/hugetlb.h>
+
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/fs/hugetlbfs/inode.c	2008-06-03 19:54:53.000000000 +1000
@@ -80,6 +80,7 @@ static int hugetlbfs_file_mmap(struct fi
 	struct inode *inode = file->f_path.dentry->d_inode;
 	loff_t len, vma_len;
 	int ret;
+	struct hstate *h = hstate_file(file);
 
 	/*
 	 * vma address alignment (but not the pgoff alignment) has
@@ -92,7 +93,7 @@ static int hugetlbfs_file_mmap(struct fi
 	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
 	vma->vm_ops = &hugetlb_vm_ops;
 
-	if (vma->vm_pgoff & ~(HPAGE_MASK >> PAGE_SHIFT))
+	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
 		return -EINVAL;
 
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
@@ -104,8 +105,8 @@ static int hugetlbfs_file_mmap(struct fi
 	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
 	if (hugetlb_reserve_pages(inode,
-				vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
-				len >> HPAGE_SHIFT, vma))
+				vma->vm_pgoff >> huge_page_order(h),
+				len >> huge_page_shift(h), vma))
 		goto out;
 
 	ret = 0;
@@ -130,20 +131,21 @@ hugetlb_get_unmapped_area(struct file *f
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long start_addr;
+	struct hstate *h = hstate_file(file);
 
-	if (len & ~HPAGE_MASK)
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
 
 	if (addr) {
-		addr = ALIGN(addr, HPAGE_SIZE);
+		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
@@ -156,7 +158,7 @@ hugetlb_get_unmapped_area(struct file *f
 		start_addr = TASK_UNMAPPED_BASE;
 
 full_search:
-	addr = ALIGN(start_addr, HPAGE_SIZE);
+	addr = ALIGN(start_addr, huge_page_size(h));
 
 	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
@@ -174,7 +176,7 @@ full_search:
 
 		if (!vma || addr + len <= vma->vm_start)
 			return addr;
-		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+		addr = ALIGN(vma->vm_end, huge_page_size(h));
 	}
 }
 #endif
@@ -225,10 +227,11 @@ hugetlbfs_read_actor(struct page *page, 
 static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
 			      size_t len, loff_t *ppos)
 {
+	struct hstate *h = hstate_file(filp);
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-	unsigned long index = *ppos >> HPAGE_SHIFT;
-	unsigned long offset = *ppos & ~HPAGE_MASK;
+	unsigned long index = *ppos >> huge_page_shift(h);
+	unsigned long offset = *ppos & ~huge_page_mask(h);
 	unsigned long end_index;
 	loff_t isize;
 	ssize_t retval = 0;
@@ -243,17 +246,17 @@ static ssize_t hugetlbfs_read(struct fil
 	if (!isize)
 		goto out;
 
-	end_index = (isize - 1) >> HPAGE_SHIFT;
+	end_index = (isize - 1) >> huge_page_shift(h);
 	for (;;) {
 		struct page *page;
-		int nr, ret;
+		unsigned long nr, ret;
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = HPAGE_SIZE;
+		nr = huge_page_size(h);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
-			nr = ((isize - 1) & ~HPAGE_MASK) + 1;
+			nr = ((isize - 1) & ~huge_page_mask(h)) + 1;
 			if (nr <= offset) {
 				goto out;
 			}
@@ -287,8 +290,8 @@ static ssize_t hugetlbfs_read(struct fil
 		offset += ret;
 		retval += ret;
 		len -= ret;
-		index += offset >> HPAGE_SHIFT;
-		offset &= ~HPAGE_MASK;
+		index += offset >> huge_page_shift(h);
+		offset &= ~huge_page_mask(h);
 
 		if (page)
 			page_cache_release(page);
@@ -298,7 +301,7 @@ static ssize_t hugetlbfs_read(struct fil
 			break;
 	}
 out:
-	*ppos = ((loff_t)index << HPAGE_SHIFT) + offset;
+	*ppos = ((loff_t)index << huge_page_shift(h)) + offset;
 	mutex_unlock(&inode->i_mutex);
 	return retval;
 }
@@ -339,8 +342,9 @@ static void truncate_huge_page(struct pa
 
 static void truncate_hugepages(struct inode *inode, loff_t lstart)
 {
+	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
-	const pgoff_t start = lstart >> HPAGE_SHIFT;
+	const pgoff_t start = lstart >> huge_page_shift(h);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
@@ -449,8 +453,9 @@ static int hugetlb_vmtruncate(struct ino
 {
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
 
-	BUG_ON(offset & ~HPAGE_MASK);
+	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
@@ -465,6 +470,7 @@ static int hugetlb_vmtruncate(struct ino
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct hstate *h = hstate_inode(inode);
 	int error;
 	unsigned int ia_valid = attr->ia_valid;
 
@@ -476,7 +482,7 @@ static int hugetlbfs_setattr(struct dent
 
 	if (ia_valid & ATTR_SIZE) {
 		error = -EINVAL;
-		if (!(attr->ia_size & ~HPAGE_MASK))
+		if (!(attr->ia_size & ~huge_page_mask(h)))
 			error = hugetlb_vmtruncate(inode, attr->ia_size);
 		if (error)
 			goto out;
@@ -610,9 +616,10 @@ static int hugetlbfs_set_page_dirty(stru
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
+	struct hstate *h = hstate_inode(dentry->d_inode);
 
 	buf->f_type = HUGETLBFS_MAGIC;
-	buf->f_bsize = HPAGE_SIZE;
+	buf->f_bsize = huge_page_size(h);
 	if (sbinfo) {
 		spin_lock(&sbinfo->stat_lock);
 		/* If no limits set, just report 0 for max/free/used
@@ -942,7 +949,8 @@ struct file *hugetlb_file_setup(const ch
 		goto out_dentry;
 
 	error = -ENOMEM;
-	if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT, NULL))
+	if (hugetlb_reserve_pages(inode, 0,
+			size >> huge_page_shift(hstate_inode(inode)), NULL))
 		goto out_inode;
 
 	d_instantiate(dentry, inode);
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/ipc/shm.c	2008-06-03 19:53:30.000000000 +1000
@@ -562,7 +562,8 @@ static void shm_get_stat(struct ipc_name
 
 		if (is_file_hugepages(shp->shm_file)) {
 			struct address_space *mapping = inode->i_mapping;
-			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+			struct hstate *h = hstate_file(shp->shm_file);
+			*rss += pages_per_huge_page(h) * mapping->nrpages;
 		} else {
 			struct shmem_inode_info *info = SHMEM_I(inode);
 			spin_lock(&info->lock);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/mm/memory.c	2008-06-03 19:53:30.000000000 +1000
@@ -904,7 +904,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				unmap_hugepage_range(vma, start, end, NULL);
 				zap_work -= (end - start) /
-						(HPAGE_SIZE / PAGE_SIZE);
+					pages_per_huge_page(hstate_vma(vma));
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/mm/mempolicy.c	2008-06-03 19:53:30.000000000 +1000
@@ -1477,7 +1477,7 @@ struct zonelist *huge_zonelist(struct vm
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
 		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
-						HPAGE_SHIFT), gfp_flags);
+				huge_page_shift(hstate_vma(vma))), gfp_flags);
 	} else {
 		zl = policy_zonelist(gfp_flags, *mpol);
 		if ((*mpol)->mode == MPOL_BIND)
@@ -2216,9 +2216,12 @@ static void check_huge_range(struct vm_a
 {
 	unsigned long addr;
 	struct page *page;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
 
-	for (addr = start; addr < end; addr += HPAGE_SIZE) {
-		pte_t *ptep = huge_pte_offset(vma->vm_mm, addr & HPAGE_MASK);
+	for (addr = start; addr < end; addr += sz) {
+		pte_t *ptep = huge_pte_offset(vma->vm_mm,
+						addr & huge_page_mask(h));
 		pte_t pte;
 
 		if (!ptep)
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/mm/mmap.c	2008-06-03 19:53:30.000000000 +1000
@@ -1808,7 +1808,8 @@ int split_vma(struct mm_struct * mm, str
 	struct mempolicy *pol;
 	struct vm_area_struct *new;
 
-	if (is_vm_hugetlb_page(vma) && (addr & ~HPAGE_MASK))
+	if (is_vm_hugetlb_page(vma) && (addr &
+					~(huge_page_mask(hstate_vma(vma)))))
 		return -EINVAL;
 
 	if (mm->map_count >= sysctl_max_map_count)
Index: linux-2.6/include/asm-ia64/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/hugetlb.h	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/include/asm-ia64/hugetlb.h	2008-06-03 19:53:30.000000000 +1000
@@ -8,7 +8,8 @@ void hugetlb_free_pgd_range(struct mmu_g
 			    unsigned long end, unsigned long floor,
 			    unsigned long ceiling);
 
-int prepare_hugepage_range(unsigned long addr, unsigned long len);
+int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len);
 
 static inline int is_hugepage_only_range(struct mm_struct *mm,
 					 unsigned long addr,
Index: linux-2.6/include/asm-powerpc/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/hugetlb.h	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/include/asm-powerpc/hugetlb.h	2008-06-03 19:53:30.000000000 +1000
@@ -21,7 +21,8 @@ pte_t huge_ptep_get_and_clear(struct mm_
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-s390/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-s390/hugetlb.h	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/include/asm-s390/hugetlb.h	2008-06-03 19:53:30.000000000 +1000
@@ -22,7 +22,8 @@ void set_huge_pte_at(struct mm_struct *m
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-sh/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-sh/hugetlb.h	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/include/asm-sh/hugetlb.h	2008-06-03 19:53:30.000000000 +1000
@@ -14,7 +14,8 @@ static inline int is_hugepage_only_range
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-sparc64/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/hugetlb.h	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/include/asm-sparc64/hugetlb.h	2008-06-03 19:53:30.000000000 +1000
@@ -22,7 +22,8 @@ static inline int is_hugepage_only_range
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-x86/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-x86/hugetlb.h	2008-06-03 19:52:53.000000000 +1000
+++ linux-2.6/include/asm-x86/hugetlb.h	2008-06-03 19:53:30.000000000 +1000
@@ -14,11 +14,13 @@ static inline int is_hugepage_only_range
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
-	if (len & ~HPAGE_MASK)
+	struct hstate *h = hstate_file(file);
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (addr & ~HPAGE_MASK)
+	if (addr & ~huge_page_mask(h))
 		return -EINVAL;
 	return 0;
 }
Index: linux-2.6/arch/s390/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/hugetlbpage.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/arch/s390/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
@@ -72,7 +72,8 @@ void arch_release_hugepage(struct page *
 	page[1].index = 0;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgdp;
 	pud_t *pudp;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 03/21] hugetlb: multiple hstates
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
  2008-06-03  9:59 ` [patch 01/21] hugetlb: factor out prep_new_huge_page npiggin
  2008-06-03  9:59 ` [patch 02/21] hugetlb: modular state npiggin
@ 2008-06-03  9:59 ` npiggin
  2008-06-03 10:00 ` [patch 04/21] hugetlbfs: per mount hstates npiggin
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03  9:59 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: hugetlb-multiple-hstates.patch --]
[-- Type: text/plain, Size: 9224 bytes --]

Add basic support for more than one hstate in hugetlbfs

- Convert hstates to an array
- Add a first default entry covering the standard huge page size
- Add functions for architectures to register new hstates
- Add basic iterators over hstates

Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/hugetlb.h |   16 +++++++
 mm/hugetlb.c            |   98 +++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 95 insertions(+), 19 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:54:21.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:55:41.000000000 +1000
@@ -22,12 +22,19 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-unsigned long max_huge_pages;
-unsigned long sysctl_overcommit_huge_pages;
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
-struct hstate default_hstate;
+static int max_hstate = 0;
+unsigned int default_hstate_idx;
+struct hstate hstates[HUGE_MAX_HSTATE];
+
+/* for command line parsing */
+static struct hstate * __initdata parsed_hstate = NULL;
+static unsigned long __initdata default_hstate_max_huge_pages = 0;
+
+#define for_each_hstate(h) \
+	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -272,13 +279,24 @@ static void update_and_free_page(struct 
 	__free_pages(page, huge_page_order(h));
 }
 
+struct hstate *size_to_hstate(unsigned long size)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		if (huge_page_size(h) == size)
+			return h;
+	}
+	return NULL;
+}
+
 static void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
 	 * compound page destructor.
 	 */
-	struct hstate *h = &default_hstate;
+	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
@@ -812,39 +830,94 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
-static int __init hugetlb_init(void)
+static void __init hugetlb_init_one_hstate(struct hstate *h)
 {
 	unsigned long i;
-	struct hstate *h = &default_hstate;
-
-	if (HPAGE_SHIFT == 0)
-		return 0;
-
-	if (!h->order) {
-		h->order = HPAGE_SHIFT - PAGE_SHIFT;
-		h->mask = HPAGE_MASK;
-	}
 
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
 	h->hugetlb_next_nid = first_node(node_online_map);
 
-	for (i = 0; i < max_huge_pages; ++i) {
+	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
-	printk(KERN_INFO "Total HugeTLB memory allocated, %ld\n",
-			h->free_huge_pages);
+	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+}
+
+static void __init hugetlb_init_hstates(void)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		hugetlb_init_one_hstate(h);
+	}
+}
+
+static void __init report_hugepages(void)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		printk(KERN_INFO "Total HugeTLB memory allocated, "
+				"%ld %dMB pages\n",
+				h->free_huge_pages,
+				1 << (h->order + PAGE_SHIFT - 20));
+	}
+}
+
+static int __init hugetlb_init(void)
+{
+	BUILD_BUG_ON(HPAGE_SHIFT == 0);
+
+	if (!size_to_hstate(HPAGE_SIZE)) {
+		hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
+		parsed_hstate->max_huge_pages = default_hstate_max_huge_pages;
+	}
+	default_hstate_idx = size_to_hstate(HPAGE_SIZE) - hstates;
+
+	hugetlb_init_hstates();
+
+	report_hugepages();
+
 	return 0;
 }
 module_init(hugetlb_init);
 
+/* Should be called on processing a hugepagesz=... option */
+void __init hugetlb_add_hstate(unsigned order)
+{
+	struct hstate *h;
+	if (size_to_hstate(PAGE_SIZE << order)) {
+		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
+		return;
+	}
+	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+	BUG_ON(order == 0);
+	h = &hstates[max_hstate++];
+	h->order = order;
+	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
+	hugetlb_init_one_hstate(h);
+	parsed_hstate = h;
+}
+
 static int __init hugetlb_setup(char *s)
 {
-	if (sscanf(s, "%lu", &max_huge_pages) <= 0)
-		max_huge_pages = 0;
+	unsigned long *mhp;
+
+	/*
+	 * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
+	 * so this hugepages= parameter goes to the "default hstate".
+	 */
+	if (!max_hstate)
+		mhp = &default_hstate_max_huge_pages;
+	else
+		mhp = &parsed_hstate->max_huge_pages;
+
+	if (sscanf(s, "%lu", mhp) <= 0)
+		*mhp = 0;
+
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);
@@ -875,7 +948,7 @@ static void try_to_free_low(struct hstat
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
-			update_and_free_page(page);
+			update_and_free_page(h, page);
 			h->free_huge_pages--;
 			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
@@ -888,10 +961,9 @@ static inline void try_to_free_low(struc
 #endif
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(unsigned long count)
+static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
 	unsigned long min_count, ret;
-	struct hstate *h = &default_hstate;
 
 	/*
 	 * Increase the pool size
@@ -962,8 +1034,19 @@ int hugetlb_sysctl_handler(struct ctl_ta
 			   struct file *file, void __user *buffer,
 			   size_t *length, loff_t *ppos)
 {
+	struct hstate *h = &default_hstate;
+	unsigned long tmp;
+
+	if (!write)
+		tmp = h->max_huge_pages;
+
+	table->data = &tmp;
+	table->maxlen = sizeof(unsigned long);
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
-	max_huge_pages = set_max_huge_pages(max_huge_pages);
+
+	if (write)
+		h->max_huge_pages = set_max_huge_pages(h, tmp);
+
 	return 0;
 }
 
@@ -984,10 +1067,21 @@ int hugetlb_overcommit_handler(struct ct
 			size_t *length, loff_t *ppos)
 {
 	struct hstate *h = &default_hstate;
+	unsigned long tmp;
+
+	if (!write)
+		tmp = h->nr_overcommit_huge_pages;
+
+	table->data = &tmp;
+	table->maxlen = sizeof(unsigned long);
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
-	spin_lock(&hugetlb_lock);
-	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
-	spin_unlock(&hugetlb_lock);
+
+	if (write) {
+		spin_lock(&hugetlb_lock);
+		h->nr_overcommit_huge_pages = tmp;
+		spin_unlock(&hugetlb_lock);
+	}
+
 	return 0;
 }
 
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 19:53:30.000000000 +1000
+++ linux-2.6/include/linux/hugetlb.h	2008-06-03 19:55:17.000000000 +1000
@@ -36,8 +36,6 @@ int hugetlb_reserve_pages(struct inode *
 						struct vm_area_struct *vma);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 
-extern unsigned long max_huge_pages;
-extern unsigned long sysctl_overcommit_huge_pages;
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
@@ -181,7 +179,17 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 };
 
-extern struct hstate default_hstate;
+void __init hugetlb_add_hstate(unsigned order);
+struct hstate *size_to_hstate(unsigned long size);
+
+#ifndef HUGE_MAX_HSTATE
+#define HUGE_MAX_HSTATE 1
+#endif
+
+extern struct hstate hstates[HUGE_MAX_HSTATE];
+extern unsigned int default_hstate_idx;
+
+#define default_hstate (hstates[default_hstate_idx])
 
 static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 {
@@ -228,6 +236,11 @@ static inline unsigned int blocks_per_hu
 	return huge_page_size(h) / 512;
 }
 
+static inline struct hstate *page_hstate(struct page *page)
+{
+	return size_to_hstate(PAGE_SIZE << compound_order(page));
+}
+
 #else
 struct hstate {};
 #define hstate_file(f) NULL
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2008-06-03 19:52:52.000000000 +1000
+++ linux-2.6/kernel/sysctl.c	2008-06-03 19:55:17.000000000 +1000
@@ -915,7 +915,7 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_HUGETLB_PAGE
 	 {
 		.procname	= "nr_hugepages",
-		.data		= &max_huge_pages,
+		.data		= NULL,
 		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
 		.proc_handler	= &hugetlb_sysctl_handler,
@@ -941,10 +941,12 @@ static struct ctl_table vm_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "nr_overcommit_hugepages",
-		.data		= &sysctl_overcommit_huge_pages,
-		.maxlen		= sizeof(sysctl_overcommit_huge_pages),
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
 		.proc_handler	= &hugetlb_overcommit_handler,
+		.extra1		= (void *)&hugetlb_zero,
+		.extra2		= (void *)&hugetlb_infinity,
 	},
 #endif
 	{

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 04/21] hugetlbfs: per mount hstates
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (2 preceding siblings ...)
  2008-06-03  9:59 ` [patch 03/21] hugetlb: multiple hstates npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 05/21] hugetlb: new sysfs interface npiggin
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: hugetlbfs-per-mount-hstate.patch --]
[-- Type: text/plain, Size: 8059 bytes --]

Add support to have individual hstates for each hugetlbfs mount

- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- Set up pointers to a suitable hstate for the set page size option
to the super block and the inode and the vma.
- Change the hstate accessors to use this information
- Add code to the hstate init function to set parsed_hstate for command
line processing
- Handle duplicated hstate registrations to the make command line user proof

[np: take hstate out of hugetlbfs inode and vma->vm_private_data]

Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
 include/linux/hugetlb.h |   14 +++++++++-----
 mm/hugetlb.c            |   16 +++-------------
 mm/memory.c             |   18 ++++++++++++++++--
 4 files changed, 66 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 19:55:17.000000000 +1000
+++ linux-2.6/include/linux/hugetlb.h	2008-06-03 19:56:16.000000000 +1000
@@ -100,6 +100,7 @@ struct hugetlbfs_config {
 	umode_t mode;
 	long	nr_blocks;
 	long	nr_inodes;
+	struct hstate *hstate;
 };
 
 struct hugetlbfs_sb_info {
@@ -108,6 +109,7 @@ struct hugetlbfs_sb_info {
 	long	max_inodes;   /* inodes allowed */
 	long	free_inodes;  /* inodes free */
 	spinlock_t	stat_lock;
+	struct hstate *hstate;
 };
 
 
@@ -191,19 +193,21 @@ extern unsigned int default_hstate_idx;
 
 #define default_hstate (hstates[default_hstate_idx])
 
-static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+static inline struct hstate *hstate_inode(struct inode *i)
 {
-	return &default_hstate;
+	struct hugetlbfs_sb_info *hsb;
+	hsb = HUGETLBFS_SB(i->i_sb);
+	return hsb->hstate;
 }
 
 static inline struct hstate *hstate_file(struct file *f)
 {
-	return &default_hstate;
+	return hstate_inode(f->f_dentry->d_inode);
 }
 
-static inline struct hstate *hstate_inode(struct inode *i)
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 {
-	return &default_hstate;
+	return hstate_file(vma->vm_file);
 }
 
 static inline unsigned long huge_page_size(struct hstate *h)
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2008-06-03 19:54:53.000000000 +1000
+++ linux-2.6/fs/hugetlbfs/inode.c	2008-06-03 19:56:16.000000000 +1000
@@ -53,6 +53,7 @@ int sysctl_hugetlb_shm_group;
 enum {
 	Opt_size, Opt_nr_inodes,
 	Opt_mode, Opt_uid, Opt_gid,
+	Opt_pagesize,
 	Opt_err,
 };
 
@@ -62,6 +63,7 @@ static match_table_t tokens = {
 	{Opt_mode,	"mode=%o"},
 	{Opt_uid,	"uid=%u"},
 	{Opt_gid,	"gid=%u"},
+	{Opt_pagesize,	"pagesize=%s"},
 	{Opt_err,	NULL},
 };
 
@@ -750,6 +752,8 @@ hugetlbfs_parse_options(char *options, s
 	char *p, *rest;
 	substring_t args[MAX_OPT_ARGS];
 	int option;
+	unsigned long long size = 0;
+	enum { NO_SIZE, SIZE_STD, SIZE_PERCENT } setsize = NO_SIZE;
 
 	if (!options)
 		return 0;
@@ -780,17 +784,13 @@ hugetlbfs_parse_options(char *options, s
 			break;
 
 		case Opt_size: {
- 			unsigned long long size;
 			/* memparse() will accept a K/M/G without a digit */
 			if (!isdigit(*args[0].from))
 				goto bad_val;
 			size = memparse(args[0].from, &rest);
-			if (*rest == '%') {
-				size <<= HPAGE_SHIFT;
-				size *= max_huge_pages;
-				do_div(size, 100);
-			}
-			pconfig->nr_blocks = (size >> HPAGE_SHIFT);
+			setsize = SIZE_STD;
+			if (*rest == '%')
+				setsize = SIZE_PERCENT;
 			break;
 		}
 
@@ -801,6 +801,19 @@ hugetlbfs_parse_options(char *options, s
 			pconfig->nr_inodes = memparse(args[0].from, &rest);
 			break;
 
+		case Opt_pagesize: {
+			unsigned long ps;
+			ps = memparse(args[0].from, &rest);
+			pconfig->hstate = size_to_hstate(ps);
+			if (!pconfig->hstate) {
+				printk(KERN_ERR
+				"hugetlbfs: Unsupported page size %lu MB\n",
+					ps >> 20);
+				return -EINVAL;
+			}
+			break;
+		}
+
 		default:
 			printk(KERN_ERR "hugetlbfs: Bad mount option: \"%s\"\n",
 				 p);
@@ -808,6 +821,18 @@ hugetlbfs_parse_options(char *options, s
 			break;
 		}
 	}
+
+	/* Do size after hstate is set up */
+	if (setsize > NO_SIZE) {
+		struct hstate *h = pconfig->hstate;
+		if (setsize == SIZE_PERCENT) {
+			size <<= huge_page_shift(h);
+			size *= h->max_huge_pages;
+			do_div(size, 100);
+		}
+		pconfig->nr_blocks = (size >> huge_page_shift(h));
+	}
+
 	return 0;
 
 bad_val:
@@ -832,6 +857,7 @@ hugetlbfs_fill_super(struct super_block 
 	config.uid = current->fsuid;
 	config.gid = current->fsgid;
 	config.mode = 0755;
+	config.hstate = &default_hstate;
 	ret = hugetlbfs_parse_options(data, &config);
 	if (ret)
 		return ret;
@@ -840,14 +866,15 @@ hugetlbfs_fill_super(struct super_block 
 	if (!sbinfo)
 		return -ENOMEM;
 	sb->s_fs_info = sbinfo;
+	sbinfo->hstate = config.hstate;
 	spin_lock_init(&sbinfo->stat_lock);
 	sbinfo->max_blocks = config.nr_blocks;
 	sbinfo->free_blocks = config.nr_blocks;
 	sbinfo->max_inodes = config.nr_inodes;
 	sbinfo->free_inodes = config.nr_inodes;
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = HPAGE_SIZE;
-	sb->s_blocksize_bits = HPAGE_SHIFT;
+	sb->s_blocksize = huge_page_size(config.hstate);
+	sb->s_blocksize_bits = huge_page_shift(config.hstate);
 	sb->s_magic = HUGETLBFS_MAGIC;
 	sb->s_op = &hugetlbfs_ops;
 	sb->s_time_gran = 1;
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:55:41.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:56:16.000000000 +1000
@@ -1334,19 +1334,9 @@ void __unmap_hugepage_range(struct vm_ar
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page)
 {
-	/*
-	 * It is undesirable to test vma->vm_file as it should be non-null
-	 * for valid hugetlb area. However, vm_file will be NULL in the error
-	 * cleanup path of do_mmap_pgoff. When hugetlbfs ->mmap method fails,
-	 * do_mmap_pgoff() nullifies vma->vm_file before calling this function
-	 * to clean up. Since no pte has actually been setup, it is safe to
-	 * do nothing in this case.
-	 */
-	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
-		__unmap_hugepage_range(vma, start, end, ref_page);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
-	}
+	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	__unmap_hugepage_range(vma, start, end, ref_page);
+	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 /*
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-06-03 19:53:30.000000000 +1000
+++ linux-2.6/mm/memory.c	2008-06-03 19:56:16.000000000 +1000
@@ -902,9 +902,23 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			if (unlikely(is_vm_hugetlb_page(vma))) {
-				unmap_hugepage_range(vma, start, end, NULL);
-				zap_work -= (end - start) /
+				/*
+				 * It is undesirable to test vma->vm_file as it
+				 * should be non-null for valid hugetlb area.
+				 * However, vm_file will be NULL in the error
+				 * cleanup path of do_mmap_pgoff. When
+				 * hugetlbfs ->mmap method fails,
+				 * do_mmap_pgoff() nullifies vma->vm_file
+				 * before calling this function to clean up.
+				 * Since no pte has actually been setup, it is
+				 * safe to do nothing in this case.
+				 */
+				if (vma->vm_file) {
+					unmap_hugepage_range(vma, start, end, NULL);
+					zap_work -= (end - start) /
 					pages_per_huge_page(hstate_vma(vma));
+				}
+
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 05/21] hugetlb: new sysfs interface
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (3 preceding siblings ...)
  2008-06-03 10:00 ` [patch 04/21] hugetlbfs: per mount hstates npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 06/21] hugetlb: abstract numa round robin selection npiggin
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Greg Kroah-Hartman, kniht, andi,
	abh, joachim.deguara

[-- Attachment #1: hugetlb-sysfs-reporting.patch --]
[-- Type: text/plain, Size: 10685 bytes --]

Provide new hugepages user APIs that are more suited to multiple hstates in
sysfs. There is a new directory, /sys/kernel/hugepages. Underneath that
directory there will be a directory per-supported hugepage size, e.g.:

/sys/kernel/hugepages/hugepages-64kB
/sys/kernel/hugepages/hugepages-16384kB
/sys/kernel/hugepages/hugepages-16777216kB

corresponding to 64k, 16m and 16g respectively. Within each
hugepages-size directory there are a number of files, corresponding to
the tracked counters in the hstate, e.g.:

/sys/kernel/hugepages/hugepages-64/nr_hugepages
/sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-64/free_hugepages
/sys/kernel/hugepages/hugepages-64/resv_hugepages
/sys/kernel/hugepages/hugepages-64/surplus_hugepages

Of these files, the first two are read-write and the latter three are
read-only. The size of the hugepage being manipulated is trivially
deducible from the enclosing directory and is always expressed in kB (to
match meminfo).

Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>

---
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 19:56:16.000000000 +1000
+++ linux-2.6/include/linux/hugetlb.h	2008-06-03 19:56:43.000000000 +1000
@@ -164,6 +164,7 @@ unsigned long hugetlb_get_unmapped_area(
 
 #ifdef CONFIG_HUGETLB_PAGE
 
+#define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
 	int hugetlb_next_nid;
@@ -179,6 +180,7 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+	char name[HSTATE_NAME_LEN];
 };
 
 void __init hugetlb_add_hstate(unsigned order);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:56:16.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:56:43.000000000 +1000
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/sysfs.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -867,72 +868,6 @@ static void __init report_hugepages(void
 	}
 }
 
-static int __init hugetlb_init(void)
-{
-	BUILD_BUG_ON(HPAGE_SHIFT == 0);
-
-	if (!size_to_hstate(HPAGE_SIZE)) {
-		hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
-		parsed_hstate->max_huge_pages = default_hstate_max_huge_pages;
-	}
-	default_hstate_idx = size_to_hstate(HPAGE_SIZE) - hstates;
-
-	hugetlb_init_hstates();
-
-	report_hugepages();
-
-	return 0;
-}
-module_init(hugetlb_init);
-
-/* Should be called on processing a hugepagesz=... option */
-void __init hugetlb_add_hstate(unsigned order)
-{
-	struct hstate *h;
-	if (size_to_hstate(PAGE_SIZE << order)) {
-		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
-		return;
-	}
-	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
-	BUG_ON(order == 0);
-	h = &hstates[max_hstate++];
-	h->order = order;
-	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
-	hugetlb_init_one_hstate(h);
-	parsed_hstate = h;
-}
-
-static int __init hugetlb_setup(char *s)
-{
-	unsigned long *mhp;
-
-	/*
-	 * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
-	 * so this hugepages= parameter goes to the "default hstate".
-	 */
-	if (!max_hstate)
-		mhp = &default_hstate_max_huge_pages;
-	else
-		mhp = &parsed_hstate->max_huge_pages;
-
-	if (sscanf(s, "%lu", mhp) <= 0)
-		*mhp = 0;
-
-	return 1;
-}
-__setup("hugepages=", hugetlb_setup);
-
-static unsigned int cpuset_mems_nr(unsigned int *array)
-{
-	int node;
-	unsigned int nr = 0;
-
-	for_each_node_mask(node, cpuset_current_mems_allowed)
-		nr += array[node];
-
-	return nr;
-}
-
 #ifdef CONFIG_SYSCTL
 #ifdef CONFIG_HIGHMEM
 static void try_to_free_low(struct hstate *h, unsigned long count)
@@ -1030,6 +965,233 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_SYSFS
+#define HSTATE_ATTR_RO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+
+#define HSTATE_ATTR(_name) \
+	static struct kobj_attribute _name##_attr = \
+		__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static struct kobject *hugepages_kobj;
+static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj)
+{
+	int i;
+	for (i = 0; i < HUGE_MAX_HSTATE; i++)
+		if (hstate_kobjs[i] == kobj)
+			return &hstates[i];
+	BUG();
+	return NULL;
+}
+
+static ssize_t nr_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+}
+static ssize_t nr_hugepages_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int err;
+	unsigned long input;
+	struct hstate *h = kobj_to_hstate(kobj);
+
+	err = strict_strtoul(buf, 10, &input);
+	if (err)
+		return 0;
+
+	h->max_huge_pages = set_max_huge_pages(h, input);
+
+	return count;
+}
+HSTATE_ATTR(nr_hugepages);
+
+static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
+}
+static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int err;
+	unsigned long input;
+	struct hstate *h = kobj_to_hstate(kobj);
+
+	err = strict_strtoul(buf, 10, &input);
+	if (err)
+		return 0;
+
+	spin_lock(&hugetlb_lock);
+	h->nr_overcommit_huge_pages = input;
+	spin_unlock(&hugetlb_lock);
+
+	return count;
+}
+HSTATE_ATTR(nr_overcommit_hugepages);
+
+static ssize_t free_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->free_huge_pages);
+}
+HSTATE_ATTR_RO(free_hugepages);
+
+static ssize_t resv_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->resv_huge_pages);
+}
+HSTATE_ATTR_RO(resv_hugepages);
+
+static ssize_t surplus_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+}
+HSTATE_ATTR_RO(surplus_hugepages);
+
+static struct attribute *hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&nr_overcommit_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&resv_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group hstate_attr_group = {
+	.attrs = hstate_attrs,
+};
+
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+{
+	int retval;
+
+	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
+							hugepages_kobj);
+	if (!hstate_kobjs[h - hstates])
+		return -ENOMEM;
+
+	retval = sysfs_create_group(hstate_kobjs[h - hstates],
+							&hstate_attr_group);
+	if (retval)
+		kobject_put(hstate_kobjs[h - hstates]);
+
+	return retval;
+}
+
+static void __init hugetlb_sysfs_init(void)
+{
+	struct hstate *h;
+	int err;
+
+	hugepages_kobj = kobject_create_and_add("hugepages", kernel_kobj);
+	if (!hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h);
+		if (err)
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
+								h->name);
+	}
+}
+#else
+static void __init hugetlb_sysfs_init(void)
+{
+}
+#endif
+
+static int __init hugetlb_init(void)
+{
+	BUILD_BUG_ON(HPAGE_SHIFT == 0);
+
+	if (!size_to_hstate(HPAGE_SIZE)) {
+		hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
+		parsed_hstate->max_huge_pages = default_hstate_max_huge_pages;
+	}
+	default_hstate_idx = size_to_hstate(HPAGE_SIZE) - hstates;
+
+	hugetlb_init_hstates();
+
+	report_hugepages();
+
+	hugetlb_sysfs_init();
+
+	return 0;
+}
+module_init(hugetlb_init);
+
+static void __exit hugetlb_exit(void)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		kobject_put(hstate_kobjs[h - hstates]);
+	}
+
+	kobject_put(hugepages_kobj);
+}
+module_exit(hugetlb_exit);
+
+/* Should be called on processing a hugepagesz=... option */
+void __init hugetlb_add_hstate(unsigned order)
+{
+	struct hstate *h;
+	if (size_to_hstate(PAGE_SIZE << order)) {
+		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
+		return;
+	}
+	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+	BUG_ON(order == 0);
+	h = &hstates[max_hstate++];
+	h->order = order;
+	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
+	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
+					huge_page_size(h)/1024);
+	hugetlb_init_one_hstate(h);
+	parsed_hstate = h;
+}
+
+static int __init hugetlb_setup(char *s)
+{
+	unsigned long *mhp;
+
+	/*
+	 * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
+	 * so this hugepages= parameter goes to the "default hstate".
+	 */
+	if (!max_hstate)
+		mhp = &default_hstate_max_huge_pages;
+	else
+		mhp = &parsed_hstate->max_huge_pages;
+
+	if (sscanf(s, "%lu", mhp) <= 0)
+		*mhp = 0;
+
+	return 1;
+}
+__setup("hugepages=", hugetlb_setup);
+
+static unsigned int cpuset_mems_nr(unsigned int *array)
+{
+	int node;
+	unsigned int nr = 0;
+
+	for_each_node_mask(node, cpuset_current_mems_allowed)
+		nr += array[node];
+
+	return nr;
+}
+
 int hugetlb_sysctl_handler(struct ctl_table *table, int write,
 			   struct file *file, void __user *buffer,
 			   size_t *length, loff_t *ppos)
Index: linux-2.6/Documentation/ABI/testing/sysfs-kernel-hugepages
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/ABI/testing/sysfs-kernel-hugepages	2008-06-03 19:56:43.000000000 +1000
@@ -0,0 +1,14 @@
+What:		/sys/kernel/hugepages/
+Date:		June 2008
+Contact:	Nishanth Aravamudan <nacc@us.ibm.com>, hugetlb maintainers
+Description:
+		/sys/kernel/hugepages/ contains a number of subdirectories
+		of the form hugepages-<size>kb, where <size> is the page size
+		of the hugepages supported by the kernel/CPU combination.
+
+		Under these directories are a number of files:
+		nr_hugepages - minimum number of hugepages reserved
+		nr_overcommit_hugepages - maximum number that can be allocated
+		free_hugepages - number of hugepages free
+		surplus_hugepages -
+		resv_hugepages -

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 06/21] hugetlb: abstract numa round robin selection
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (4 preceding siblings ...)
  2008-06-03 10:00 ` [patch 05/21] hugetlb: new sysfs interface npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 07/21] mm: introduce non panic alloc_bootmem npiggin
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: hugetlb-abstract-numa-rr.patch --]
[-- Type: text/plain, Size: 2698 bytes --]

Need this as a separate function for a future patch.

No behaviour change.

Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:56:43.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:56:45.000000000 +1000
@@ -383,6 +383,27 @@ static struct page *alloc_fresh_huge_pag
 	return page;
 }
 
+/*
+ * Use a helper variable to find the next node and then
+ * copy it back to hugetlb_next_nid afterwards:
+ * otherwise there's a window in which a racer might
+ * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * But we don't need to use a spin_lock here: it really
+ * doesn't matter if occasionally a racer chooses the
+ * same nid as we do.  Move nid forward in the mask even
+ * if we just successfully allocated a hugepage so that
+ * the next caller gets hugepages on the next node.
+ */
+static int hstate_next_node(struct hstate *h)
+{
+	int next_nid;
+	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(node_online_map);
+	h->hugetlb_next_nid = next_nid;
+	return next_nid;
+}
+
 static int alloc_fresh_huge_page(struct hstate *h)
 {
 	struct page *page;
@@ -396,21 +417,7 @@ static int alloc_fresh_huge_page(struct 
 		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
 		if (page)
 			ret = 1;
-		/*
-		 * Use a helper variable to find the next node and then
-		 * copy it back to hugetlb_next_nid afterwards:
-		 * otherwise there's a window in which a racer might
-		 * pass invalid nid MAX_NUMNODES to alloc_pages_node.
-		 * But we don't need to use a spin_lock here: it really
-		 * doesn't matter if occasionally a racer chooses the
-		 * same nid as we do.  Move nid forward in the mask even
-		 * if we just successfully allocated a hugepage so that
-		 * the next caller gets hugepages on the next node.
-		 */
-		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
-		if (next_nid == MAX_NUMNODES)
-			next_nid = first_node(node_online_map);
-		h->hugetlb_next_nid = next_nid;
+		next_nid = hstate_next_node(h);
 	} while (!page && h->hugetlb_next_nid != start_nid);
 
 	if (ret)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 07/21] mm: introduce non panic alloc_bootmem
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (5 preceding siblings ...)
  2008-06-03 10:00 ` [patch 06/21] hugetlb: abstract numa round robin selection npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 08/21] mm: export prep_compound_page to mm npiggin
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm; +Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

[-- Attachment #1: __alloc_bootmem_node_nopanic.patch --]
[-- Type: text/plain, Size: 1921 bytes --]

Straight forward variant of the existing __alloc_bootmem_node, only 
difference is that it doesn't panic on failure.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/bootmem.h |    4 ++++
 mm/bootmem.c            |   12 ++++++++++++
 2 files changed, 16 insertions(+)

Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c	2008-06-03 19:52:50.000000000 +1000
+++ linux-2.6/mm/bootmem.c	2008-06-03 19:56:46.000000000 +1000
@@ -576,6 +576,18 @@ void * __init alloc_bootmem_section(unsi
 }
 #endif
 
+void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
+				   unsigned long align, unsigned long goal)
+{
+	void *ptr;
+
+	ptr = alloc_bootmem_core(pgdat->bdata, size, align, goal, 0);
+	if (ptr)
+		return ptr;
+
+	return __alloc_bootmem_nopanic(size, align, goal);
+}
+
 #ifndef ARCH_LOW_ADDRESS_LIMIT
 #define ARCH_LOW_ADDRESS_LIMIT	0xffffffffUL
 #endif
Index: linux-2.6/include/linux/bootmem.h
===================================================================
--- linux-2.6.orig/include/linux/bootmem.h	2008-06-03 19:52:50.000000000 +1000
+++ linux-2.6/include/linux/bootmem.h	2008-06-03 19:56:46.000000000 +1000
@@ -87,6 +87,10 @@ extern void *__alloc_bootmem_node(pg_dat
 				  unsigned long size,
 				  unsigned long align,
 				  unsigned long goal);
+extern void *__alloc_bootmem_node_nopanic(pg_data_t *pgdat,
+				  unsigned long size,
+				  unsigned long align,
+				  unsigned long goal);
 extern unsigned long init_bootmem_node(pg_data_t *pgdat,
 				       unsigned long freepfn,
 				       unsigned long startpfn,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 08/21] mm: export prep_compound_page to mm
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (6 preceding siblings ...)
  2008-06-03 10:00 ` [patch 07/21] mm: introduce non panic alloc_bootmem npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 09/21] hugetlb: support larger than MAX_ORDER npiggin
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: mm-export-prep_compound_page.patch --]
[-- Type: text/plain, Size: 1711 bytes --]

hugetlb will need to get compound pages from bootmem to handle the case of them
being greater than or equal to MAX_ORDER. Export the constructor function
needed for this.

Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/internal.h   |    2 ++
 mm/page_alloc.c |    2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h	2008-06-03 19:52:49.000000000 +1000
+++ linux-2.6/mm/internal.h	2008-06-03 19:56:48.000000000 +1000
@@ -16,6 +16,8 @@
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+extern void prep_compound_page(struct page *page, unsigned long order);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2008-06-03 19:52:49.000000000 +1000
+++ linux-2.6/mm/page_alloc.c	2008-06-03 19:56:48.000000000 +1000
@@ -288,7 +288,7 @@ static void free_compound_page(struct pa
 	__free_pages_ok(page, compound_order(page));
 }
 
-static void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 09/21] hugetlb: support larger than MAX_ORDER
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (7 preceding siblings ...)
  2008-06-03 10:00 ` [patch 08/21] mm: export prep_compound_page to mm npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 10/21] hugetlb: support boot allocate different sizes npiggin
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm; +Cc: Nishanth Aravamudan, linux-mm, Andrew Hastings, kniht, andi

[-- Attachment #1: hugetlb-unlimited-order.patch --]
[-- Type: text/plain, Size: 5445 bytes --]

This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
not practical to enlarge MAX_ORDER to 1GB. 

Instead the 1GB pages are only allocated at boot using the bootmem
allocator using the hugepages=... option.

These 1G bootmem pages are never freed. In theory it would be possible
to implement that with some complications, but since it would be a one-way
street (>= MAX_ORDER pages cannot be allocated later) I decided not to
currently.

The >= MAX_ORDER code is not ifdef'ed per architecture. It is not very big
and the ifdef uglyness seemed not be worth it.

Known problems: /proc/meminfo and "free" do not display the memory 
allocated for gb pages in "Total". This is a little confusing for the
user.

Acked-by: Andrew Hastings <abh@cray.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:56:45.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:56:50.000000000 +1000
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/bootmem.h>
 #include <linux/sysfs.h>
 
 #include <asm/page.h>
@@ -307,7 +308,7 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	if (h->surplus_huge_pages_node[nid]) {
+	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
 		update_and_free_page(h, page);
 		h->surplus_huge_pages--;
 		h->surplus_huge_pages_node[nid]--;
@@ -368,6 +369,9 @@ static struct page *alloc_fresh_huge_pag
 {
 	struct page *page;
 
+	if (h->order >= MAX_ORDER)
+		return NULL;
+
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
@@ -434,6 +438,9 @@ static struct page *alloc_buddy_huge_pag
 	struct page *page;
 	unsigned int nid;
 
+	if (h->order >= MAX_ORDER)
+		return NULL;
+
 	/*
 	 * Assume we will successfully allocate the surplus page to
 	 * prevent racing processes from causing the surplus to exceed
@@ -610,6 +617,10 @@ static void return_unused_surplus_pages(
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
+	/* Cannot return gigantic pages currently */
+	if (h->order >= MAX_ORDER)
+		return;
+
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
 	while (remaining_iterations-- && nr_pages) {
@@ -838,6 +849,63 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
+static __initdata LIST_HEAD(huge_boot_pages);
+
+struct huge_bootmem_page {
+	struct list_head list;
+	struct hstate *hstate;
+};
+
+static int __init alloc_bootmem_huge_page(struct hstate *h)
+{
+	struct huge_bootmem_page *m;
+	int nr_nodes = nodes_weight(node_online_map);
+
+	while (nr_nodes) {
+		void *addr;
+
+		addr = __alloc_bootmem_node_nopanic(
+				NODE_DATA(h->hugetlb_next_nid),
+				huge_page_size(h), huge_page_size(h), 0);
+
+		if (addr) {
+			/*
+			 * Use the beginning of the huge page to store the
+			 * huge_bootmem_page struct (until gather_bootmem
+			 * puts them into the mem_map).
+			 */
+			m = addr;
+			if (m)
+				goto found;
+		}
+		hstate_next_node(h);
+		nr_nodes--;
+	}
+	return 0;
+
+found:
+	BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
+	/* Put them into a private list first because mem_map is not up yet */
+	list_add(&m->list, &huge_boot_pages);
+	m->hstate = h;
+	return 1;
+}
+
+/* Put bootmem huge pages into the standard lists after mem_map is up */
+static void __init gather_bootmem_prealloc(void)
+{
+	struct huge_bootmem_page *m;
+
+	list_for_each_entry(m, &huge_boot_pages, list) {
+		struct page *page = virt_to_page(m);
+		struct hstate *h = m->hstate;
+		__ClearPageReserved(page);
+		WARN_ON(page_count(page) != 1);
+		prep_compound_page(page, h->order);
+		prep_new_huge_page(h, page, page_to_nid(page));
+	}
+}
+
 static void __init hugetlb_init_one_hstate(struct hstate *h)
 {
 	unsigned long i;
@@ -848,7 +916,10 @@ static void __init hugetlb_init_one_hsta
 	h->hugetlb_next_nid = first_node(node_online_map);
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
-		if (!alloc_fresh_huge_page(h))
+		if (h->order >= MAX_ORDER) {
+			if (!alloc_bootmem_huge_page(h))
+				break;
+		} else if (!alloc_fresh_huge_page(h))
 			break;
 	}
 	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
@@ -881,6 +952,9 @@ static void try_to_free_low(struct hstat
 {
 	int i;
 
+	if (h->order >= MAX_ORDER)
+		return;
+
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
@@ -907,6 +981,9 @@ static unsigned long set_max_huge_pages(
 {
 	unsigned long min_count, ret;
 
+	if (h->order >= MAX_ORDER)
+		return h->max_huge_pages;
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -1129,6 +1206,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_init_hstates();
 
+	gather_bootmem_prealloc();
+
 	report_hugepages();
 
 	hugetlb_sysfs_init();

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 10/21] hugetlb: support boot allocate different sizes
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (8 preceding siblings ...)
  2008-06-03 10:00 ` [patch 09/21] hugetlb: support larger than MAX_ORDER npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 11/21] hugetlb: printk cleanup npiggin
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, Andrew Hastings, kniht,
	andi

[-- Attachment #1: hugetlb-different-page-sizes.patch --]
[-- Type: text/plain, Size: 3748 bytes --]

Make some infrastructure changes to allow boot allocation of different
hugepage page sizes.

- move all basic hstate initialisation into hugetlb_add_hstate
- create a new function hugetlb_hstate_alloc_pages() to do the
  actual initial page allocations. Call this function early in
  order to allocate giant pages from bootmem.
- Check for multiple hugepages= parameters

Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Andrew Hastings <abh@cray.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:56:50.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:56:51.000000000 +1000
@@ -906,15 +906,10 @@ static void __init gather_bootmem_preall
 	}
 }
 
-static void __init hugetlb_init_one_hstate(struct hstate *h)
+static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
 
-	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
-
-	h->hugetlb_next_nid = first_node(node_online_map);
-
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (h->order >= MAX_ORDER) {
 			if (!alloc_bootmem_huge_page(h))
@@ -922,7 +917,7 @@ static void __init hugetlb_init_one_hsta
 		} else if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	h->max_huge_pages = i;
 }
 
 static void __init hugetlb_init_hstates(void)
@@ -930,7 +925,9 @@ static void __init hugetlb_init_hstates(
 	struct hstate *h;
 
 	for_each_hstate(h) {
-		hugetlb_init_one_hstate(h);
+		/* oversize hugepages were init'ed in early boot */
+		if (h->order < MAX_ORDER)
+			hugetlb_hstate_alloc_pages(h);
 	}
 }
 
@@ -1232,6 +1229,8 @@ module_exit(hugetlb_exit);
 void __init hugetlb_add_hstate(unsigned order)
 {
 	struct hstate *h;
+	unsigned long i;
+
 	if (size_to_hstate(PAGE_SIZE << order)) {
 		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
 		return;
@@ -1241,15 +1240,21 @@ void __init hugetlb_add_hstate(unsigned 
 	h = &hstates[max_hstate++];
 	h->order = order;
 	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
+	h->nr_huge_pages = 0;
+	h->free_huge_pages = 0;
+	for (i = 0; i < MAX_NUMNODES; ++i)
+		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+	h->hugetlb_next_nid = first_node(node_online_map);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
-	hugetlb_init_one_hstate(h);
+
 	parsed_hstate = h;
 }
 
 static int __init hugetlb_setup(char *s)
 {
 	unsigned long *mhp;
+	static unsigned long *last_mhp;
 
 	/*
 	 * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
@@ -1260,9 +1265,25 @@ static int __init hugetlb_setup(char *s)
 	else
 		mhp = &parsed_hstate->max_huge_pages;
 
+	if (mhp == last_mhp) {
+		printk(KERN_WARNING "hugepages= specified twice without "
+			"interleaving hugepagesz=, ignoring\n");
+		return 1;
+	}
+
 	if (sscanf(s, "%lu", mhp) <= 0)
 		*mhp = 0;
 
+	/*
+	 * Global state is always initialized later in hugetlb_init.
+	 * But we need to allocate >= MAX_ORDER hstates here early to still
+	 * use the bootmem allocator.
+	 */
+	if (max_hstate && parsed_hstate->order >= MAX_ORDER)
+		hugetlb_hstate_alloc_pages(parsed_hstate);
+
+	last_mhp = mhp;
+
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 11/21] hugetlb: printk cleanup
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (9 preceding siblings ...)
  2008-06-03 10:00 ` [patch 10/21] hugetlb: support boot allocate different sizes npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 12/21] hugetlb: introduce pud_huge npiggin
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: hugetlb-printk-cleanup.patch --]
[-- Type: text/plain, Size: 1634 bytes --]

- Reword sentence to clarify meaning with multiple options
- Add support for using GB prefixes for the page size
- Add extra printk to delayed > MAX_ORDER allocation code

Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:56:51.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:56:53.000000000 +1000
@@ -931,15 +931,27 @@ static void __init hugetlb_init_hstates(
 	}
 }
 
+static char * __init memfmt(char *buf, unsigned long n)
+{
+	if (n >= (1UL << 30))
+		sprintf(buf, "%lu GB", n >> 30);
+	else if (n >= (1UL << 20))
+		sprintf(buf, "%lu MB", n >> 20);
+	else
+		sprintf(buf, "%lu KB", n >> 10);
+	return buf;
+}
+
 static void __init report_hugepages(void)
 {
 	struct hstate *h;
 
 	for_each_hstate(h) {
-		printk(KERN_INFO "Total HugeTLB memory allocated, "
-				"%ld %dMB pages\n",
-				h->free_huge_pages,
-				1 << (h->order + PAGE_SHIFT - 20));
+		char buf[32];
+		printk(KERN_INFO "HugeTLB registered %s page size, "
+				 "pre-allocated %ld pages\n",
+			memfmt(buf, huge_page_size(h)),
+			h->free_huge_pages);
 	}
 }
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 12/21] hugetlb: introduce pud_huge
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (10 preceding siblings ...)
  2008-06-03 10:00 ` [patch 11/21] hugetlb: printk cleanup npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 13/21] x86: support GB hugepages on 64-bit npiggin
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm; +Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

[-- Attachment #1: hugetlbfs-huge_pud.patch --]
[-- Type: text/plain, Size: 7091 bytes --]

Straight forward extensions for huge pages located in the PUD
instead of PMDs.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    6 ++++++
 arch/powerpc/mm/hugetlbpage.c |    5 +++++
 arch/sh/mm/hugetlbpage.c      |    5 +++++
 arch/sparc64/mm/hugetlbpage.c |    5 +++++
 arch/x86/mm/hugetlbpage.c     |   25 ++++++++++++++++++++++++-
 include/linux/hugetlb.h       |    5 +++++
 mm/hugetlb.c                  |    9 +++++++++
 mm/memory.c                   |   10 +++++++++-
 8 files changed, 68 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 19:56:43.000000000 +1000
+++ linux-2.6/include/linux/hugetlb.h	2008-06-03 19:56:55.000000000 +1000
@@ -50,7 +50,10 @@ struct page *follow_huge_addr(struct mm_
 			      int write);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 				pmd_t *pmd, int write);
+struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
+				pud_t *pud, int write);
 int pmd_huge(pmd_t pmd);
+int pud_huge(pud_t pmd);
 void hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
 
@@ -78,8 +81,10 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL
+#define follow_huge_pud(mm, addr, pud, write)	NULL
 #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
+#define pud_huge(x)	0
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, write)	({ BUG(); 0; })
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c	2008-06-03 19:56:55.000000000 +1000
@@ -107,6 +107,12 @@ int pmd_huge(pmd_t pmd)
 {
 	return 0;
 }
+
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:56:55.000000000 +1000
@@ -369,6 +369,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
Index: linux-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
+++ linux-2.6/arch/sh/mm/hugetlbpage.c	2008-06-03 19:56:55.000000000 +1000
@@ -79,6 +79,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 			     pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c	2008-06-03 19:56:55.000000000 +1000
@@ -295,6 +295,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 			     pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c	2008-06-03 19:53:30.000000000 +1000
+++ linux-2.6/arch/x86/mm/hugetlbpage.c	2008-06-03 19:56:55.000000000 +1000
@@ -189,6 +189,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
@@ -209,6 +214,11 @@ int pmd_huge(pmd_t pmd)
 	return !!(pmd_val(pmd) & _PAGE_PSE);
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
@@ -217,9 +227,22 @@ follow_huge_pmd(struct mm_struct *mm, un
 
 	page = pte_page(*(pte_t *)pmd);
 	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
+		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
 	return page;
 }
+
+struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+		pud_t *pud, int write)
+{
+	struct page *page;
+
+	page = pte_page(*(pte_t *)pud);
+	if (page)
+		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
+	return page;
+}
+
 #endif
 
 /* x86_64 also uses this file */
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:56:53.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:56:55.000000000 +1000
@@ -1896,6 +1896,15 @@ int hugetlb_fault(struct mm_struct *mm, 
 	return ret;
 }
 
+/* Can be overriden by architectures */
+__attribute__((weak)) struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+	       pud_t *pud, int write)
+{
+	BUG();
+	return NULL;
+}
+
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			struct page **pages, struct vm_area_struct **vmas,
 			unsigned long *position, int *length, int i,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-06-03 19:56:16.000000000 +1000
+++ linux-2.6/mm/memory.c	2008-06-03 19:56:55.000000000 +1000
@@ -999,19 +999,24 @@ struct page *follow_page(struct vm_area_
 		goto no_page_table;
 
 	pud = pud_offset(pgd, address);
-	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+	if (pud_none(*pud))
+		goto no_page_table;
+	if (pud_huge(*pud)) {
+		BUG_ON(flags & FOLL_GET);
+		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
+		goto out;
+	}
+	if (unlikely(pud_bad(*pud)))
 		goto no_page_table;
-	
+
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-
 	if (pmd_huge(*pmd)) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
-
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -1542,6 +1547,8 @@ static int apply_to_pmd_range(struct mm_
 	unsigned long next;
 	int err;
 
+	BUG_ON(pud_huge(*pud));
+
 	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		return -ENOMEM;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 13/21] x86: support GB hugepages on 64-bit
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (11 preceding siblings ...)
  2008-06-03 10:00 ` [patch 12/21] hugetlb: introduce pud_huge npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 14/21] x86: add hugepagesz option " npiggin
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: x86-support-GB-hugetlb-pages.patch --]
[-- Type: text/plain, Size: 3969 bytes --]

---
 arch/x86/mm/hugetlbpage.c |   33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c	2008-06-03 19:56:55.000000000 +1000
+++ linux-2.6/arch/x86/mm/hugetlbpage.c	2008-06-03 19:56:56.000000000 +1000
@@ -134,9 +134,14 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	pgd = pgd_offset(mm, addr);
 	pud = pud_alloc(mm, pgd, addr);
 	if (pud) {
-		if (pud_none(*pud))
-			huge_pmd_share(mm, addr, pud);
-		pte = (pte_t *) pmd_alloc(mm, pud, addr);
+		if (sz == PUD_SIZE) {
+			pte = (pte_t *)pud;
+		} else {
+			BUG_ON(sz != PMD_SIZE);
+			if (pud_none(*pud))
+				huge_pmd_share(mm, addr, pud);
+			pte = (pte_t *) pmd_alloc(mm, pud, addr);
+		}
 	}
 	BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
 
@@ -152,8 +157,11 @@ pte_t *huge_pte_offset(struct mm_struct 
 	pgd = pgd_offset(mm, addr);
 	if (pgd_present(*pgd)) {
 		pud = pud_offset(pgd, addr);
-		if (pud_present(*pud))
+		if (pud_present(*pud)) {
+			if (pud_large(*pud))
+				return (pte_t *)pud;
 			pmd = pmd_offset(pud, addr);
+		}
 	}
 	return (pte_t *) pmd;
 }
@@ -216,7 +224,7 @@ int pmd_huge(pmd_t pmd)
 
 int pud_huge(pud_t pud)
 {
-	return 0;
+	return !!(pud_val(pud) & _PAGE_PSE);
 }
 
 struct page *
@@ -252,6 +260,7 @@ static unsigned long hugetlb_get_unmappe
 		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long start_addr;
@@ -264,7 +273,7 @@ static unsigned long hugetlb_get_unmappe
 	}
 
 full_search:
-	addr = ALIGN(start_addr, HPAGE_SIZE);
+	addr = ALIGN(start_addr, huge_page_size(h));
 
 	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
@@ -286,7 +295,7 @@ full_search:
 		}
 		if (addr + mm->cached_hole_size < vma->vm_start)
 		        mm->cached_hole_size = vma->vm_start - addr;
-		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+		addr = ALIGN(vma->vm_end, huge_page_size(h));
 	}
 }
 
@@ -294,6 +303,7 @@ static unsigned long hugetlb_get_unmappe
 		unsigned long addr0, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev_vma;
 	unsigned long base = mm->mmap_base, addr = addr0;
@@ -314,7 +324,7 @@ try_again:
 		goto fail;
 
 	/* either no address requested or cant fit in requested address hole */
-	addr = (mm->free_area_cache - len) & HPAGE_MASK;
+	addr = (mm->free_area_cache - len) & huge_page_mask(h);
 	do {
 		/*
 		 * Lookup failure means no vma is above this address,
@@ -345,7 +355,7 @@ try_again:
 		        largest_hole = vma->vm_start - addr;
 
 		/* try just below the current vma->vm_start */
-		addr = (vma->vm_start - len) & HPAGE_MASK;
+		addr = (vma->vm_start - len) & huge_page_mask(h);
 	} while (len <= vma->vm_start);
 
 fail:
@@ -383,10 +393,11 @@ unsigned long
 hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 
-	if (len & ~HPAGE_MASK)
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -398,7 +409,7 @@ hugetlb_get_unmapped_area(struct file *f
 	}
 
 	if (addr) {
-		addr = ALIGN(addr, HPAGE_SIZE);
+		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (12 preceding siblings ...)
  2008-06-03 10:00 ` [patch 13/21] x86: support GB hugepages on 64-bit npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 17:48   ` Dave Hansen
  2008-06-03 10:00 ` [patch 15/21] hugetlb: override default huge page size npiggin
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm; +Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

[-- Attachment #1: x86-64-implement-hugepagesz.patch --]
[-- Type: text/plain, Size: 3247 bytes --]

Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.

This finally allows to select GB pages for hugetlbfs in x86 now
that all the infrastructure is in place.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 Documentation/kernel-parameters.txt |   11 +++++++++--
 arch/x86/mm/hugetlbpage.c           |   17 +++++++++++++++++
 include/asm-x86/page.h              |    2 ++
 3 files changed, 28 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c	2008-06-03 19:56:56.000000000 +1000
+++ linux-2.6/arch/x86/mm/hugetlbpage.c	2008-06-03 19:56:57.000000000 +1000
@@ -425,3 +425,20 @@ hugetlb_get_unmapped_area(struct file *f
 
 #endif /*HAVE_ARCH_HUGETLB_UNMAPPED_AREA*/
 
+#ifdef CONFIG_X86_64
+static __init int setup_hugepagesz(char *opt)
+{
+	unsigned long ps = memparse(opt, &opt);
+	if (ps == PMD_SIZE) {
+		hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT);
+	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
+		hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
+	} else {
+		printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
+			ps >> 20);
+		return 0;
+	}
+	return 1;
+}
+__setup("hugepagesz=", setup_hugepagesz);
+#endif
Index: linux-2.6/include/asm-x86/page.h
===================================================================
--- linux-2.6.orig/include/asm-x86/page.h	2008-06-03 19:52:47.000000000 +1000
+++ linux-2.6/include/asm-x86/page.h	2008-06-03 19:56:57.000000000 +1000
@@ -29,6 +29,8 @@
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
 
+#define HUGE_MAX_HSTATE 2
+
 /* to align the pointer to the (next) page boundary */
 #define PAGE_ALIGN(addr)	(((addr)+PAGE_SIZE-1)&PAGE_MASK)
 
Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt	2008-06-03 19:52:47.000000000 +1000
+++ linux-2.6/Documentation/kernel-parameters.txt	2008-06-03 19:56:58.000000000 +1000
@@ -765,8 +765,15 @@ and is between 256 and 4096 characters. 
 	hisax=		[HW,ISDN]
 			See Documentation/isdn/README.HiSax.
 
-	hugepages=	[HW,X86-32,IA-64] Maximal number of HugeTLB pages.
-	hugepagesz=	[HW,IA-64,PPC] The size of the HugeTLB pages.
+	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
+			On x86 this option can be specified multiple times
+			interleaved with hugepages= to reserve huge pages
+			of different sizes. Valid pages sizes on x86-64
+			are 2M (when the CPU supports "pse") and 1G (when the
+			CPU supports the "pdpe1gb" cpuinfo flag)
+			Note that 1GB pages can only be allocated at boot time
+			using hugepages= and not freed afterwards.
 
 	i8042.direct	[HW] Put keyboard port into non-translated mode
 	i8042.dumbkbd	[HW] Pretend that controller can only read data from

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 15/21] hugetlb: override default huge page size
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (13 preceding siblings ...)
  2008-06-03 10:00 ` [patch 14/21] x86: add hugepagesz option " npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 16/21] hugetlb: allow arch overried hugepage allocation npiggin
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm; +Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

[-- Attachment #1: hugetlb-non-default-hstate.patch --]
[-- Type: text/plain, Size: 3649 bytes --]

Allow configurations with the default huge page size which is different to
the traditional HPAGE_SIZE size. The default huge page size is the one
represented in the legacy /proc ABIs, SHM, and which is defaulted to when
mounting hugetlbfs filesystems.

This is implemented with a new kernel option default_hugepagesz=, which
defaults to HPAGE_SIZE if not specified.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/hugetlbfs/inode.c |    2 ++
 mm/hugetlb.c         |    2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:56:55.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:56:59.000000000 +1000
@@ -34,6 +34,7 @@ struct hstate hstates[HUGE_MAX_HSTATE];
 /* for command line parsing */
 static struct hstate * __initdata parsed_hstate = NULL;
 static unsigned long __initdata default_hstate_max_huge_pages = 0;
+static unsigned long __initdata default_hstate_size = HPAGE_SIZE;
 
 #define for_each_hstate(h) \
 	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
@@ -1207,11 +1208,14 @@ static int __init hugetlb_init(void)
 {
 	BUILD_BUG_ON(HPAGE_SHIFT == 0);
 
-	if (!size_to_hstate(HPAGE_SIZE)) {
-		hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
-		parsed_hstate->max_huge_pages = default_hstate_max_huge_pages;
-	}
-	default_hstate_idx = size_to_hstate(HPAGE_SIZE) - hstates;
+	if (!size_to_hstate(default_hstate_size)) {
+		default_hstate_size = HPAGE_SIZE;
+		if (!size_to_hstate(default_hstate_size))
+			hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
+	}
+	default_hstate_idx = size_to_hstate(default_hstate_size) - hstates;
+	if (default_hstate_max_huge_pages)
+		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
 
 	hugetlb_init_hstates();
 
@@ -1263,7 +1267,7 @@ void __init hugetlb_add_hstate(unsigned 
 	parsed_hstate = h;
 }
 
-static int __init hugetlb_setup(char *s)
+static int __init hugetlb_nrpages_setup(char *s)
 {
 	unsigned long *mhp;
 	static unsigned long *last_mhp;
@@ -1298,7 +1302,14 @@ static int __init hugetlb_setup(char *s)
 
 	return 1;
 }
-__setup("hugepages=", hugetlb_setup);
+__setup("hugepages=", hugetlb_nrpages_setup);
+
+static int __init hugetlb_default_setup(char *s)
+{
+	default_hstate_size = memparse(s, &s);
+	return 1;
+}
+__setup("default_hugepagesz=", hugetlb_default_setup);
 
 static unsigned int cpuset_mems_nr(unsigned int *array)
 {
Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt	2008-06-03 19:56:58.000000000 +1000
+++ linux-2.6/Documentation/kernel-parameters.txt	2008-06-03 19:56:59.000000000 +1000
@@ -774,6 +774,13 @@ and is between 256 and 4096 characters. 
 			CPU supports the "pdpe1gb" cpuinfo flag)
 			Note that 1GB pages can only be allocated at boot time
 			using hugepages= and not freed afterwards.
+	default_hugepagesz=
+			[same as hugepagesz=] The size of the default
+			HugeTLB page size. This is the size represented by
+			the legacy /proc/ hugepages APIs, used for SHM, and
+			default size when mounting hugetlbfs filesystems.
+			Defaults to the default architecture's huge page size
+			if not specified.
 
 	i8042.direct	[HW] Put keyboard port into non-translated mode
 	i8042.dumbkbd	[HW] Pretend that controller can only read data from

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 16/21] hugetlb: allow arch overried hugepage allocation
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (14 preceding siblings ...)
  2008-06-03 10:00 ` [patch 15/21] hugetlb: override default huge page size npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 17/21] powerpc: function to allocate gigantic hugepages npiggin
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, Jon Tollefson, kniht,
	andi, abh, joachim.deguara

[-- Attachment #1: hugetlb-allow-arch-override-hugepage-allocation.patch --]
[-- Type: text/plain, Size: 2880 bytes --]

Allow alloc_bootmem_huge_page() to be overridden by architectures that can't
always use bootmem. This requires huge_boot_pages to be available for
use by this function. The 16G pages on ppc64 have to be reserved prior
to boot-time. The location of these pages are indicated in the device
tree.

Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 include/linux/hugetlb.h |   10 ++++++++++
 mm/hugetlb.c            |   12 ++++--------
 2 files changed, 14 insertions(+), 8 deletions(-)


Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 19:56:55.000000000 +1000
+++ linux-2.6/include/linux/hugetlb.h	2008-06-03 19:57:00.000000000 +1000
@@ -39,6 +39,7 @@ void hugetlb_unreserve_pages(struct inod
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
+extern struct list_head huge_boot_pages;
 
 /* arch callbacks */
 
@@ -188,6 +189,14 @@ struct hstate {
 	char name[HSTATE_NAME_LEN];
 };
 
+struct huge_bootmem_page {
+	struct list_head list;
+	struct hstate *hstate;
+};
+
+/* arch callback */
+int __init alloc_bootmem_huge_page(struct hstate *h);
+
 void __init hugetlb_add_hstate(unsigned order);
 struct hstate *size_to_hstate(unsigned long size);
 
@@ -254,6 +263,7 @@ static inline struct hstate *page_hstate
 
 #else
 struct hstate {};
+#define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
 #define hstate_inode(i) NULL
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 19:56:59.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 19:57:00.000000000 +1000
@@ -31,6 +31,8 @@ static int max_hstate = 0;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
+__initdata LIST_HEAD(huge_boot_pages);
+
 /* for command line parsing */
 static struct hstate * __initdata parsed_hstate = NULL;
 static unsigned long __initdata default_hstate_max_huge_pages = 0;
@@ -850,14 +852,7 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
-static __initdata LIST_HEAD(huge_boot_pages);
-
-struct huge_bootmem_page {
-	struct list_head list;
-	struct hstate *hstate;
-};
-
-static int __init alloc_bootmem_huge_page(struct hstate *h)
+__attribute__((weak)) int alloc_bootmem_huge_page(struct hstate *h)
 {
 	struct huge_bootmem_page *m;
 	int nr_nodes = nodes_weight(node_online_map);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 17/21] powerpc: function to allocate gigantic hugepages
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (15 preceding siblings ...)
  2008-06-03 10:00 ` [patch 16/21] hugetlb: allow arch overried hugepage allocation npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 18/21] powerpc: scan device tree for gigantic pages npiggin
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, Jon Tollefson, kniht,
	andi, abh, joachim.deguara

[-- Attachment #1: powerpc-function-for-gigantic-hugepage-allocation.patch --]
[-- Type: text/plain, Size: 1912 bytes --]

The 16G page locations have been saved during early boot in an array.
The alloc_bootmem_huge_page() function adds a page from here to the
huge_boot_pages list.

Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 arch/powerpc/mm/hugetlbpage.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:56:55.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:57:03.000000000 +1000
@@ -29,6 +29,12 @@
 
 #define NUM_LOW_AREAS	(0x100000000UL >> SID_SHIFT)
 #define NUM_HIGH_AREAS	(PGTABLE_RANGE >> HTLB_AREA_SHIFT)
+#define MAX_NUMBER_GPAGES	1024
+
+/* Tracks the 16G pages after the device tree is scanned and before the
+ * huge_boot_pages list is ready.  */
+static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
+static unsigned nr_gpages;
 
 unsigned int hugepte_shift;
 #define PTRS_PER_HUGEPTE	(1 << hugepte_shift)
@@ -104,6 +110,21 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, 
 }
 #endif
 
+/* Moves the gigantic page addresses from the temporary list to the
+ * huge_boot_pages list.  */
+int alloc_bootmem_huge_page(struct hstate *h)
+{
+	struct huge_bootmem_page *m;
+	if (nr_gpages == 0)
+		return 0;
+	m = phys_to_virt(gpage_freearray[--nr_gpages]);
+	gpage_freearray[nr_gpages] = 0;
+	list_add(&m->list, &huge_boot_pages);
+	m->hstate = h;
+	return 1;
+}
+
+
 /* Modelled after find_linux_pte() */
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 18/21] powerpc: scan device tree for gigantic pages
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (16 preceding siblings ...)
  2008-06-03 10:00 ` [patch 17/21] powerpc: function to allocate gigantic hugepages npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 19/21] powerpc: define support for 16G hugepages npiggin
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, Jon Tollefson, kniht,
	andi, abh, joachim.deguara

[-- Attachment #1: powerpc-scan-device-tree-and-save-gigantic-page-locations.patch --]
[-- Type: text/plain, Size: 4735 bytes --]

The 16G huge pages have to be reserved in the HMC prior to boot. The
location of the pages are placed in the device tree.   This patch adds
code to scan the device tree during very early boot and save these page
locations until hugetlbfs is ready for them.

Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 arch/powerpc/mm/hash_utils_64.c  |   44 ++++++++++++++++++++++++++++++++++++++-
 arch/powerpc/mm/hugetlbpage.c    |   16 ++++++++++++++
 include/asm-powerpc/mmu-hash64.h |    2 +
 3 files changed, 61 insertions(+), 1 deletion(-)



Index: linux-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hash_utils_64.c	2008-06-03 19:52:46.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hash_utils_64.c	2008-06-03 19:57:04.000000000 +1000
@@ -68,6 +68,7 @@
 
 #define KB (1024)
 #define MB (1024*KB)
+#define GB (1024L*MB)
 
 /*
  * Note:  pte   --> Linux PTE
@@ -329,6 +330,44 @@ static int __init htab_dt_scan_page_size
 	return 0;
 }
 
+/* Scan for 16G memory blocks that have been set aside for huge pages
+ * and reserve those blocks for 16G huge pages.
+ */
+static int __init htab_dt_scan_hugepage_blocks(unsigned long node,
+					const char *uname, int depth,
+					void *data) {
+	char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+	unsigned long *addr_prop;
+	u32 *page_count_prop;
+	unsigned int expected_pages;
+	long unsigned int phys_addr;
+	long unsigned int block_size;
+
+	/* We are scanning "memory" nodes only */
+	if (type == NULL || strcmp(type, "memory") != 0)
+		return 0;
+
+	/* This property is the log base 2 of the number of virtual pages that
+	 * will represent this memory block. */
+	page_count_prop = of_get_flat_dt_prop(node, "ibm,expected#pages", NULL);
+	if (page_count_prop == NULL)
+		return 0;
+	expected_pages = (1 << page_count_prop[0]);
+	addr_prop = of_get_flat_dt_prop(node, "reg", NULL);
+	if (addr_prop == NULL)
+		return 0;
+	phys_addr = addr_prop[0];
+	block_size = addr_prop[1];
+	if (block_size != (16 * GB))
+		return 0;
+	printk(KERN_INFO "Huge page(16GB) memory: "
+			"addr = 0x%lX size = 0x%lX pages = %d\n",
+			phys_addr, block_size, expected_pages);
+	lmb_reserve(phys_addr, block_size * expected_pages);
+	add_gpage(phys_addr, block_size, expected_pages);
+	return 0;
+}
+
 static void __init htab_init_page_sizes(void)
 {
 	int rc;
@@ -418,7 +457,10 @@ static void __init htab_init_page_sizes(
 	       );
 
 #ifdef CONFIG_HUGETLB_PAGE
-	/* Init large page size. Currently, we pick 16M or 1M depending
+	/* Reserve 16G huge page memory sections for huge pages */
+	of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
+
+/* Init large page size. Currently, we pick 16M or 1M depending
 	 * on what is available
 	 */
 	if (mmu_psize_defs[MMU_PAGE_16M].shift)
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:57:03.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:57:04.000000000 +1000
@@ -110,6 +110,22 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, 
 }
 #endif
 
+/* Build list of addresses of gigantic pages.  This function is used in early
+ * boot before the buddy or bootmem allocator is setup.
+ */
+void add_gpage(unsigned long addr, unsigned long page_size,
+	unsigned long number_of_pages)
+{
+	if (!addr)
+		return;
+	while (number_of_pages > 0) {
+		gpage_freearray[nr_gpages] = addr;
+		nr_gpages++;
+		number_of_pages--;
+		addr += page_size;
+	}
+}
+
 /* Moves the gigantic page addresses from the temporary list to the
  * huge_boot_pages list.  */
 int alloc_bootmem_huge_page(struct hstate *h)
Index: linux-2.6/include/asm-powerpc/mmu-hash64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/mmu-hash64.h	2008-06-03 19:52:46.000000000 +1000
+++ linux-2.6/include/asm-powerpc/mmu-hash64.h	2008-06-03 19:57:04.000000000 +1000
@@ -281,6 +281,8 @@ extern int htab_bolt_mapping(unsigned lo
 			     unsigned long pstart, unsigned long mode,
 			     int psize, int ssize);
 extern void set_huge_psize(int psize);
+extern void add_gpage(unsigned long addr, unsigned long page_size,
+			  unsigned long number_of_pages);
 extern void demote_segment_4k(struct mm_struct *mm, unsigned long addr);
 
 extern void htab_initialize(void);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 19/21] powerpc: define support for 16G hugepages
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (17 preceding siblings ...)
  2008-06-03 10:00 ` [patch 18/21] powerpc: scan device tree for gigantic pages npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 20/21] fs: check for statfs overflow npiggin
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Jon Tollefson, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: powerpc-define-page-support-for-16g-hugepages.patch --]
[-- Type: text/plain, Size: 5636 bytes --]

The huge page size is defined for 16G pages.  If a hugepagesz of 16G is
specified at boot-time then it becomes the huge page size instead of
the default 16M.

The change in pgtable-64K.h is to the macro
pte_iterate_hashed_subpages to make the increment to va (the 1
being shifted) be a long so that it is not shifted to 0.  Otherwise it
would create an infinite loop when the shift value is for a 16G page
(when base page size is 64K).

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 arch/powerpc/mm/hugetlbpage.c     |   62 ++++++++++++++++++++++++++------------
 include/asm-powerpc/pgtable-64k.h |    2 -
 2 files changed, 45 insertions(+), 19 deletions(-)

Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:57:04.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:57:06.000000000 +1000
@@ -24,8 +24,9 @@
 #include <asm/cputable.h>
 #include <asm/spu.h>
 
-#define HPAGE_SHIFT_64K	16
-#define HPAGE_SHIFT_16M	24
+#define PAGE_SHIFT_64K	16
+#define PAGE_SHIFT_16M	24
+#define PAGE_SHIFT_16G	34
 
 #define NUM_LOW_AREAS	(0x100000000UL >> SID_SHIFT)
 #define NUM_HIGH_AREAS	(PGTABLE_RANGE >> HTLB_AREA_SHIFT)
@@ -95,7 +96,7 @@ static int __hugepte_alloc(struct mm_str
 static inline
 pmd_t *hpmd_offset(pud_t *pud, unsigned long addr)
 {
-	if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+	if (HPAGE_SHIFT == PAGE_SHIFT_64K)
 		return pmd_offset(pud, addr);
 	else
 		return (pmd_t *) pud;
@@ -103,7 +104,7 @@ pmd_t *hpmd_offset(pud_t *pud, unsigned 
 static inline
 pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr)
 {
-	if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+	if (HPAGE_SHIFT == PAGE_SHIFT_64K)
 		return pmd_alloc(mm, pud, addr);
 	else
 		return (pmd_t *) pud;
@@ -260,7 +261,7 @@ static void hugetlb_free_pud_range(struc
 			continue;
 		hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling);
 #else
-		if (HPAGE_SHIFT == HPAGE_SHIFT_64K) {
+		if (HPAGE_SHIFT == PAGE_SHIFT_64K) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
 			hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -591,20 +592,40 @@ void set_huge_psize(int psize)
 {
 	/* Check that it is a page size supported by the hardware and
 	 * that it fits within pagetable limits. */
-	if (mmu_psize_defs[psize].shift && mmu_psize_defs[psize].shift < SID_SHIFT &&
+	if (mmu_psize_defs[psize].shift &&
+		mmu_psize_defs[psize].shift < SID_SHIFT_1T &&
 		(mmu_psize_defs[psize].shift > MIN_HUGEPTE_SHIFT ||
-			mmu_psize_defs[psize].shift == HPAGE_SHIFT_64K)) {
+		 mmu_psize_defs[psize].shift == PAGE_SHIFT_64K ||
+		 mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) {
+		/* Return if huge page size is the same as the
+		 * base page size. */
+		if (mmu_psize_defs[psize].shift == PAGE_SHIFT)
+			return;
+
 		HPAGE_SHIFT = mmu_psize_defs[psize].shift;
 		mmu_huge_psize = psize;
-#ifdef CONFIG_PPC_64K_PAGES
-		hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT);
-#else
-		if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
-			hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT);
-		else
-			hugepte_shift = (PUD_SHIFT-HPAGE_SHIFT);
-#endif
 
+		switch (HPAGE_SHIFT) {
+		case PAGE_SHIFT_64K:
+		    /* We only allow 64k hpages with 4k base page,
+		     * which was checked above, and always put them
+		     * at the PMD */
+		    hugepte_shift = PMD_SHIFT;
+		    break;
+		case PAGE_SHIFT_16M:
+		    /* 16M pages can be at two different levels
+		     * of pagestables based on base page size */
+		    if (PAGE_SHIFT == PAGE_SHIFT_64K)
+			    hugepte_shift = PMD_SHIFT;
+		    else /* 4k base page */
+			    hugepte_shift = PUD_SHIFT;
+		    break;
+		case PAGE_SHIFT_16G:
+		    /* 16G pages are always at PGD level */
+		    hugepte_shift = PGDIR_SHIFT;
+		    break;
+		}
+		hugepte_shift -= HPAGE_SHIFT;
 	} else
 		HPAGE_SHIFT = 0;
 }
@@ -620,17 +641,22 @@ static int __init hugepage_setup_sz(char
 	shift = __ffs(size);
 	switch (shift) {
 #ifndef CONFIG_PPC_64K_PAGES
-	case HPAGE_SHIFT_64K:
+	case PAGE_SHIFT_64K:
 		mmu_psize = MMU_PAGE_64K;
 		break;
 #endif
-	case HPAGE_SHIFT_16M:
+	case PAGE_SHIFT_16M:
 		mmu_psize = MMU_PAGE_16M;
 		break;
+	case PAGE_SHIFT_16G:
+		mmu_psize = MMU_PAGE_16G;
+		break;
 	}
 
-	if (mmu_psize >=0 && mmu_psize_defs[mmu_psize].shift)
+	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift) {
 		set_huge_psize(mmu_psize);
+		hugetlb_add_hstate(shift - PAGE_SHIFT);
+	}
 	else
 		printk(KERN_WARNING "Invalid huge page size specified(%llu)\n", size);
 
Index: linux-2.6/include/asm-powerpc/pgtable-64k.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-64k.h	2008-06-03 19:52:45.000000000 +1000
+++ linux-2.6/include/asm-powerpc/pgtable-64k.h	2008-06-03 19:57:06.000000000 +1000
@@ -125,7 +125,7 @@ static inline struct subpage_prot_table 
                 unsigned __split = (psize == MMU_PAGE_4K ||                 \
 				    psize == MMU_PAGE_64K_AP);              \
                 shift = mmu_psize_defs[psize].shift;                        \
-	        for (index = 0; va < __end; index++, va += (1 << shift)) {  \
+		for (index = 0; va < __end; index++, va += (1L << shift)) { \
 		        if (!__split || __rpte_sub_valid(rpte, index)) do { \
 
 #define pte_iterate_hashed_end() } while(0); } } while(0)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 20/21] fs: check for statfs overflow
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (18 preceding siblings ...)
  2008-06-03 10:00 ` [patch 19/21] powerpc: define support for 16G hugepages npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:00 ` [patch 21/21] powerpc: support multiple hugepage sizes npiggin
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Jon Tollefson, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: fs-check-for-statfs-overflow.patch --]
[-- Type: text/plain, Size: 2327 bytes --]

Adds a check for an overflow in the filesystem size so if someone is
checking with statfs() on a 16G hugetlbfs in a 32bit binary that it
will report back EOVERFLOW instead of a size of 0.

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 fs/compat.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


Index: linux-2.6/fs/compat.c
===================================================================
--- linux-2.6.orig/fs/compat.c	2008-06-03 19:52:45.000000000 +1000
+++ linux-2.6/fs/compat.c	2008-06-03 19:57:08.000000000 +1000
@@ -197,8 +197,8 @@ static int put_compat_statfs(struct comp
 {
 	
 	if (sizeof ubuf->f_blocks == 4) {
-		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail) &
-		    0xffffffff00000000ULL)
+		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail |
+		     kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL)
 			return -EOVERFLOW;
 		/* f_files and f_ffree may be -1; it's okay
 		 * to stuff that into 32 bits */
@@ -271,8 +271,8 @@ out:
 static int put_compat_statfs64(struct compat_statfs64 __user *ubuf, struct kstatfs *kbuf)
 {
 	if (sizeof ubuf->f_blocks == 4) {
-		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail) &
-		    0xffffffff00000000ULL)
+		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail |
+		     kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL)
 			return -EOVERFLOW;
 		/* f_files and f_ffree may be -1; it's okay
 		 * to stuff that into 32 bits */
Index: linux-2.6/fs/open.c
===================================================================
--- linux-2.6.orig/fs/open.c	2008-06-03 19:52:45.000000000 +1000
+++ linux-2.6/fs/open.c	2008-06-03 19:57:08.000000000 +1000
@@ -63,7 +63,8 @@ static int vfs_statfs_native(struct dent
 		memcpy(buf, &st, sizeof(st));
 	else {
 		if (sizeof buf->f_blocks == 4) {
-			if ((st.f_blocks | st.f_bfree | st.f_bavail) &
+			if ((st.f_blocks | st.f_bfree | st.f_bavail |
+			     st.f_bsize | st.f_frsize) &
 			    0xffffffff00000000ULL)
 				return -EOVERFLOW;
 			/*

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 21/21] powerpc: support multiple hugepage sizes
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (19 preceding siblings ...)
  2008-06-03 10:00 ` [patch 20/21] fs: check for statfs overflow npiggin
@ 2008-06-03 10:00 ` npiggin
  2008-06-03 10:29 ` [patch 1/1] x86: get_user_pages_lockless support 1GB hugepages Nick Piggin
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-03 10:00 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Jon Tollefson, kniht, andi, abh,
	joachim.deguara

[-- Attachment #1: powerpc-support-multiple-hugepage-sizes.patch --]
[-- Type: text/plain, Size: 26545 bytes --]

Instead of using the variable mmu_huge_psize to keep track of the huge
page size we use an array of MMU_PAGE_* values.  For each supported
huge page size we need to know the hugepte_shift value and have a
pgtable_cache.  The hstate or an mmu_huge_psizes index is passed to
functions so that they know which huge page size they should use.

The hugepage sizes 16M and 64K are setup(if available on the
hardware) so that they don't have to be set on the boot cmd line in
order to use them.  The number of 16G pages have to be specified at
boot-time though (e.g. hugepagesz=16G hugepages=5).

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 arch/powerpc/mm/hash_utils_64.c  |    9 -
 arch/powerpc/mm/hugetlbpage.c    |  272 +++++++++++++++++++++++++--------------
 arch/powerpc/mm/init_64.c        |    8 -
 arch/powerpc/mm/tlb_64.c         |    2 
 include/asm-powerpc/hugetlb.h    |    5 
 include/asm-powerpc/mmu-hash64.h |    4 
 include/asm-powerpc/page_64.h    |    1 
 include/asm-powerpc/pgalloc-64.h |    4 
 8 files changed, 192 insertions(+), 113 deletions(-)


Index: linux-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hash_utils_64.c	2008-06-03 19:57:04.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hash_utils_64.c	2008-06-03 19:57:10.000000000 +1000
@@ -103,7 +103,6 @@ int mmu_kernel_ssize = MMU_SEGSIZE_256M;
 int mmu_highuser_ssize = MMU_SEGSIZE_256M;
 u16 mmu_slb_size = 64;
 #ifdef CONFIG_HUGETLB_PAGE
-int mmu_huge_psize = MMU_PAGE_16M;
 unsigned int HPAGE_SHIFT;
 #endif
 #ifdef CONFIG_PPC_64K_PAGES
@@ -460,15 +459,15 @@ static void __init htab_init_page_sizes(
 	/* Reserve 16G huge page memory sections for huge pages */
 	of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
 
-/* Init large page size. Currently, we pick 16M or 1M depending
+/* Set default large page size. Currently, we pick 16M or 1M depending
 	 * on what is available
 	 */
 	if (mmu_psize_defs[MMU_PAGE_16M].shift)
-		set_huge_psize(MMU_PAGE_16M);
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
 	/* With 4k/4level pagetables, we can't (for now) cope with a
 	 * huge page size < PMD_SIZE */
 	else if (mmu_psize_defs[MMU_PAGE_1M].shift)
-		set_huge_psize(MMU_PAGE_1M);
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
 #endif /* CONFIG_HUGETLB_PAGE */
 }
 
@@ -871,7 +870,7 @@ int hash_page(unsigned long ea, unsigned
 
 #ifdef CONFIG_HUGETLB_PAGE
 	/* Handle hugepage regions */
-	if (HPAGE_SHIFT && psize == mmu_huge_psize) {
+	if (HPAGE_SHIFT && mmu_huge_psizes[psize]) {
 		DBG_LOW(" -> huge page !\n");
 		return hash_huge_page(mm, access, ea, vsid, local, trap);
 	}
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:57:06.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c	2008-06-03 19:57:27.000000000 +1000
@@ -37,15 +37,30 @@
 static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
 static unsigned nr_gpages;
 
-unsigned int hugepte_shift;
-#define PTRS_PER_HUGEPTE	(1 << hugepte_shift)
-#define HUGEPTE_TABLE_SIZE	(sizeof(pte_t) << hugepte_shift)
-
-#define HUGEPD_SHIFT		(HPAGE_SHIFT + hugepte_shift)
-#define HUGEPD_SIZE		(1UL << HUGEPD_SHIFT)
-#define HUGEPD_MASK		(~(HUGEPD_SIZE-1))
+/* Array of valid huge page sizes - non-zero value(hugepte_shift) is
+ * stored for the huge page sizes that are valid.
+ */
+unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
-#define huge_pgtable_cache	(pgtable_cache[HUGEPTE_CACHE_NUM])
+#define hugepte_shift			mmu_huge_psizes
+#define PTRS_PER_HUGEPTE(psize)		(1 << hugepte_shift[psize])
+#define HUGEPTE_TABLE_SIZE(psize)	(sizeof(pte_t) << hugepte_shift[psize])
+
+#define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
+						+ hugepte_shift[psize])
+#define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
+#define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
+
+/* Subtract one from array size because we don't need a cache for 4K since
+ * is not a huge page size */
+#define huge_pgtable_cache(psize)	(pgtable_cache[HUGEPTE_CACHE_NUM \
+							+ psize-1])
+#define HUGEPTE_CACHE_NAME(psize)	(huge_pgtable_cache_name[psize])
+
+static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
+	"unused_4K", "hugepte_cache_64K", "unused_64K_AP",
+	"hugepte_cache_1M", "hugepte_cache_16M", "hugepte_cache_16G"
+};
 
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
@@ -56,24 +71,49 @@ typedef struct { unsigned long pd; } hug
 
 #define hugepd_none(hpd)	((hpd).pd == 0)
 
+static inline int shift_to_mmu_psize(unsigned int shift)
+{
+	switch (shift) {
+#ifndef CONFIG_PPC_64K_PAGES
+	case PAGE_SHIFT_64K:
+	    return MMU_PAGE_64K;
+#endif
+	case PAGE_SHIFT_16M:
+	    return MMU_PAGE_16M;
+	case PAGE_SHIFT_16G:
+	    return MMU_PAGE_16G;
+	}
+	return -1;
+}
+
+static inline unsigned int mmu_psize_to_shift(unsigned int mmu_psize)
+{
+	if (mmu_psize_defs[mmu_psize].shift)
+		return mmu_psize_defs[mmu_psize].shift;
+	BUG();
+}
+
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
 	BUG_ON(!(hpd.pd & HUGEPD_OK));
 	return (pte_t *)(hpd.pd & ~HUGEPD_OK);
 }
 
-static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr)
+static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
+				    struct hstate *hstate)
 {
-	unsigned long idx = ((addr >> HPAGE_SHIFT) & (PTRS_PER_HUGEPTE-1));
+	unsigned int shift = huge_page_shift(hstate);
+	int psize = shift_to_mmu_psize(shift);
+	unsigned long idx = ((addr >> shift) & (PTRS_PER_HUGEPTE(psize)-1));
 	pte_t *dir = hugepd_page(*hpdp);
 
 	return dir + idx;
 }
 
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
-			   unsigned long address)
+			   unsigned long address, unsigned int psize)
 {
-	pte_t *new = kmem_cache_alloc(huge_pgtable_cache,
+	pte_t *new = kmem_cache_alloc(huge_pgtable_cache(psize),
 				      GFP_KERNEL|__GFP_REPEAT);
 
 	if (! new)
@@ -81,7 +121,7 @@ static int __hugepte_alloc(struct mm_str
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(huge_pgtable_cache, new);
+		kmem_cache_free(huge_pgtable_cache(psize), new);
 	else
 		hpdp->pd = (unsigned long)new | HUGEPD_OK;
 	spin_unlock(&mm->page_table_lock);
@@ -90,21 +130,22 @@ static int __hugepte_alloc(struct mm_str
 
 /* Base page size affects how we walk hugetlb page tables */
 #ifdef CONFIG_PPC_64K_PAGES
-#define hpmd_offset(pud, addr)		pmd_offset(pud, addr)
-#define hpmd_alloc(mm, pud, addr)	pmd_alloc(mm, pud, addr)
+#define hpmd_offset(pud, addr, h)	pmd_offset(pud, addr)
+#define hpmd_alloc(mm, pud, addr, h)	pmd_alloc(mm, pud, addr)
 #else
 static inline
-pmd_t *hpmd_offset(pud_t *pud, unsigned long addr)
+pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
 {
-	if (HPAGE_SHIFT == PAGE_SHIFT_64K)
+	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 		return pmd_offset(pud, addr);
 	else
 		return (pmd_t *) pud;
 }
 static inline
-pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr)
+pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
+		  struct hstate *hstate)
 {
-	if (HPAGE_SHIFT == PAGE_SHIFT_64K)
+	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 		return pmd_alloc(mm, pud, addr);
 	else
 		return (pmd_t *) pud;
@@ -128,8 +169,9 @@ void add_gpage(unsigned long addr, unsig
 }
 
 /* Moves the gigantic page addresses from the temporary list to the
- * huge_boot_pages list.  */
-int alloc_bootmem_huge_page(struct hstate *h)
+ * huge_boot_pages list.
+ */
+int alloc_bootmem_huge_page(struct hstate *hstate)
 {
 	struct huge_bootmem_page *m;
 	if (nr_gpages == 0)
@@ -137,7 +179,7 @@ int alloc_bootmem_huge_page(struct hstat
 	m = phys_to_virt(gpage_freearray[--nr_gpages]);
 	gpage_freearray[nr_gpages] = 0;
 	list_add(&m->list, &huge_boot_pages);
-	m->hstate = h;
+	m->hstate = hstate;
 	return 1;
 }
 
@@ -149,17 +191,25 @@ pte_t *huge_pte_offset(struct mm_struct 
 	pud_t *pu;
 	pmd_t *pm;
 
-	BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize);
+	unsigned int psize;
+	unsigned int shift;
+	unsigned long sz;
+	struct hstate *hstate;
+	psize = get_slice_psize(mm, addr);
+	shift = mmu_psize_to_shift(psize);
+	sz = ((1UL) << shift);
+	hstate = size_to_hstate(sz);
 
-	addr &= HPAGE_MASK;
+	addr &= hstate->mask;
 
 	pg = pgd_offset(mm, addr);
 	if (!pgd_none(*pg)) {
 		pu = pud_offset(pg, addr);
 		if (!pud_none(*pu)) {
-			pm = hpmd_offset(pu, addr);
+			pm = hpmd_offset(pu, addr, hstate);
 			if (!pmd_none(*pm))
-				return hugepte_offset((hugepd_t *)pm, addr);
+				return hugepte_offset((hugepd_t *)pm, addr,
+						      hstate);
 		}
 	}
 
@@ -173,16 +223,20 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	pud_t *pu;
 	pmd_t *pm;
 	hugepd_t *hpdp = NULL;
+	struct hstate *hstate;
+	unsigned int psize;
+	hstate = size_to_hstate(sz);
 
-	BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize);
+	psize = get_slice_psize(mm, addr);
+	BUG_ON(!mmu_huge_psizes[psize]);
 
-	addr &= HPAGE_MASK;
+	addr &= hstate->mask;
 
 	pg = pgd_offset(mm, addr);
 	pu = pud_alloc(mm, pg, addr);
 
 	if (pu) {
-		pm = hpmd_alloc(mm, pu, addr);
+		pm = hpmd_alloc(mm, pu, addr, hstate);
 		if (pm)
 			hpdp = (hugepd_t *)pm;
 	}
@@ -190,10 +244,10 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	if (! hpdp)
 		return NULL;
 
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr))
+	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, psize))
 		return NULL;
 
-	return hugepte_offset(hpdp, addr);
+	return hugepte_offset(hpdp, addr, hstate);
 }
 
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
@@ -201,19 +255,22 @@ int huge_pmd_unshare(struct mm_struct *m
 	return 0;
 }
 
-static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp)
+static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp,
+			       unsigned int psize)
 {
 	pte_t *hugepte = hugepd_page(*hpdp);
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte, HUGEPTE_CACHE_NUM,
+	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
+						 HUGEPTE_CACHE_NUM+psize-1,
 						 PGF_CACHENUM_MASK));
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				   unsigned long addr, unsigned long end,
-				   unsigned long floor, unsigned long ceiling)
+				   unsigned long floor, unsigned long ceiling,
+				   unsigned int psize)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -225,7 +282,7 @@ static void hugetlb_free_pmd_range(struc
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd))
 			continue;
-		free_hugepte_range(tlb, (hugepd_t *)pmd);
+		free_hugepte_range(tlb, (hugepd_t *)pmd, psize);
 	} while (pmd++, addr = next, addr != end);
 
 	start &= PUD_MASK;
@@ -251,6 +308,9 @@ static void hugetlb_free_pud_range(struc
 	pud_t *pud;
 	unsigned long next;
 	unsigned long start;
+	unsigned int shift;
+	unsigned int psize = get_slice_psize(tlb->mm, addr);
+	shift = mmu_psize_to_shift(psize);
 
 	start = addr;
 	pud = pud_offset(pgd, addr);
@@ -259,16 +319,18 @@ static void hugetlb_free_pud_range(struc
 #ifdef CONFIG_PPC_64K_PAGES
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling);
+		hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling,
+				       psize);
 #else
-		if (HPAGE_SHIFT == PAGE_SHIFT_64K) {
+		if (shift == PAGE_SHIFT_64K) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
-			hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling);
+			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
+					       ceiling, psize);
 		} else {
 			if (pud_none(*pud))
 				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pud);
+			free_hugepte_range(tlb, (hugepd_t *)pud, psize);
 		}
 #endif
 	} while (pud++, addr = next, addr != end);
@@ -336,27 +398,29 @@ void hugetlb_free_pgd_range(struct mmu_g
 	 * now has no other vmas using it, so can be freed, we don't
 	 * bother to round floor or end up - the tests don't need that.
 	 */
+	unsigned int psize = get_slice_psize(tlb->mm, addr);
 
-	addr &= HUGEPD_MASK;
+	addr &= HUGEPD_MASK(psize);
 	if (addr < floor) {
-		addr += HUGEPD_SIZE;
+		addr += HUGEPD_SIZE(psize);
 		if (!addr)
 			return;
 	}
 	if (ceiling) {
-		ceiling &= HUGEPD_MASK;
+		ceiling &= HUGEPD_MASK(psize);
 		if (!ceiling)
 			return;
 	}
 	if (end - 1 > ceiling - 1)
-		end -= HUGEPD_SIZE;
+		end -= HUGEPD_SIZE(psize);
 	if (addr > end - 1)
 		return;
 
 	start = addr;
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		BUG_ON(get_slice_psize(tlb->mm, addr) != mmu_huge_psize);
+		psize = get_slice_psize(tlb->mm, addr);
+		BUG_ON(!mmu_huge_psizes[psize]);
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
@@ -373,7 +437,11 @@ void set_huge_pte_at(struct mm_struct *m
 		 * necessary anymore if we make hpte_need_flush() get the
 		 * page size from the slices
 		 */
-		pte_update(mm, addr & HPAGE_MASK, ptep, ~0UL, 1);
+		unsigned int psize = get_slice_psize(mm, addr);
+		unsigned int shift = mmu_psize_to_shift(psize);
+		unsigned long sz = ((1UL) << shift);
+		struct hstate *hstate = size_to_hstate(sz);
+		pte_update(mm, addr & hstate->mask, ptep, ~0UL, 1);
 	}
 	*ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
 }
@@ -390,14 +458,19 @@ follow_huge_addr(struct mm_struct *mm, u
 {
 	pte_t *ptep;
 	struct page *page;
+	unsigned int mmu_psize = get_slice_psize(mm, address);
 
-	if (get_slice_psize(mm, address) != mmu_huge_psize)
+	/* Verify it is a huge page else bail. */
+	if (!mmu_huge_psizes[mmu_psize])
 		return ERR_PTR(-EINVAL);
 
 	ptep = huge_pte_offset(mm, address);
 	page = pte_page(*ptep);
-	if (page)
-		page += (address % HPAGE_SIZE) / PAGE_SIZE;
+	if (page) {
+		unsigned int shift = mmu_psize_to_shift(mmu_psize);
+		unsigned long sz = ((1UL) << shift);
+		page += (address % sz) / PAGE_SIZE;
+	}
 
 	return page;
 }
@@ -425,15 +498,16 @@ unsigned long hugetlb_get_unmapped_area(
 					unsigned long len, unsigned long pgoff,
 					unsigned long flags)
 {
-	return slice_get_unmapped_area(addr, len, flags,
-				       mmu_huge_psize, 1, 0);
+	struct hstate *hstate = hstate_file(file);
+	int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
+	return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
 }
 
 /*
  * Called by asm hashtable.S for doing lazy icache flush
  */
 static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
-						  pte_t pte, int trap)
+					pte_t pte, int trap, unsigned long sz)
 {
 	struct page *page;
 	int i;
@@ -446,7 +520,7 @@ static unsigned int hash_huge_page_do_la
 	/* page is dirty */
 	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
 		if (trap == 0x400) {
-			for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++)
+			for (i = 0; i < (sz / PAGE_SIZE); i++)
 				__flush_dcache_icache(page_address(page+i));
 			set_bit(PG_arch_1, &page->flags);
 		} else {
@@ -462,11 +536,16 @@ int hash_huge_page(struct mm_struct *mm,
 {
 	pte_t *ptep;
 	unsigned long old_pte, new_pte;
-	unsigned long va, rflags, pa;
+	unsigned long va, rflags, pa, sz;
 	long slot;
 	int err = 1;
 	int ssize = user_segment_size(ea);
+	unsigned int mmu_psize;
+	int shift;
+	mmu_psize = get_slice_psize(mm, ea);
 
+	if (!mmu_huge_psizes[mmu_psize])
+		goto out;
 	ptep = huge_pte_offset(mm, ea);
 
 	/* Search the Linux page table for a match with va */
@@ -510,30 +589,32 @@ int hash_huge_page(struct mm_struct *mm,
 	rflags = 0x2 | (!(new_pte & _PAGE_RW));
  	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
 	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
+	shift = mmu_psize_to_shift(mmu_psize);
+	sz = ((1UL) << shift);
 	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
 		/* No CPU has hugepages but lacks no execute, so we
 		 * don't need to worry about that case */
 		rflags = hash_huge_page_do_lazy_icache(rflags, __pte(old_pte),
-						       trap);
+						       trap, sz);
 
 	/* Check if pte already has an hpte (case 2) */
 	if (unlikely(old_pte & _PAGE_HASHPTE)) {
 		/* There MIGHT be an HPTE for this pte */
 		unsigned long hash, slot;
 
-		hash = hpt_hash(va, HPAGE_SHIFT, ssize);
+		hash = hpt_hash(va, shift, ssize);
 		if (old_pte & _PAGE_F_SECOND)
 			hash = ~hash;
 		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
 		slot += (old_pte & _PAGE_F_GIX) >> 12;
 
-		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_huge_psize,
+		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_psize,
 					 ssize, local) == -1)
 			old_pte &= ~_PAGE_HPTEFLAGS;
 	}
 
 	if (likely(!(old_pte & _PAGE_HASHPTE))) {
-		unsigned long hash = hpt_hash(va, HPAGE_SHIFT, ssize);
+		unsigned long hash = hpt_hash(va, shift, ssize);
 		unsigned long hpte_group;
 
 		pa = pte_pfn(__pte(old_pte)) << PAGE_SHIFT;
@@ -552,7 +633,7 @@ repeat:
 
 		/* Insert into the hash table, primary slot */
 		slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags, 0,
-					  mmu_huge_psize, ssize);
+					  mmu_psize, ssize);
 
 		/* Primary is full, try the secondary */
 		if (unlikely(slot == -1)) {
@@ -560,7 +641,7 @@ repeat:
 				      HPTES_PER_GROUP) & ~0x7UL; 
 			slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags,
 						  HPTE_V_SECONDARY,
-						  mmu_huge_psize, ssize);
+						  mmu_psize, ssize);
 			if (slot == -1) {
 				if (mftb() & 0x1)
 					hpte_group = ((hash & htab_hash_mask) *
@@ -597,66 +678,50 @@ void set_huge_psize(int psize)
 		(mmu_psize_defs[psize].shift > MIN_HUGEPTE_SHIFT ||
 		 mmu_psize_defs[psize].shift == PAGE_SHIFT_64K ||
 		 mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) {
-		/* Return if huge page size is the same as the
-		 * base page size. */
-		if (mmu_psize_defs[psize].shift == PAGE_SHIFT)
+		/* Return if huge page size has already been setup or is the
+		 * same as the base page size. */
+		if (mmu_huge_psizes[psize] ||
+		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
 			return;
+		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
-		HPAGE_SHIFT = mmu_psize_defs[psize].shift;
-		mmu_huge_psize = psize;
-
-		switch (HPAGE_SHIFT) {
+		switch (mmu_psize_defs[psize].shift) {
 		case PAGE_SHIFT_64K:
 		    /* We only allow 64k hpages with 4k base page,
 		     * which was checked above, and always put them
 		     * at the PMD */
-		    hugepte_shift = PMD_SHIFT;
+		    hugepte_shift[psize] = PMD_SHIFT;
 		    break;
 		case PAGE_SHIFT_16M:
 		    /* 16M pages can be at two different levels
 		     * of pagestables based on base page size */
 		    if (PAGE_SHIFT == PAGE_SHIFT_64K)
-			    hugepte_shift = PMD_SHIFT;
+			    hugepte_shift[psize] = PMD_SHIFT;
 		    else /* 4k base page */
-			    hugepte_shift = PUD_SHIFT;
+			    hugepte_shift[psize] = PUD_SHIFT;
 		    break;
 		case PAGE_SHIFT_16G:
 		    /* 16G pages are always at PGD level */
-		    hugepte_shift = PGDIR_SHIFT;
+		    hugepte_shift[psize] = PGDIR_SHIFT;
 		    break;
 		}
-		hugepte_shift -= HPAGE_SHIFT;
+		hugepte_shift[psize] -= mmu_psize_defs[psize].shift;
 	} else
-		HPAGE_SHIFT = 0;
+		hugepte_shift[psize] = 0;
 }
 
 static int __init hugepage_setup_sz(char *str)
 {
 	unsigned long long size;
-	int mmu_psize = -1;
+	int mmu_psize;
 	int shift;
 
 	size = memparse(str, &str);
 
 	shift = __ffs(size);
-	switch (shift) {
-#ifndef CONFIG_PPC_64K_PAGES
-	case PAGE_SHIFT_64K:
-		mmu_psize = MMU_PAGE_64K;
-		break;
-#endif
-	case PAGE_SHIFT_16M:
-		mmu_psize = MMU_PAGE_16M;
-		break;
-	case PAGE_SHIFT_16G:
-		mmu_psize = MMU_PAGE_16G;
-		break;
-	}
-
-	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift) {
+	mmu_psize = shift_to_mmu_psize(shift);
+	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift)
 		set_huge_psize(mmu_psize);
-		hugetlb_add_hstate(shift - PAGE_SHIFT);
-	}
 	else
 		printk(KERN_WARNING "Invalid huge page size specified(%llu)\n", size);
 
@@ -671,16 +736,31 @@ static void zero_ctor(struct kmem_cache 
 
 static int __init hugetlbpage_init(void)
 {
+	unsigned int psize;
+
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
-
-	huge_pgtable_cache = kmem_cache_create("hugepte_cache",
-					       HUGEPTE_TABLE_SIZE,
-					       HUGEPTE_TABLE_SIZE,
-					       0,
-					       zero_ctor);
-	if (! huge_pgtable_cache)
-		panic("hugetlbpage_init(): could not create hugepte cache\n");
+	/* Add supported huge page sizes.  Need to change HUGE_MAX_HSTATE
+	 * and adjust PTE_NONCACHE_NUM if the number of supported huge page
+	 * sizes changes.
+	 */
+	set_huge_psize(MMU_PAGE_16M);
+	set_huge_psize(MMU_PAGE_64K);
+	set_huge_psize(MMU_PAGE_16G);
+
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
+		if (mmu_huge_psizes[psize]) {
+			huge_pgtable_cache(psize) = kmem_cache_create(
+						HUGEPTE_CACHE_NAME(psize),
+						HUGEPTE_TABLE_SIZE(psize),
+						HUGEPTE_TABLE_SIZE(psize),
+						0,
+						zero_ctor);
+			if (!huge_pgtable_cache(psize))
+				panic("hugetlbpage_init(): could not create %s"\
+				      "\n", HUGEPTE_CACHE_NAME(psize));
+		}
+	}
 
 	return 0;
 }
Index: linux-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/init_64.c	2008-06-03 19:52:44.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/init_64.c	2008-06-03 19:57:10.000000000 +1000
@@ -153,10 +153,10 @@ static const char *pgtable_cache_name[AR
 };
 
 #ifdef CONFIG_HUGETLB_PAGE
-/* Hugepages need one extra cache, initialized in hugetlbpage.c.  We
- * can't put into the tables above, because HPAGE_SHIFT is not compile
- * time constant. */
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+1];
+/* Hugepages need an extra cache per hugepagesize, initialized in
+ * hugetlbpage.c.  We can't put into the tables above, because HPAGE_SHIFT
+ * is not compile time constant. */
+struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+MMU_PAGE_COUNT];
 #else
 struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)];
 #endif
Index: linux-2.6/arch/powerpc/mm/tlb_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_64.c	2008-06-03 19:52:44.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/tlb_64.c	2008-06-03 19:57:10.000000000 +1000
@@ -147,7 +147,7 @@ void hpte_need_flush(struct mm_struct *m
 	 */
 	if (huge) {
 #ifdef CONFIG_HUGETLB_PAGE
-		psize = mmu_huge_psize;
+		psize = get_slice_psize(mm, addr);;
 #else
 		BUG();
 		psize = pte_pagesize_index(mm, addr, pte); /* shutup gcc */
Index: linux-2.6/include/asm-powerpc/mmu-hash64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/mmu-hash64.h	2008-06-03 19:57:04.000000000 +1000
+++ linux-2.6/include/asm-powerpc/mmu-hash64.h	2008-06-03 19:57:10.000000000 +1000
@@ -194,9 +194,9 @@ extern int mmu_ci_restrictions;
 
 #ifdef CONFIG_HUGETLB_PAGE
 /*
- * The page size index of the huge pages for use by hugetlbfs
+ * The page size indexes of the huge pages for use by hugetlbfs
  */
-extern int mmu_huge_psize;
+extern unsigned int mmu_huge_psizes[MMU_PAGE_COUNT];
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
Index: linux-2.6/include/asm-powerpc/page_64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/page_64.h	2008-06-03 19:52:44.000000000 +1000
+++ linux-2.6/include/asm-powerpc/page_64.h	2008-06-03 19:57:10.000000000 +1000
@@ -90,6 +90,7 @@ extern unsigned int HPAGE_SHIFT;
 #define HPAGE_SIZE		((1UL) << HPAGE_SHIFT)
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
+#define HUGE_MAX_HSTATE		3
 
 #endif /* __ASSEMBLY__ */
 
Index: linux-2.6/include/asm-powerpc/pgalloc-64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgalloc-64.h	2008-06-03 19:52:44.000000000 +1000
+++ linux-2.6/include/asm-powerpc/pgalloc-64.h	2008-06-03 19:57:10.000000000 +1000
@@ -22,7 +22,7 @@ extern struct kmem_cache *pgtable_cache[
 #define PUD_CACHE_NUM		1
 #define PMD_CACHE_NUM		1
 #define HUGEPTE_CACHE_NUM	2
-#define PTE_NONCACHE_NUM	3  /* from GFP rather than kmem_cache */
+#define PTE_NONCACHE_NUM	7  /* from GFP rather than kmem_cache */
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
@@ -119,7 +119,7 @@ static inline void pte_free(struct mm_st
 	__free_page(ptepage);
 }
 
-#define PGF_CACHENUM_MASK	0x3
+#define PGF_CACHENUM_MASK	0x7
 
 typedef struct pgtable_free {
 	unsigned long val;
Index: linux-2.6/include/asm-powerpc/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/hugetlb.h	2008-06-03 19:53:30.000000000 +1000
+++ linux-2.6/include/asm-powerpc/hugetlb.h	2008-06-03 19:57:10.000000000 +1000
@@ -24,9 +24,10 @@ pte_t huge_ptep_get_and_clear(struct mm_
 static inline int prepare_hugepage_range(struct file *file,
 			unsigned long addr, unsigned long len)
 {
-	if (len & ~HPAGE_MASK)
+	struct hstate *h = hstate_file(file);
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (addr & ~HPAGE_MASK)
+	if (addr & ~huge_page_mask(h))
 		return -EINVAL;
 	return 0;
 }
Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt	2008-06-03 19:56:59.000000000 +1000
+++ linux-2.6/Documentation/kernel-parameters.txt	2008-06-03 19:57:10.000000000 +1000
@@ -767,11 +767,11 @@ and is between 256 and 4096 characters. 
 
 	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
 	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
-			On x86 this option can be specified multiple times
-			interleaved with hugepages= to reserve huge pages
-			of different sizes. Valid pages sizes on x86-64
-			are 2M (when the CPU supports "pse") and 1G (when the
-			CPU supports the "pdpe1gb" cpuinfo flag)
+			On x86-64 and powerpc, this option can be specified
+			multiple times interleaved with hugepages= to reserve
+			huge pages of different sizes. Valid pages sizes on
+			x86-64 are 2M (when the CPU supports "pse") and 1G
+			(when the CPU supports the "pdpe1gb" cpuinfo flag)
 			Note that 1GB pages can only be allocated at boot time
 			using hugepages= and not freed afterwards.
 	default_hugepagesz=

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 1/1] x86: get_user_pages_lockless support 1GB hugepages
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (20 preceding siblings ...)
  2008-06-03 10:00 ` [patch 21/21] powerpc: support multiple hugepage sizes npiggin
@ 2008-06-03 10:29 ` Nick Piggin
  2008-06-03 10:57 ` [patch 00/21] hugetlb multi size, giant hugetlb support, etc Nick Piggin
  2008-06-04  8:29 ` Andrew Morton
  23 siblings, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2008-06-03 10:29 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara,
	Dave Kleikamp, Andy Whitcroft, Ingo Molnar, Thomas Gleixner,
	Badari Pulavarty, Zach Brown, Jens Axboe

Hi,

This patch couples lockless get_user_pages with the 1GB hugepages for
x86: if one is merged upstream first, this patch has to be merged with
the other.

---

x86: support 1GB hugepages with get_user_pages_lockless()

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c
+++ linux-2.6/arch/x86/mm/gup.c
@@ -122,7 +122,7 @@ static noinline int gup_huge_pmd(pmd_t p
 
 	refs = 0;
 	head = pte_page(pte);
-	page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -160,6 +160,38 @@ static int gup_pmd_range(pud_t pud, unsi
 	return 1;
 }
 
+static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	pte_t pte = *(pte_t *)&pud;
+	struct page *head, *page;
+	int refs;
+
+	mask = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+	/* hugepages are never "special" */
+	VM_BUG_ON(pte_val(pte) & _PAGE_SPECIAL);
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+	get_head_page_multiple(head, refs);
+
+	return 1;
+}
+
 static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long next;
@@ -172,8 +204,13 @@ static int gup_pud_range(pgd_t pgd, unsi
 		next = pud_addr_end(addr, end);
 		if (pud_none(pud))
 			return 0;
-		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
-			return 0;
+		if (unlikely(pud_large(pud))) {
+			if (!gup_huge_pud(pud, addr, next, write, pages, nr))
+				return 0;
+		} else {
+			if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+				return 0;
+		}
 	} while (pudp++, addr = next, addr != end);
 
 	return 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/21] hugetlb multi size, giant hugetlb support, etc
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (21 preceding siblings ...)
  2008-06-03 10:29 ` [patch 1/1] x86: get_user_pages_lockless support 1GB hugepages Nick Piggin
@ 2008-06-03 10:57 ` Nick Piggin
  2008-06-06 17:12   ` Andy Whitcroft
  2008-06-04  8:29 ` Andrew Morton
  23 siblings, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2008-06-03 10:57 UTC (permalink / raw)
  To: akpm; +Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

On Tue, Jun 03, 2008 at 07:59:56PM +1000, npiggin@suse.de wrote:
> Hi,
> 
> Here is my submission to be merged in -mm. Given the amount of hunks this
> patchset has, and the recent flurry of hugetlb development work, I'd hope to
> get this merged up provided there aren't major issues (I would prefer to fix
> minor ones with incremental patches). It's just a lot of error prone work to
> track -mm when multiple concurrent development is happening.
> 
> Patch against latest mmotm.

Ah, missed a couple of things required to compile with hugetlbfs configured
out. Here is a patch against mmotm, I will also send out a rediffed patch
in the series.

---
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 20:44:40.000000000 +1000
+++ linux-2.6/include/linux/hugetlb.h	2008-06-03 20:45:07.000000000 +1000
@@ -76,7 +76,7 @@ static inline unsigned long hugetlb_tota
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
 #define hugetlb_prefault(mapping, vma)		({ BUG(); 0; })
-#define unmap_hugepage_range(vma, start, end)	BUG()
+#define unmap_hugepage_range(vma, start, end, page)	BUG()
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 02/21] hugetlb: modular state (take 2)
  2008-06-03  9:59 ` [patch 02/21] hugetlb: modular state npiggin
@ 2008-06-03 10:58   ` Nick Piggin
  0 siblings, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2008-06-03 10:58 UTC (permalink / raw)
  To: akpm
  Cc: Nishanth Aravamudan, linux-mm, Adam Litke, kniht, andi, abh,
	joachim.deguara

Updated version which compiles without hugetlbfs configured.
---

hugetlb: modular state

Large, but rather mechanical patch that converts most of the hugetlb.c
globals into structure members and passes them around.

Right now there is only a single global hstate structure, but 
most of the infrastructure to extend it is there.

Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    6 
 arch/powerpc/mm/hugetlbpage.c |    2 
 arch/sh/mm/hugetlbpage.c      |    2 
 arch/sparc64/mm/hugetlbpage.c |    4 
 arch/x86/mm/hugetlbpage.c     |    4 
 fs/hugetlbfs/inode.c          |   49 +++---
 include/asm-ia64/hugetlb.h    |    2 
 include/asm-powerpc/hugetlb.h |    2 
 include/asm-s390/hugetlb.h    |    2 
 include/asm-sh/hugetlb.h      |    2 
 include/asm-sparc64/hugetlb.h |    2 
 include/asm-x86/hugetlb.h     |    7 
 include/linux/hugetlb.h       |   81 +++++++++-
 ipc/shm.c                     |    3 
 mm/hugetlb.c                  |  321 ++++++++++++++++++++++--------------------
 mm/memory.c                   |    2 
 mm/mempolicy.c                |    9 -
 mm/mmap.c                     |    3 
 18 files changed, 308 insertions(+), 195 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-03 20:45:14.000000000 +1000
+++ linux-2.6/mm/hugetlb.c	2008-06-03 20:45:19.000000000 +1000
@@ -22,18 +22,12 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
-static unsigned long surplus_huge_pages;
-static unsigned long nr_overcommit_huge_pages;
 unsigned long max_huge_pages;
 unsigned long sysctl_overcommit_huge_pages;
-static struct list_head hugepage_freelists[MAX_NUMNODES];
-static unsigned int nr_huge_pages_node[MAX_NUMNODES];
-static unsigned int free_huge_pages_node[MAX_NUMNODES];
-static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
-static int hugetlb_next_nid;
+
+struct hstate default_hstate;
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -55,11 +49,11 @@ static pgoff_t vma_page_offset(struct vm
  * Convert the address within this vma to the page offset within
  * the mapping, in pagecache page units; huge pages here.
  */
-static pgoff_t vma_pagecache_offset(struct vm_area_struct *vma,
-					unsigned long address)
+static pgoff_t vma_pagecache_offset(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
 {
-	return ((address - vma->vm_start) >> HPAGE_SHIFT) +
-			(vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+	return ((address - vma->vm_start) >> huge_page_shift(h)) +
+			(vma->vm_pgoff >> huge_page_order(h));
 }
 
 #define HPAGE_RESV_OWNER    (1UL << (BITS_PER_LONG - 1))
@@ -120,14 +114,15 @@ static int is_vma_resv_set(struct vm_are
 }
 
 /* Decrement the reserved pages in the hugepage pool by one */
-static void decrement_hugepage_resv_vma(struct vm_area_struct *vma)
+static void decrement_hugepage_resv_vma(struct hstate *h,
+			struct vm_area_struct *vma)
 {
 	if (vma->vm_flags & VM_NORESERVE)
 		return;
 
 	if (vma->vm_flags & VM_SHARED) {
 		/* Shared mappings always use reserves */
-		resv_huge_pages--;
+		h->resv_huge_pages--;
 	} else {
 		/*
 		 * Only the process that called mmap() has reserves for
@@ -135,7 +130,7 @@ static void decrement_hugepage_resv_vma(
 		 */
 		if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 			unsigned long flags, reserve;
-			resv_huge_pages--;
+			h->resv_huge_pages--;
 			flags = (unsigned long)vma->vm_private_data &
 							HPAGE_RESV_MASK;
 			reserve = (unsigned long)vma->vm_private_data - 1;
@@ -162,12 +157,13 @@ static int vma_has_private_reserves(stru
 	return 1;
 }
 
-static void clear_huge_page(struct page *page, unsigned long addr)
+static void clear_huge_page(struct page *page,
+			unsigned long addr, unsigned long sz)
 {
 	int i;
 
 	might_sleep();
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
+	for (i = 0; i < sz/PAGE_SIZE; i++) {
 		cond_resched();
 		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
 	}
@@ -177,41 +173,43 @@ static void copy_huge_page(struct page *
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
+	struct hstate *h = hstate_vma(vma);
 
 	might_sleep();
-	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+	for (i = 0; i < pages_per_huge_page(h); i++) {
 		cond_resched();
 		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
 	}
 }
 
-static void enqueue_huge_page(struct page *page)
+static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_add(&page->lru, &hugepage_freelists[nid]);
-	free_huge_pages++;
-	free_huge_pages_node[nid]++;
+	list_add(&page->lru, &h->hugepage_freelists[nid]);
+	h->free_huge_pages++;
+	h->free_huge_pages_node[nid]++;
 }
 
-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct hstate *h)
 {
 	int nid;
 	struct page *page = NULL;
 
 	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 			break;
 		}
 	}
 	return page;
 }
 
-static struct page *dequeue_huge_page_vma(struct vm_area_struct *vma,
+static struct page *dequeue_huge_page_vma(struct hstate *h,
+				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
 	int nid;
@@ -229,26 +227,26 @@ static struct page *dequeue_huge_page_vm
 	 * not "stolen". The child may still get SIGKILLed
 	 */
 	if (!vma_has_private_reserves(vma) &&
-			free_huge_pages - resv_huge_pages == 0)
+			h->free_huge_pages - h->resv_huge_pages == 0)
 		return NULL;
 
 	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && free_huge_pages - resv_huge_pages == 0)
+	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
 		return NULL;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
 		nid = zone_to_nid(zone);
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
-		    !list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		    !list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 
 			if (!avoid_reserve)
-				decrement_hugepage_resv_vma(vma);
+				decrement_hugepage_resv_vma(h, vma);
 
 			break;
 		}
@@ -257,12 +255,13 @@ static struct page *dequeue_huge_page_vm
 	return page;
 }
 
-static void update_and_free_page(struct page *page)
+static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
-	nr_huge_pages--;
-	nr_huge_pages_node[page_to_nid(page)]--;
-	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
+
+	h->nr_huge_pages--;
+	h->nr_huge_pages_node[page_to_nid(page)]--;
+	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
 				1 << PG_private | 1<< PG_writeback);
@@ -270,11 +269,16 @@ static void update_and_free_page(struct 
 	set_compound_page_dtor(page, NULL);
 	set_page_refcounted(page);
 	arch_release_hugepage(page);
-	__free_pages(page, HUGETLB_PAGE_ORDER);
+	__free_pages(page, huge_page_order(h));
 }
 
 static void free_huge_page(struct page *page)
 {
+	/*
+	 * Can't pass hstate in here because it is called from the
+	 * compound page destructor.
+	 */
+	struct hstate *h = &default_hstate;
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
@@ -284,12 +288,12 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages_node[nid]) {
-		update_and_free_page(page);
-		surplus_huge_pages--;
-		surplus_huge_pages_node[nid]--;
+	if (h->surplus_huge_pages_node[nid]) {
+		update_and_free_page(h, page);
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
 	} else {
-		enqueue_huge_page(page);
+		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
 	if (mapping)
@@ -301,7 +305,7 @@ static void free_huge_page(struct page *
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(int delta)
+static int adjust_pool_surplus(struct hstate *h, int delta)
 {
 	static int prev_nid;
 	int nid = prev_nid;
@@ -314,15 +318,15 @@ static int adjust_pool_surplus(int delta
 			nid = first_node(node_online_map);
 
 		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !surplus_huge_pages_node[nid])
+		if (delta < 0 && !h->surplus_huge_pages_node[nid])
 			continue;
 		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && surplus_huge_pages_node[nid] >=
-						nr_huge_pages_node[nid])
+		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+						h->nr_huge_pages_node[nid])
 			continue;
 
-		surplus_huge_pages += delta;
-		surplus_huge_pages_node[nid] += delta;
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
 		ret = 1;
 		break;
 	} while (nid != prev_nid);
@@ -331,46 +335,46 @@ static int adjust_pool_surplus(int delta
 	return ret;
 }
 
-static void prep_new_huge_page(struct page *page, int nid)
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	set_compound_page_dtor(page, free_huge_page);
 	spin_lock(&hugetlb_lock);
-	nr_huge_pages++;
-	nr_huge_pages_node[nid]++;
+	h->nr_huge_pages++;
+	h->nr_huge_pages_node[nid]++;
 	spin_unlock(&hugetlb_lock);
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static struct page *alloc_fresh_huge_page_node(int nid)
+static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
-		HUGETLB_PAGE_ORDER);
+		huge_page_order(h));
 	if (page) {
 		if (arch_prepare_hugepage(page)) {
 			__free_pages(page, HUGETLB_PAGE_ORDER);
 			return NULL;
 		}
-		prep_new_huge_page(page, nid);
+		prep_new_huge_page(h, page, nid);
 	}
 
 	return page;
 }
 
-static int alloc_fresh_huge_page(void)
+static int alloc_fresh_huge_page(struct hstate *h)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hugetlb_next_nid;
+	start_nid = h->hugetlb_next_nid;
 
 	do {
-		page = alloc_fresh_huge_page_node(hugetlb_next_nid);
+		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
 		if (page)
 			ret = 1;
 		/*
@@ -384,11 +388,11 @@ static int alloc_fresh_huge_page(void)
 		 * if we just successfully allocated a hugepage so that
 		 * the next caller gets hugepages on the next node.
 		 */
-		next_nid = next_node(hugetlb_next_nid, node_online_map);
+		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
 		if (next_nid == MAX_NUMNODES)
 			next_nid = first_node(node_online_map);
-		hugetlb_next_nid = next_nid;
-	} while (!page && hugetlb_next_nid != start_nid);
+		h->hugetlb_next_nid = next_nid;
+	} while (!page && h->hugetlb_next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -398,8 +402,8 @@ static int alloc_fresh_huge_page(void)
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
-						unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
 {
 	struct page *page;
 	unsigned int nid;
@@ -428,18 +432,18 @@ static struct page *alloc_buddy_huge_pag
 	 * per-node value is checked there.
 	 */
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
+	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		spin_unlock(&hugetlb_lock);
 		return NULL;
 	} else {
-		nr_huge_pages++;
-		surplus_huge_pages++;
+		h->nr_huge_pages++;
+		h->surplus_huge_pages++;
 	}
 	spin_unlock(&hugetlb_lock);
 
 	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
 					__GFP_REPEAT|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
+					huge_page_order(h));
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -454,12 +458,12 @@ static struct page *alloc_buddy_huge_pag
 		/*
 		 * We incremented the global counters already
 		 */
-		nr_huge_pages_node[nid]++;
-		surplus_huge_pages_node[nid]++;
+		h->nr_huge_pages_node[nid]++;
+		h->surplus_huge_pages_node[nid]++;
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	} else {
-		nr_huge_pages--;
-		surplus_huge_pages--;
+		h->nr_huge_pages--;
+		h->surplus_huge_pages--;
 		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 	}
 	spin_unlock(&hugetlb_lock);
@@ -471,16 +475,16 @@ static struct page *alloc_buddy_huge_pag
  * Increase the hugetlb pool such that it can accomodate a reservation
  * of size 'delta'.
  */
-static int gather_surplus_pages(int delta)
+static int gather_surplus_pages(struct hstate *h, int delta)
 {
 	struct list_head surplus_list;
 	struct page *page, *tmp;
 	int ret, i;
 	int needed, allocated;
 
-	needed = (resv_huge_pages + delta) - free_huge_pages;
+	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
 	if (needed <= 0) {
-		resv_huge_pages += delta;
+		h->resv_huge_pages += delta;
 		return 0;
 	}
 
@@ -491,7 +495,7 @@ static int gather_surplus_pages(int delt
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(NULL, 0);
+		page = alloc_buddy_huge_page(h, NULL, 0);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -512,7 +516,8 @@ retry:
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
 	spin_lock(&hugetlb_lock);
-	needed = (resv_huge_pages + delta) - (free_huge_pages + allocated);
+	needed = (h->resv_huge_pages + delta) -
+			(h->free_huge_pages + allocated);
 	if (needed > 0)
 		goto retry;
 
@@ -525,7 +530,7 @@ retry:
 	 * before they are reserved.
 	 */
 	needed += allocated;
-	resv_huge_pages += delta;
+	h->resv_huge_pages += delta;
 	ret = 0;
 free:
 	/* Free the needed pages to the hugetlb pool */
@@ -533,7 +538,7 @@ free:
 		if ((--needed) < 0)
 			break;
 		list_del(&page->lru);
-		enqueue_huge_page(page);
+		enqueue_huge_page(h, page);
 	}
 
 	/* Free unnecessary surplus pages to the buddy allocator */
@@ -561,7 +566,8 @@ free:
  * allocated to satisfy the reservation must be explicitly freed if they were
  * never used.
  */
-static void return_unused_surplus_pages(unsigned long unused_resv_pages)
+static void return_unused_surplus_pages(struct hstate *h,
+					unsigned long unused_resv_pages)
 {
 	static int nid = -1;
 	struct page *page;
@@ -576,27 +582,27 @@ static void return_unused_surplus_pages(
 	unsigned long remaining_iterations = num_online_nodes();
 
 	/* Uncommit the reservation */
-	resv_huge_pages -= unused_resv_pages;
+	h->resv_huge_pages -= unused_resv_pages;
 
-	nr_pages = min(unused_resv_pages, surplus_huge_pages);
+	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
 	while (remaining_iterations-- && nr_pages) {
 		nid = next_node(nid, node_online_map);
 		if (nid == MAX_NUMNODES)
 			nid = first_node(node_online_map);
 
-		if (!surplus_huge_pages_node[nid])
+		if (!h->surplus_huge_pages_node[nid])
 			continue;
 
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
-			surplus_huge_pages--;
-			surplus_huge_pages_node[nid]--;
+			update_and_free_page(h, page);
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
+			h->surplus_huge_pages--;
+			h->surplus_huge_pages_node[nid]--;
 			nr_pages--;
 			remaining_iterations = num_online_nodes();
 		}
@@ -733,13 +739,14 @@ static long region_truncate(struct list_
  * an instantiated the change should be committed via vma_commit_reservation.
  * No action is required on failure.
  */
-static int vma_needs_reservation(struct vm_area_struct *vma, unsigned long addr)
+static int vma_needs_reservation(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 	if (vma->vm_flags & VM_SHARED) {
-		pgoff_t idx = vma_pagecache_offset(vma, addr);
+		pgoff_t idx = vma_pagecache_offset(h, vma, addr);
 		return region_chg(&inode->i_mapping->private_list,
 							idx, idx + 1);
 
@@ -750,14 +757,14 @@ static int vma_needs_reservation(struct 
 
 	return 0;
 }
-static void vma_commit_reservation(struct vm_area_struct *vma,
-							unsigned long addr)
+static void vma_commit_reservation(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
 	if (vma->vm_flags & VM_SHARED) {
-		pgoff_t idx = vma_pagecache_offset(vma, addr);
+		pgoff_t idx = vma_pagecache_offset(h, vma, addr);
 		region_add(&inode->i_mapping->private_list, idx, idx + 1);
 	}
 }
@@ -765,6 +772,7 @@ static void vma_commit_reservation(struc
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
+	struct hstate *h = hstate_vma(vma);
 	struct page *page;
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
@@ -777,7 +785,7 @@ static struct page *alloc_huge_page(stru
 	 * MAP_NORESERVE mappings may also need pages and quota allocated
 	 * if no reserve mapping overlaps.
 	 */
-	chg = vma_needs_reservation(vma, addr);
+	chg = vma_needs_reservation(h, vma, addr);
 	if (chg < 0)
 		return ERR_PTR(chg);
 	if (chg)
@@ -785,11 +793,11 @@ static struct page *alloc_huge_page(stru
 			return ERR_PTR(-ENOSPC);
 
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(vma, addr, avoid_reserve);
+	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(vma, addr);
+		page = alloc_buddy_huge_page(h, vma, addr);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_OOM);
@@ -799,7 +807,7 @@ static struct page *alloc_huge_page(stru
 	set_page_refcounted(page);
 	set_page_private(page, (unsigned long) mapping);
 
-	vma_commit_reservation(vma, addr);
+	vma_commit_reservation(h, vma, addr);
 
 	return page;
 }
@@ -807,21 +815,28 @@ static struct page *alloc_huge_page(stru
 static int __init hugetlb_init(void)
 {
 	unsigned long i;
+	struct hstate *h = &default_hstate;
 
 	if (HPAGE_SHIFT == 0)
 		return 0;
 
+	if (!h->order) {
+		h->order = HPAGE_SHIFT - PAGE_SHIFT;
+		h->mask = HPAGE_MASK;
+	}
+
 	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&hugepage_freelists[i]);
+		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
-	hugetlb_next_nid = first_node(node_online_map);
+	h->hugetlb_next_nid = first_node(node_online_map);
 
 	for (i = 0; i < max_huge_pages; ++i) {
-		if (!alloc_fresh_huge_page())
+		if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	max_huge_pages = free_huge_pages = nr_huge_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	printk(KERN_INFO "Total HugeTLB memory allocated, %ld\n",
+			h->free_huge_pages);
 	return 0;
 }
 module_init(hugetlb_init);
@@ -847,34 +862,36 @@ static unsigned int cpuset_mems_nr(unsig
 
 #ifdef CONFIG_SYSCTL
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count)
 {
 	int i;
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
-		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
-			if (count >= nr_huge_pages)
+		struct list_head *freel = &h->hugepage_freelists[i];
+		list_for_each_entry_safe(page, next, freel, lru) {
+			if (count >= h->nr_huge_pages)
 				return;
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
 			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[page_to_nid(page)]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
 #else
-static inline void try_to_free_low(unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count)
 {
 }
 #endif
 
-#define persistent_huge_pages (nr_huge_pages - surplus_huge_pages)
+#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(unsigned long count)
 {
 	unsigned long min_count, ret;
+	struct hstate *h = &default_hstate;
 
 	/*
 	 * Increase the pool size
@@ -888,19 +905,19 @@ static unsigned long set_max_huge_pages(
 	 * within all the constraints specified by the sysctls.
 	 */
 	spin_lock(&hugetlb_lock);
-	while (surplus_huge_pages && count > persistent_huge_pages) {
-		if (!adjust_pool_surplus(-1))
+	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, -1))
 			break;
 	}
 
-	while (count > persistent_huge_pages) {
+	while (count > persistent_huge_pages(h)) {
 		/*
 		 * If this allocation races such that we no longer need the
 		 * page, free_huge_page will handle it by freeing the page
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page();
+		ret = alloc_fresh_huge_page(h);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -922,21 +939,21 @@ static unsigned long set_max_huge_pages(
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
 	 */
-	min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
+	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(min_count);
-	while (min_count < persistent_huge_pages) {
-		struct page *page = dequeue_huge_page();
+	try_to_free_low(h, min_count);
+	while (min_count < persistent_huge_pages(h)) {
+		struct page *page = dequeue_huge_page(h);
 		if (!page)
 			break;
-		update_and_free_page(page);
+		update_and_free_page(h, page);
 	}
-	while (count < persistent_huge_pages) {
-		if (!adjust_pool_surplus(1))
+	while (count < persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, 1))
 			break;
 	}
 out:
-	ret = persistent_huge_pages;
+	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
 	return ret;
 }
@@ -966,9 +983,10 @@ int hugetlb_overcommit_handler(struct ct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
+	struct hstate *h = &default_hstate;
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	spin_lock(&hugetlb_lock);
-	nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
+	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
 	spin_unlock(&hugetlb_lock);
 	return 0;
 }
@@ -977,37 +995,40 @@ int hugetlb_overcommit_handler(struct ct
 
 int hugetlb_report_meminfo(char *buf)
 {
+	struct hstate *h = &default_hstate;
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
 			"HugePages_Rsvd:  %5lu\n"
 			"HugePages_Surp:  %5lu\n"
 			"Hugepagesize:    %5lu kB\n",
-			nr_huge_pages,
-			free_huge_pages,
-			resv_huge_pages,
-			surplus_huge_pages,
-			HPAGE_SIZE/1024);
+			h->nr_huge_pages,
+			h->free_huge_pages,
+			h->resv_huge_pages,
+			h->surplus_huge_pages,
+			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
 }
 
 int hugetlb_report_node_meminfo(int nid, char *buf)
 {
+	struct hstate *h = &default_hstate;
 	return sprintf(buf,
 		"Node %d HugePages_Total: %5u\n"
 		"Node %d HugePages_Free:  %5u\n"
 		"Node %d HugePages_Surp:  %5u\n",
-		nid, nr_huge_pages_node[nid],
-		nid, free_huge_pages_node[nid],
-		nid, surplus_huge_pages_node[nid]);
+		nid, h->nr_huge_pages_node[nid],
+		nid, h->free_huge_pages_node[nid],
+		nid, h->surplus_huge_pages_node[nid]);
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
-	return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
+	struct hstate *h = &default_hstate;
+	return h->nr_huge_pages * pages_per_huge_page(h);
 }
 
-static int hugetlb_acct_memory(long delta)
+static int hugetlb_acct_memory(struct hstate *h, long delta)
 {
 	int ret = -ENOMEM;
 
@@ -1030,18 +1051,18 @@ static int hugetlb_acct_memory(long delt
 	 * semantics that cpuset has.
 	 */
 	if (delta > 0) {
-		if (gather_surplus_pages(delta) < 0)
+		if (gather_surplus_pages(h, delta) < 0)
 			goto out;
 
-		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
-			return_unused_surplus_pages(delta);
+		if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
+			return_unused_surplus_pages(h, delta);
 			goto out;
 		}
 	}
 
 	ret = 0;
 	if (delta < 0)
-		return_unused_surplus_pages((unsigned long) -delta);
+		return_unused_surplus_pages(h, (unsigned long) -delta);
 
 out:
 	spin_unlock(&hugetlb_lock);
@@ -1050,9 +1071,11 @@ out:
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
+	struct hstate *h = hstate_vma(vma);
 	unsigned long reserve = vma_resv_huge_pages(vma);
+
 	if (reserve)
-		hugetlb_acct_memory(-reserve);
+		hugetlb_acct_memory(h, -reserve);
 }
 
 /*
@@ -1108,14 +1131,16 @@ int copy_hugetlb_page_range(struct mm_st
 	struct page *ptepage;
 	unsigned long addr;
 	int cow;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, addr);
+		dst_pte = huge_pte_alloc(dst, addr, sz);
 		if (!dst_pte)
 			goto nomem;
 
@@ -1151,6 +1176,9 @@ void __unmap_hugepage_range(struct vm_ar
 	pte_t pte;
 	struct page *page;
 	struct page *tmp;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
+
 	/*
 	 * A page gathering list, protected by per file i_mmap_lock. The
 	 * lock is used to avoid list corruption from multiple unmapping
@@ -1159,11 +1187,11 @@ void __unmap_hugepage_range(struct vm_ar
 	LIST_HEAD(page_list);
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON(start & ~HPAGE_MASK);
-	BUG_ON(end & ~HPAGE_MASK);
+	BUG_ON(start & ~huge_page_mask(h));
+	BUG_ON(end & ~huge_page_mask(h));
 
 	spin_lock(&mm->page_table_lock);
-	for (address = start; address < end; address += HPAGE_SIZE) {
+	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -1276,6 +1304,7 @@ static int hugetlb_cow(struct mm_struct 
 			unsigned long address, pte_t *ptep, pte_t pte,
 			struct page *pagecache_page)
 {
+	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int avoidcopy;
 	int outside_reserve = 0;
@@ -1336,7 +1365,7 @@ retry_avoidcopy:
 	__SetPageUptodate(new_page);
 	spin_lock(&mm->page_table_lock);
 
-	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
+	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
@@ -1351,14 +1380,14 @@ retry_avoidcopy:
 }
 
 /* Return the pagecache page at a given address within a VMA */
-static struct page *hugetlbfs_pagecache_page(struct vm_area_struct *vma,
-			unsigned long address)
+static struct page *hugetlbfs_pagecache_page(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
 {
 	struct address_space *mapping;
 	pgoff_t idx;
 
 	mapping = vma->vm_file->f_mapping;
-	idx = vma_pagecache_offset(vma, address);
+	idx = vma_pagecache_offset(h, vma, address);
 
 	return find_lock_page(mapping, idx);
 }
@@ -1366,6 +1395,7 @@ static struct page *hugetlbfs_pagecache_
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, int write_access)
 {
+	struct hstate *h = hstate_vma(vma);
 	int ret = VM_FAULT_SIGBUS;
 	pgoff_t idx;
 	unsigned long size;
@@ -1386,7 +1416,7 @@ static int hugetlb_no_page(struct mm_str
 	}
 
 	mapping = vma->vm_file->f_mapping;
-	idx = vma_pagecache_offset(vma, address);
+	idx = vma_pagecache_offset(h, vma, address);
 
 	/*
 	 * Use page lock to guard against racing truncation
@@ -1395,7 +1425,7 @@ static int hugetlb_no_page(struct mm_str
 retry:
 	page = find_lock_page(mapping, idx);
 	if (!page) {
-		size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
 		page = alloc_huge_page(vma, address, 0);
@@ -1403,7 +1433,7 @@ retry:
 			ret = -PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, address);
+		clear_huge_page(page, address, huge_page_size(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_SHARED) {
@@ -1419,14 +1449,14 @@ retry:
 			}
 
 			spin_lock(&inode->i_lock);
-			inode->i_blocks += BLOCKS_PER_HUGEPAGE;
+			inode->i_blocks += blocks_per_huge_page(h);
 			spin_unlock(&inode->i_lock);
 		} else
 			lock_page(page);
 	}
 
 	spin_lock(&mm->page_table_lock);
-	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
 
@@ -1462,8 +1492,9 @@ int hugetlb_fault(struct mm_struct *mm, 
 	pte_t entry;
 	int ret;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	struct hstate *h = hstate_vma(vma);
 
-	ptep = huge_pte_alloc(mm, address);
+	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
 	if (!ptep)
 		return VM_FAULT_OOM;
 
@@ -1487,7 +1518,7 @@ int hugetlb_fault(struct mm_struct *mm, 
 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
 		if (write_access && !pte_write(entry)) {
 			struct page *page;
-			page = hugetlbfs_pagecache_page(vma, address);
+			page = hugetlbfs_pagecache_page(h, vma, address);
 			ret = hugetlb_cow(mm, vma, address, ptep, entry, page);
 			if (page) {
 				unlock_page(page);
@@ -1508,6 +1539,7 @@ int follow_hugetlb_page(struct mm_struct
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
 	int remainder = *length;
+	struct hstate *h = hstate_vma(vma);
 
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
@@ -1519,7 +1551,7 @@ int follow_hugetlb_page(struct mm_struct
 		 * each hugepage.  We have to make * sure we get the
 		 * first, for the page indexing below to work.
 		 */
-		pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
+		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
 
 		if (!pte || huge_pte_none(huge_ptep_get(pte)) ||
 		    (write && !pte_write(huge_ptep_get(pte)))) {
@@ -1537,7 +1569,7 @@ int follow_hugetlb_page(struct mm_struct
 			break;
 		}
 
-		pfn_offset = (vaddr & ~HPAGE_MASK) >> PAGE_SHIFT;
+		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(huge_ptep_get(pte));
 same_page:
 		if (pages) {
@@ -1553,7 +1585,7 @@ same_page:
 		--remainder;
 		++i;
 		if (vaddr < vma->vm_end && remainder &&
-				pfn_offset < HPAGE_SIZE/PAGE_SIZE) {
+				pfn_offset < pages_per_huge_page(h)) {
 			/*
 			 * We use pfn_offset to avoid touching the pageframes
 			 * of this compound page.
@@ -1575,13 +1607,14 @@ void hugetlb_change_protection(struct vm
 	unsigned long start = address;
 	pte_t *ptep;
 	pte_t pte;
+	struct hstate *h = hstate_vma(vma);
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
 	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
-	for (; address < end; address += HPAGE_SIZE) {
+	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -1604,6 +1637,7 @@ int hugetlb_reserve_pages(struct inode *
 					struct vm_area_struct *vma)
 {
 	long ret, chg;
+	struct hstate *h = hstate_inode(inode);
 
 	if (vma && vma->vm_flags & VM_NORESERVE)
 		return 0;
@@ -1627,7 +1661,7 @@ int hugetlb_reserve_pages(struct inode *
 
 	if (hugetlb_get_quota(inode->i_mapping, chg))
 		return -ENOSPC;
-	ret = hugetlb_acct_memory(chg);
+	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
 		hugetlb_put_quota(inode->i_mapping, chg);
 		return ret;
@@ -1639,12 +1673,13 @@ int hugetlb_reserve_pages(struct inode *
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
+	struct hstate *h = hstate_inode(inode);
 	long chg = region_truncate(&inode->i_mapping->private_list, offset);
 
 	spin_lock(&inode->i_lock);
-	inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
+	inode->i_blocks -= blocks_per_huge_page(h);
 	spin_unlock(&inode->i_lock);
 
 	hugetlb_put_quota(inode->i_mapping, (chg - freed));
-	hugetlb_acct_memory(-(chg - freed));
+	hugetlb_acct_memory(h, -(chg - freed));
 }
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c	2008-06-03 20:45:19.000000000 +1000
@@ -128,7 +128,8 @@ pte_t *huge_pte_offset(struct mm_struct 
 	return NULL;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pg;
 	pud_t *pu;
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c	2008-06-03 20:45:19.000000000 +1000
@@ -175,7 +175,7 @@ hugetlb_get_unmapped_area(struct file *f
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
@@ -195,7 +195,8 @@ hugetlb_get_unmapped_area(struct file *f
 				pgoff, flags);
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/hugetlbpage.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/arch/sh/mm/hugetlbpage.c	2008-06-03 20:45:19.000000000 +1000
@@ -22,7 +22,8 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c	2008-06-03 20:45:19.000000000 +1000
@@ -24,7 +24,7 @@
 unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
 
 pte_t *
-huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
+huge_pte_alloc (struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
@@ -75,7 +75,8 @@ int huge_pmd_unshare(struct mm_struct *m
  * Don't actually need to do any preparation, but need to make sure
  * the address is in the right region.
  */
-int prepare_hugepage_range(unsigned long addr, unsigned long len)
+int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
@@ -149,7 +150,7 @@ unsigned long hugetlb_get_unmapped_area(
 
 	/* Handle MAP_FIXED */
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/arch/x86/mm/hugetlbpage.c	2008-06-03 20:45:19.000000000 +1000
@@ -124,7 +124,8 @@ int huge_pmd_unshare(struct mm_struct *m
 	return 1;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -368,7 +369,7 @@ hugetlb_get_unmapped_area(struct file *f
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 20:45:07.000000000 +1000
+++ linux-2.6/include/linux/hugetlb.h	2008-06-03 20:48:41.000000000 +1000
@@ -8,7 +8,6 @@
 #include <linux/mempolicy.h>
 #include <linux/shm.h>
 #include <asm/tlbflush.h>
-#include <asm/hugetlb.h>
 
 struct ctl_table;
 
@@ -45,7 +44,8 @@ extern int sysctl_hugetlb_shm_group;
 
 /* arch callbacks */
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -80,7 +80,7 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL
-#define prepare_hugepage_range(addr,len)	(-EINVAL)
+#define prepare_hugepage_range(file, addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
@@ -134,8 +134,6 @@ struct file *hugetlb_file_setup(const ch
 int hugetlb_get_quota(struct address_space *mapping, long delta);
 void hugetlb_put_quota(struct address_space *mapping, long delta);
 
-#define BLOCKS_PER_HUGEPAGE	(HPAGE_SIZE / 512)
-
 static inline int is_file_hugepages(struct file *file)
 {
 	if (file->f_op == &hugetlbfs_file_operations)
@@ -164,4 +162,84 @@ unsigned long hugetlb_get_unmapped_area(
 					unsigned long flags);
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+/* Defines one hugetlb page size */
+struct hstate {
+	int hugetlb_next_nid;
+	unsigned int order;
+	unsigned long mask;
+	unsigned long max_huge_pages;
+	unsigned long nr_huge_pages;
+	unsigned long free_huge_pages;
+	unsigned long resv_huge_pages;
+	unsigned long surplus_huge_pages;
+	unsigned long nr_overcommit_huge_pages;
+	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int nr_huge_pages_node[MAX_NUMNODES];
+	unsigned int free_huge_pages_node[MAX_NUMNODES];
+	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+};
+
+extern struct hstate default_hstate;
+
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+{
+	return &default_hstate;
+}
+
+static inline struct hstate *hstate_file(struct file *f)
+{
+	return &default_hstate;
+}
+
+static inline struct hstate *hstate_inode(struct inode *i)
+{
+	return &default_hstate;
+}
+
+static inline unsigned long huge_page_size(struct hstate *h)
+{
+	return (unsigned long)PAGE_SIZE << h->order;
+}
+
+static inline unsigned long huge_page_mask(struct hstate *h)
+{
+	return h->mask;
+}
+
+static inline unsigned int huge_page_order(struct hstate *h)
+{
+	return h->order;
+}
+
+static inline unsigned huge_page_shift(struct hstate *h)
+{
+	return h->order + PAGE_SHIFT;
+}
+
+static inline unsigned int pages_per_huge_page(struct hstate *h)
+{
+	return 1 << h->order;
+}
+
+static inline unsigned int blocks_per_huge_page(struct hstate *h)
+{
+	return huge_page_size(h) / 512;
+}
+
+#include <asm/hugetlb.h>
+
+#else
+struct hstate {};
+#define hstate_file(f) NULL
+#define hstate_vma(v) NULL
+#define hstate_inode(i) NULL
+#define huge_page_size(h) PAGE_SIZE
+#define huge_page_mask(h) PAGE_MASK
+#define huge_page_order(h) 0
+#define huge_page_shift(h) PAGE_SHIFT
+#define pages_per_huge_page(h) 1
+#endif
+
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/fs/hugetlbfs/inode.c	2008-06-03 20:45:19.000000000 +1000
@@ -80,6 +80,7 @@ static int hugetlbfs_file_mmap(struct fi
 	struct inode *inode = file->f_path.dentry->d_inode;
 	loff_t len, vma_len;
 	int ret;
+	struct hstate *h = hstate_file(file);
 
 	/*
 	 * vma address alignment (but not the pgoff alignment) has
@@ -92,7 +93,7 @@ static int hugetlbfs_file_mmap(struct fi
 	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
 	vma->vm_ops = &hugetlb_vm_ops;
 
-	if (vma->vm_pgoff & ~(HPAGE_MASK >> PAGE_SHIFT))
+	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
 		return -EINVAL;
 
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
@@ -104,8 +105,8 @@ static int hugetlbfs_file_mmap(struct fi
 	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
 	if (hugetlb_reserve_pages(inode,
-				vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
-				len >> HPAGE_SHIFT, vma))
+				vma->vm_pgoff >> huge_page_order(h),
+				len >> huge_page_shift(h), vma))
 		goto out;
 
 	ret = 0;
@@ -130,20 +131,21 @@ hugetlb_get_unmapped_area(struct file *f
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long start_addr;
+	struct hstate *h = hstate_file(file);
 
-	if (len & ~HPAGE_MASK)
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
 
 	if (addr) {
-		addr = ALIGN(addr, HPAGE_SIZE);
+		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
@@ -156,7 +158,7 @@ hugetlb_get_unmapped_area(struct file *f
 		start_addr = TASK_UNMAPPED_BASE;
 
 full_search:
-	addr = ALIGN(start_addr, HPAGE_SIZE);
+	addr = ALIGN(start_addr, huge_page_size(h));
 
 	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
@@ -174,7 +176,7 @@ full_search:
 
 		if (!vma || addr + len <= vma->vm_start)
 			return addr;
-		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+		addr = ALIGN(vma->vm_end, huge_page_size(h));
 	}
 }
 #endif
@@ -225,10 +227,11 @@ hugetlbfs_read_actor(struct page *page, 
 static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
 			      size_t len, loff_t *ppos)
 {
+	struct hstate *h = hstate_file(filp);
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-	unsigned long index = *ppos >> HPAGE_SHIFT;
-	unsigned long offset = *ppos & ~HPAGE_MASK;
+	unsigned long index = *ppos >> huge_page_shift(h);
+	unsigned long offset = *ppos & ~huge_page_mask(h);
 	unsigned long end_index;
 	loff_t isize;
 	ssize_t retval = 0;
@@ -243,17 +246,17 @@ static ssize_t hugetlbfs_read(struct fil
 	if (!isize)
 		goto out;
 
-	end_index = (isize - 1) >> HPAGE_SHIFT;
+	end_index = (isize - 1) >> huge_page_shift(h);
 	for (;;) {
 		struct page *page;
-		int nr, ret;
+		unsigned long nr, ret;
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = HPAGE_SIZE;
+		nr = huge_page_size(h);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
-			nr = ((isize - 1) & ~HPAGE_MASK) + 1;
+			nr = ((isize - 1) & ~huge_page_mask(h)) + 1;
 			if (nr <= offset) {
 				goto out;
 			}
@@ -287,8 +290,8 @@ static ssize_t hugetlbfs_read(struct fil
 		offset += ret;
 		retval += ret;
 		len -= ret;
-		index += offset >> HPAGE_SHIFT;
-		offset &= ~HPAGE_MASK;
+		index += offset >> huge_page_shift(h);
+		offset &= ~huge_page_mask(h);
 
 		if (page)
 			page_cache_release(page);
@@ -298,7 +301,7 @@ static ssize_t hugetlbfs_read(struct fil
 			break;
 	}
 out:
-	*ppos = ((loff_t)index << HPAGE_SHIFT) + offset;
+	*ppos = ((loff_t)index << huge_page_shift(h)) + offset;
 	mutex_unlock(&inode->i_mutex);
 	return retval;
 }
@@ -339,8 +342,9 @@ static void truncate_huge_page(struct pa
 
 static void truncate_hugepages(struct inode *inode, loff_t lstart)
 {
+	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
-	const pgoff_t start = lstart >> HPAGE_SHIFT;
+	const pgoff_t start = lstart >> huge_page_shift(h);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
@@ -449,8 +453,9 @@ static int hugetlb_vmtruncate(struct ino
 {
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
 
-	BUG_ON(offset & ~HPAGE_MASK);
+	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
@@ -465,6 +470,7 @@ static int hugetlb_vmtruncate(struct ino
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct hstate *h = hstate_inode(inode);
 	int error;
 	unsigned int ia_valid = attr->ia_valid;
 
@@ -476,7 +482,7 @@ static int hugetlbfs_setattr(struct dent
 
 	if (ia_valid & ATTR_SIZE) {
 		error = -EINVAL;
-		if (!(attr->ia_size & ~HPAGE_MASK))
+		if (!(attr->ia_size & ~huge_page_mask(h)))
 			error = hugetlb_vmtruncate(inode, attr->ia_size);
 		if (error)
 			goto out;
@@ -610,9 +616,10 @@ static int hugetlbfs_set_page_dirty(stru
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
+	struct hstate *h = hstate_inode(dentry->d_inode);
 
 	buf->f_type = HUGETLBFS_MAGIC;
-	buf->f_bsize = HPAGE_SIZE;
+	buf->f_bsize = huge_page_size(h);
 	if (sbinfo) {
 		spin_lock(&sbinfo->stat_lock);
 		/* If no limits set, just report 0 for max/free/used
@@ -942,7 +949,8 @@ struct file *hugetlb_file_setup(const ch
 		goto out_dentry;
 
 	error = -ENOMEM;
-	if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT, NULL))
+	if (hugetlb_reserve_pages(inode, 0,
+			size >> huge_page_shift(hstate_inode(inode)), NULL))
 		goto out_inode;
 
 	d_instantiate(dentry, inode);
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/ipc/shm.c	2008-06-03 20:45:19.000000000 +1000
@@ -562,7 +562,8 @@ static void shm_get_stat(struct ipc_name
 
 		if (is_file_hugepages(shp->shm_file)) {
 			struct address_space *mapping = inode->i_mapping;
-			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+			struct hstate *h = hstate_file(shp->shm_file);
+			*rss += pages_per_huge_page(h) * mapping->nrpages;
 		} else {
 			struct shmem_inode_info *info = SHMEM_I(inode);
 			spin_lock(&info->lock);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/mm/memory.c	2008-06-03 20:45:19.000000000 +1000
@@ -904,7 +904,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				unmap_hugepage_range(vma, start, end, NULL);
 				zap_work -= (end - start) /
-						(HPAGE_SIZE / PAGE_SIZE);
+					pages_per_huge_page(hstate_vma(vma));
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/mm/mempolicy.c	2008-06-03 20:45:19.000000000 +1000
@@ -1477,7 +1477,7 @@ struct zonelist *huge_zonelist(struct vm
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
 		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
-						HPAGE_SHIFT), gfp_flags);
+				huge_page_shift(hstate_vma(vma))), gfp_flags);
 	} else {
 		zl = policy_zonelist(gfp_flags, *mpol);
 		if ((*mpol)->mode == MPOL_BIND)
@@ -2216,9 +2216,12 @@ static void check_huge_range(struct vm_a
 {
 	unsigned long addr;
 	struct page *page;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
 
-	for (addr = start; addr < end; addr += HPAGE_SIZE) {
-		pte_t *ptep = huge_pte_offset(vma->vm_mm, addr & HPAGE_MASK);
+	for (addr = start; addr < end; addr += sz) {
+		pte_t *ptep = huge_pte_offset(vma->vm_mm,
+						addr & huge_page_mask(h));
 		pte_t pte;
 
 		if (!ptep)
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/mm/mmap.c	2008-06-03 20:45:19.000000000 +1000
@@ -1808,7 +1808,8 @@ int split_vma(struct mm_struct * mm, str
 	struct mempolicy *pol;
 	struct vm_area_struct *new;
 
-	if (is_vm_hugetlb_page(vma) && (addr & ~HPAGE_MASK))
+	if (is_vm_hugetlb_page(vma) && (addr &
+					~(huge_page_mask(hstate_vma(vma)))))
 		return -EINVAL;
 
 	if (mm->map_count >= sysctl_max_map_count)
Index: linux-2.6/include/asm-ia64/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/hugetlb.h	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/include/asm-ia64/hugetlb.h	2008-06-03 20:45:19.000000000 +1000
@@ -8,7 +8,8 @@ void hugetlb_free_pgd_range(struct mmu_g
 			    unsigned long end, unsigned long floor,
 			    unsigned long ceiling);
 
-int prepare_hugepage_range(unsigned long addr, unsigned long len);
+int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len);
 
 static inline int is_hugepage_only_range(struct mm_struct *mm,
 					 unsigned long addr,
Index: linux-2.6/include/asm-powerpc/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/hugetlb.h	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/include/asm-powerpc/hugetlb.h	2008-06-03 20:45:19.000000000 +1000
@@ -21,7 +21,8 @@ pte_t huge_ptep_get_and_clear(struct mm_
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-s390/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-s390/hugetlb.h	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/include/asm-s390/hugetlb.h	2008-06-03 20:45:19.000000000 +1000
@@ -22,7 +22,8 @@ void set_huge_pte_at(struct mm_struct *m
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-sh/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-sh/hugetlb.h	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/include/asm-sh/hugetlb.h	2008-06-03 20:45:19.000000000 +1000
@@ -14,7 +14,8 @@ static inline int is_hugepage_only_range
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-sparc64/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/hugetlb.h	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/include/asm-sparc64/hugetlb.h	2008-06-03 20:45:19.000000000 +1000
@@ -22,7 +22,8 @@ static inline int is_hugepage_only_range
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-x86/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-x86/hugetlb.h	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/include/asm-x86/hugetlb.h	2008-06-03 20:45:19.000000000 +1000
@@ -14,11 +14,13 @@ static inline int is_hugepage_only_range
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file,
+			unsigned long addr, unsigned long len)
 {
-	if (len & ~HPAGE_MASK)
+	struct hstate *h = hstate_file(file);
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (addr & ~HPAGE_MASK)
+	if (addr & ~huge_page_mask(h))
 		return -EINVAL;
 	return 0;
 }
Index: linux-2.6/arch/s390/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/hugetlbpage.c	2008-06-03 20:42:43.000000000 +1000
+++ linux-2.6/arch/s390/mm/hugetlbpage.c	2008-06-03 20:45:19.000000000 +1000
@@ -72,7 +72,8 @@ void arch_release_hugepage(struct page *
 	page[1].index = 0;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm,
+			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgdp;
 	pud_t *pudp;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-03 10:00 ` [patch 14/21] x86: add hugepagesz option " npiggin
@ 2008-06-03 17:48   ` Dave Hansen
  2008-06-03 18:24     ` Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2008-06-03 17:48 UTC (permalink / raw)
  To: npiggin
  Cc: akpm, Nishanth Aravamudan, linux-mm, kniht, andi, abh,
	joachim.deguara

On Tue, 2008-06-03 at 20:00 +1000, npiggin@suse.de wrote:
> +static __init int setup_hugepagesz(char *opt)
> +{
> +       unsigned long ps = memparse(opt, &opt);
> +       if (ps == PMD_SIZE) {
> +               hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT);
> +       } else if (ps == PUD_SIZE && cpu_has_gbpages) {
> +               hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
> +       } else {
> +               printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
> +                       ps >> 20);
> +               return 0;
> +       }
> +       return 1;
> +}
> +__setup("hugepagesz=", setup_hugepagesz);
> +#endif

Hi Nick,

I was talking to Nish a bit about these bits.  I'm worried that this
setup isn't going to be very user friendly.

First of all, it seems a bit silly to require that users spell out all
of the huge page sizes at boot.  Shouldn't we allow the small sizes to
be runtime-added as well?

This also requires that users know at boot time which page sizes are
supported, and that might not be horribly feasible if you wanted to have
a common setup among a bunch of different machines, or if the supported
sizes can change with things as insignificant as firmware revisions
(they can on ppc).

So, here's what I propose.  At boot, the architecture calls
hugetlb_add_hstate() for every hardware-supported huge page size.  This,
of course, gets enumerated in Nish's  new sysfs interface.

Then, give the boot-time large page reservations either to hugepages= or
a new boot option.  But, instead of doing it in number of hpages, do it
in sizes like hugepages=10G.  Bootmem-alloc that area, and make it
available to the first hugetlbfs users that come along, regardless of
their hpage size.  Do whatever is simplest here when breaking down the
large area.

I *think* this could all be done on top of what you have here.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-03 17:48   ` Dave Hansen
@ 2008-06-03 18:24     ` Andi Kleen
  2008-06-03 18:59       ` Dave Hansen
  2008-06-03 19:00       ` Dave Hansen
  0 siblings, 2 replies; 49+ messages in thread
From: Andi Kleen @ 2008-06-03 18:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: npiggin, akpm, Nishanth Aravamudan, linux-mm, kniht, andi, abh,
	joachim.deguara

On Tue, Jun 03, 2008 at 10:48:02AM -0700, Dave Hansen wrote:
> First of all, it seems a bit silly to require that users spell out all
> of the huge page sizes at boot.  Shouldn't we allow the small sizes to
> be runtime-added as well

They are already for most systems where you have only two
hpage sizes. That is because the legacy hpage size is always 
added and you can still allocate pages for it using the sysctl. And if
you want to prereserve at boot you'll have to spell the size out
explicitely anyways.

> 
> Then, give the boot-time large page reservations either to hugepages= or
> a new boot option.  But, instead of doing it in number of hpages, do it
> in sizes like hugepages=10G.  Bootmem-alloc that area, and make it

That assumes you can allocate all 10GB continuously.  Might be not true.
e.g. consider a 16GB x86 with its 1GB PCI memory hole at 4GB and user
wants 14GB worth of hugepages. It would need to allocate over the hole
which is not possible in your scheme.

The bootmem allocator always needs to know the size to be able to split
up.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-03 18:24     ` Andi Kleen
@ 2008-06-03 18:59       ` Dave Hansen
  2008-06-03 20:57         ` Andi Kleen
  2008-06-03 19:00       ` Dave Hansen
  1 sibling, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2008-06-03 18:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: npiggin, akpm, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

On Tue, 2008-06-03 at 20:24 +0200, Andi Kleen wrote:
> > Then, give the boot-time large page reservations either to hugepages= or
> > a new boot option.  But, instead of doing it in number of hpages, do it
> > in sizes like hugepages=10G.  Bootmem-alloc that area, and make it
> 
> That assumes you can allocate all 10GB continuously.  Might be not true.
> e.g. consider a 16GB x86 with its 1GB PCI memory hole at 4GB and user
> wants 14GB worth of hugepages. It would need to allocate over the hole
> which is not possible in your scheme.
> 
> The bootmem allocator always needs to know the size to be able to split
> up.

Yeah, that's very true.  But, it shouldn't be too hard to work around.
We could try to allocate a set of bootmem areas which are, at their
largest, the largest available hsize, then fall back to smaller areas as
needed.

The downside of something like this is that you have yet another data
structure to manage.  Andi, do you think something like this would be
workable?

int alloc_boot_hpages(unsigned long needed)
{
	while(needed > 0) {
		unsigned long alloc = 0;
		/* would need to be sorted descending hsize */	 
		for_each(hsize) {
			alloc = alloc_bootmem(hsize)
			if (alloc) {
				add_to_hpage_pool(alloc);
				break;
			}
		}
		if (!alloc) {
			WARN();
			break;
		}
		needed -= alloc;
	}
}

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-03 18:24     ` Andi Kleen
  2008-06-03 18:59       ` Dave Hansen
@ 2008-06-03 19:00       ` Dave Hansen
  1 sibling, 0 replies; 49+ messages in thread
From: Dave Hansen @ 2008-06-03 19:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: npiggin, akpm, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

On Tue, 2008-06-03 at 20:24 +0200, Andi Kleen wrote:
> On Tue, Jun 03, 2008 at 10:48:02AM -0700, Dave Hansen wrote:
> > First of all, it seems a bit silly to require that users spell out all
> > of the huge page sizes at boot.  Shouldn't we allow the small sizes to
> > be runtime-added as well
> 
> They are already for most systems where you have only two
> hpage sizes. That is because the legacy hpage size is always 
> added and you can still allocate pages for it using the sysctl. And if
> you want to prereserve at boot you'll have to spell the size out
> explicitely anyways.

Yeah, it doesn't make much sense if you're looking at x86 with 2/4MB and
1GB.  But, for ppc or ia64, you might have actual chances at runtime to
allocate huge pages, even for non-default sizes.  

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-03 18:59       ` Dave Hansen
@ 2008-06-03 20:57         ` Andi Kleen
  2008-06-03 21:27           ` Dave Hansen
  2008-06-04  1:10           ` Nick Piggin
  0 siblings, 2 replies; 49+ messages in thread
From: Andi Kleen @ 2008-06-03 20:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, npiggin, akpm, Nishanth Aravamudan, linux-mm, kniht,
	abh, joachim.deguara

> The downside of something like this is that you have yet another data
> structure to manage.  Andi, do you think something like this would be
> workable?

The reason I don't like your proposal is that it makes only sense
with a lot of hugepage sizes being active at the same time. But the
API (one mount per size) doesn't really scale to that anyways.
It should support two (as on x86), three if you stretch it, but
anything beyond would be difficult.
If you really wanted to support a zillion sizes you would at least
first need a different flexible interface that completely hides page
sizes.
Otherwise you would drive both sysadmins and programmers crazy and 
overlong command lines would be the smallest of their problems
With two or even three sizes only the whole thing is not needed and my original
scheme works fine IMHO.

That is why I was also sceptical of the newly proposed sysfs interfaces. 
For two or three numbers you don't need a sysfs interface.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-03 20:57         ` Andi Kleen
@ 2008-06-03 21:27           ` Dave Hansen
  2008-06-04  0:06             ` Andi Kleen
  2008-06-04  1:10           ` Nick Piggin
  1 sibling, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2008-06-03 21:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: npiggin, akpm, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

On Tue, 2008-06-03 at 22:57 +0200, Andi Kleen wrote:
> > The downside of something like this is that you have yet another data
> > structure to manage.  Andi, do you think something like this would be
> > workable?
> 
> The reason I don't like your proposal is that it makes only sense
> with a lot of hugepage sizes being active at the same time. But the
> API (one mount per size) doesn't really scale to that anyways.
> It should support two (as on x86), three if you stretch it, but
> anything beyond would be difficult.
> If you really wanted to support a zillion sizes you would at least
> first need a different flexible interface that completely hides page
> sizes.
> Otherwise you would drive both sysadmins and programmers crazy and 
> overlong command lines would be the smallest of their problems
> With two or even three sizes only the whole thing is not needed and my original
> scheme works fine IMHO.
> 
> That is why I was also sceptical of the newly proposed sysfs interfaces. 
> For two or three numbers you don't need a sysfs interface.

So, I think what I've proposed can be useful even for only two page
sizes.  This boot parameter is really one of the very few places users
will actually interact with huge pages and what I've suggested is simply
the most intuitive interface for them to use.  Either a fixed number of
megabytes or a percentage of total RAM is likely to be how people
actually think about it.  I hate getting out calculators to figure out
my kernel cmdline. :)

Also, as I said, users doesn't really know what the OS or hardware will
support and can't tell until the OS is up.  Firmware revisions,
different kernel versions, and different hypervisors can change this
underneath them.

We're expecting users to predict, properly balance, and commit to
their huge page allocation choices before boot.  Why do this when we
don't have to?

I don't think this adds any complexity at *all* for sysadmins or
programmers.  The legacy hugepages= command and sysctl interfaces can
stay as-is, and they get an optional and much easier to use interface.
I think sysadmins seriously appreciate things that aren't carved into
stone at boot.  I'm not sure how this makes it more complex for
programmers.  Could you elaborate on that?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-03 21:27           ` Dave Hansen
@ 2008-06-04  0:06             ` Andi Kleen
  2008-06-04  1:04               ` Nick Piggin
  2008-06-05 23:15               ` Nishanth Aravamudan
  0 siblings, 2 replies; 49+ messages in thread
From: Andi Kleen @ 2008-06-04  0:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: npiggin, akpm, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

> Also, as I said, users doesn't really know what the OS or hardware will
> support 

The normal Linux expectation is that these kinds of users will not
use huge pages at all.  Or rather if everybody was supposed to use
them then all the interfaces would need to be greatly improved and any
kinds of boot parameters would be out and they would need to be
100% integrated with the standard VM.

Hugepages are strictly an harder-to-use optimization for specific people
who love to tweak (e.g. database administrators or benchmarkers). From
what I heard so far these people like to have more control, not less.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-04  0:06             ` Andi Kleen
@ 2008-06-04  1:04               ` Nick Piggin
  2008-06-04 16:01                 ` Dave Hansen
  2008-06-05 23:15               ` Nishanth Aravamudan
  1 sibling, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2008-06-04  1:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, akpm, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

On Wed, Jun 04, 2008 at 02:06:10AM +0200, Andi Kleen wrote:
> 
> > Also, as I said, users doesn't really know what the OS or hardware will
> > support 
> 
> The normal Linux expectation is that these kinds of users will not
> use huge pages at all.  Or rather if everybody was supposed to use
> them then all the interfaces would need to be greatly improved and any
> kinds of boot parameters would be out and they would need to be
> 100% integrated with the standard VM.
> 
> Hugepages are strictly an harder-to-use optimization for specific people
> who love to tweak (e.g. database administrators or benchmarkers). From
> what I heard so far these people like to have more control, not less.

Hi,

Well I think you both raise some valid points. Especially with 1GB
hugepages I agree with Andi that they are quite limiting and you need
to specify them at boot anyway really.

However I'm really not the right person to arbitrate or decide on the
exact final way the user will interact with these things, because I've
honestly never used hugepages for a serious app, nor am I involved with
libhugetlbfs etc etc.

Dave raises a good point with smaller hugepages sizes. They are
potentially much more usable than even 16M pages in a long running
system, so it probably doesn't hurt to have them available.

So I won't oppose this being tinkered with once it is in -mm or upstream.
So long as we try to make changes carefully. For example, there should
be no reason why we can't subsequently have a patch to register all
huge page sizes on boot, or if it is really important somebody might
write a patch to return the 1GB pages to the buddy allocator etc.

I'm basically just trying to follow the path of least resistance ;) So
I'm hoping that nobody is too upset with the current set of patches,
and from there I am very happy for people to submit incremental patches
to the user apis..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-03 20:57         ` Andi Kleen
  2008-06-03 21:27           ` Dave Hansen
@ 2008-06-04  1:10           ` Nick Piggin
  2008-06-05 23:12             ` Nishanth Aravamudan
  1 sibling, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2008-06-04  1:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, akpm, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

On Tue, Jun 03, 2008 at 10:57:52PM +0200, Andi Kleen wrote:
> > The downside of something like this is that you have yet another data
> > structure to manage.  Andi, do you think something like this would be
> > workable?
> 
> The reason I don't like your proposal is that it makes only sense
> with a lot of hugepage sizes being active at the same time. But the
> API (one mount per size) doesn't really scale to that anyways.
> It should support two (as on x86), three if you stretch it, but
> anything beyond would be difficult.
> If you really wanted to support a zillion sizes you would at least
> first need a different flexible interface that completely hides page
> sizes.
> Otherwise you would drive both sysadmins and programmers crazy and 
> overlong command lines would be the smallest of their problems
> With two or even three sizes only the whole thing is not needed and my original
> scheme works fine IMHO.
> 
> That is why I was also sceptical of the newly proposed sysfs interfaces. 
> For two or three numbers you don't need a sysfs interface.

I do think your proc enhancements are clever, and you're right that for
the current setup they are pretty workable. The reason I haven't submitted
them in this round is because they do cause libhugetlbfs failures...
maybe that's just because the regression suite does really dumb parsing,
and nothing important will break, but it is the only thing I have to go on
so I have to give it some credit ;)

With the sysfs API, we have a way to control the other hstates, so it
takes a little importance off the proc interface.

sysfs doesn't appear to give a huge improvement yet (although I still
think it is nicer), but I think the hugetlbfs guys want to have control
over which nodes things get allocated on etc. so I think proc really
was going to run out of steam at some point.

Anyway, my point is just that sysfs is the way forward, but I'm also not
against people tinkering with /proc if it isn't going to cause problems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/21] hugetlb multi size, giant hugetlb support, etc
  2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
                   ` (22 preceding siblings ...)
  2008-06-03 10:57 ` [patch 00/21] hugetlb multi size, giant hugetlb support, etc Nick Piggin
@ 2008-06-04  8:29 ` Andrew Morton
  2008-06-04  9:35   ` Nick Piggin
  2008-06-04 11:57   ` Andi Kleen
  23 siblings, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-04  8:29 UTC (permalink / raw)
  To: npiggin; +Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

On Tue, 03 Jun 2008 19:59:56 +1000 npiggin@suse.de wrote:

> Hi,
> 
> Here is my submission to be merged in -mm. Given the amount of hunks this
> patchset has, and the recent flurry of hugetlb development work, I'd hope to
> get this merged up provided there aren't major issues (I would prefer to fix
> minor ones with incremental patches). It's just a lot of error prone work to
> track -mm when multiple concurrent development is happening.
> 
> Patch against latest mmotm.
> 
> 
> What I have done for the user API issues with this release is to take the
> safe way out and just maintain the existing hugepages user interfaces
> unchanged. I've integrated Nish's sysfs API to control the other huge
> page sizes.
> 
> I had initially opted to drop the /proc/sys/vm/* parts of the changes, but
> I found that the libhugetlbfs suite continued to have failures, so I
> decided to revert the multi column /proc/meminfo changes too, for now.
> 
> I say for now, because it is very easy to subsequently agree on some
> extention to the API, but it is much harder to revert such an extention once
> it has been made. I also think the main thing at this point is to get the
> existing patchset merged. User API changes I really don't want to worry
> with at the moment... point is: the infrastructure changes are a lot of
> work to code, but not so hard to get right; the user API changes are
> easy to code but harder to get right.
> 
> New to this patchset: I have implemented a default_hugepagesz= boot option
> (defaulting to the arch's HPAGE_SIZE if unspecified), which can be used to
> specify the default hugepage size for all /proc/* files, SHM, and default
> hugetlbfs mount size. This is the best compromise I could find to keep back
> compatibility while allowing the possibility to try different sizes with
> legacy code.
> 
> One thing I worry about is whether the sysfs API is going to be foward
> compatible with NUMA allocation changes that might be in the pipe.
> This need not hold up a merge into -mm, but I'd like some reassurances
> that thought is put in before it goes upstream.
> 
> Lastly, embarassingly, I'm not the best source of information for the
> sysfs tunables, so incremental patches against Documentation/ABI would
> be welcome :P
> 

I think I'll duck this iteration.  Partly because I was unable to work
out how nacky the feedback for 14/21 was, but mainly because I don't
know what it all does, because none of the above explains this.

Can't review it if I don't know what it's all trying to do.

Things like this:

: Large, but rather mechanical patch that converts most of the hugetlb.c
: globals into structure members and passes them around.
: 
: Right now there is only a single global hstate structure, but 
: most of the infrastructure to extend it is there.

OK, but it didn't tell us why we want multiple hstate structures.

: Add basic support for more than one hstate in hugetlbfs
: 
: - Convert hstates to an array
: - Add a first default entry covering the standard huge page size
: - Add functions for architectures to register new hstates
: - Add basic iterators over hstates

And neither did that.

One for each hugepage size, I'd guess.

: Add support to have individual hstates for each hugetlbfs mount
: 
: - Add a new pagesize= option to the hugetlbfs mount that allows setting
: the page size
: - Set up pointers to a suitable hstate for the set page size option
: to the super block and the inode and the vma.
: - Change the hstate accessors to use this information
: - Add code to the hstate init function to set parsed_hstate for command
: line processing
: - Handle duplicated hstate registrations to the make command line user proof

Nope, wrong guess.  It's one per mountpoint.

So now I'm seeing (I think) that the patchset does indeed implement
multiple page-size hugepages, and that it it does this by making an
entire hugetlb mountpoint have a single-but-settable pagesize.

All pretty straightforward stuff, unless I'm missing something.  But
please do spell it out because surely there's stuff in here which I
will miss from the implementation and the skimpy changelog.

Please don't think I'm being anal here - changelogging matters.  It
makes review more effective and it allows reviewers to find problems
which they would otherwise have overlooked.  btdt, lots of times.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/21] hugetlb multi size, giant hugetlb support, etc
  2008-06-04  8:29 ` Andrew Morton
@ 2008-06-04  9:35   ` Nick Piggin
  2008-06-04  9:46     ` Andrew Morton
  2008-06-04 11:57   ` Andi Kleen
  1 sibling, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2008-06-04  9:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

On Wed, Jun 04, 2008 at 01:29:38AM -0700, Andrew Morton wrote:
> On Tue, 03 Jun 2008 19:59:56 +1000 npiggin@suse.de wrote:
> > 
> > Lastly, embarassingly, I'm not the best source of information for the
> > sysfs tunables, so incremental patches against Documentation/ABI would
> > be welcome :P
> > 
> 
> I think I'll duck this iteration.  Partly because I was unable to work
> out how nacky the feedback for 14/21 was,

I don't think it was at all. At least, any of the suggestions Dave was
making could even be implemented in a backwards compatible way even
after it goes upstream. Basically I'm taking the easy way out when it
comes to user API changes, changing none of the existing interfaces
and just providing very basic boot options.


> but mainly because I don't
> know what it all does, because none of the above explains this.
> 
> Can't review it if I don't know what it's all trying to do.

Fair enough.

 
> Things like this:
> 
> : Large, but rather mechanical patch that converts most of the hugetlb.c
> : globals into structure members and passes them around.
> : 
> : Right now there is only a single global hstate structure, but 
> : most of the infrastructure to extend it is there.
> 
> OK, but it didn't tell us why we want multiple hstate structures.

OK.

 
> : Add basic support for more than one hstate in hugetlbfs
> : 
> : - Convert hstates to an array
> : - Add a first default entry covering the standard huge page size
> : - Add functions for architectures to register new hstates
> : - Add basic iterators over hstates
> 
> And neither did that.
> 
> One for each hugepage size, I'd guess.

Yes.

 
> : Add support to have individual hstates for each hugetlbfs mount
> : 
> : - Add a new pagesize= option to the hugetlbfs mount that allows setting
> : the page size
> : - Set up pointers to a suitable hstate for the set page size option
> : to the super block and the inode and the vma.
> : - Change the hstate accessors to use this information
> : - Add code to the hstate init function to set parsed_hstate for command
> : line processing
> : - Handle duplicated hstate registrations to the make command line user proof
> 
> Nope, wrong guess.  It's one per mountpoint.
 
It is per hugepage size, but each mount point has a pointer to one
of the hstates. Hstate is basically a gathering of most of the
currently global attributes and state in hugetlb and put into a structure.


> So now I'm seeing (I think) that the patchset does indeed implement
> multiple page-size hugepages, and that it it does this by making an
> entire hugetlb mountpoint have a single-but-settable pagesize.

Yes.
 

> All pretty straightforward stuff, unless I'm missing something.  But
> please do spell it out because surely there's stuff in here which I
> will miss from the implementation and the skimpy changelog.
> 
> Please don't think I'm being anal here - changelogging matters.  It
> makes review more effective and it allows reviewers to find problems
> which they would otherwise have overlooked.  btdt, lots of times.

OK, well I'm keen to get it into mm so it's not holding up (or being
busted by) other work... can I just try replying with improved
changelogs? Or do you want me to resend the full series?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/21] hugetlb multi size, giant hugetlb support, etc
  2008-06-04  9:35   ` Nick Piggin
@ 2008-06-04  9:46     ` Andrew Morton
  2008-06-04 11:04       ` Nick Piggin
  2008-06-04 11:33       ` Nick Piggin
  0 siblings, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-04  9:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

On Wed, 4 Jun 2008 11:35:17 +0200 Nick Piggin <npiggin@suse.de> wrote:

> OK, well I'm keen to get it into mm so it's not holding up (or being
> busted by) other work... can I just try replying with improved
> changelogs? Or do you want me to resend the full series?

Full resend please - if I'm going to spend an hour with my nose stuck
in this, there's not much point in getting all confused about what goes
with what.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/21] hugetlb multi size, giant hugetlb support, etc
  2008-06-04  9:46     ` Andrew Morton
@ 2008-06-04 11:04       ` Nick Piggin
  2008-06-04 11:33       ` Nick Piggin
  1 sibling, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2008-06-04 11:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

On Wed, Jun 04, 2008 at 02:46:48AM -0700, Andrew Morton wrote:
> On Wed, 4 Jun 2008 11:35:17 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > OK, well I'm keen to get it into mm so it's not holding up (or being
> > busted by) other work... can I just try replying with improved
> > changelogs? Or do you want me to resend the full series?
> 
> Full resend please - if I'm going to spend an hour with my nose stuck
> in this, there's not much point in getting all confused about what goes
> with what.

OK, coming right up...

Firstly, you need to fix hugetlb building with CONFIG_HUGETLB off...

---
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 20:44:40.000000000 +1000
+++ linux-2.6/include/linux/hugetlb.h	2008-06-03 20:45:07.000000000 +1000
@@ -76,7 +76,7 @@ static inline unsigned long hugetlb_tota
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
 #define hugetlb_prefault(mapping, vma)		({ BUG(); 0; })
-#define unmap_hugepage_range(vma, start, end)	BUG()
+#define unmap_hugepage_range(vma, start, end, page)	BUG()
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 18/21] powerpc: scan device tree for gigantic pages
  2008-06-04 11:29 [patch 00/21] hugetlb patches resend npiggin
@ 2008-06-04 11:29 ` npiggin
  0 siblings, 0 replies; 49+ messages in thread
From: npiggin @ 2008-06-04 11:29 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm

[-- Attachment #1: powerpc-scan-device-tree-and-save-gigantic-page-locations.patch --]
[-- Type: text/plain, Size: 4735 bytes --]

The 16G huge pages have to be reserved in the HMC prior to boot. The
location of the pages are placed in the device tree.   This patch adds
code to scan the device tree during very early boot and save these page
locations until hugetlbfs is ready for them.

Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 arch/powerpc/mm/hash_utils_64.c  |   44 ++++++++++++++++++++++++++++++++++++++-
 arch/powerpc/mm/hugetlbpage.c    |   16 ++++++++++++++
 include/asm-powerpc/mmu-hash64.h |    2 +
 3 files changed, 61 insertions(+), 1 deletion(-)



Index: linux-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hash_utils_64.c	2008-06-04 20:47:32.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hash_utils_64.c	2008-06-04 20:51:26.000000000 +1000
@@ -68,6 +68,7 @@
 
 #define KB (1024)
 #define MB (1024*KB)
+#define GB (1024L*MB)
 
 /*
  * Note:  pte   --> Linux PTE
@@ -329,6 +330,44 @@ static int __init htab_dt_scan_page_size
 	return 0;
 }
 
+/* Scan for 16G memory blocks that have been set aside for huge pages
+ * and reserve those blocks for 16G huge pages.
+ */
+static int __init htab_dt_scan_hugepage_blocks(unsigned long node,
+					const char *uname, int depth,
+					void *data) {
+	char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+	unsigned long *addr_prop;
+	u32 *page_count_prop;
+	unsigned int expected_pages;
+	long unsigned int phys_addr;
+	long unsigned int block_size;
+
+	/* We are scanning "memory" nodes only */
+	if (type == NULL || strcmp(type, "memory") != 0)
+		return 0;
+
+	/* This property is the log base 2 of the number of virtual pages that
+	 * will represent this memory block. */
+	page_count_prop = of_get_flat_dt_prop(node, "ibm,expected#pages", NULL);
+	if (page_count_prop == NULL)
+		return 0;
+	expected_pages = (1 << page_count_prop[0]);
+	addr_prop = of_get_flat_dt_prop(node, "reg", NULL);
+	if (addr_prop == NULL)
+		return 0;
+	phys_addr = addr_prop[0];
+	block_size = addr_prop[1];
+	if (block_size != (16 * GB))
+		return 0;
+	printk(KERN_INFO "Huge page(16GB) memory: "
+			"addr = 0x%lX size = 0x%lX pages = %d\n",
+			phys_addr, block_size, expected_pages);
+	lmb_reserve(phys_addr, block_size * expected_pages);
+	add_gpage(phys_addr, block_size, expected_pages);
+	return 0;
+}
+
 static void __init htab_init_page_sizes(void)
 {
 	int rc;
@@ -418,7 +457,10 @@ static void __init htab_init_page_sizes(
 	       );
 
 #ifdef CONFIG_HUGETLB_PAGE
-	/* Init large page size. Currently, we pick 16M or 1M depending
+	/* Reserve 16G huge page memory sections for huge pages */
+	of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
+
+/* Init large page size. Currently, we pick 16M or 1M depending
 	 * on what is available
 	 */
 	if (mmu_psize_defs[MMU_PAGE_16M].shift)
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-06-04 20:51:26.000000000 +1000
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c	2008-06-04 20:51:26.000000000 +1000
@@ -110,6 +110,22 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, 
 }
 #endif
 
+/* Build list of addresses of gigantic pages.  This function is used in early
+ * boot before the buddy or bootmem allocator is setup.
+ */
+void add_gpage(unsigned long addr, unsigned long page_size,
+	unsigned long number_of_pages)
+{
+	if (!addr)
+		return;
+	while (number_of_pages > 0) {
+		gpage_freearray[nr_gpages] = addr;
+		nr_gpages++;
+		number_of_pages--;
+		addr += page_size;
+	}
+}
+
 /* Moves the gigantic page addresses from the temporary list to the
  * huge_boot_pages list.  */
 int alloc_bootmem_huge_page(struct hstate *h)
Index: linux-2.6/include/asm-powerpc/mmu-hash64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/mmu-hash64.h	2008-06-04 20:47:32.000000000 +1000
+++ linux-2.6/include/asm-powerpc/mmu-hash64.h	2008-06-04 20:51:26.000000000 +1000
@@ -281,6 +281,8 @@ extern int htab_bolt_mapping(unsigned lo
 			     unsigned long pstart, unsigned long mode,
 			     int psize, int ssize);
 extern void set_huge_psize(int psize);
+extern void add_gpage(unsigned long addr, unsigned long page_size,
+			  unsigned long number_of_pages);
 extern void demote_segment_4k(struct mm_struct *mm, unsigned long addr);
 
 extern void htab_initialize(void);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/21] hugetlb multi size, giant hugetlb support, etc
  2008-06-04  9:46     ` Andrew Morton
  2008-06-04 11:04       ` Nick Piggin
@ 2008-06-04 11:33       ` Nick Piggin
  1 sibling, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2008-06-04 11:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nishanth Aravamudan, linux-mm, kniht, andi, abh, joachim.deguara

On Wed, Jun 04, 2008 at 02:46:48AM -0700, Andrew Morton wrote:
> On Wed, 4 Jun 2008 11:35:17 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > OK, well I'm keen to get it into mm so it's not holding up (or being
> > busted by) other work... can I just try replying with improved
> > changelogs? Or do you want me to resend the full series?
> 
> Full resend please - if I'm going to spend an hour with my nose stuck
> in this, there's not much point in getting all confused about what goes
> with what.

Am resending only ccing linux-mm this time. btw.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/21] hugetlb multi size, giant hugetlb support, etc
  2008-06-04  8:29 ` Andrew Morton
  2008-06-04  9:35   ` Nick Piggin
@ 2008-06-04 11:57   ` Andi Kleen
  2008-06-04 18:39     ` Andrew Morton
  1 sibling, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2008-06-04 11:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: npiggin, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

> : Large, but rather mechanical patch that converts most of the hugetlb.c
> : globals into structure members and passes them around.
> : 
> : Right now there is only a single global hstate structure, but 
> : most of the infrastructure to extend it is there.
> 
> OK, but it didn't tell us why we want multiple hstate structures.
> 
> : Add basic support for more than one hstate in hugetlbfs
> : 
> : - Convert hstates to an array
> : - Add a first default entry covering the standard huge page size
> : - Add functions for architectures to register new hstates
> : - Add basic iterators over hstates
> 
> And neither did that.
> 
> One for each hugepage size, I'd guess.
> 
> : Add support to have individual hstates for each hugetlbfs mount
> : 
> : - Add a new pagesize= option to the hugetlbfs mount that allows setting
> : the page size
> : - Set up pointers to a suitable hstate for the set page size option
> : to the super block and the inode and the vma.
> : - Change the hstate accessors to use this information
> : - Add code to the hstate init function to set parsed_hstate for command
> : line processing
> : - Handle duplicated hstate registrations to the make command line user proof
> 
> Nope, wrong guess.  It's one per mountpoint.

No the initial guess was correct. It's one per size.

> 
> So now I'm seeing (I think) that the patchset does indeed implement
> multiple page-size hugepages, and that it it does this by making an
> entire hugetlb mountpoint have a single-but-settable pagesize.

Correct.

> 
> All pretty straightforward stuff, unless I'm missing something.  But
> please do spell it out because surely there's stuff in here which I
> will miss from the implementation and the skimpy changelog.

It was spelled out in the original 0/0

Here's a copy
ftp://ftp.firstfloor.org/pub/ak/gbpages/patches/intro

> Please don't think I'm being anal here - changelogging matters.  It
> makes review more effective and it allows reviewers to find problems
> which they would otherwise have overlooked.  btdt, lots of times.

Hmm, perhaps we need dummy commits for 0/0s. I guess the intro could
be added to the changelog of the first patch.

-Andi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-04  1:04               ` Nick Piggin
@ 2008-06-04 16:01                 ` Dave Hansen
  2008-06-06 16:09                   ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2008-06-04 16:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, akpm, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

On Wed, 2008-06-04 at 03:04 +0200, Nick Piggin wrote:
> So I won't oppose this being tinkered with once it is in -mm or upstream.
> So long as we try to make changes carefully. For example, there should
> be no reason why we can't subsequently have a patch to register all
> huge page sizes on boot, or if it is really important somebody might
> write a patch to return the 1GB pages to the buddy allocator etc.
> 
> I'm basically just trying to follow the path of least resistance ;) So
> I'm hoping that nobody is too upset with the current set of patches,
> and from there I am very happy for people to submit incremental patches
> to the user apis..

That sounds like a good plan to me.  Let's see how the patches look on
top of what you have here.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/21] hugetlb multi size, giant hugetlb support, etc
  2008-06-04 11:57   ` Andi Kleen
@ 2008-06-04 18:39     ` Andrew Morton
  0 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2008-06-04 18:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: npiggin, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

On Wed, 04 Jun 2008 13:57:55 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> > 
> > All pretty straightforward stuff, unless I'm missing something.  But
> > please do spell it out because surely there's stuff in here which I
> > will miss from the implementation and the skimpy changelog.
> 
> It was spelled out in the original 0/0
> 
> Here's a copy
> ftp://ftp.firstfloor.org/pub/ak/gbpages/patches/intro

yeah, like that.

> > Please don't think I'm being anal here - changelogging matters.  It
> > makes review more effective and it allows reviewers to find problems
> > which they would otherwise have overlooked.  btdt, lots of times.
> 
> Hmm, perhaps we need dummy commits for 0/0s. I guess the intro could
> be added to the changelog of the first patch.

Yup.  I always copy the 0/n text into 1/n.  For some reason I often forget
to do it until after I've committed the 1/n.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-04  1:10           ` Nick Piggin
@ 2008-06-05 23:12             ` Nishanth Aravamudan
  2008-06-05 23:23               ` Nishanth Aravamudan
  0 siblings, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2008-06-05 23:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Dave Hansen, akpm, linux-mm, kniht, abh,
	joachim.deguara

On 04.06.2008 [03:10:16 +0200], Nick Piggin wrote:
> On Tue, Jun 03, 2008 at 10:57:52PM +0200, Andi Kleen wrote:
> > > The downside of something like this is that you have yet another data
> > > structure to manage.  Andi, do you think something like this would be
> > > workable?
> > 
> > The reason I don't like your proposal is that it makes only sense
> > with a lot of hugepage sizes being active at the same time. But the
> > API (one mount per size) doesn't really scale to that anyways.
> > It should support two (as on x86), three if you stretch it, but
> > anything beyond would be difficult.
> > If you really wanted to support a zillion sizes you would at least
> > first need a different flexible interface that completely hides page
> > sizes.
> > Otherwise you would drive both sysadmins and programmers crazy and 
> > overlong command lines would be the smallest of their problems
> > With two or even three sizes only the whole thing is not needed and my original
> > scheme works fine IMHO.
> > 
> > That is why I was also sceptical of the newly proposed sysfs interfaces. 
> > For two or three numbers you don't need a sysfs interface.
> 
> I do think your proc enhancements are clever, and you're right that
> for the current setup they are pretty workable. The reason I haven't
> submitted them in this round is because they do cause libhugetlbfs
> failures...  maybe that's just because the regression suite does
> really dumb parsing, and nothing important will break, but it is the
> only thing I have to go on so I have to give it some credit ;)

Will chime in here that yes, regardless of anything we do here,
libhugetlbfs will need to be updated to leverage multiple hugepage sizes
available at run-time. And I think it is sane to make sure that the
parser we have either is fixed if it has a bug that is causing the
failures or assuming the failures indicate a userspace interface change
:)

> With the sysfs API, we have a way to control the other hstates, so it
> takes a little importance off the proc interface.
> 
> sysfs doesn't appear to give a huge improvement yet (although I still
> think it is nicer), but I think the hugetlbfs guys want to have control
> over which nodes things get allocated on etc. so I think proc really
> was going to run out of steam at some point.

Well, I know Lee S. really wants it and it could help on large NUMA
systems using cpusets or other process restriction methods to be able to
specify which nodes the hugepages get allocated on.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-04  0:06             ` Andi Kleen
  2008-06-04  1:04               ` Nick Piggin
@ 2008-06-05 23:15               ` Nishanth Aravamudan
  2008-06-06  0:29                 ` Andi Kleen
  1 sibling, 1 reply; 49+ messages in thread
From: Nishanth Aravamudan @ 2008-06-05 23:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, npiggin, akpm, linux-mm, kniht, abh, joachim.deguara

On 04.06.2008 [02:06:10 +0200], Andi Kleen wrote:
> 
> > Also, as I said, users doesn't really know what the OS or hardware will
> > support 
> 
> The normal Linux expectation is that these kinds of users will not
> use huge pages at all.  Or rather if everybody was supposed to use
> them then all the interfaces would need to be greatly improved and any
> kinds of boot parameters would be out and they would need to be
> 100% integrated with the standard VM.
> 
> Hugepages are strictly an harder-to-use optimization for specific people
> who love to tweak (e.g. database administrators or benchmarkers). From
> what I heard so far these people like to have more control, not less.

I really don't want to get involved in this discussion, but let me just
say: "Hugepages are *right now* strictly a harder-to-use optimization".
libhugetlbfs helps quite a bit (in my opinion) as far as making
hugepages easier to use. And Adam's dynamic pool work does as well. It's
progress, slow and steady as it may be. I don't really appreciate the
entire hugepage area being pigeon-holed into what you've experienced.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-05 23:12             ` Nishanth Aravamudan
@ 2008-06-05 23:23               ` Nishanth Aravamudan
  0 siblings, 0 replies; 49+ messages in thread
From: Nishanth Aravamudan @ 2008-06-05 23:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Dave Hansen, akpm, linux-mm, kniht, abh,
	joachim.deguara, lee.schermerhorn

On 05.06.2008 [17:12:47 -0600], Nishanth Aravamudan wrote:
> On 04.06.2008 [03:10:16 +0200], Nick Piggin wrote:
> > On Tue, Jun 03, 2008 at 10:57:52PM +0200, Andi Kleen wrote:
> > > > The downside of something like this is that you have yet another data
> > > > structure to manage.  Andi, do you think something like this would be
> > > > workable?
> > > 
> > > The reason I don't like your proposal is that it makes only sense
> > > with a lot of hugepage sizes being active at the same time. But the
> > > API (one mount per size) doesn't really scale to that anyways.
> > > It should support two (as on x86), three if you stretch it, but
> > > anything beyond would be difficult.
> > > If you really wanted to support a zillion sizes you would at least
> > > first need a different flexible interface that completely hides page
> > > sizes.
> > > Otherwise you would drive both sysadmins and programmers crazy and 
> > > overlong command lines would be the smallest of their problems
> > > With two or even three sizes only the whole thing is not needed and my original
> > > scheme works fine IMHO.
> > > 
> > > That is why I was also sceptical of the newly proposed sysfs interfaces. 
> > > For two or three numbers you don't need a sysfs interface.
> > 
> > I do think your proc enhancements are clever, and you're right that
> > for the current setup they are pretty workable. The reason I haven't
> > submitted them in this round is because they do cause libhugetlbfs
> > failures...  maybe that's just because the regression suite does
> > really dumb parsing, and nothing important will break, but it is the
> > only thing I have to go on so I have to give it some credit ;)
> 
> Will chime in here that yes, regardless of anything we do here,
> libhugetlbfs will need to be updated to leverage multiple hugepage sizes
> available at run-time. And I think it is sane to make sure that the
> parser we have either is fixed if it has a bug that is causing the
> failures or assuming the failures indicate a userspace interface change
> :)
> 
> > With the sysfs API, we have a way to control the other hstates, so it
> > takes a little importance off the proc interface.
> > 
> > sysfs doesn't appear to give a huge improvement yet (although I still
> > think it is nicer), but I think the hugetlbfs guys want to have control
> > over which nodes things get allocated on etc. so I think proc really
> > was going to run out of steam at some point.
> 
> Well, I know Lee S. really wants it and it could help on large NUMA
> systems using cpusets or other process restriction methods to be able to
> specify which nodes the hugepages get allocated on.

Oh, and I imagine the layout will be something like (on power):

/sys/kernel/hugepages/hugepages-64kB/nodeX/nr_hugepages
/sys/kernel/hugepages/hugepages-64kB/nodeX/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-64kB/nodeX/free_hugepages
/sys/kernel/hugepages/hugepages-64kB/nodeX/resv_hugepages
/sys/kernel/hugepages/hugepages-64kB/nodeX/surplus_hugepages
/sys/kernel/hugepages/hugepages-64kB/nr_hugepages
/sys/kernel/hugepages/hugepages-64kB/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-64kB/free_hugepages
/sys/kernel/hugepages/hugepages-64kB/resv_hugepages
/sys/kernel/hugepages/hugepages-64kB/surplus_hugepages
/sys/kernel/hugepages/hugepages-16384kB/nodeX/nr_hugepages
/sys/kernel/hugepages/hugepages-16384kB/nodeX/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-16384kB/nodeX/free_hugepages
/sys/kernel/hugepages/hugepages-16384kB/nodeX/resv_hugepages
/sys/kernel/hugepages/hugepages-16384kB/nodeX/surplus_hugepages
/sys/kernel/hugepages/hugepages-16384kB/nr_hugepages
/sys/kernel/hugepages/hugepages-16384kB/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-16384kB/free_hugepages
/sys/kernel/hugepages/hugepages-16384kB/resv_hugepages
/sys/kernel/hugepages/hugepages-16384kB/surplus_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/nodeX/nr_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/nodeX/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/nodeX/free_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/nodeX/resv_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/nodeX/surplus_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/nr_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/free_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/resv_hugepages
/sys/kernel/hugepages/hugepages-16777216kB/surplus_hugepages

Where X varies over all possible nids. Does that seem reasonable? I'm
not entirely sure I like the amount of repetion of the topology, e.g.,
does it make more sense to have:

/sys/kernel/hugepages/hugepages-64kB
/sys/kernel/hugepages/hugepages-16384kB
/sys/kernel/hugepages/hugepages-16777216kB
/sys/kernel/hugepages/nodeX/hugepages-64kB
/sys/kernel/hugepages/nodeX/hugepages-16384kB
/sys/kernel/hugepages/nodeX/hugepages-16777216kB

?

Would definitely be a more compact representation...

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-05 23:15               ` Nishanth Aravamudan
@ 2008-06-06  0:29                 ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2008-06-06  0:29 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Dave Hansen, npiggin, akpm, linux-mm, kniht, abh, joachim.deguara

> I really don't want to get involved in this discussion, but let me just
> say: "Hugepages are *right now* strictly a harder-to-use optimization".
> libhugetlbfs helps quite a bit (in my opinion) as far as making
> hugepages easier to use. And Adam's dynamic pool work does as well. It's
> progress, slow and steady as it may be. I don't really appreciate the
> entire hugepage area being pigeon-holed into what you've experienced.

That is how Linus wanted hugepages to be done initially. Otherwise
we would have never gotten the separate hugetlbfs.

If you want a complete second VM that does everything in variable page
sizes then please redesign the original VM instead instead of recreating
all its code in mm/hugetlb.c. Thank you.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 14/21] x86: add hugepagesz option on 64-bit
  2008-06-04 16:01                 ` Dave Hansen
@ 2008-06-06 16:09                   ` Dave Hansen
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Hansen @ 2008-06-06 16:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, akpm, Nishanth Aravamudan, linux-mm, kniht, abh,
	joachim.deguara

On Wed, 2008-06-04 at 09:01 -0700, Dave Hansen wrote:
> On Wed, 2008-06-04 at 03:04 +0200, Nick Piggin wrote:
> > So I won't oppose this being tinkered with once it is in -mm or upstream.
> > So long as we try to make changes carefully. For example, there should
> > be no reason why we can't subsequently have a patch to register all
> > huge page sizes on boot, or if it is really important somebody might
> > write a patch to return the 1GB pages to the buddy allocator etc.
> > 
> > I'm basically just trying to follow the path of least resistance ;) So
> > I'm hoping that nobody is too upset with the current set of patches,
> > and from there I am very happy for people to submit incremental patches
> > to the user apis..
> 
> That sounds like a good plan to me.  Let's see how the patches look on
> top of what you have here.

I managed to implement most of this, but I'm not happy how it came out.
It's not quite functional, yet.

I don't think it is horribly worth doing unless it can simplify some of
what is there already, which this can't.  I basically ended up having to
write another little allocator to break down the large pages into
smaller ones.  That's sure to have bugs.

http://userweb.kernel.org/~daveh/boot-time-hugetlb-reservations.patch

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 00/21] hugetlb multi size, giant hugetlb support, etc
  2008-06-03 10:57 ` [patch 00/21] hugetlb multi size, giant hugetlb support, etc Nick Piggin
@ 2008-06-06 17:12   ` Andy Whitcroft
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Whitcroft @ 2008-06-06 17:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, Nishanth Aravamudan, linux-mm, kniht, andi, abh,
	joachim.deguara

On Tue, Jun 03, 2008 at 12:57:21PM +0200, Nick Piggin wrote:
> On Tue, Jun 03, 2008 at 07:59:56PM +1000, npiggin@suse.de wrote:
> > Hi,
> > 
> > Here is my submission to be merged in -mm. Given the amount of hunks this
> > patchset has, and the recent flurry of hugetlb development work, I'd hope to
> > get this merged up provided there aren't major issues (I would prefer to fix
> > minor ones with incremental patches). It's just a lot of error prone work to
> > track -mm when multiple concurrent development is happening.
> > 
> > Patch against latest mmotm.
> 
> Ah, missed a couple of things required to compile with hugetlbfs configured
> out. Here is a patch against mmotm, I will also send out a rediffed patch
> in the series.
> 
> ---
> Index: linux-2.6/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/linux/hugetlb.h	2008-06-03 20:44:40.000000000 +1000
> +++ linux-2.6/include/linux/hugetlb.h	2008-06-03 20:45:07.000000000 +1000
> @@ -76,7 +76,7 @@ static inline unsigned long hugetlb_tota
>  #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
>  #define hugetlb_prefault(mapping, vma)		({ BUG(); 0; })
> -#define unmap_hugepage_range(vma, start, end)	BUG()
> +#define unmap_hugepage_range(vma, start, end, page)	BUG()
>  #define hugetlb_report_meminfo(buf)		0
>  #define hugetlb_report_node_meminfo(n, buf)	0
>  #define follow_huge_pmd(mm, addr, pmd, write)	NULL

Acked-by: Andy Whitcroft <apw@shadowen.org>

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2008-06-06 17:12 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-03  9:59 [patch 00/21] hugetlb multi size, giant hugetlb support, etc npiggin
2008-06-03  9:59 ` [patch 01/21] hugetlb: factor out prep_new_huge_page npiggin
2008-06-03  9:59 ` [patch 02/21] hugetlb: modular state npiggin
2008-06-03 10:58   ` [patch 02/21] hugetlb: modular state (take 2) Nick Piggin
2008-06-03  9:59 ` [patch 03/21] hugetlb: multiple hstates npiggin
2008-06-03 10:00 ` [patch 04/21] hugetlbfs: per mount hstates npiggin
2008-06-03 10:00 ` [patch 05/21] hugetlb: new sysfs interface npiggin
2008-06-03 10:00 ` [patch 06/21] hugetlb: abstract numa round robin selection npiggin
2008-06-03 10:00 ` [patch 07/21] mm: introduce non panic alloc_bootmem npiggin
2008-06-03 10:00 ` [patch 08/21] mm: export prep_compound_page to mm npiggin
2008-06-03 10:00 ` [patch 09/21] hugetlb: support larger than MAX_ORDER npiggin
2008-06-03 10:00 ` [patch 10/21] hugetlb: support boot allocate different sizes npiggin
2008-06-03 10:00 ` [patch 11/21] hugetlb: printk cleanup npiggin
2008-06-03 10:00 ` [patch 12/21] hugetlb: introduce pud_huge npiggin
2008-06-03 10:00 ` [patch 13/21] x86: support GB hugepages on 64-bit npiggin
2008-06-03 10:00 ` [patch 14/21] x86: add hugepagesz option " npiggin
2008-06-03 17:48   ` Dave Hansen
2008-06-03 18:24     ` Andi Kleen
2008-06-03 18:59       ` Dave Hansen
2008-06-03 20:57         ` Andi Kleen
2008-06-03 21:27           ` Dave Hansen
2008-06-04  0:06             ` Andi Kleen
2008-06-04  1:04               ` Nick Piggin
2008-06-04 16:01                 ` Dave Hansen
2008-06-06 16:09                   ` Dave Hansen
2008-06-05 23:15               ` Nishanth Aravamudan
2008-06-06  0:29                 ` Andi Kleen
2008-06-04  1:10           ` Nick Piggin
2008-06-05 23:12             ` Nishanth Aravamudan
2008-06-05 23:23               ` Nishanth Aravamudan
2008-06-03 19:00       ` Dave Hansen
2008-06-03 10:00 ` [patch 15/21] hugetlb: override default huge page size npiggin
2008-06-03 10:00 ` [patch 16/21] hugetlb: allow arch overried hugepage allocation npiggin
2008-06-03 10:00 ` [patch 17/21] powerpc: function to allocate gigantic hugepages npiggin
2008-06-03 10:00 ` [patch 18/21] powerpc: scan device tree for gigantic pages npiggin
2008-06-03 10:00 ` [patch 19/21] powerpc: define support for 16G hugepages npiggin
2008-06-03 10:00 ` [patch 20/21] fs: check for statfs overflow npiggin
2008-06-03 10:00 ` [patch 21/21] powerpc: support multiple hugepage sizes npiggin
2008-06-03 10:29 ` [patch 1/1] x86: get_user_pages_lockless support 1GB hugepages Nick Piggin
2008-06-03 10:57 ` [patch 00/21] hugetlb multi size, giant hugetlb support, etc Nick Piggin
2008-06-06 17:12   ` Andy Whitcroft
2008-06-04  8:29 ` Andrew Morton
2008-06-04  9:35   ` Nick Piggin
2008-06-04  9:46     ` Andrew Morton
2008-06-04 11:04       ` Nick Piggin
2008-06-04 11:33       ` Nick Piggin
2008-06-04 11:57   ` Andi Kleen
2008-06-04 18:39     ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2008-06-04 11:29 [patch 00/21] hugetlb patches resend npiggin
2008-06-04 11:29 ` [patch 18/21] powerpc: scan device tree for gigantic pages npiggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).