[patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc
@ 2008-05-25 14:23 npiggin
  2008-05-25 14:23 ` [patch 01/23] hugetlb: fix lockdep error npiggin
                   ` (23 more replies)
  0 siblings, 24 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara

Hi all,

Given the amount of feedback this has had, and the powerpc patches from Jon,
I'll send out one more request for review and testing before asking Andrew
to merge in -mm.

Patches are against Linus's current git (eb90d81d). I will have to rebase
to -mm next.

The patches pass the libhugetlbfs regression test suite here on x86 and
powerpc (although my G5 can only run 16MB hugepages, so it is less
interesting...).

So, review and testing welcome.

Thanks!
Nick

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 01/23] hugetlb: fix lockdep error
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 16:30   ` Nishanth Aravamudan
  2008-05-27 19:55   ` Adam Litke
  2008-05-25 14:23 ` [patch 02/23] hugetlb: factor out huge_new_page npiggin
                   ` (22 subsequent siblings)
  23 siblings, 2 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara

[-- Attachment #1: hugetlb-copy-lockdep.patch --]
[-- Type: text/plain, Size: 793 bytes --]

---
 mm/hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -785,7 +785,7 @@ int copy_hugetlb_page_range(struct mm_st
 			continue;
 
 		spin_lock(&dst->page_table_lock);
-		spin_lock(&src->page_table_lock);
+		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
 		if (!huge_pte_none(huge_ptep_get(src_pte))) {
 			if (cow)
 				huge_ptep_set_wrprotect(src, addr, src_pte);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 01/23] hugetlb: fix lockdep error
  2008-05-25 14:23 ` [patch 01/23] hugetlb: fix lockdep error npiggin
@ 2008-05-27 16:30   ` Nishanth Aravamudan
  2008-05-27 19:55   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 16:30 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On 26.05.2008 [00:23:18 +1000], npiggin@suse.de wrote:
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

And can probably go upstream independent of the rest?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 01/23] hugetlb: fix lockdep error
  2008-05-25 14:23 ` [patch 01/23] hugetlb: fix lockdep error npiggin
  2008-05-27 16:30   ` Nishanth Aravamudan
@ 2008-05-27 19:55   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 19:55 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlb-copy-lockdep.patch)
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

> ---
>  mm/hugetlb.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -785,7 +785,7 @@ int copy_hugetlb_page_range(struct mm_st
>  			continue;
> 
>  		spin_lock(&dst->page_table_lock);
> -		spin_lock(&src->page_table_lock);
> +		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
>  		if (!huge_pte_none(huge_ptep_get(src_pte))) {
>  			if (cow)
>  				huge_ptep_set_wrprotect(src, addr, src_pte);
> 
-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 02/23] hugetlb: factor out huge_new_page
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
  2008-05-25 14:23 ` [patch 01/23] hugetlb: fix lockdep error npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 16:31   ` Nishanth Aravamudan
  2008-05-27 20:03   ` Adam Litke
  2008-05-25 14:23 ` [patch 03/23] hugetlb: modular state npiggin
                   ` (21 subsequent siblings)
  23 siblings, 2 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlb-factor-page-prep.patch --]
[-- Type: text/plain, Size: 1469 bytes --]

Needed to avoid code duplication in follow up patches.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -194,6 +194,16 @@ static int adjust_pool_surplus(int delta
 	return ret;
 }
 
+static void prep_new_huge_page(struct page *page, int nid)
+{
+	set_compound_page_dtor(page, free_huge_page);
+	spin_lock(&hugetlb_lock);
+	nr_huge_pages++;
+	nr_huge_pages_node[nid]++;
+	spin_unlock(&hugetlb_lock);
+	put_page(page); /* free it into the hugepage allocator */
+}
+
 static struct page *alloc_fresh_huge_page_node(int nid)
 {
 	struct page *page;
@@ -207,12 +217,7 @@ static struct page *alloc_fresh_huge_pag
 			__free_pages(page, HUGETLB_PAGE_ORDER);
 			return NULL;
 		}
-		set_compound_page_dtor(page, free_huge_page);
-		spin_lock(&hugetlb_lock);
-		nr_huge_pages++;
-		nr_huge_pages_node[nid]++;
-		spin_unlock(&hugetlb_lock);
-		put_page(page); /* free it into the hugepage allocator */
+		prep_new_huge_page(page, nid);
 	}
 
 	return page;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 02/23] hugetlb: factor out huge_new_page
  2008-05-25 14:23 ` [patch 02/23] hugetlb: factor out huge_new_page npiggin
@ 2008-05-27 16:31   ` Nishanth Aravamudan
  2008-05-27 20:03   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 16:31 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On 26.05.2008 [00:23:19 +1000], npiggin@suse.de wrote:
> Needed to avoid code duplication in follow up patches.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Although the e-mail subject does not match the name of the function :)

And can probably be sent upstream without the other patches as well.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 02/23] hugetlb: factor out huge_new_page
  2008-05-25 14:23 ` [patch 02/23] hugetlb: factor out huge_new_page npiggin
  2008-05-27 16:31   ` Nishanth Aravamudan
@ 2008-05-27 20:03   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 20:03 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara, Andi Kleen

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlb-factor-page-prep.patch)
> Needed to avoid code duplication in follow up patches.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 03/23] hugetlb: modular state
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
  2008-05-25 14:23 ` [patch 01/23] hugetlb: fix lockdep error npiggin
  2008-05-25 14:23 ` [patch 02/23] hugetlb: factor out huge_new_page npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 16:44   ` Nishanth Aravamudan
  2008-05-27 20:38   ` Adam Litke
  2008-05-25 14:23 ` [patch 04/23] hugetlb: multiple hstates npiggin
                   ` (20 subsequent siblings)
  23 siblings, 2 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlb-modular-state.patch --]
[-- Type: text/plain, Size: 47591 bytes --]

Large, but rather mechanical patch that converts most of the hugetlb.c
globals into structure members and passes them around.

Right now there is only a single global hstate structure, but 
most of the infrastructure to extend it is there.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    6 
 arch/powerpc/mm/hugetlbpage.c |    2 
 arch/sh/mm/hugetlbpage.c      |    2 
 arch/sparc64/mm/hugetlbpage.c |    4 
 arch/x86/mm/hugetlbpage.c     |    4 
 fs/hugetlbfs/inode.c          |   49 +++---
 include/asm-ia64/hugetlb.h    |    2 
 include/asm-powerpc/hugetlb.h |    2 
 include/asm-s390/hugetlb.h    |    2 
 include/asm-sh/hugetlb.h      |    2 
 include/asm-sparc64/hugetlb.h |    2 
 include/asm-x86/hugetlb.h     |    7 
 include/linux/hugetlb.h       |   81 +++++++++-
 ipc/shm.c                     |    3 
 mm/hugetlb.c                  |  321 ++++++++++++++++++++++--------------------
 mm/memory.c                   |    2 
 mm/mempolicy.c                |    9 -
 mm/mmap.c                     |    3 
 18 files changed, 308 insertions(+), 195 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -22,30 +22,24 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
-static unsigned long surplus_huge_pages;
-static unsigned long nr_overcommit_huge_pages;
 unsigned long max_huge_pages;
 unsigned long sysctl_overcommit_huge_pages;
-static struct list_head hugepage_freelists[MAX_NUMNODES];
-static unsigned int nr_huge_pages_node[MAX_NUMNODES];
-static unsigned int free_huge_pages_node[MAX_NUMNODES];
-static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
-static int hugetlb_next_nid;
+
+struct hstate global_hstate;
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
-static void clear_huge_page(struct page *page, unsigned long addr)
+static void clear_huge_page(struct page *page, unsigned long addr, unsigned long sz)
 {
 	int i;
 
 	might_sleep();
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
+	for (i = 0; i < sz/PAGE_SIZE; i++) {
 		cond_resched();
 		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
 	}
@@ -55,42 +49,43 @@ static void copy_huge_page(struct page *
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
+	struct hstate *h = hstate_vma(vma);
 
 	might_sleep();
-	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+	for (i = 0; i < 1 << huge_page_order(h); i++) {
 		cond_resched();
 		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
 	}
 }
 
-static void enqueue_huge_page(struct page *page)
+static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_add(&page->lru, &hugepage_freelists[nid]);
-	free_huge_pages++;
-	free_huge_pages_node[nid]++;
+	list_add(&page->lru, &h->hugepage_freelists[nid]);
+	h->free_huge_pages++;
+	h->free_huge_pages_node[nid]++;
 }
 
-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct hstate *h)
 {
 	int nid;
 	struct page *page = NULL;
 
 	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 			break;
 		}
 	}
 	return page;
 }
 
-static struct page *dequeue_huge_page_vma(struct vm_area_struct *vma,
-				unsigned long address)
+static struct page *dequeue_huge_page_vma(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
 {
 	int nid;
 	struct page *page = NULL;
@@ -105,14 +100,14 @@ static struct page *dequeue_huge_page_vm
 						MAX_NR_ZONES - 1, nodemask) {
 		nid = zone_to_nid(zone);
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
-		    !list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		    !list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 			if (vma && vma->vm_flags & VM_MAYSHARE)
-				resv_huge_pages--;
+				h->resv_huge_pages--;
 			break;
 		}
 	}
@@ -120,12 +115,13 @@ static struct page *dequeue_huge_page_vm
 	return page;
 }
 
-static void update_and_free_page(struct page *page)
+static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
-	nr_huge_pages--;
-	nr_huge_pages_node[page_to_nid(page)]--;
-	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
+
+	h->nr_huge_pages--;
+	h->nr_huge_pages_node[page_to_nid(page)]--;
+	for (i = 0; i < (1 << huge_page_order(h)); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
 				1 << PG_private | 1<< PG_writeback);
@@ -133,11 +129,16 @@ static void update_and_free_page(struct 
 	set_compound_page_dtor(page, NULL);
 	set_page_refcounted(page);
 	arch_release_hugepage(page);
-	__free_pages(page, HUGETLB_PAGE_ORDER);
+	__free_pages(page, huge_page_order(h));
 }
 
 static void free_huge_page(struct page *page)
 {
+	/*
+	 * Can't pass hstate in here because it is called from the
+	 * compound page destructor.
+	 */
+	struct hstate *h = &global_hstate;
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
@@ -147,12 +148,12 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages_node[nid]) {
-		update_and_free_page(page);
-		surplus_huge_pages--;
-		surplus_huge_pages_node[nid]--;
+	if (h->surplus_huge_pages_node[nid]) {
+		update_and_free_page(h, page);
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
 	} else {
-		enqueue_huge_page(page);
+		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
 	if (mapping)
@@ -164,7 +165,7 @@ static void free_huge_page(struct page *
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(int delta)
+static int adjust_pool_surplus(struct hstate *h, int delta)
 {
 	static int prev_nid;
 	int nid = prev_nid;
@@ -177,15 +178,15 @@ static int adjust_pool_surplus(int delta
 			nid = first_node(node_online_map);
 
 		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !surplus_huge_pages_node[nid])
+		if (delta < 0 && !h->surplus_huge_pages_node[nid])
 			continue;
 		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && surplus_huge_pages_node[nid] >=
-						nr_huge_pages_node[nid])
+		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+						h->nr_huge_pages_node[nid])
 			continue;
 
-		surplus_huge_pages += delta;
-		surplus_huge_pages_node[nid] += delta;
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
 		ret = 1;
 		break;
 	} while (nid != prev_nid);
@@ -194,46 +195,46 @@ static int adjust_pool_surplus(int delta
 	return ret;
 }
 
-static void prep_new_huge_page(struct page *page, int nid)
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	set_compound_page_dtor(page, free_huge_page);
 	spin_lock(&hugetlb_lock);
-	nr_huge_pages++;
-	nr_huge_pages_node[nid]++;
+	h->nr_huge_pages++;
+	h->nr_huge_pages_node[nid]++;
 	spin_unlock(&hugetlb_lock);
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static struct page *alloc_fresh_huge_page_node(int nid)
+static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
-		HUGETLB_PAGE_ORDER);
+		huge_page_order(h));
 	if (page) {
 		if (arch_prepare_hugepage(page)) {
 			__free_pages(page, HUGETLB_PAGE_ORDER);
 			return NULL;
 		}
-		prep_new_huge_page(page, nid);
+		prep_new_huge_page(h, page, nid);
 	}
 
 	return page;
 }
 
-static int alloc_fresh_huge_page(void)
+static int alloc_fresh_huge_page(struct hstate *h)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hugetlb_next_nid;
+	start_nid = h->hugetlb_next_nid;
 
 	do {
-		page = alloc_fresh_huge_page_node(hugetlb_next_nid);
+		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
 		if (page)
 			ret = 1;
 		/*
@@ -247,11 +248,11 @@ static int alloc_fresh_huge_page(void)
 		 * if we just successfully allocated a hugepage so that
 		 * the next caller gets hugepages on the next node.
 		 */
-		next_nid = next_node(hugetlb_next_nid, node_online_map);
+		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
 		if (next_nid == MAX_NUMNODES)
 			next_nid = first_node(node_online_map);
-		hugetlb_next_nid = next_nid;
-	} while (!page && hugetlb_next_nid != start_nid);
+		h->hugetlb_next_nid = next_nid;
+	} while (!page && h->hugetlb_next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -261,8 +262,8 @@ static int alloc_fresh_huge_page(void)
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
-						unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
 {
 	struct page *page;
 	unsigned int nid;
@@ -291,18 +292,18 @@ static struct page *alloc_buddy_huge_pag
 	 * per-node value is checked there.
 	 */
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
+	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		spin_unlock(&hugetlb_lock);
 		return NULL;
 	} else {
-		nr_huge_pages++;
-		surplus_huge_pages++;
+		h->nr_huge_pages++;
+		h->surplus_huge_pages++;
 	}
 	spin_unlock(&hugetlb_lock);
 
 	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
 					__GFP_REPEAT|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
+					huge_page_order(h));
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -317,12 +318,12 @@ static struct page *alloc_buddy_huge_pag
 		/*
 		 * We incremented the global counters already
 		 */
-		nr_huge_pages_node[nid]++;
-		surplus_huge_pages_node[nid]++;
+		h->nr_huge_pages_node[nid]++;
+		h->surplus_huge_pages_node[nid]++;
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	} else {
-		nr_huge_pages--;
-		surplus_huge_pages--;
+		h->nr_huge_pages--;
+		h->surplus_huge_pages--;
 		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 	}
 	spin_unlock(&hugetlb_lock);
@@ -334,16 +335,16 @@ static struct page *alloc_buddy_huge_pag
  * Increase the hugetlb pool such that it can accomodate a reservation
  * of size 'delta'.
  */
-static int gather_surplus_pages(int delta)
+static int gather_surplus_pages(struct hstate *h, int delta)
 {
 	struct list_head surplus_list;
 	struct page *page, *tmp;
 	int ret, i;
 	int needed, allocated;
 
-	needed = (resv_huge_pages + delta) - free_huge_pages;
+	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
 	if (needed <= 0) {
-		resv_huge_pages += delta;
+		h->resv_huge_pages += delta;
 		return 0;
 	}
 
@@ -354,7 +355,7 @@ static int gather_surplus_pages(int delt
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(NULL, 0);
+		page = alloc_buddy_huge_page(h, NULL, 0);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -375,7 +376,8 @@ retry:
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
 	spin_lock(&hugetlb_lock);
-	needed = (resv_huge_pages + delta) - (free_huge_pages + allocated);
+	needed = (h->resv_huge_pages + delta) -
+			(h->free_huge_pages + allocated);
 	if (needed > 0)
 		goto retry;
 
@@ -388,7 +390,7 @@ retry:
 	 * before they are reserved.
 	 */
 	needed += allocated;
-	resv_huge_pages += delta;
+	h->resv_huge_pages += delta;
 	ret = 0;
 free:
 	/* Free the needed pages to the hugetlb pool */
@@ -396,7 +398,7 @@ free:
 		if ((--needed) < 0)
 			break;
 		list_del(&page->lru);
-		enqueue_huge_page(page);
+		enqueue_huge_page(h, page);
 	}
 
 	/* Free unnecessary surplus pages to the buddy allocator */
@@ -424,7 +426,8 @@ free:
  * allocated to satisfy the reservation must be explicitly freed if they were
  * never used.
  */
-static void return_unused_surplus_pages(unsigned long unused_resv_pages)
+static void return_unused_surplus_pages(struct hstate *h,
+					unsigned long unused_resv_pages)
 {
 	static int nid = -1;
 	struct page *page;
@@ -439,27 +442,27 @@ static void return_unused_surplus_pages(
 	unsigned long remaining_iterations = num_online_nodes();
 
 	/* Uncommit the reservation */
-	resv_huge_pages -= unused_resv_pages;
+	h->resv_huge_pages -= unused_resv_pages;
 
-	nr_pages = min(unused_resv_pages, surplus_huge_pages);
+	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
 	while (remaining_iterations-- && nr_pages) {
 		nid = next_node(nid, node_online_map);
 		if (nid == MAX_NUMNODES)
 			nid = first_node(node_online_map);
 
-		if (!surplus_huge_pages_node[nid])
+		if (!h->surplus_huge_pages_node[nid])
 			continue;
 
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
-			surplus_huge_pages--;
-			surplus_huge_pages_node[nid]--;
+			update_and_free_page(h, page);
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
+			h->surplus_huge_pages--;
+			h->surplus_huge_pages_node[nid]--;
 			nr_pages--;
 			remaining_iterations = num_online_nodes();
 		}
@@ -471,9 +474,10 @@ static struct page *alloc_huge_page_shar
 						unsigned long addr)
 {
 	struct page *page;
+	struct hstate *h = hstate_vma(vma);
 
 	spin_lock(&hugetlb_lock);
-	page = dequeue_huge_page_vma(vma, addr);
+	page = dequeue_huge_page_vma(h, vma, addr);
 	spin_unlock(&hugetlb_lock);
 	return page ? page : ERR_PTR(-VM_FAULT_OOM);
 }
@@ -482,16 +486,17 @@ static struct page *alloc_huge_page_priv
 						unsigned long addr)
 {
 	struct page *page = NULL;
+	struct hstate *h = hstate_vma(vma);
 
 	if (hugetlb_get_quota(vma->vm_file->f_mapping, 1))
 		return ERR_PTR(-VM_FAULT_SIGBUS);
 
 	spin_lock(&hugetlb_lock);
-	if (free_huge_pages > resv_huge_pages)
-		page = dequeue_huge_page_vma(vma, addr);
+	if (h->free_huge_pages > h->resv_huge_pages)
+		page = dequeue_huge_page_vma(h, vma, addr);
 	spin_unlock(&hugetlb_lock);
 	if (!page) {
-		page = alloc_buddy_huge_page(vma, addr);
+		page = alloc_buddy_huge_page(h, vma, addr);
 		if (!page) {
 			hugetlb_put_quota(vma->vm_file->f_mapping, 1);
 			return ERR_PTR(-VM_FAULT_OOM);
@@ -521,21 +526,27 @@ static struct page *alloc_huge_page(stru
 static int __init hugetlb_init(void)
 {
 	unsigned long i;
+	struct hstate *h = &global_hstate;
 
 	if (HPAGE_SHIFT == 0)
 		return 0;
 
+	if (!h->order) {
+		h->order = HPAGE_SHIFT - PAGE_SHIFT;
+		h->mask = HPAGE_MASK;
+	}
+
 	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&hugepage_freelists[i]);
+		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
-	hugetlb_next_nid = first_node(node_online_map);
+	h->hugetlb_next_nid = first_node(node_online_map);
 
 	for (i = 0; i < max_huge_pages; ++i) {
-		if (!alloc_fresh_huge_page())
+		if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	max_huge_pages = free_huge_pages = nr_huge_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
 	return 0;
 }
 module_init(hugetlb_init);
@@ -561,34 +572,36 @@ static unsigned int cpuset_mems_nr(unsig
 
 #ifdef CONFIG_SYSCTL
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count)
 {
 	int i;
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
-		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
-			if (count >= nr_huge_pages)
+		struct list_head *freel = &h->hugepage_freelists[i];
+		list_for_each_entry_safe(page, next, freel, lru) {
+			if (count >= h->nr_huge_pages)
 				return;
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
 			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[page_to_nid(page)]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
 #else
-static inline void try_to_free_low(unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count)
 {
 }
 #endif
 
-#define persistent_huge_pages (nr_huge_pages - surplus_huge_pages)
+#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(unsigned long count)
 {
 	unsigned long min_count, ret;
+	struct hstate *h = &global_hstate;
 
 	/*
 	 * Increase the pool size
@@ -602,12 +615,12 @@ static unsigned long set_max_huge_pages(
 	 * within all the constraints specified by the sysctls.
 	 */
 	spin_lock(&hugetlb_lock);
-	while (surplus_huge_pages && count > persistent_huge_pages) {
-		if (!adjust_pool_surplus(-1))
+	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, -1))
 			break;
 	}
 
-	while (count > persistent_huge_pages) {
+	while (count > persistent_huge_pages(h)) {
 		int ret;
 		/*
 		 * If this allocation races such that we no longer need the
@@ -615,7 +628,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page();
+		ret = alloc_fresh_huge_page(h);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -637,21 +650,21 @@ static unsigned long set_max_huge_pages(
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
 	 */
-	min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
+	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(min_count);
-	while (min_count < persistent_huge_pages) {
-		struct page *page = dequeue_huge_page();
+	try_to_free_low(h, min_count);
+	while (min_count < persistent_huge_pages(h)) {
+		struct page *page = dequeue_huge_page(h);
 		if (!page)
 			break;
-		update_and_free_page(page);
+		update_and_free_page(h, page);
 	}
-	while (count < persistent_huge_pages) {
-		if (!adjust_pool_surplus(1))
+	while (count < persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, 1))
 			break;
 	}
 out:
-	ret = persistent_huge_pages;
+	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
 	return ret;
 }
@@ -681,9 +694,10 @@ int hugetlb_overcommit_handler(struct ct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
+	struct hstate *h = &global_hstate;
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	spin_lock(&hugetlb_lock);
-	nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
+	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
 	spin_unlock(&hugetlb_lock);
 	return 0;
 }
@@ -692,34 +706,37 @@ int hugetlb_overcommit_handler(struct ct
 
 int hugetlb_report_meminfo(char *buf)
 {
+	struct hstate *h = &global_hstate;
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
 			"HugePages_Rsvd:  %5lu\n"
 			"HugePages_Surp:  %5lu\n"
 			"Hugepagesize:    %5lu kB\n",
-			nr_huge_pages,
-			free_huge_pages,
-			resv_huge_pages,
-			surplus_huge_pages,
-			HPAGE_SIZE/1024);
+			h->nr_huge_pages,
+			h->free_huge_pages,
+			h->resv_huge_pages,
+			h->surplus_huge_pages,
+			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
 }
 
 int hugetlb_report_node_meminfo(int nid, char *buf)
 {
+	struct hstate *h = &global_hstate;
 	return sprintf(buf,
 		"Node %d HugePages_Total: %5u\n"
 		"Node %d HugePages_Free:  %5u\n"
 		"Node %d HugePages_Surp:  %5u\n",
-		nid, nr_huge_pages_node[nid],
-		nid, free_huge_pages_node[nid],
-		nid, surplus_huge_pages_node[nid]);
+		nid, h->nr_huge_pages_node[nid],
+		nid, h->free_huge_pages_node[nid],
+		nid, h->surplus_huge_pages_node[nid]);
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
-	return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
+	struct hstate *h = &global_hstate;
+	return h->nr_huge_pages * (1 << huge_page_order(h));
 }
 
 /*
@@ -774,14 +791,16 @@ int copy_hugetlb_page_range(struct mm_st
 	struct page *ptepage;
 	unsigned long addr;
 	int cow;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, addr);
+		dst_pte = huge_pte_alloc(dst, addr, sz);
 		if (!dst_pte)
 			goto nomem;
 
@@ -817,6 +836,9 @@ void __unmap_hugepage_range(struct vm_ar
 	pte_t pte;
 	struct page *page;
 	struct page *tmp;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
+
 	/*
 	 * A page gathering list, protected by per file i_mmap_lock. The
 	 * lock is used to avoid list corruption from multiple unmapping
@@ -825,11 +847,11 @@ void __unmap_hugepage_range(struct vm_ar
 	LIST_HEAD(page_list);
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON(start & ~HPAGE_MASK);
-	BUG_ON(end & ~HPAGE_MASK);
+	BUG_ON(start & ~huge_page_mask(h));
+	BUG_ON(end & ~huge_page_mask(h));
 
 	spin_lock(&mm->page_table_lock);
-	for (address = start; address < end; address += HPAGE_SIZE) {
+	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -877,6 +899,7 @@ static int hugetlb_cow(struct mm_struct 
 {
 	struct page *old_page, *new_page;
 	int avoidcopy;
+	struct hstate *h = hstate_vma(vma);
 
 	old_page = pte_page(pte);
 
@@ -901,7 +924,7 @@ static int hugetlb_cow(struct mm_struct 
 	__SetPageUptodate(new_page);
 	spin_lock(&mm->page_table_lock);
 
-	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
+	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
@@ -924,10 +947,11 @@ static int hugetlb_no_page(struct mm_str
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	struct hstate *h = hstate_vma(vma);
 
 	mapping = vma->vm_file->f_mapping;
-	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
-		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+	idx = ((address - vma->vm_start) >> huge_page_shift(h))
+		+ (vma->vm_pgoff >> huge_page_order(h));
 
 	/*
 	 * Use page lock to guard against racing truncation
@@ -936,7 +960,7 @@ static int hugetlb_no_page(struct mm_str
 retry:
 	page = find_lock_page(mapping, idx);
 	if (!page) {
-		size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
 		page = alloc_huge_page(vma, address);
@@ -944,7 +968,7 @@ retry:
 			ret = -PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, address);
+		clear_huge_page(page, address, huge_page_size(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_SHARED) {
@@ -960,14 +984,14 @@ retry:
 			}
 
 			spin_lock(&inode->i_lock);
-			inode->i_blocks += BLOCKS_PER_HUGEPAGE;
+			inode->i_blocks += blocks_per_hugepage(h);
 			spin_unlock(&inode->i_lock);
 		} else
 			lock_page(page);
 	}
 
 	spin_lock(&mm->page_table_lock);
-	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
 
@@ -1003,8 +1027,9 @@ int hugetlb_fault(struct mm_struct *mm, 
 	pte_t entry;
 	int ret;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	struct hstate *h = hstate_vma(vma);
 
-	ptep = huge_pte_alloc(mm, address);
+	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
 	if (!ptep)
 		return VM_FAULT_OOM;
 
@@ -1042,6 +1067,7 @@ int follow_hugetlb_page(struct mm_struct
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
 	int remainder = *length;
+	struct hstate *h = hstate_vma(vma);
 
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
@@ -1053,7 +1079,7 @@ int follow_hugetlb_page(struct mm_struct
 		 * each hugepage.  We have to make * sure we get the
 		 * first, for the page indexing below to work.
 		 */
-		pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
+		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
 
 		if (!pte || huge_pte_none(huge_ptep_get(pte)) ||
 		    (write && !pte_write(huge_ptep_get(pte)))) {
@@ -1071,7 +1097,7 @@ int follow_hugetlb_page(struct mm_struct
 			break;
 		}
 
-		pfn_offset = (vaddr & ~HPAGE_MASK) >> PAGE_SHIFT;
+		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(huge_ptep_get(pte));
 same_page:
 		if (pages) {
@@ -1087,7 +1113,7 @@ same_page:
 		--remainder;
 		++i;
 		if (vaddr < vma->vm_end && remainder &&
-				pfn_offset < HPAGE_SIZE/PAGE_SIZE) {
+				pfn_offset < (1 << huge_page_order(h))) {
 			/*
 			 * We use pfn_offset to avoid touching the pageframes
 			 * of this compound page.
@@ -1109,13 +1135,14 @@ void hugetlb_change_protection(struct vm
 	unsigned long start = address;
 	pte_t *ptep;
 	pte_t pte;
+	struct hstate *h = hstate_vma(vma);
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
 	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
-	for (; address < end; address += HPAGE_SIZE) {
+	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -1254,7 +1281,7 @@ static long region_truncate(struct list_
 	return chg;
 }
 
-static int hugetlb_acct_memory(long delta)
+static int hugetlb_acct_memory(struct hstate *h, long delta)
 {
 	int ret = -ENOMEM;
 
@@ -1277,18 +1304,18 @@ static int hugetlb_acct_memory(long delt
 	 * semantics that cpuset has.
 	 */
 	if (delta > 0) {
-		if (gather_surplus_pages(delta) < 0)
+		if (gather_surplus_pages(h, delta) < 0)
 			goto out;
 
-		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
-			return_unused_surplus_pages(delta);
+		if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
+			return_unused_surplus_pages(h, delta);
 			goto out;
 		}
 	}
 
 	ret = 0;
 	if (delta < 0)
-		return_unused_surplus_pages((unsigned long) -delta);
+		return_unused_surplus_pages(h, (unsigned long) -delta);
 
 out:
 	spin_unlock(&hugetlb_lock);
@@ -1298,6 +1325,7 @@ out:
 int hugetlb_reserve_pages(struct inode *inode, long from, long to)
 {
 	long ret, chg;
+	struct hstate *h = hstate_inode(inode);
 
 	chg = region_chg(&inode->i_mapping->private_list, from, to);
 	if (chg < 0)
@@ -1305,7 +1333,7 @@ int hugetlb_reserve_pages(struct inode *
 
 	if (hugetlb_get_quota(inode->i_mapping, chg))
 		return -ENOSPC;
-	ret = hugetlb_acct_memory(chg);
+	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
 		hugetlb_put_quota(inode->i_mapping, chg);
 		return ret;
@@ -1316,12 +1344,13 @@ int hugetlb_reserve_pages(struct inode *
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
+	struct hstate *h = hstate_inode(inode);
 	long chg = region_truncate(&inode->i_mapping->private_list, offset);
 
 	spin_lock(&inode->i_lock);
-	inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
+	inode->i_blocks -= blocks_per_hugepage(h);
 	spin_unlock(&inode->i_lock);
 
 	hugetlb_put_quota(inode->i_mapping, (chg - freed));
-	hugetlb_acct_memory(-(chg - freed));
+	hugetlb_acct_memory(h, -(chg - freed));
 }
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -128,7 +128,7 @@ pte_t *huge_pte_offset(struct mm_struct 
 	return NULL;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	pgd_t *pg;
 	pud_t *pu;
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c
@@ -175,7 +175,7 @@ hugetlb_get_unmapped_area(struct file *f
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
@@ -195,7 +195,7 @@ hugetlb_get_unmapped_area(struct file *f
 				pgoff, flags);
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/hugetlbpage.c
+++ linux-2.6/arch/sh/mm/hugetlbpage.c
@@ -22,7 +22,7 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c
@@ -24,7 +24,7 @@
 unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
 
 pte_t *
-huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
+huge_pte_alloc (struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
@@ -75,7 +75,7 @@ int huge_pmd_unshare(struct mm_struct *m
  * Don't actually need to do any preparation, but need to make sure
  * the address is in the right region.
  */
-int prepare_hugepage_range(unsigned long addr, unsigned long len)
+int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
@@ -149,7 +149,7 @@ unsigned long hugetlb_get_unmapped_area(
 
 	/* Handle MAP_FIXED */
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
 	return 1;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -368,7 +368,7 @@ hugetlb_get_unmapped_area(struct file *f
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -8,7 +8,6 @@
 #include <linux/mempolicy.h>
 #include <linux/shm.h>
 #include <asm/tlbflush.h>
-#include <asm/hugetlb.h>
 
 struct ctl_table;
 
@@ -41,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
 
 /* arch callbacks */
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -71,7 +70,7 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL
-#define prepare_hugepage_range(addr,len)	(-EINVAL)
+#define prepare_hugepage_range(file, addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
@@ -125,8 +124,6 @@ struct file *hugetlb_file_setup(const ch
 int hugetlb_get_quota(struct address_space *mapping, long delta);
 void hugetlb_put_quota(struct address_space *mapping, long delta);
 
-#define BLOCKS_PER_HUGEPAGE	(HPAGE_SIZE / 512)
-
 static inline int is_file_hugepages(struct file *file)
 {
 	if (file->f_op == &hugetlbfs_file_operations)
@@ -155,4 +152,78 @@ unsigned long hugetlb_get_unmapped_area(
 					unsigned long flags);
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+/* Defines one hugetlb page size */
+struct hstate {
+	int hugetlb_next_nid;
+	unsigned int order;
+	unsigned long mask;
+	unsigned long max_huge_pages;
+	unsigned long nr_huge_pages;
+	unsigned long free_huge_pages;
+	unsigned long resv_huge_pages;
+	unsigned long surplus_huge_pages;
+	unsigned long nr_overcommit_huge_pages;
+	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int nr_huge_pages_node[MAX_NUMNODES];
+	unsigned int free_huge_pages_node[MAX_NUMNODES];
+	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+};
+
+extern struct hstate global_hstate;
+
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+{
+	return &global_hstate;
+}
+
+static inline struct hstate *hstate_file(struct file *f)
+{
+	return &global_hstate;
+}
+
+static inline struct hstate *hstate_inode(struct inode *i)
+{
+	return &global_hstate;
+}
+
+static inline unsigned long huge_page_size(struct hstate *h)
+{
+	return (unsigned long)PAGE_SIZE << h->order;
+}
+
+static inline unsigned long huge_page_mask(struct hstate *h)
+{
+	return h->mask;
+}
+
+static inline unsigned long huge_page_order(struct hstate *h)
+{
+	return h->order;
+}
+
+static inline unsigned huge_page_shift(struct hstate *h)
+{
+	return h->order + PAGE_SHIFT;
+}
+
+static inline unsigned int blocks_per_hugepage(struct hstate *h)
+{
+	return huge_page_size(h) / 512;
+}
+
+#else
+struct hstate {};
+#define hstate_file(f) NULL
+#define hstate_vma(v) NULL
+#define hstate_inode(i) NULL
+#define huge_page_size(h) PAGE_SIZE
+#define huge_page_mask(h) PAGE_MASK
+#define huge_page_order(h) 0
+#define huge_page_shift(h) PAGE_SHIFT
+#endif
+
+#include <asm/hugetlb.h>
+
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -80,6 +80,7 @@ static int hugetlbfs_file_mmap(struct fi
 	struct inode *inode = file->f_path.dentry->d_inode;
 	loff_t len, vma_len;
 	int ret;
+	struct hstate *h = hstate_file(file);
 
 	/*
 	 * vma address alignment (but not the pgoff alignment) has
@@ -92,7 +93,7 @@ static int hugetlbfs_file_mmap(struct fi
 	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
 	vma->vm_ops = &hugetlb_vm_ops;
 
-	if (vma->vm_pgoff & ~(HPAGE_MASK >> PAGE_SHIFT))
+	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
 		return -EINVAL;
 
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
@@ -104,8 +105,8 @@ static int hugetlbfs_file_mmap(struct fi
 	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
 	if (vma->vm_flags & VM_MAYSHARE &&
-	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
-				  len >> HPAGE_SHIFT))
+	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> huge_page_order(h),
+				  len >> huge_page_shift(h)))
 		goto out;
 
 	ret = 0;
@@ -130,20 +131,21 @@ hugetlb_get_unmapped_area(struct file *f
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long start_addr;
+	struct hstate *h = hstate_file(file);
 
-	if (len & ~HPAGE_MASK)
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
 
 	if (addr) {
-		addr = ALIGN(addr, HPAGE_SIZE);
+		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
@@ -156,7 +158,7 @@ hugetlb_get_unmapped_area(struct file *f
 		start_addr = TASK_UNMAPPED_BASE;
 
 full_search:
-	addr = ALIGN(start_addr, HPAGE_SIZE);
+	addr = ALIGN(start_addr, huge_page_size(h));
 
 	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
@@ -174,7 +176,7 @@ full_search:
 
 		if (!vma || addr + len <= vma->vm_start)
 			return addr;
-		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+		addr = ALIGN(vma->vm_end, huge_page_size(h));
 	}
 }
 #endif
@@ -225,10 +227,11 @@ hugetlbfs_read_actor(struct page *page, 
 static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
 			      size_t len, loff_t *ppos)
 {
+	struct hstate *h = hstate_file(filp);
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-	unsigned long index = *ppos >> HPAGE_SHIFT;
-	unsigned long offset = *ppos & ~HPAGE_MASK;
+	unsigned long index = *ppos >> huge_page_shift(h);
+	unsigned long offset = *ppos & ~huge_page_mask(h);
 	unsigned long end_index;
 	loff_t isize;
 	ssize_t retval = 0;
@@ -243,17 +246,17 @@ static ssize_t hugetlbfs_read(struct fil
 	if (!isize)
 		goto out;
 
-	end_index = (isize - 1) >> HPAGE_SHIFT;
+	end_index = (isize - 1) >> huge_page_shift(h);
 	for (;;) {
 		struct page *page;
-		int nr, ret;
+		unsigned long nr, ret;
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = HPAGE_SIZE;
+		nr = huge_page_size(h);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
-			nr = ((isize - 1) & ~HPAGE_MASK) + 1;
+			nr = ((isize - 1) & ~huge_page_mask(h)) + 1;
 			if (nr <= offset) {
 				goto out;
 			}
@@ -287,8 +290,8 @@ static ssize_t hugetlbfs_read(struct fil
 		offset += ret;
 		retval += ret;
 		len -= ret;
-		index += offset >> HPAGE_SHIFT;
-		offset &= ~HPAGE_MASK;
+		index += offset >> huge_page_shift(h);
+		offset &= ~huge_page_mask(h);
 
 		if (page)
 			page_cache_release(page);
@@ -298,7 +301,7 @@ static ssize_t hugetlbfs_read(struct fil
 			break;
 	}
 out:
-	*ppos = ((loff_t)index << HPAGE_SHIFT) + offset;
+	*ppos = ((loff_t)index << huge_page_shift(h)) + offset;
 	mutex_unlock(&inode->i_mutex);
 	return retval;
 }
@@ -339,8 +342,9 @@ static void truncate_huge_page(struct pa
 
 static void truncate_hugepages(struct inode *inode, loff_t lstart)
 {
+	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
-	const pgoff_t start = lstart >> HPAGE_SHIFT;
+	const pgoff_t start = lstart >> huge_page_shift(h);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
@@ -449,8 +453,9 @@ static int hugetlb_vmtruncate(struct ino
 {
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
 
-	BUG_ON(offset & ~HPAGE_MASK);
+	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
@@ -465,6 +470,7 @@ static int hugetlb_vmtruncate(struct ino
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct hstate *h = hstate_inode(inode);
 	int error;
 	unsigned int ia_valid = attr->ia_valid;
 
@@ -476,7 +482,7 @@ static int hugetlbfs_setattr(struct dent
 
 	if (ia_valid & ATTR_SIZE) {
 		error = -EINVAL;
-		if (!(attr->ia_size & ~HPAGE_MASK))
+		if (!(attr->ia_size & ~huge_page_mask(h)))
 			error = hugetlb_vmtruncate(inode, attr->ia_size);
 		if (error)
 			goto out;
@@ -610,9 +616,10 @@ static int hugetlbfs_set_page_dirty(stru
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
+	struct hstate *h = hstate_inode(dentry->d_inode);
 
 	buf->f_type = HUGETLBFS_MAGIC;
-	buf->f_bsize = HPAGE_SIZE;
+	buf->f_bsize = huge_page_size(h);
 	if (sbinfo) {
 		spin_lock(&sbinfo->stat_lock);
 		/* If no limits set, just report 0 for max/free/used
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c
+++ linux-2.6/ipc/shm.c
@@ -577,7 +577,8 @@ static void shm_get_stat(struct ipc_name
 
 		if (is_file_hugepages(shp->shm_file)) {
 			struct address_space *mapping = inode->i_mapping;
-			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+			struct hstate *h = hstate_file(shp->shm_file);
+			*rss += (1 << huge_page_order(h)) * mapping->nrpages;
 		} else {
 			struct shmem_inode_info *info = SHMEM_I(inode);
 			spin_lock(&info->lock);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -901,7 +901,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				unmap_hugepage_range(vma, start, end);
 				zap_work -= (end - start) /
-						(HPAGE_SIZE / PAGE_SIZE);
+					(1 << huge_page_order(hstate_vma(vma)));
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c
+++ linux-2.6/mm/mempolicy.c
@@ -1477,7 +1477,7 @@ struct zonelist *huge_zonelist(struct vm
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
 		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
-						HPAGE_SHIFT), gfp_flags);
+				huge_page_shift(hstate_vma(vma))), gfp_flags);
 	} else {
 		zl = policy_zonelist(gfp_flags, *mpol);
 		if ((*mpol)->mode == MPOL_BIND)
@@ -2216,9 +2216,12 @@ static void check_huge_range(struct vm_a
 {
 	unsigned long addr;
 	struct page *page;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
 
-	for (addr = start; addr < end; addr += HPAGE_SIZE) {
-		pte_t *ptep = huge_pte_offset(vma->vm_mm, addr & HPAGE_MASK);
+	for (addr = start; addr < end; addr += sz) {
+		pte_t *ptep = huge_pte_offset(vma->vm_mm,
+						addr & huge_page_mask(h));
 		pte_t pte;
 
 		if (!ptep)
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1800,7 +1800,8 @@ int split_vma(struct mm_struct * mm, str
 	struct mempolicy *pol;
 	struct vm_area_struct *new;
 
-	if (is_vm_hugetlb_page(vma) && (addr & ~HPAGE_MASK))
+	if (is_vm_hugetlb_page(vma) && (addr &
+					~(huge_page_mask(hstate_vma(vma)))))
 		return -EINVAL;
 
 	if (mm->map_count >= sysctl_max_map_count)
Index: linux-2.6/include/asm-ia64/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/hugetlb.h
+++ linux-2.6/include/asm-ia64/hugetlb.h
@@ -8,7 +8,7 @@ void hugetlb_free_pgd_range(struct mmu_g
 			    unsigned long end, unsigned long floor,
 			    unsigned long ceiling);
 
-int prepare_hugepage_range(unsigned long addr, unsigned long len);
+int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len);
 
 static inline int is_hugepage_only_range(struct mm_struct *mm,
 					 unsigned long addr,
Index: linux-2.6/include/asm-powerpc/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/hugetlb.h
+++ linux-2.6/include/asm-powerpc/hugetlb.h
@@ -21,7 +21,7 @@ pte_t huge_ptep_get_and_clear(struct mm_
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-s390/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-s390/hugetlb.h
+++ linux-2.6/include/asm-s390/hugetlb.h
@@ -22,7 +22,7 @@ void set_huge_pte_at(struct mm_struct *m
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-sh/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-sh/hugetlb.h
+++ linux-2.6/include/asm-sh/hugetlb.h
@@ -14,7 +14,7 @@ static inline int is_hugepage_only_range
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-sparc64/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/hugetlb.h
+++ linux-2.6/include/asm-sparc64/hugetlb.h
@@ -22,7 +22,7 @@ static inline int is_hugepage_only_range
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
 {
 	if (len & ~HPAGE_MASK)
 		return -EINVAL;
Index: linux-2.6/include/asm-x86/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-x86/hugetlb.h
+++ linux-2.6/include/asm-x86/hugetlb.h
@@ -14,11 +14,12 @@ static inline int is_hugepage_only_range
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
  */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
+static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
 {
-	if (len & ~HPAGE_MASK)
+	struct hstate *h = hstate_file(file);
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (addr & ~HPAGE_MASK)
+	if (addr & ~huge_page_mask(h))
 		return -EINVAL;
 	return 0;
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 03/23] hugetlb: modular state
  2008-05-25 14:23 ` [patch 03/23] hugetlb: modular state npiggin
@ 2008-05-27 16:44   ` Nishanth Aravamudan
  2008-05-28  8:40     ` Nick Piggin
  2008-05-27 20:38   ` Adam Litke
  1 sibling, 1 reply; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 16:44 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Andi Kleen

On 26.05.2008 [00:23:20 +1000], npiggin@suse.de wrote:
> Large, but rather mechanical patch that converts most of the hugetlb.c
> globals into structure members and passes them around.
> 
> Right now there is only a single global hstate structure, but 
> most of the infrastructure to extend it is there.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  arch/ia64/mm/hugetlbpage.c    |    6 
>  arch/powerpc/mm/hugetlbpage.c |    2 
>  arch/sh/mm/hugetlbpage.c      |    2 
>  arch/sparc64/mm/hugetlbpage.c |    4 
>  arch/x86/mm/hugetlbpage.c     |    4 
>  fs/hugetlbfs/inode.c          |   49 +++---
>  include/asm-ia64/hugetlb.h    |    2 
>  include/asm-powerpc/hugetlb.h |    2 
>  include/asm-s390/hugetlb.h    |    2 
>  include/asm-sh/hugetlb.h      |    2 
>  include/asm-sparc64/hugetlb.h |    2 
>  include/asm-x86/hugetlb.h     |    7 
>  include/linux/hugetlb.h       |   81 +++++++++-
>  ipc/shm.c                     |    3 
>  mm/hugetlb.c                  |  321 ++++++++++++++++++++++--------------------
>  mm/memory.c                   |    2 
>  mm/mempolicy.c                |    9 -
>  mm/mmap.c                     |    3 
>  18 files changed, 308 insertions(+), 195 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -22,30 +22,24 @@
>  #include "internal.h"
> 
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> -static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
> -static unsigned long surplus_huge_pages;
> -static unsigned long nr_overcommit_huge_pages;
>  unsigned long max_huge_pages;
>  unsigned long sysctl_overcommit_huge_pages;
> -static struct list_head hugepage_freelists[MAX_NUMNODES];
> -static unsigned int nr_huge_pages_node[MAX_NUMNODES];
> -static unsigned int free_huge_pages_node[MAX_NUMNODES];
> -static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
>  static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
>  unsigned long hugepages_treat_as_movable;
> -static int hugetlb_next_nid;
> +
> +struct hstate global_hstate;
> 
>  /*
>   * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
>   */
>  static DEFINE_SPINLOCK(hugetlb_lock);
> 
> -static void clear_huge_page(struct page *page, unsigned long addr)
> +static void clear_huge_page(struct page *page, unsigned long addr, unsigned long sz)
>  {
>  	int i;
> 
>  	might_sleep();
> -	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
> +	for (i = 0; i < sz/PAGE_SIZE; i++) {
>  		cond_resched();
>  		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>  	}
> @@ -55,42 +49,43 @@ static void copy_huge_page(struct page *
>  			   unsigned long addr, struct vm_area_struct *vma)
>  {
>  	int i;
> +	struct hstate *h = hstate_vma(vma);
> 
>  	might_sleep();
> -	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
> +	for (i = 0; i < 1 << huge_page_order(h); i++) {

So it seems like most (not quite all) users of huge_page_order(h) don't
actually care about the order, per se, but want some sense of the
underlying pagesize. Either pages_per_huge_page() or huge_page_size().

So perhaps it would be sensible to have the helpers defined as such?

huge_page_size(h) -> size in bytes of huge page (corresponds to what was
HPAGE_SIZE), which is what I think you currently have

and

pages_per_huge_page(h) -> number of base pages per huge page
(corresponds to HPAGE_SIZE / PAGE_SIZE)

?

Also, I noticed that this caller has no parentheses, but the other one
does, for (1 << huge_page_order(h))

Neither are huge issues and the first can be a clean-up patch from me,
so

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 03/23] hugetlb: modular state
  2008-05-27 16:44   ` Nishanth Aravamudan
@ 2008-05-28  8:40     ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-28  8:40 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Andi Kleen

On Tue, May 27, 2008 at 09:44:26AM -0700, Nishanth Aravamudan wrote:
> On 26.05.2008 [00:23:20 +1000], npiggin@suse.de wrote:
> > 
> >  	might_sleep();
> > -	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
> > +	for (i = 0; i < 1 << huge_page_order(h); i++) {
> 
> So it seems like most (not quite all) users of huge_page_order(h) don't
> actually care about the order, per se, but want some sense of the
> underlying pagesize. Either pages_per_huge_page() or huge_page_size().
> 
> So perhaps it would be sensible to have the helpers defined as such?
> 
> huge_page_size(h) -> size in bytes of huge page (corresponds to what was
> HPAGE_SIZE), which is what I think you currently have
> 
> and
> 
> pages_per_huge_page(h) -> number of base pages per huge page
> (corresponds to HPAGE_SIZE / PAGE_SIZE)
> 
> ?

I think pages_per_huge_page would be reasonable, yes.

 
> Also, I noticed that this caller has no parentheses, but the other one
> does, for (1 << huge_page_order(h))
> 
> Neither are huge issues and the first can be a clean-up patch from me,
> so
> 
> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks... I'll do pages_per_huge_page(), it won't be much work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 03/23] hugetlb: modular state
  2008-05-25 14:23 ` [patch 03/23] hugetlb: modular state npiggin
  2008-05-27 16:44   ` Nishanth Aravamudan
@ 2008-05-27 20:38   ` Adam Litke
  2008-05-28  9:13     ` Nick Piggin
  1 sibling, 1 reply; 88+ messages in thread
From: Adam Litke @ 2008-05-27 20:38 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara, Andi Kleen

Phew.  At last I made it to the end of this one :)  It seems okay to me
though.  Have you done any performance testing on this patch series yet?
I don't expect the hstate structure to introduce any measurable
performance degradation, but it would be nice to have some numbers to
back up that educated guess.

Acked-by: Adam Litke <agl@us.ibm.com>

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlb-modular-state.patch)
> Large, but rather mechanical patch that converts most of the hugetlb.c
> globals into structure members and passes them around.
> 
> Right now there is only a single global hstate structure, but 
> most of the infrastructure to extend it is there.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  arch/ia64/mm/hugetlbpage.c    |    6 
>  arch/powerpc/mm/hugetlbpage.c |    2 
>  arch/sh/mm/hugetlbpage.c      |    2 
>  arch/sparc64/mm/hugetlbpage.c |    4 
>  arch/x86/mm/hugetlbpage.c     |    4 
>  fs/hugetlbfs/inode.c          |   49 +++---
>  include/asm-ia64/hugetlb.h    |    2 
>  include/asm-powerpc/hugetlb.h |    2 
>  include/asm-s390/hugetlb.h    |    2 
>  include/asm-sh/hugetlb.h      |    2 
>  include/asm-sparc64/hugetlb.h |    2 
>  include/asm-x86/hugetlb.h     |    7 
>  include/linux/hugetlb.h       |   81 +++++++++-
>  ipc/shm.c                     |    3 
>  mm/hugetlb.c                  |  321 ++++++++++++++++++++++--------------------
>  mm/memory.c                   |    2 
>  mm/mempolicy.c                |    9 -
>  mm/mmap.c                     |    3 
>  18 files changed, 308 insertions(+), 195 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -22,30 +22,24 @@
>  #include "internal.h"
> 
>  const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> -static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
> -static unsigned long surplus_huge_pages;
> -static unsigned long nr_overcommit_huge_pages;
>  unsigned long max_huge_pages;
>  unsigned long sysctl_overcommit_huge_pages;
> -static struct list_head hugepage_freelists[MAX_NUMNODES];
> -static unsigned int nr_huge_pages_node[MAX_NUMNODES];
> -static unsigned int free_huge_pages_node[MAX_NUMNODES];
> -static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
>  static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
>  unsigned long hugepages_treat_as_movable;
> -static int hugetlb_next_nid;
> +
> +struct hstate global_hstate;
> 
>  /*
>   * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
>   */
>  static DEFINE_SPINLOCK(hugetlb_lock);
> 
> -static void clear_huge_page(struct page *page, unsigned long addr)
> +static void clear_huge_page(struct page *page, unsigned long addr, unsigned long sz)
>  {
>  	int i;
> 
>  	might_sleep();
> -	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
> +	for (i = 0; i < sz/PAGE_SIZE; i++) {
>  		cond_resched();
>  		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>  	}
> @@ -55,42 +49,43 @@ static void copy_huge_page(struct page *
>  			   unsigned long addr, struct vm_area_struct *vma)
>  {
>  	int i;
> +	struct hstate *h = hstate_vma(vma);
> 
>  	might_sleep();
> -	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
> +	for (i = 0; i < 1 << huge_page_order(h); i++) {
>  		cond_resched();
>  		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
>  	}
>  }
> 
> -static void enqueue_huge_page(struct page *page)
> +static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>  	int nid = page_to_nid(page);
> -	list_add(&page->lru, &hugepage_freelists[nid]);
> -	free_huge_pages++;
> -	free_huge_pages_node[nid]++;
> +	list_add(&page->lru, &h->hugepage_freelists[nid]);
> +	h->free_huge_pages++;
> +	h->free_huge_pages_node[nid]++;
>  }
> 
> -static struct page *dequeue_huge_page(void)
> +static struct page *dequeue_huge_page(struct hstate *h)
>  {
>  	int nid;
>  	struct page *page = NULL;
> 
>  	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
> -		if (!list_empty(&hugepage_freelists[nid])) {
> -			page = list_entry(hugepage_freelists[nid].next,
> +		if (!list_empty(&h->hugepage_freelists[nid])) {
> +			page = list_entry(h->hugepage_freelists[nid].next,
>  					  struct page, lru);
>  			list_del(&page->lru);
> -			free_huge_pages--;
> -			free_huge_pages_node[nid]--;
> +			h->free_huge_pages--;
> +			h->free_huge_pages_node[nid]--;
>  			break;
>  		}
>  	}
>  	return page;
>  }
> 
> -static struct page *dequeue_huge_page_vma(struct vm_area_struct *vma,
> -				unsigned long address)
> +static struct page *dequeue_huge_page_vma(struct hstate *h,
> +			struct vm_area_struct *vma, unsigned long address)
>  {
>  	int nid;
>  	struct page *page = NULL;
> @@ -105,14 +100,14 @@ static struct page *dequeue_huge_page_vm
>  						MAX_NR_ZONES - 1, nodemask) {
>  		nid = zone_to_nid(zone);
>  		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
> -		    !list_empty(&hugepage_freelists[nid])) {
> -			page = list_entry(hugepage_freelists[nid].next,
> +		    !list_empty(&h->hugepage_freelists[nid])) {
> +			page = list_entry(h->hugepage_freelists[nid].next,
>  					  struct page, lru);
>  			list_del(&page->lru);
> -			free_huge_pages--;
> -			free_huge_pages_node[nid]--;
> +			h->free_huge_pages--;
> +			h->free_huge_pages_node[nid]--;
>  			if (vma && vma->vm_flags & VM_MAYSHARE)
> -				resv_huge_pages--;
> +				h->resv_huge_pages--;
>  			break;
>  		}
>  	}
> @@ -120,12 +115,13 @@ static struct page *dequeue_huge_page_vm
>  	return page;
>  }
> 
> -static void update_and_free_page(struct page *page)
> +static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>  	int i;
> -	nr_huge_pages--;
> -	nr_huge_pages_node[page_to_nid(page)]--;
> -	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
> +
> +	h->nr_huge_pages--;
> +	h->nr_huge_pages_node[page_to_nid(page)]--;
> +	for (i = 0; i < (1 << huge_page_order(h)); i++) {
>  		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
>  				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
>  				1 << PG_private | 1<< PG_writeback);
> @@ -133,11 +129,16 @@ static void update_and_free_page(struct 
>  	set_compound_page_dtor(page, NULL);
>  	set_page_refcounted(page);
>  	arch_release_hugepage(page);
> -	__free_pages(page, HUGETLB_PAGE_ORDER);
> +	__free_pages(page, huge_page_order(h));
>  }
> 
>  static void free_huge_page(struct page *page)
>  {
> +	/*
> +	 * Can't pass hstate in here because it is called from the
> +	 * compound page destructor.
> +	 */
> +	struct hstate *h = &global_hstate;
>  	int nid = page_to_nid(page);
>  	struct address_space *mapping;
> 
> @@ -147,12 +148,12 @@ static void free_huge_page(struct page *
>  	INIT_LIST_HEAD(&page->lru);
> 
>  	spin_lock(&hugetlb_lock);
> -	if (surplus_huge_pages_node[nid]) {
> -		update_and_free_page(page);
> -		surplus_huge_pages--;
> -		surplus_huge_pages_node[nid]--;
> +	if (h->surplus_huge_pages_node[nid]) {
> +		update_and_free_page(h, page);
> +		h->surplus_huge_pages--;
> +		h->surplus_huge_pages_node[nid]--;
>  	} else {
> -		enqueue_huge_page(page);
> +		enqueue_huge_page(h, page);
>  	}
>  	spin_unlock(&hugetlb_lock);
>  	if (mapping)
> @@ -164,7 +165,7 @@ static void free_huge_page(struct page *
>   * balanced by operating on them in a round-robin fashion.
>   * Returns 1 if an adjustment was made.
>   */
> -static int adjust_pool_surplus(int delta)
> +static int adjust_pool_surplus(struct hstate *h, int delta)
>  {
>  	static int prev_nid;
>  	int nid = prev_nid;
> @@ -177,15 +178,15 @@ static int adjust_pool_surplus(int delta
>  			nid = first_node(node_online_map);
> 
>  		/* To shrink on this node, there must be a surplus page */
> -		if (delta < 0 && !surplus_huge_pages_node[nid])
> +		if (delta < 0 && !h->surplus_huge_pages_node[nid])
>  			continue;
>  		/* Surplus cannot exceed the total number of pages */
> -		if (delta > 0 && surplus_huge_pages_node[nid] >=
> -						nr_huge_pages_node[nid])
> +		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
> +						h->nr_huge_pages_node[nid])
>  			continue;
> 
> -		surplus_huge_pages += delta;
> -		surplus_huge_pages_node[nid] += delta;
> +		h->surplus_huge_pages += delta;
> +		h->surplus_huge_pages_node[nid] += delta;
>  		ret = 1;
>  		break;
>  	} while (nid != prev_nid);
> @@ -194,46 +195,46 @@ static int adjust_pool_surplus(int delta
>  	return ret;
>  }
> 
> -static void prep_new_huge_page(struct page *page, int nid)
> +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  {
>  	set_compound_page_dtor(page, free_huge_page);
>  	spin_lock(&hugetlb_lock);
> -	nr_huge_pages++;
> -	nr_huge_pages_node[nid]++;
> +	h->nr_huge_pages++;
> +	h->nr_huge_pages_node[nid]++;
>  	spin_unlock(&hugetlb_lock);
>  	put_page(page); /* free it into the hugepage allocator */
>  }
> 
> -static struct page *alloc_fresh_huge_page_node(int nid)
> +static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
>  {
>  	struct page *page;
> 
>  	page = alloc_pages_node(nid,
>  		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
>  						__GFP_REPEAT|__GFP_NOWARN,
> -		HUGETLB_PAGE_ORDER);
> +		huge_page_order(h));
>  	if (page) {
>  		if (arch_prepare_hugepage(page)) {
>  			__free_pages(page, HUGETLB_PAGE_ORDER);
>  			return NULL;
>  		}
> -		prep_new_huge_page(page, nid);
> +		prep_new_huge_page(h, page, nid);
>  	}
> 
>  	return page;
>  }
> 
> -static int alloc_fresh_huge_page(void)
> +static int alloc_fresh_huge_page(struct hstate *h)
>  {
>  	struct page *page;
>  	int start_nid;
>  	int next_nid;
>  	int ret = 0;
> 
> -	start_nid = hugetlb_next_nid;
> +	start_nid = h->hugetlb_next_nid;
> 
>  	do {
> -		page = alloc_fresh_huge_page_node(hugetlb_next_nid);
> +		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
>  		if (page)
>  			ret = 1;
>  		/*
> @@ -247,11 +248,11 @@ static int alloc_fresh_huge_page(void)
>  		 * if we just successfully allocated a hugepage so that
>  		 * the next caller gets hugepages on the next node.
>  		 */
> -		next_nid = next_node(hugetlb_next_nid, node_online_map);
> +		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
>  		if (next_nid == MAX_NUMNODES)
>  			next_nid = first_node(node_online_map);
> -		hugetlb_next_nid = next_nid;
> -	} while (!page && hugetlb_next_nid != start_nid);
> +		h->hugetlb_next_nid = next_nid;
> +	} while (!page && h->hugetlb_next_nid != start_nid);
> 
>  	if (ret)
>  		count_vm_event(HTLB_BUDDY_PGALLOC);
> @@ -261,8 +262,8 @@ static int alloc_fresh_huge_page(void)
>  	return ret;
>  }
> 
> -static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
> -						unsigned long address)
> +static struct page *alloc_buddy_huge_page(struct hstate *h,
> +			struct vm_area_struct *vma, unsigned long address)
>  {
>  	struct page *page;
>  	unsigned int nid;
> @@ -291,18 +292,18 @@ static struct page *alloc_buddy_huge_pag
>  	 * per-node value is checked there.
>  	 */
>  	spin_lock(&hugetlb_lock);
> -	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
> +	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>  		spin_unlock(&hugetlb_lock);
>  		return NULL;
>  	} else {
> -		nr_huge_pages++;
> -		surplus_huge_pages++;
> +		h->nr_huge_pages++;
> +		h->surplus_huge_pages++;
>  	}
>  	spin_unlock(&hugetlb_lock);
> 
>  	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
>  					__GFP_REPEAT|__GFP_NOWARN,
> -					HUGETLB_PAGE_ORDER);
> +					huge_page_order(h));
> 
>  	spin_lock(&hugetlb_lock);
>  	if (page) {
> @@ -317,12 +318,12 @@ static struct page *alloc_buddy_huge_pag
>  		/*
>  		 * We incremented the global counters already
>  		 */
> -		nr_huge_pages_node[nid]++;
> -		surplus_huge_pages_node[nid]++;
> +		h->nr_huge_pages_node[nid]++;
> +		h->surplus_huge_pages_node[nid]++;
>  		__count_vm_event(HTLB_BUDDY_PGALLOC);
>  	} else {
> -		nr_huge_pages--;
> -		surplus_huge_pages--;
> +		h->nr_huge_pages--;
> +		h->surplus_huge_pages--;
>  		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
>  	}
>  	spin_unlock(&hugetlb_lock);
> @@ -334,16 +335,16 @@ static struct page *alloc_buddy_huge_pag
>   * Increase the hugetlb pool such that it can accomodate a reservation
>   * of size 'delta'.
>   */
> -static int gather_surplus_pages(int delta)
> +static int gather_surplus_pages(struct hstate *h, int delta)
>  {
>  	struct list_head surplus_list;
>  	struct page *page, *tmp;
>  	int ret, i;
>  	int needed, allocated;
> 
> -	needed = (resv_huge_pages + delta) - free_huge_pages;
> +	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
>  	if (needed <= 0) {
> -		resv_huge_pages += delta;
> +		h->resv_huge_pages += delta;
>  		return 0;
>  	}
> 
> @@ -354,7 +355,7 @@ static int gather_surplus_pages(int delt
>  retry:
>  	spin_unlock(&hugetlb_lock);
>  	for (i = 0; i < needed; i++) {
> -		page = alloc_buddy_huge_page(NULL, 0);
> +		page = alloc_buddy_huge_page(h, NULL, 0);
>  		if (!page) {
>  			/*
>  			 * We were not able to allocate enough pages to
> @@ -375,7 +376,8 @@ retry:
>  	 * because either resv_huge_pages or free_huge_pages may have changed.
>  	 */
>  	spin_lock(&hugetlb_lock);
> -	needed = (resv_huge_pages + delta) - (free_huge_pages + allocated);
> +	needed = (h->resv_huge_pages + delta) -
> +			(h->free_huge_pages + allocated);
>  	if (needed > 0)
>  		goto retry;
> 
> @@ -388,7 +390,7 @@ retry:
>  	 * before they are reserved.
>  	 */
>  	needed += allocated;
> -	resv_huge_pages += delta;
> +	h->resv_huge_pages += delta;
>  	ret = 0;
>  free:
>  	/* Free the needed pages to the hugetlb pool */
> @@ -396,7 +398,7 @@ free:
>  		if ((--needed) < 0)
>  			break;
>  		list_del(&page->lru);
> -		enqueue_huge_page(page);
> +		enqueue_huge_page(h, page);
>  	}
> 
>  	/* Free unnecessary surplus pages to the buddy allocator */
> @@ -424,7 +426,8 @@ free:
>   * allocated to satisfy the reservation must be explicitly freed if they were
>   * never used.
>   */
> -static void return_unused_surplus_pages(unsigned long unused_resv_pages)
> +static void return_unused_surplus_pages(struct hstate *h,
> +					unsigned long unused_resv_pages)
>  {
>  	static int nid = -1;
>  	struct page *page;
> @@ -439,27 +442,27 @@ static void return_unused_surplus_pages(
>  	unsigned long remaining_iterations = num_online_nodes();
> 
>  	/* Uncommit the reservation */
> -	resv_huge_pages -= unused_resv_pages;
> +	h->resv_huge_pages -= unused_resv_pages;
> 
> -	nr_pages = min(unused_resv_pages, surplus_huge_pages);
> +	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
> 
>  	while (remaining_iterations-- && nr_pages) {
>  		nid = next_node(nid, node_online_map);
>  		if (nid == MAX_NUMNODES)
>  			nid = first_node(node_online_map);
> 
> -		if (!surplus_huge_pages_node[nid])
> +		if (!h->surplus_huge_pages_node[nid])
>  			continue;
> 
> -		if (!list_empty(&hugepage_freelists[nid])) {
> -			page = list_entry(hugepage_freelists[nid].next,
> +		if (!list_empty(&h->hugepage_freelists[nid])) {
> +			page = list_entry(h->hugepage_freelists[nid].next,
>  					  struct page, lru);
>  			list_del(&page->lru);
> -			update_and_free_page(page);
> -			free_huge_pages--;
> -			free_huge_pages_node[nid]--;
> -			surplus_huge_pages--;
> -			surplus_huge_pages_node[nid]--;
> +			update_and_free_page(h, page);
> +			h->free_huge_pages--;
> +			h->free_huge_pages_node[nid]--;
> +			h->surplus_huge_pages--;
> +			h->surplus_huge_pages_node[nid]--;
>  			nr_pages--;
>  			remaining_iterations = num_online_nodes();
>  		}
> @@ -471,9 +474,10 @@ static struct page *alloc_huge_page_shar
>  						unsigned long addr)
>  {
>  	struct page *page;
> +	struct hstate *h = hstate_vma(vma);
> 
>  	spin_lock(&hugetlb_lock);
> -	page = dequeue_huge_page_vma(vma, addr);
> +	page = dequeue_huge_page_vma(h, vma, addr);
>  	spin_unlock(&hugetlb_lock);
>  	return page ? page : ERR_PTR(-VM_FAULT_OOM);
>  }
> @@ -482,16 +486,17 @@ static struct page *alloc_huge_page_priv
>  						unsigned long addr)
>  {
>  	struct page *page = NULL;
> +	struct hstate *h = hstate_vma(vma);
> 
>  	if (hugetlb_get_quota(vma->vm_file->f_mapping, 1))
>  		return ERR_PTR(-VM_FAULT_SIGBUS);
> 
>  	spin_lock(&hugetlb_lock);
> -	if (free_huge_pages > resv_huge_pages)
> -		page = dequeue_huge_page_vma(vma, addr);
> +	if (h->free_huge_pages > h->resv_huge_pages)
> +		page = dequeue_huge_page_vma(h, vma, addr);
>  	spin_unlock(&hugetlb_lock);
>  	if (!page) {
> -		page = alloc_buddy_huge_page(vma, addr);
> +		page = alloc_buddy_huge_page(h, vma, addr);
>  		if (!page) {
>  			hugetlb_put_quota(vma->vm_file->f_mapping, 1);
>  			return ERR_PTR(-VM_FAULT_OOM);
> @@ -521,21 +526,27 @@ static struct page *alloc_huge_page(stru
>  static int __init hugetlb_init(void)
>  {
>  	unsigned long i;
> +	struct hstate *h = &global_hstate;
> 
>  	if (HPAGE_SHIFT == 0)
>  		return 0;
> 
> +	if (!h->order) {
> +		h->order = HPAGE_SHIFT - PAGE_SHIFT;
> +		h->mask = HPAGE_MASK;
> +	}
> +
>  	for (i = 0; i < MAX_NUMNODES; ++i)
> -		INIT_LIST_HEAD(&hugepage_freelists[i]);
> +		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> 
> -	hugetlb_next_nid = first_node(node_online_map);
> +	h->hugetlb_next_nid = first_node(node_online_map);
> 
>  	for (i = 0; i < max_huge_pages; ++i) {
> -		if (!alloc_fresh_huge_page())
> +		if (!alloc_fresh_huge_page(h))
>  			break;
>  	}
> -	max_huge_pages = free_huge_pages = nr_huge_pages = i;
> -	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
> +	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> +	printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
>  	return 0;
>  }
>  module_init(hugetlb_init);
> @@ -561,34 +572,36 @@ static unsigned int cpuset_mems_nr(unsig
> 
>  #ifdef CONFIG_SYSCTL
>  #ifdef CONFIG_HIGHMEM
> -static void try_to_free_low(unsigned long count)
> +static void try_to_free_low(struct hstate *h, unsigned long count)
>  {
>  	int i;
> 
>  	for (i = 0; i < MAX_NUMNODES; ++i) {
>  		struct page *page, *next;
> -		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
> -			if (count >= nr_huge_pages)
> +		struct list_head *freel = &h->hugepage_freelists[i];
> +		list_for_each_entry_safe(page, next, freel, lru) {
> +			if (count >= h->nr_huge_pages)
>  				return;
>  			if (PageHighMem(page))
>  				continue;
>  			list_del(&page->lru);
>  			update_and_free_page(page);
> -			free_huge_pages--;
> -			free_huge_pages_node[page_to_nid(page)]--;
> +			h->free_huge_pages--;
> +			h->free_huge_pages_node[page_to_nid(page)]--;
>  		}
>  	}
>  }
>  #else
> -static inline void try_to_free_low(unsigned long count)
> +static inline void try_to_free_low(struct hstate *h, unsigned long count)
>  {
>  }
>  #endif
> 
> -#define persistent_huge_pages (nr_huge_pages - surplus_huge_pages)
> +#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
>  static unsigned long set_max_huge_pages(unsigned long count)
>  {
>  	unsigned long min_count, ret;
> +	struct hstate *h = &global_hstate;
> 
>  	/*
>  	 * Increase the pool size
> @@ -602,12 +615,12 @@ static unsigned long set_max_huge_pages(
>  	 * within all the constraints specified by the sysctls.
>  	 */
>  	spin_lock(&hugetlb_lock);
> -	while (surplus_huge_pages && count > persistent_huge_pages) {
> -		if (!adjust_pool_surplus(-1))
> +	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
> +		if (!adjust_pool_surplus(h, -1))
>  			break;
>  	}
> 
> -	while (count > persistent_huge_pages) {
> +	while (count > persistent_huge_pages(h)) {
>  		int ret;
>  		/*
>  		 * If this allocation races such that we no longer need the
> @@ -615,7 +628,7 @@ static unsigned long set_max_huge_pages(
>  		 * and reducing the surplus.
>  		 */
>  		spin_unlock(&hugetlb_lock);
> -		ret = alloc_fresh_huge_page();
> +		ret = alloc_fresh_huge_page(h);
>  		spin_lock(&hugetlb_lock);
>  		if (!ret)
>  			goto out;
> @@ -637,21 +650,21 @@ static unsigned long set_max_huge_pages(
>  	 * and won't grow the pool anywhere else. Not until one of the
>  	 * sysctls are changed, or the surplus pages go out of use.
>  	 */
> -	min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
> +	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>  	min_count = max(count, min_count);
> -	try_to_free_low(min_count);
> -	while (min_count < persistent_huge_pages) {
> -		struct page *page = dequeue_huge_page();
> +	try_to_free_low(h, min_count);
> +	while (min_count < persistent_huge_pages(h)) {
> +		struct page *page = dequeue_huge_page(h);
>  		if (!page)
>  			break;
> -		update_and_free_page(page);
> +		update_and_free_page(h, page);
>  	}
> -	while (count < persistent_huge_pages) {
> -		if (!adjust_pool_surplus(1))
> +	while (count < persistent_huge_pages(h)) {
> +		if (!adjust_pool_surplus(h, 1))
>  			break;
>  	}
>  out:
> -	ret = persistent_huge_pages;
> +	ret = persistent_huge_pages(h);
>  	spin_unlock(&hugetlb_lock);
>  	return ret;
>  }
> @@ -681,9 +694,10 @@ int hugetlb_overcommit_handler(struct ct
>  			struct file *file, void __user *buffer,
>  			size_t *length, loff_t *ppos)
>  {
> +	struct hstate *h = &global_hstate;
>  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
>  	spin_lock(&hugetlb_lock);
> -	nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
> +	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
>  	spin_unlock(&hugetlb_lock);
>  	return 0;
>  }
> @@ -692,34 +706,37 @@ int hugetlb_overcommit_handler(struct ct
> 
>  int hugetlb_report_meminfo(char *buf)
>  {
> +	struct hstate *h = &global_hstate;
>  	return sprintf(buf,
>  			"HugePages_Total: %5lu\n"
>  			"HugePages_Free:  %5lu\n"
>  			"HugePages_Rsvd:  %5lu\n"
>  			"HugePages_Surp:  %5lu\n"
>  			"Hugepagesize:    %5lu kB\n",
> -			nr_huge_pages,
> -			free_huge_pages,
> -			resv_huge_pages,
> -			surplus_huge_pages,
> -			HPAGE_SIZE/1024);
> +			h->nr_huge_pages,
> +			h->free_huge_pages,
> +			h->resv_huge_pages,
> +			h->surplus_huge_pages,
> +			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
>  }
> 
>  int hugetlb_report_node_meminfo(int nid, char *buf)
>  {
> +	struct hstate *h = &global_hstate;
>  	return sprintf(buf,
>  		"Node %d HugePages_Total: %5u\n"
>  		"Node %d HugePages_Free:  %5u\n"
>  		"Node %d HugePages_Surp:  %5u\n",
> -		nid, nr_huge_pages_node[nid],
> -		nid, free_huge_pages_node[nid],
> -		nid, surplus_huge_pages_node[nid]);
> +		nid, h->nr_huge_pages_node[nid],
> +		nid, h->free_huge_pages_node[nid],
> +		nid, h->surplus_huge_pages_node[nid]);
>  }
> 
>  /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
>  unsigned long hugetlb_total_pages(void)
>  {
> -	return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
> +	struct hstate *h = &global_hstate;
> +	return h->nr_huge_pages * (1 << huge_page_order(h));
>  }
> 
>  /*
> @@ -774,14 +791,16 @@ int copy_hugetlb_page_range(struct mm_st
>  	struct page *ptepage;
>  	unsigned long addr;
>  	int cow;
> +	struct hstate *h = hstate_vma(vma);
> +	unsigned long sz = huge_page_size(h);
> 
>  	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> 
> -	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
> +	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>  		src_pte = huge_pte_offset(src, addr);
>  		if (!src_pte)
>  			continue;
> -		dst_pte = huge_pte_alloc(dst, addr);
> +		dst_pte = huge_pte_alloc(dst, addr, sz);
>  		if (!dst_pte)
>  			goto nomem;
> 
> @@ -817,6 +836,9 @@ void __unmap_hugepage_range(struct vm_ar
>  	pte_t pte;
>  	struct page *page;
>  	struct page *tmp;
> +	struct hstate *h = hstate_vma(vma);
> +	unsigned long sz = huge_page_size(h);
> +
>  	/*
>  	 * A page gathering list, protected by per file i_mmap_lock. The
>  	 * lock is used to avoid list corruption from multiple unmapping
> @@ -825,11 +847,11 @@ void __unmap_hugepage_range(struct vm_ar
>  	LIST_HEAD(page_list);
> 
>  	WARN_ON(!is_vm_hugetlb_page(vma));
> -	BUG_ON(start & ~HPAGE_MASK);
> -	BUG_ON(end & ~HPAGE_MASK);
> +	BUG_ON(start & ~huge_page_mask(h));
> +	BUG_ON(end & ~huge_page_mask(h));
> 
>  	spin_lock(&mm->page_table_lock);
> -	for (address = start; address < end; address += HPAGE_SIZE) {
> +	for (address = start; address < end; address += sz) {
>  		ptep = huge_pte_offset(mm, address);
>  		if (!ptep)
>  			continue;
> @@ -877,6 +899,7 @@ static int hugetlb_cow(struct mm_struct 
>  {
>  	struct page *old_page, *new_page;
>  	int avoidcopy;
> +	struct hstate *h = hstate_vma(vma);
> 
>  	old_page = pte_page(pte);
> 
> @@ -901,7 +924,7 @@ static int hugetlb_cow(struct mm_struct 
>  	__SetPageUptodate(new_page);
>  	spin_lock(&mm->page_table_lock);
> 
> -	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
> +	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
>  	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
>  		/* Break COW */
>  		huge_ptep_clear_flush(vma, address, ptep);
> @@ -924,10 +947,11 @@ static int hugetlb_no_page(struct mm_str
>  	struct page *page;
>  	struct address_space *mapping;
>  	pte_t new_pte;
> +	struct hstate *h = hstate_vma(vma);
> 
>  	mapping = vma->vm_file->f_mapping;
> -	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
> -		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
> +	idx = ((address - vma->vm_start) >> huge_page_shift(h))
> +		+ (vma->vm_pgoff >> huge_page_order(h));
> 
>  	/*
>  	 * Use page lock to guard against racing truncation
> @@ -936,7 +960,7 @@ static int hugetlb_no_page(struct mm_str
>  retry:
>  	page = find_lock_page(mapping, idx);
>  	if (!page) {
> -		size = i_size_read(mapping->host) >> HPAGE_SHIFT;
> +		size = i_size_read(mapping->host) >> huge_page_shift(h);
>  		if (idx >= size)
>  			goto out;
>  		page = alloc_huge_page(vma, address);
> @@ -944,7 +968,7 @@ retry:
>  			ret = -PTR_ERR(page);
>  			goto out;
>  		}
> -		clear_huge_page(page, address);
> +		clear_huge_page(page, address, huge_page_size(h));
>  		__SetPageUptodate(page);
> 
>  		if (vma->vm_flags & VM_SHARED) {
> @@ -960,14 +984,14 @@ retry:
>  			}
> 
>  			spin_lock(&inode->i_lock);
> -			inode->i_blocks += BLOCKS_PER_HUGEPAGE;
> +			inode->i_blocks += blocks_per_hugepage(h);
>  			spin_unlock(&inode->i_lock);
>  		} else
>  			lock_page(page);
>  	}
> 
>  	spin_lock(&mm->page_table_lock);
> -	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
> +	size = i_size_read(mapping->host) >> huge_page_shift(h);
>  	if (idx >= size)
>  		goto backout;
> 
> @@ -1003,8 +1027,9 @@ int hugetlb_fault(struct mm_struct *mm, 
>  	pte_t entry;
>  	int ret;
>  	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> +	struct hstate *h = hstate_vma(vma);
> 
> -	ptep = huge_pte_alloc(mm, address);
> +	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
>  	if (!ptep)
>  		return VM_FAULT_OOM;
> 
> @@ -1042,6 +1067,7 @@ int follow_hugetlb_page(struct mm_struct
>  	unsigned long pfn_offset;
>  	unsigned long vaddr = *position;
>  	int remainder = *length;
> +	struct hstate *h = hstate_vma(vma);
> 
>  	spin_lock(&mm->page_table_lock);
>  	while (vaddr < vma->vm_end && remainder) {
> @@ -1053,7 +1079,7 @@ int follow_hugetlb_page(struct mm_struct
>  		 * each hugepage.  We have to make * sure we get the
>  		 * first, for the page indexing below to work.
>  		 */
> -		pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
> +		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
> 
>  		if (!pte || huge_pte_none(huge_ptep_get(pte)) ||
>  		    (write && !pte_write(huge_ptep_get(pte)))) {
> @@ -1071,7 +1097,7 @@ int follow_hugetlb_page(struct mm_struct
>  			break;
>  		}
> 
> -		pfn_offset = (vaddr & ~HPAGE_MASK) >> PAGE_SHIFT;
> +		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
>  		page = pte_page(huge_ptep_get(pte));
>  same_page:
>  		if (pages) {
> @@ -1087,7 +1113,7 @@ same_page:
>  		--remainder;
>  		++i;
>  		if (vaddr < vma->vm_end && remainder &&
> -				pfn_offset < HPAGE_SIZE/PAGE_SIZE) {
> +				pfn_offset < (1 << huge_page_order(h))) {
>  			/*
>  			 * We use pfn_offset to avoid touching the pageframes
>  			 * of this compound page.
> @@ -1109,13 +1135,14 @@ void hugetlb_change_protection(struct vm
>  	unsigned long start = address;
>  	pte_t *ptep;
>  	pte_t pte;
> +	struct hstate *h = hstate_vma(vma);
> 
>  	BUG_ON(address >= end);
>  	flush_cache_range(vma, address, end);
> 
>  	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
>  	spin_lock(&mm->page_table_lock);
> -	for (; address < end; address += HPAGE_SIZE) {
> +	for (; address < end; address += huge_page_size(h)) {
>  		ptep = huge_pte_offset(mm, address);
>  		if (!ptep)
>  			continue;
> @@ -1254,7 +1281,7 @@ static long region_truncate(struct list_
>  	return chg;
>  }
> 
> -static int hugetlb_acct_memory(long delta)
> +static int hugetlb_acct_memory(struct hstate *h, long delta)
>  {
>  	int ret = -ENOMEM;
> 
> @@ -1277,18 +1304,18 @@ static int hugetlb_acct_memory(long delt
>  	 * semantics that cpuset has.
>  	 */
>  	if (delta > 0) {
> -		if (gather_surplus_pages(delta) < 0)
> +		if (gather_surplus_pages(h, delta) < 0)
>  			goto out;
> 
> -		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
> -			return_unused_surplus_pages(delta);
> +		if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
> +			return_unused_surplus_pages(h, delta);
>  			goto out;
>  		}
>  	}
> 
>  	ret = 0;
>  	if (delta < 0)
> -		return_unused_surplus_pages((unsigned long) -delta);
> +		return_unused_surplus_pages(h, (unsigned long) -delta);
> 
>  out:
>  	spin_unlock(&hugetlb_lock);
> @@ -1298,6 +1325,7 @@ out:
>  int hugetlb_reserve_pages(struct inode *inode, long from, long to)
>  {
>  	long ret, chg;
> +	struct hstate *h = hstate_inode(inode);
> 
>  	chg = region_chg(&inode->i_mapping->private_list, from, to);
>  	if (chg < 0)
> @@ -1305,7 +1333,7 @@ int hugetlb_reserve_pages(struct inode *
> 
>  	if (hugetlb_get_quota(inode->i_mapping, chg))
>  		return -ENOSPC;
> -	ret = hugetlb_acct_memory(chg);
> +	ret = hugetlb_acct_memory(h, chg);
>  	if (ret < 0) {
>  		hugetlb_put_quota(inode->i_mapping, chg);
>  		return ret;
> @@ -1316,12 +1344,13 @@ int hugetlb_reserve_pages(struct inode *
> 
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
> +	struct hstate *h = hstate_inode(inode);
>  	long chg = region_truncate(&inode->i_mapping->private_list, offset);
> 
>  	spin_lock(&inode->i_lock);
> -	inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
> +	inode->i_blocks -= blocks_per_hugepage(h);
>  	spin_unlock(&inode->i_lock);
> 
>  	hugetlb_put_quota(inode->i_mapping, (chg - freed));
> -	hugetlb_acct_memory(-(chg - freed));
> +	hugetlb_acct_memory(h, -(chg - freed));
>  }
> Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
> +++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
> @@ -128,7 +128,7 @@ pte_t *huge_pte_offset(struct mm_struct 
>  	return NULL;
>  }
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
>  {
>  	pgd_t *pg;
>  	pud_t *pu;
> Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c
> +++ linux-2.6/arch/sparc64/mm/hugetlbpage.c
> @@ -175,7 +175,7 @@ hugetlb_get_unmapped_area(struct file *f
>  		return -ENOMEM;
> 
>  	if (flags & MAP_FIXED) {
> -		if (prepare_hugepage_range(addr, len))
> +		if (prepare_hugepage_range(file, addr, len))
>  			return -EINVAL;
>  		return addr;
>  	}
> @@ -195,7 +195,7 @@ hugetlb_get_unmapped_area(struct file *f
>  				pgoff, flags);
>  }
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> Index: linux-2.6/arch/sh/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/sh/mm/hugetlbpage.c
> +++ linux-2.6/arch/sh/mm/hugetlbpage.c
> @@ -22,7 +22,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/cacheflush.h>
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c
> +++ linux-2.6/arch/ia64/mm/hugetlbpage.c
> @@ -24,7 +24,7 @@
>  unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
> 
>  pte_t *
> -huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
> +huge_pte_alloc (struct mm_struct *mm, unsigned long addr, unsigned long sz)
>  {
>  	unsigned long taddr = htlbpage_to_page(addr);
>  	pgd_t *pgd;
> @@ -75,7 +75,7 @@ int huge_pmd_unshare(struct mm_struct *m
>   * Don't actually need to do any preparation, but need to make sure
>   * the address is in the right region.
>   */
> -int prepare_hugepage_range(unsigned long addr, unsigned long len)
> +int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
>  {
>  	if (len & ~HPAGE_MASK)
>  		return -EINVAL;
> @@ -149,7 +149,7 @@ unsigned long hugetlb_get_unmapped_area(
> 
>  	/* Handle MAP_FIXED */
>  	if (flags & MAP_FIXED) {
> -		if (prepare_hugepage_range(addr, len))
> +		if (prepare_hugepage_range(file, addr, len))
>  			return -EINVAL;
>  		return addr;
>  	}
> Index: linux-2.6/arch/x86/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
> +++ linux-2.6/arch/x86/mm/hugetlbpage.c
> @@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
>  	return 1;
>  }
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> @@ -368,7 +368,7 @@ hugetlb_get_unmapped_area(struct file *f
>  		return -ENOMEM;
> 
>  	if (flags & MAP_FIXED) {
> -		if (prepare_hugepage_range(addr, len))
> +		if (prepare_hugepage_range(file, addr, len))
>  			return -EINVAL;
>  		return addr;
>  	}
> Index: linux-2.6/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/linux/hugetlb.h
> +++ linux-2.6/include/linux/hugetlb.h
> @@ -8,7 +8,6 @@
>  #include <linux/mempolicy.h>
>  #include <linux/shm.h>
>  #include <asm/tlbflush.h>
> -#include <asm/hugetlb.h>
> 
>  struct ctl_table;
> 
> @@ -41,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
> 
>  /* arch callbacks */
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz);
>  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> @@ -71,7 +70,7 @@ static inline unsigned long hugetlb_tota
>  #define hugetlb_report_meminfo(buf)		0
>  #define hugetlb_report_node_meminfo(n, buf)	0
>  #define follow_huge_pmd(mm, addr, pmd, write)	NULL
> -#define prepare_hugepage_range(addr,len)	(-EINVAL)
> +#define prepare_hugepage_range(file, addr, len)	(-EINVAL)
>  #define pmd_huge(x)	0
>  #define is_hugepage_only_range(mm, addr, len)	0
>  #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
> @@ -125,8 +124,6 @@ struct file *hugetlb_file_setup(const ch
>  int hugetlb_get_quota(struct address_space *mapping, long delta);
>  void hugetlb_put_quota(struct address_space *mapping, long delta);
> 
> -#define BLOCKS_PER_HUGEPAGE	(HPAGE_SIZE / 512)
> -
>  static inline int is_file_hugepages(struct file *file)
>  {
>  	if (file->f_op == &hugetlbfs_file_operations)
> @@ -155,4 +152,78 @@ unsigned long hugetlb_get_unmapped_area(
>  					unsigned long flags);
>  #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
> 
> +#ifdef CONFIG_HUGETLB_PAGE
> +
> +/* Defines one hugetlb page size */
> +struct hstate {
> +	int hugetlb_next_nid;
> +	unsigned int order;
> +	unsigned long mask;
> +	unsigned long max_huge_pages;
> +	unsigned long nr_huge_pages;
> +	unsigned long free_huge_pages;
> +	unsigned long resv_huge_pages;
> +	unsigned long surplus_huge_pages;
> +	unsigned long nr_overcommit_huge_pages;
> +	struct list_head hugepage_freelists[MAX_NUMNODES];
> +	unsigned int nr_huge_pages_node[MAX_NUMNODES];
> +	unsigned int free_huge_pages_node[MAX_NUMNODES];
> +	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> +};
> +
> +extern struct hstate global_hstate;
> +
> +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> +{
> +	return &global_hstate;
> +}
> +
> +static inline struct hstate *hstate_file(struct file *f)
> +{
> +	return &global_hstate;
> +}
> +
> +static inline struct hstate *hstate_inode(struct inode *i)
> +{
> +	return &global_hstate;
> +}
> +
> +static inline unsigned long huge_page_size(struct hstate *h)
> +{
> +	return (unsigned long)PAGE_SIZE << h->order;
> +}
> +
> +static inline unsigned long huge_page_mask(struct hstate *h)
> +{
> +	return h->mask;
> +}
> +
> +static inline unsigned long huge_page_order(struct hstate *h)
> +{
> +	return h->order;
> +}
> +
> +static inline unsigned huge_page_shift(struct hstate *h)
> +{
> +	return h->order + PAGE_SHIFT;
> +}
> +
> +static inline unsigned int blocks_per_hugepage(struct hstate *h)
> +{
> +	return huge_page_size(h) / 512;
> +}
> +
> +#else
> +struct hstate {};
> +#define hstate_file(f) NULL
> +#define hstate_vma(v) NULL
> +#define hstate_inode(i) NULL
> +#define huge_page_size(h) PAGE_SIZE
> +#define huge_page_mask(h) PAGE_MASK
> +#define huge_page_order(h) 0
> +#define huge_page_shift(h) PAGE_SHIFT
> +#endif
> +
> +#include <asm/hugetlb.h>
> +
>  #endif /* _LINUX_HUGETLB_H */
> Index: linux-2.6/fs/hugetlbfs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/hugetlbfs/inode.c
> +++ linux-2.6/fs/hugetlbfs/inode.c
> @@ -80,6 +80,7 @@ static int hugetlbfs_file_mmap(struct fi
>  	struct inode *inode = file->f_path.dentry->d_inode;
>  	loff_t len, vma_len;
>  	int ret;
> +	struct hstate *h = hstate_file(file);
> 
>  	/*
>  	 * vma address alignment (but not the pgoff alignment) has
> @@ -92,7 +93,7 @@ static int hugetlbfs_file_mmap(struct fi
>  	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
>  	vma->vm_ops = &hugetlb_vm_ops;
> 
> -	if (vma->vm_pgoff & ~(HPAGE_MASK >> PAGE_SHIFT))
> +	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
>  		return -EINVAL;
> 
>  	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
> @@ -104,8 +105,8 @@ static int hugetlbfs_file_mmap(struct fi
>  	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> 
>  	if (vma->vm_flags & VM_MAYSHARE &&
> -	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
> -				  len >> HPAGE_SHIFT))
> +	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> huge_page_order(h),
> +				  len >> huge_page_shift(h)))
>  		goto out;
> 
>  	ret = 0;
> @@ -130,20 +131,21 @@ hugetlb_get_unmapped_area(struct file *f
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma;
>  	unsigned long start_addr;
> +	struct hstate *h = hstate_file(file);
> 
> -	if (len & ~HPAGE_MASK)
> +	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> 
>  	if (flags & MAP_FIXED) {
> -		if (prepare_hugepage_range(addr, len))
> +		if (prepare_hugepage_range(file, addr, len))
>  			return -EINVAL;
>  		return addr;
>  	}
> 
>  	if (addr) {
> -		addr = ALIGN(addr, HPAGE_SIZE);
> +		addr = ALIGN(addr, huge_page_size(h));
>  		vma = find_vma(mm, addr);
>  		if (TASK_SIZE - len >= addr &&
>  		    (!vma || addr + len <= vma->vm_start))
> @@ -156,7 +158,7 @@ hugetlb_get_unmapped_area(struct file *f
>  		start_addr = TASK_UNMAPPED_BASE;
> 
>  full_search:
> -	addr = ALIGN(start_addr, HPAGE_SIZE);
> +	addr = ALIGN(start_addr, huge_page_size(h));
> 
>  	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
>  		/* At this point:  (!vma || addr < vma->vm_end). */
> @@ -174,7 +176,7 @@ full_search:
> 
>  		if (!vma || addr + len <= vma->vm_start)
>  			return addr;
> -		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
> +		addr = ALIGN(vma->vm_end, huge_page_size(h));
>  	}
>  }
>  #endif
> @@ -225,10 +227,11 @@ hugetlbfs_read_actor(struct page *page, 
>  static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
>  			      size_t len, loff_t *ppos)
>  {
> +	struct hstate *h = hstate_file(filp);
>  	struct address_space *mapping = filp->f_mapping;
>  	struct inode *inode = mapping->host;
> -	unsigned long index = *ppos >> HPAGE_SHIFT;
> -	unsigned long offset = *ppos & ~HPAGE_MASK;
> +	unsigned long index = *ppos >> huge_page_shift(h);
> +	unsigned long offset = *ppos & ~huge_page_mask(h);
>  	unsigned long end_index;
>  	loff_t isize;
>  	ssize_t retval = 0;
> @@ -243,17 +246,17 @@ static ssize_t hugetlbfs_read(struct fil
>  	if (!isize)
>  		goto out;
> 
> -	end_index = (isize - 1) >> HPAGE_SHIFT;
> +	end_index = (isize - 1) >> huge_page_shift(h);
>  	for (;;) {
>  		struct page *page;
> -		int nr, ret;
> +		unsigned long nr, ret;
> 
>  		/* nr is the maximum number of bytes to copy from this page */
> -		nr = HPAGE_SIZE;
> +		nr = huge_page_size(h);
>  		if (index >= end_index) {
>  			if (index > end_index)
>  				goto out;
> -			nr = ((isize - 1) & ~HPAGE_MASK) + 1;
> +			nr = ((isize - 1) & ~huge_page_mask(h)) + 1;
>  			if (nr <= offset) {
>  				goto out;
>  			}
> @@ -287,8 +290,8 @@ static ssize_t hugetlbfs_read(struct fil
>  		offset += ret;
>  		retval += ret;
>  		len -= ret;
> -		index += offset >> HPAGE_SHIFT;
> -		offset &= ~HPAGE_MASK;
> +		index += offset >> huge_page_shift(h);
> +		offset &= ~huge_page_mask(h);
> 
>  		if (page)
>  			page_cache_release(page);
> @@ -298,7 +301,7 @@ static ssize_t hugetlbfs_read(struct fil
>  			break;
>  	}
>  out:
> -	*ppos = ((loff_t)index << HPAGE_SHIFT) + offset;
> +	*ppos = ((loff_t)index << huge_page_shift(h)) + offset;
>  	mutex_unlock(&inode->i_mutex);
>  	return retval;
>  }
> @@ -339,8 +342,9 @@ static void truncate_huge_page(struct pa
> 
>  static void truncate_hugepages(struct inode *inode, loff_t lstart)
>  {
> +	struct hstate *h = hstate_inode(inode);
>  	struct address_space *mapping = &inode->i_data;
> -	const pgoff_t start = lstart >> HPAGE_SHIFT;
> +	const pgoff_t start = lstart >> huge_page_shift(h);
>  	struct pagevec pvec;
>  	pgoff_t next;
>  	int i, freed = 0;
> @@ -449,8 +453,9 @@ static int hugetlb_vmtruncate(struct ino
>  {
>  	pgoff_t pgoff;
>  	struct address_space *mapping = inode->i_mapping;
> +	struct hstate *h = hstate_inode(inode);
> 
> -	BUG_ON(offset & ~HPAGE_MASK);
> +	BUG_ON(offset & ~huge_page_mask(h));
>  	pgoff = offset >> PAGE_SHIFT;
> 
>  	i_size_write(inode, offset);
> @@ -465,6 +470,7 @@ static int hugetlb_vmtruncate(struct ino
>  static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = dentry->d_inode;
> +	struct hstate *h = hstate_inode(inode);
>  	int error;
>  	unsigned int ia_valid = attr->ia_valid;
> 
> @@ -476,7 +482,7 @@ static int hugetlbfs_setattr(struct dent
> 
>  	if (ia_valid & ATTR_SIZE) {
>  		error = -EINVAL;
> -		if (!(attr->ia_size & ~HPAGE_MASK))
> +		if (!(attr->ia_size & ~huge_page_mask(h)))
>  			error = hugetlb_vmtruncate(inode, attr->ia_size);
>  		if (error)
>  			goto out;
> @@ -610,9 +616,10 @@ static int hugetlbfs_set_page_dirty(stru
>  static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>  	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
> +	struct hstate *h = hstate_inode(dentry->d_inode);
> 
>  	buf->f_type = HUGETLBFS_MAGIC;
> -	buf->f_bsize = HPAGE_SIZE;
> +	buf->f_bsize = huge_page_size(h);
>  	if (sbinfo) {
>  		spin_lock(&sbinfo->stat_lock);
>  		/* If no limits set, just report 0 for max/free/used
> Index: linux-2.6/ipc/shm.c
> ===================================================================
> --- linux-2.6.orig/ipc/shm.c
> +++ linux-2.6/ipc/shm.c
> @@ -577,7 +577,8 @@ static void shm_get_stat(struct ipc_name
> 
>  		if (is_file_hugepages(shp->shm_file)) {
>  			struct address_space *mapping = inode->i_mapping;
> -			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
> +			struct hstate *h = hstate_file(shp->shm_file);
> +			*rss += (1 << huge_page_order(h)) * mapping->nrpages;
>  		} else {
>  			struct shmem_inode_info *info = SHMEM_I(inode);
>  			spin_lock(&info->lock);
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -901,7 +901,7 @@ unsigned long unmap_vmas(struct mmu_gath
>  			if (unlikely(is_vm_hugetlb_page(vma))) {
>  				unmap_hugepage_range(vma, start, end);
>  				zap_work -= (end - start) /
> -						(HPAGE_SIZE / PAGE_SIZE);
> +					(1 << huge_page_order(hstate_vma(vma)));
>  				start = end;
>  			} else
>  				start = unmap_page_range(*tlbp, vma,
> Index: linux-2.6/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.orig/mm/mempolicy.c
> +++ linux-2.6/mm/mempolicy.c
> @@ -1477,7 +1477,7 @@ struct zonelist *huge_zonelist(struct vm
> 
>  	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
>  		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
> -						HPAGE_SHIFT), gfp_flags);
> +				huge_page_shift(hstate_vma(vma))), gfp_flags);
>  	} else {
>  		zl = policy_zonelist(gfp_flags, *mpol);
>  		if ((*mpol)->mode == MPOL_BIND)
> @@ -2216,9 +2216,12 @@ static void check_huge_range(struct vm_a
>  {
>  	unsigned long addr;
>  	struct page *page;
> +	struct hstate *h = hstate_vma(vma);
> +	unsigned long sz = huge_page_size(h);
> 
> -	for (addr = start; addr < end; addr += HPAGE_SIZE) {
> -		pte_t *ptep = huge_pte_offset(vma->vm_mm, addr & HPAGE_MASK);
> +	for (addr = start; addr < end; addr += sz) {
> +		pte_t *ptep = huge_pte_offset(vma->vm_mm,
> +						addr & huge_page_mask(h));
>  		pte_t pte;
> 
>  		if (!ptep)
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c
> +++ linux-2.6/mm/mmap.c
> @@ -1800,7 +1800,8 @@ int split_vma(struct mm_struct * mm, str
>  	struct mempolicy *pol;
>  	struct vm_area_struct *new;
> 
> -	if (is_vm_hugetlb_page(vma) && (addr & ~HPAGE_MASK))
> +	if (is_vm_hugetlb_page(vma) && (addr &
> +					~(huge_page_mask(hstate_vma(vma)))))
>  		return -EINVAL;
> 
>  	if (mm->map_count >= sysctl_max_map_count)
> Index: linux-2.6/include/asm-ia64/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-ia64/hugetlb.h
> +++ linux-2.6/include/asm-ia64/hugetlb.h
> @@ -8,7 +8,7 @@ void hugetlb_free_pgd_range(struct mmu_g
>  			    unsigned long end, unsigned long floor,
>  			    unsigned long ceiling);
> 
> -int prepare_hugepage_range(unsigned long addr, unsigned long len);
> +int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len);
> 
>  static inline int is_hugepage_only_range(struct mm_struct *mm,
>  					 unsigned long addr,
> Index: linux-2.6/include/asm-powerpc/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-powerpc/hugetlb.h
> +++ linux-2.6/include/asm-powerpc/hugetlb.h
> @@ -21,7 +21,7 @@ pte_t huge_ptep_get_and_clear(struct mm_
>   * If the arch doesn't supply something else, assume that hugepage
>   * size aligned regions are ok without further preparation.
>   */
> -static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
> +static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
>  {
>  	if (len & ~HPAGE_MASK)
>  		return -EINVAL;
> Index: linux-2.6/include/asm-s390/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-s390/hugetlb.h
> +++ linux-2.6/include/asm-s390/hugetlb.h
> @@ -22,7 +22,7 @@ void set_huge_pte_at(struct mm_struct *m
>   * If the arch doesn't supply something else, assume that hugepage
>   * size aligned regions are ok without further preparation.
>   */
> -static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
> +static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
>  {
>  	if (len & ~HPAGE_MASK)
>  		return -EINVAL;
> Index: linux-2.6/include/asm-sh/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-sh/hugetlb.h
> +++ linux-2.6/include/asm-sh/hugetlb.h
> @@ -14,7 +14,7 @@ static inline int is_hugepage_only_range
>   * If the arch doesn't supply something else, assume that hugepage
>   * size aligned regions are ok without further preparation.
>   */
> -static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
> +static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
>  {
>  	if (len & ~HPAGE_MASK)
>  		return -EINVAL;
> Index: linux-2.6/include/asm-sparc64/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-sparc64/hugetlb.h
> +++ linux-2.6/include/asm-sparc64/hugetlb.h
> @@ -22,7 +22,7 @@ static inline int is_hugepage_only_range
>   * If the arch doesn't supply something else, assume that hugepage
>   * size aligned regions are ok without further preparation.
>   */
> -static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
> +static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
>  {
>  	if (len & ~HPAGE_MASK)
>  		return -EINVAL;
> Index: linux-2.6/include/asm-x86/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-x86/hugetlb.h
> +++ linux-2.6/include/asm-x86/hugetlb.h
> @@ -14,11 +14,12 @@ static inline int is_hugepage_only_range
>   * If the arch doesn't supply something else, assume that hugepage
>   * size aligned regions are ok without further preparation.
>   */
> -static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
> +static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
>  {
> -	if (len & ~HPAGE_MASK)
> +	struct hstate *h = hstate_file(file);
> +	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> -	if (addr & ~HPAGE_MASK)
> +	if (addr & ~huge_page_mask(h))
>  		return -EINVAL;
>  	return 0;
>  }
> 
-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 03/23] hugetlb: modular state
  2008-05-27 20:38   ` Adam Litke
@ 2008-05-28  9:13     ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-28  9:13 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara, Andi Kleen

On Tue, May 27, 2008 at 03:38:07PM -0500, Adam Litke wrote:
> Phew.  At last I made it to the end of this one :)  It seems okay to me
> though.  Have you done any performance testing on this patch series yet?
> I don't expect the hstate structure to introduce any measurable
> performance degradation, but it would be nice to have some numbers to
> back up that educated guess.

Haven't seen any noticable performance differences, but I don't know
that I'm doing particularly interesting testing. Would be nice to
get some results with HPC or databases or something that actually
test the code.

I'd say with HUGE_MAX_HSTATE == 1, the compiler _should_ be able to
constant fold much of it away. There would be a few more pointer
dereferences (eg to get hstate from inode or vma)... if that _really_
matters, we could special case HUGE_MAX_HSTATE == 1 in some places
to bring performance back up.



> 
> Acked-by: Adam Litke <agl@us.ibm.com>
> 
> On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> > plain text document attachment (hugetlb-modular-state.patch)
> > Large, but rather mechanical patch that converts most of the hugetlb.c
> > globals into structure members and passes them around.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 04/23] hugetlb: multiple hstates
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (2 preceding siblings ...)
  2008-05-25 14:23 ` [patch 03/23] hugetlb: modular state npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 16:52   ` Nishanth Aravamudan
  2008-05-27 20:43   ` Adam Litke
  2008-05-25 14:23 ` [patch 05/23] hugetlb: multi hstate proc files npiggin
                   ` (19 subsequent siblings)
  23 siblings, 2 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlb-multiple-hstates.patch --]
[-- Type: text/plain, Size: 5816 bytes --]

Add basic support for more than one hstate in hugetlbfs

- Convert hstates to an array
- Add a first default entry covering the standard huge page size
- Add functions for architectures to register new hstates
- Add basic iterators over hstates

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/hugetlb.h |   16 +++++++
 mm/hugetlb.c            |   98 +++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 95 insertions(+), 19 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -27,7 +27,15 @@ unsigned long sysctl_overcommit_huge_pag
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
-struct hstate global_hstate;
+static int max_hstate = 0;
+struct hstate hstates[HUGE_MAX_HSTATE];
+
+/* for command line parsing */
+static struct hstate * __initdata parsed_hstate = NULL;
+static unsigned long __initdata default_hstate_max_huge_pages = 0;
+
+#define for_each_hstate(h) \
+	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -132,13 +140,24 @@ static void update_and_free_page(struct 
 	__free_pages(page, huge_page_order(h));
 }
 
+struct hstate *size_to_hstate(unsigned long size)
+{
+	struct hstate *h;
+
+	for_each_hstate (h) {
+		if (huge_page_size(h) == size)
+			return h;
+	}
+	return NULL;
+}
+
 static void free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
 	 * compound page destructor.
 	 */
-	struct hstate *h = &global_hstate;
+	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
@@ -523,38 +542,80 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
-static int __init hugetlb_init(void)
+static void __init hugetlb_init_one_hstate(struct hstate *h)
 {
 	unsigned long i;
-	struct hstate *h = &global_hstate;
-
-	if (HPAGE_SHIFT == 0)
-		return 0;
-
-	if (!h->order) {
-		h->order = HPAGE_SHIFT - PAGE_SHIFT;
-		h->mask = HPAGE_MASK;
-	}
 
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
 	h->hugetlb_next_nid = first_node(node_online_map);
 
-	for (i = 0; i < max_huge_pages; ++i) {
+	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
+	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+}
+
+static void __init hugetlb_init_hstates(void)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		hugetlb_init_one_hstate(h);
+	}
+}
+
+static void __init report_hugepages(void)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
+				h->free_huge_pages,
+				1 << (h->order + PAGE_SHIFT - 20));
+	}
+}
+
+static int __init hugetlb_init(void)
+{
+	BUILD_BUG_ON(HPAGE_SHIFT == 0);
+
+	if (!size_to_hstate(HPAGE_SIZE)) {
+		hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
+		parsed_hstate->max_huge_pages = default_hstate_max_huge_pages;
+	}
+
+	hugetlb_init_hstates();
+
+	report_hugepages();
+
 	return 0;
 }
 module_init(hugetlb_init);
 
+/* Should be called on processing a hugepagesz=... option */
+void __init hugetlb_add_hstate(unsigned order)
+{
+	struct hstate *h;
+	if (size_to_hstate(PAGE_SIZE << order)) {
+		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
+		return;
+	}
+	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+	BUG_ON(order == 0);
+	h = &hstates[max_hstate++];
+	h->order = order;
+	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
+	hugetlb_init_one_hstate(h);
+	parsed_hstate = h;
+}
+
 static int __init hugetlb_setup(char *s)
 {
-	if (sscanf(s, "%lu", &max_huge_pages) <= 0)
-		max_huge_pages = 0;
+	if (sscanf(s, "%lu", &default_hstate_max_huge_pages) <= 0)
+		default_hstate_max_huge_pages = 0;
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);
@@ -585,7 +646,7 @@ static void try_to_free_low(struct hstat
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
-			update_and_free_page(page);
+			update_and_free_page(h, page);
 			h->free_huge_pages--;
 			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
@@ -675,6 +736,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 {
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	max_huge_pages = set_max_huge_pages(max_huge_pages);
+	global_hstate.max_huge_pages = max_huge_pages;
 	return 0;
 }
 
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -171,7 +171,16 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 };
 
-extern struct hstate global_hstate;
+void __init hugetlb_add_hstate(unsigned order);
+struct hstate *size_to_hstate(unsigned long size);
+
+#ifndef HUGE_MAX_HSTATE
+#define HUGE_MAX_HSTATE 1
+#endif
+
+extern struct hstate hstates[HUGE_MAX_HSTATE];
+
+#define global_hstate (hstates[0])
 
 static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 {
@@ -213,6 +222,11 @@ static inline unsigned int blocks_per_hu
 	return huge_page_size(h) / 512;
 }
 
+static inline struct hstate *page_hstate(struct page *page)
+{
+	return size_to_hstate(PAGE_SIZE << compound_order(page));
+}
+
 #else
 struct hstate {};
 #define hstate_file(f) NULL

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 04/23] hugetlb: multiple hstates
  2008-05-25 14:23 ` [patch 04/23] hugetlb: multiple hstates npiggin
@ 2008-05-27 16:52   ` Nishanth Aravamudan
  2008-05-27 20:43   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 16:52 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On 26.05.2008 [00:23:21 +1000], npiggin@suse.de wrote:
> Add basic support for more than one hstate in hugetlbfs
> 
> - Convert hstates to an array
> - Add a first default entry covering the standard huge page size
> - Add functions for architectures to register new hstates
> - Add basic iterators over hstates
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 04/23] hugetlb: multiple hstates
  2008-05-25 14:23 ` [patch 04/23] hugetlb: multiple hstates npiggin
  2008-05-27 16:52   ` Nishanth Aravamudan
@ 2008-05-27 20:43   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 20:43 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara, Andi Kleen

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlb-multiple-hstates.patch)
> Add basic support for more than one hstate in hugetlbfs
> 
> - Convert hstates to an array
> - Add a first default entry covering the standard huge page size
> - Add functions for architectures to register new hstates
> - Add basic iterators over hstates
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 05/23] hugetlb: multi hstate proc files
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (3 preceding siblings ...)
  2008-05-25 14:23 ` [patch 04/23] hugetlb: multiple hstates npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-29  5:07   ` Nishanth Aravamudan
  2008-05-25 14:23 ` [patch 06/23] hugetlbfs: per mount hstates npiggin
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlb-proc-hstates.patch --]
[-- Type: text/plain, Size: 3367 bytes --]

Convert /proc output code over to report multiple hstates

I chose to just report the numbers in a row, in the hope 
to minimze breakage of existing software. The "compat" page size
is always the first number.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   64 ++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 42 insertions(+), 22 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -766,39 +766,59 @@ int hugetlb_overcommit_handler(struct ct
 
 #endif /* CONFIG_SYSCTL */
 
+static int dump_field(char *buf, unsigned field)
+{
+	int n = 0;
+	struct hstate *h;
+	for_each_hstate (h)
+		n += sprintf(buf + n, " %5lu", *(unsigned long *)((char *)h + field));
+	buf[n++] = '\n';
+	return n;
+}
+
 int hugetlb_report_meminfo(char *buf)
 {
-	struct hstate *h = &global_hstate;
-	return sprintf(buf,
-			"HugePages_Total: %5lu\n"
-			"HugePages_Free:  %5lu\n"
-			"HugePages_Rsvd:  %5lu\n"
-			"HugePages_Surp:  %5lu\n"
-			"Hugepagesize:    %5lu kB\n",
-			h->nr_huge_pages,
-			h->free_huge_pages,
-			h->resv_huge_pages,
-			h->surplus_huge_pages,
-			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
+	struct hstate *h;
+	int n = 0;
+	n += sprintf(buf + 0, "HugePages_Total:");
+	n += dump_field(buf + n, offsetof(struct hstate, nr_huge_pages));
+	n += sprintf(buf + n, "HugePages_Free: ");
+	n += dump_field(buf + n, offsetof(struct hstate, free_huge_pages));
+	n += sprintf(buf + n, "HugePages_Rsvd: ");
+	n += dump_field(buf + n, offsetof(struct hstate, resv_huge_pages));
+	n += sprintf(buf + n, "HugePages_Surp: ");
+	n += dump_field(buf + n, offsetof(struct hstate, surplus_huge_pages));
+	n += sprintf(buf + n, "Hugepagesize:   ");
+	for_each_hstate (h)
+		n += sprintf(buf + n, " %5lu", huge_page_size(h) / 1024);
+	n += sprintf(buf + n, " kB\n");
+	return n;
 }
 
 int hugetlb_report_node_meminfo(int nid, char *buf)
 {
-	struct hstate *h = &global_hstate;
-	return sprintf(buf,
-		"Node %d HugePages_Total: %5u\n"
-		"Node %d HugePages_Free:  %5u\n"
-		"Node %d HugePages_Surp:  %5u\n",
-		nid, h->nr_huge_pages_node[nid],
-		nid, h->free_huge_pages_node[nid],
-		nid, h->surplus_huge_pages_node[nid]);
+	int n = 0;
+	n += sprintf(buf, "Node %d HugePages_Total: ", nid);
+	n += dump_field(buf + n, offsetof(struct hstate,
+						nr_huge_pages_node[nid]));
+	n += sprintf(buf + n, "Node %d HugePages_Free: ", nid);
+	n += dump_field(buf + n, offsetof(struct hstate,
+						free_huge_pages_node[nid]));
+	n += sprintf(buf + n, "Node %d HugePages_Surp: ", nid);
+	n += dump_field(buf + n, offsetof(struct hstate,
+						surplus_huge_pages_node[nid]));
+	return n;
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
-	struct hstate *h = &global_hstate;
-	return h->nr_huge_pages * (1 << huge_page_order(h));
+	long x = 0;
+	struct hstate *h;
+	for_each_hstate (h) {
+		x += h->nr_huge_pages * (1 << huge_page_order(h));
+	}
+	return x;
 }
 
 /*

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/23] hugetlb: multi hstate proc files
  2008-05-25 14:23 ` [patch 05/23] hugetlb: multi hstate proc files npiggin
@ 2008-05-29  5:07   ` Nishanth Aravamudan
  2008-05-29  5:44     ` Nishanth Aravamudan
  2008-05-29  9:04     ` Nick Piggin
  0 siblings, 2 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-29  5:07 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On 26.05.2008 [00:23:22 +1000], npiggin@suse.de wrote:
> Convert /proc output code over to report multiple hstates
> 
> I chose to just report the numbers in a row, in the hope 
> to minimze breakage of existing software. The "compat" page size
> is always the first number.

I'm assuming this is just copied from the old changelog, because as far
as I can tell, and from my quick testing just now with my sysfs patch,
hstates[0] is just whichever hugepage size is registered first. So that
either means by "compat" you meant the default on the current system
(which is only compatible with boots having the same order of boot-line
parameters) or we need to fix this patch to put HPAGE_SIZE (which we
haven't changed, per se) to be in hstates[0]. It might help to have a
helper macro called default_hstate (or a comment) [which I thought we
had in the beginnning of the patchset, but I see one of the intervening
patches removed it] indicating which state is the default when none is
specified.

The reason I bring this up is that I have my sysfs patchset in two
parts. First, I add the sysfs interface and then I remove the
multi-valued proc files. But for the latter, I rely on hstates[0] to be
the one we want to be presenting in proc. If that's not the case, how
should I be determining which hstate is the default? If that is the
case, shall I make the reverting patch also put the "right" value in
hstates[0]?

At the same time, my testing of the sysfs code appears successful still
and I will hopefully be able to post them in the morning.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/23] hugetlb: multi hstate proc files
  2008-05-29  5:07   ` Nishanth Aravamudan
@ 2008-05-29  5:44     ` Nishanth Aravamudan
  2008-05-29  6:30       ` Nishanth Aravamudan
  2008-05-29  9:04     ` Nick Piggin
  1 sibling, 1 reply; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-29  5:44 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On 28.05.2008 [22:07:03 -0700], Nishanth Aravamudan wrote:
> On 26.05.2008 [00:23:22 +1000], npiggin@suse.de wrote:
> > Convert /proc output code over to report multiple hstates
> > 
> > I chose to just report the numbers in a row, in the hope 
> > to minimze breakage of existing software. The "compat" page size
> > is always the first number.
> 
> I'm assuming this is just copied from the old changelog, because as far
> as I can tell, and from my quick testing just now with my sysfs patch,
> hstates[0] is just whichever hugepage size is registered first. So that
> either means by "compat" you meant the default on the current system
> (which is only compatible with boots having the same order of boot-line
> parameters) or we need to fix this patch to put HPAGE_SIZE (which we
> haven't changed, per se) to be in hstates[0]. It might help to have a
> helper macro called default_hstate (or a comment) [which I thought we
> had in the beginnning of the patchset, but I see one of the intervening
> patches removed it] indicating which state is the default when none is
> specified.
> 
> The reason I bring this up is that I have my sysfs patchset in two
> parts. First, I add the sysfs interface and then I remove the
> multi-valued proc files. But for the latter, I rely on hstates[0] to be
> the one we want to be presenting in proc. If that's not the case, how
> should I be determining which hstate is the default? If that is the
> case, shall I make the reverting patch also put the "right" value in
> hstates[0]?

Oh, I think I know what is going on now. It's because I hadn't changed
my test script between the old version of the stack and this one so it
was still putting "hugepagesz=64k hugepagesz=16m hugepagesz=16g" on the
kernel command-line, thus making 64k (the first hugepagesz specified) be
the default for the system. So, actually, using hstates[0] in this way
does work. Running one more test set without specifying any hugepagesz
options on the kernel command-line to see the default layout in proc and
sys is sane.

However, this nuance definitely should be documented in
vm/hugetlbpage.txt / kernel-parameters.txt. Administrators who wish to
preallocate hugepages should be aware that the order in which they
specify the preallocation affects which hugepage size is the default
(presuming my interpretation of my results is correct).

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/23] hugetlb: multi hstate proc files
  2008-05-29  5:44     ` Nishanth Aravamudan
@ 2008-05-29  6:30       ` Nishanth Aravamudan
  0 siblings, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-29  6:30 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On 28.05.2008 [22:44:39 -0700], Nishanth Aravamudan wrote:
> On 28.05.2008 [22:07:03 -0700], Nishanth Aravamudan wrote:
> > On 26.05.2008 [00:23:22 +1000], npiggin@suse.de wrote:
> > > Convert /proc output code over to report multiple hstates
> > > 
> > > I chose to just report the numbers in a row, in the hope 
> > > to minimze breakage of existing software. The "compat" page size
> > > is always the first number.
> > 
> > I'm assuming this is just copied from the old changelog, because as far
> > as I can tell, and from my quick testing just now with my sysfs patch,
> > hstates[0] is just whichever hugepage size is registered first. So that
> > either means by "compat" you meant the default on the current system
> > (which is only compatible with boots having the same order of boot-line
> > parameters) or we need to fix this patch to put HPAGE_SIZE (which we
> > haven't changed, per se) to be in hstates[0]. It might help to have a
> > helper macro called default_hstate (or a comment) [which I thought we
> > had in the beginnning of the patchset, but I see one of the intervening
> > patches removed it] indicating which state is the default when none is
> > specified.
> > 
> > The reason I bring this up is that I have my sysfs patchset in two
> > parts. First, I add the sysfs interface and then I remove the
> > multi-valued proc files. But for the latter, I rely on hstates[0] to be
> > the one we want to be presenting in proc. If that's not the case, how
> > should I be determining which hstate is the default? If that is the
> > case, shall I make the reverting patch also put the "right" value in
> > hstates[0]?
> 
> Oh, I think I know what is going on now. It's because I hadn't changed
> my test script between the old version of the stack and this one so it
> was still putting "hugepagesz=64k hugepagesz=16m hugepagesz=16g" on the
> kernel command-line, thus making 64k (the first hugepagesz specified) be
> the default for the system. So, actually, using hstates[0] in this way
> does work. Running one more test set without specifying any hugepagesz
> options on the kernel command-line to see the default layout in proc and
> sys is sane.

Confirmed that it's a result of my kernel command-line specifying 64k
first.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 05/23] hugetlb: multi hstate proc files
  2008-05-29  5:07   ` Nishanth Aravamudan
  2008-05-29  5:44     ` Nishanth Aravamudan
@ 2008-05-29  9:04     ` Nick Piggin
  1 sibling, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-29  9:04 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On Wed, May 28, 2008 at 10:07:03PM -0700, Nishanth Aravamudan wrote:
> On 26.05.2008 [00:23:22 +1000], npiggin@suse.de wrote:
> > Convert /proc output code over to report multiple hstates
> > 
> > I chose to just report the numbers in a row, in the hope 
> > to minimze breakage of existing software. The "compat" page size
> > is always the first number.
> 
> I'm assuming this is just copied from the old changelog, because as far
> as I can tell, and from my quick testing just now with my sysfs patch,
> hstates[0] is just whichever hugepage size is registered first. So that
> either means by "compat" you meant the default on the current system
> (which is only compatible with boots having the same order of boot-line
> parameters) or we need to fix this patch to put HPAGE_SIZE (which we
> haven't changed, per se) to be in hstates[0]. It might help to have a
> helper macro called default_hstate (or a comment) [which I thought we
> had in the beginnning of the patchset, but I see one of the intervening
> patches removed it] indicating which state is the default when none is
> specified.
> 
> The reason I bring this up is that I have my sysfs patchset in two
> parts. First, I add the sysfs interface and then I remove the
> multi-valued proc files. But for the latter, I rely on hstates[0] to be
> the one we want to be presenting in proc. If that's not the case, how
> should I be determining which hstate is the default? If that is the
> case, shall I make the reverting patch also put the "right" value in
> hstates[0]?

No it's just laziness on my part. I wanted to get the patchset out again
as quickly as I could (after the delay), so I kind of ignored the user
interface problems.

I can try to fix something up, or otherwise I guess you could do it in
your patchset? Eg. have an index which maps the default hstate from the
hstate array. This would default to the existing sizes but I guess it
would make sense to be able to override it at boot.

Do you want to do that, or should I?

 
> At the same time, my testing of the sysfs code appears successful still
> and I will hopefully be able to post them in the morning.

Oh good.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 06/23] hugetlbfs: per mount hstates
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (4 preceding siblings ...)
  2008-05-25 14:23 ` [patch 05/23] hugetlb: multi hstate proc files npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 16:58   ` Nishanth Aravamudan
  2008-05-27 20:50   ` Adam Litke
  2008-05-25 14:23 ` [patch 07/23] hugetlb: multi hstate sysctls npiggin
                   ` (17 subsequent siblings)
  23 siblings, 2 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlbfs-per-mount-hstate.patch --]
[-- Type: text/plain, Size: 7995 bytes --]

Add support to have individual hstates for each hugetlbfs mount

- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- Set up pointers to a suitable hstate for the set page size option
to the super block and the inode and the vma.
- Change the hstate accessors to use this information
- Add code to the hstate init function to set parsed_hstate for command
line processing
- Handle duplicated hstate registrations to the make command line user proof

[np: take hstate out of hugetlbfs inode and vma->vm_private_data]

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
 include/linux/hugetlb.h |   14 +++++++++-----
 mm/hugetlb.c            |   16 +++-------------
 mm/memory.c             |   18 ++++++++++++++++--
 4 files changed, 66 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -92,6 +92,7 @@ struct hugetlbfs_config {
 	umode_t mode;
 	long	nr_blocks;
 	long	nr_inodes;
+	struct hstate *hstate;
 };
 
 struct hugetlbfs_sb_info {
@@ -100,6 +101,7 @@ struct hugetlbfs_sb_info {
 	long	max_inodes;   /* inodes allowed */
 	long	free_inodes;  /* inodes free */
 	spinlock_t	stat_lock;
+	struct hstate *hstate;
 };
 
 
@@ -182,19 +184,21 @@ extern struct hstate hstates[HUGE_MAX_HS
 
 #define global_hstate (hstates[0])
 
-static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+static inline struct hstate *hstate_inode(struct inode *i)
 {
-	return &global_hstate;
+	struct hugetlbfs_sb_info *hsb;
+	hsb = HUGETLBFS_SB(i->i_sb);
+	return hsb->hstate;
 }
 
 static inline struct hstate *hstate_file(struct file *f)
 {
-	return &global_hstate;
+	return hstate_inode(f->f_dentry->d_inode);
 }
 
-static inline struct hstate *hstate_inode(struct inode *i)
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 {
-	return &global_hstate;
+	return hstate_file(vma->vm_file);
 }
 
 static inline unsigned long huge_page_size(struct hstate *h)
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -53,6 +53,7 @@ int sysctl_hugetlb_shm_group;
 enum {
 	Opt_size, Opt_nr_inodes,
 	Opt_mode, Opt_uid, Opt_gid,
+	Opt_pagesize,
 	Opt_err,
 };
 
@@ -62,6 +63,7 @@ static match_table_t tokens = {
 	{Opt_mode,	"mode=%o"},
 	{Opt_uid,	"uid=%u"},
 	{Opt_gid,	"gid=%u"},
+	{Opt_pagesize,	"pagesize=%s"},
 	{Opt_err,	NULL},
 };
 
@@ -750,6 +752,8 @@ hugetlbfs_parse_options(char *options, s
 	char *p, *rest;
 	substring_t args[MAX_OPT_ARGS];
 	int option;
+	unsigned long long size = 0;
+	enum { NO_SIZE, SIZE_STD, SIZE_PERCENT } setsize = NO_SIZE;
 
 	if (!options)
 		return 0;
@@ -780,17 +784,13 @@ hugetlbfs_parse_options(char *options, s
 			break;
 
 		case Opt_size: {
- 			unsigned long long size;
 			/* memparse() will accept a K/M/G without a digit */
 			if (!isdigit(*args[0].from))
 				goto bad_val;
 			size = memparse(args[0].from, &rest);
-			if (*rest == '%') {
-				size <<= HPAGE_SHIFT;
-				size *= max_huge_pages;
-				do_div(size, 100);
-			}
-			pconfig->nr_blocks = (size >> HPAGE_SHIFT);
+			setsize = SIZE_STD;
+			if (*rest == '%')
+				setsize = SIZE_PERCENT;
 			break;
 		}
 
@@ -801,6 +801,19 @@ hugetlbfs_parse_options(char *options, s
 			pconfig->nr_inodes = memparse(args[0].from, &rest);
 			break;
 
+		case Opt_pagesize: {
+			unsigned long ps;
+			ps = memparse(args[0].from, &rest);
+			pconfig->hstate = size_to_hstate(ps);
+			if (!pconfig->hstate) {
+				printk(KERN_ERR
+				"hugetlbfs: Unsupported page size %lu MB\n",
+					ps >> 20);
+				return -EINVAL;
+			}
+			break;
+		}
+
 		default:
 			printk(KERN_ERR "hugetlbfs: Bad mount option: \"%s\"\n",
 				 p);
@@ -808,6 +821,18 @@ hugetlbfs_parse_options(char *options, s
 			break;
 		}
 	}
+
+	/* Do size after hstate is set up */
+	if (setsize > NO_SIZE) {
+		struct hstate *h = pconfig->hstate;
+		if (setsize == SIZE_PERCENT) {
+			size <<= huge_page_shift(h);
+			size *= h->max_huge_pages;
+			do_div(size, 100);
+		}
+		pconfig->nr_blocks = (size >> huge_page_shift(h));
+	}
+
 	return 0;
 
 bad_val:
@@ -832,6 +857,7 @@ hugetlbfs_fill_super(struct super_block 
 	config.uid = current->fsuid;
 	config.gid = current->fsgid;
 	config.mode = 0755;
+	config.hstate = size_to_hstate(HPAGE_SIZE);
 	ret = hugetlbfs_parse_options(data, &config);
 	if (ret)
 		return ret;
@@ -840,14 +866,15 @@ hugetlbfs_fill_super(struct super_block 
 	if (!sbinfo)
 		return -ENOMEM;
 	sb->s_fs_info = sbinfo;
+	sbinfo->hstate = config.hstate;
 	spin_lock_init(&sbinfo->stat_lock);
 	sbinfo->max_blocks = config.nr_blocks;
 	sbinfo->free_blocks = config.nr_blocks;
 	sbinfo->max_inodes = config.nr_inodes;
 	sbinfo->free_inodes = config.nr_inodes;
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = HPAGE_SIZE;
-	sb->s_blocksize_bits = HPAGE_SHIFT;
+	sb->s_blocksize = huge_page_size(config.hstate);
+	sb->s_blocksize_bits = huge_page_shift(config.hstate);
 	sb->s_magic = HUGETLBFS_MAGIC;
 	sb->s_op = &hugetlbfs_ops;
 	sb->s_time_gran = 1;
@@ -949,7 +976,8 @@ struct file *hugetlb_file_setup(const ch
 		goto out_dentry;
 
 	error = -ENOMEM;
-	if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT))
+	if (hugetlb_reserve_pages(inode, 0,
+			size >> huge_page_shift(hstate_inode(inode))))
 		goto out_inode;
 
 	d_instantiate(dentry, inode);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -961,19 +961,9 @@ void __unmap_hugepage_range(struct vm_ar
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end)
 {
-	/*
-	 * It is undesirable to test vma->vm_file as it should be non-null
-	 * for valid hugetlb area. However, vm_file will be NULL in the error
-	 * cleanup path of do_mmap_pgoff. When hugetlbfs ->mmap method fails,
-	 * do_mmap_pgoff() nullifies vma->vm_file before calling this function
-	 * to clean up. Since no pte has actually been setup, it is safe to
-	 * do nothing in this case.
-	 */
-	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
-		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
-	}
+	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	__unmap_hugepage_range(vma, start, end);
+	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -899,9 +899,23 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			if (unlikely(is_vm_hugetlb_page(vma))) {
-				unmap_hugepage_range(vma, start, end);
-				zap_work -= (end - start) /
+				/*
+				 * It is undesirable to test vma->vm_file as it
+				 * should be non-null for valid hugetlb area.
+				 * However, vm_file will be NULL in the error
+				 * cleanup path of do_mmap_pgoff. When
+				 * hugetlbfs ->mmap method fails,
+				 * do_mmap_pgoff() nullifies vma->vm_file
+				 * before calling this function to clean up.
+				 * Since no pte has actually been setup, it is
+				 * safe to do nothing in this case.
+	 			 */
+				if (vma->vm_file) {
+					unmap_hugepage_range(vma, start, end);
+					zap_work -= (end - start) /
 					(1 << huge_page_order(hstate_vma(vma)));
+				}
+
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 06/23] hugetlbfs: per mount hstates
  2008-05-25 14:23 ` [patch 06/23] hugetlbfs: per mount hstates npiggin
@ 2008-05-27 16:58   ` Nishanth Aravamudan
  2008-05-27 20:50   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 16:58 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Andi Kleen

On 26.05.2008 [00:23:23 +1000], npiggin@suse.de wrote:
> Add support to have individual hstates for each hugetlbfs mount
> 
> - Add a new pagesize= option to the hugetlbfs mount that allows setting
> the page size
> - Set up pointers to a suitable hstate for the set page size option
> to the super block and the inode and the vma.
> - Change the hstate accessors to use this information
> - Add code to the hstate init function to set parsed_hstate for command
> line processing
> - Handle duplicated hstate registrations to the make command line user proof
> 
> [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 06/23] hugetlbfs: per mount hstates
  2008-05-25 14:23 ` [patch 06/23] hugetlbfs: per mount hstates npiggin
  2008-05-27 16:58   ` Nishanth Aravamudan
@ 2008-05-27 20:50   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 20:50 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara,
	Andi Kleen

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlbfs-per-mount-hstate.patch)
> Add support to have individual hstates for each hugetlbfs mount
> 
> - Add a new pagesize= option to the hugetlbfs mount that allows setting
> the page size
> - Set up pointers to a suitable hstate for the set page size option
> to the super block and the inode and the vma.
> - Change the hstate accessors to use this information
> - Add code to the hstate init function to set parsed_hstate for command
> line processing
> - Handle duplicated hstate registrations to the make command line user proof
> 
> [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 07/23] hugetlb: multi hstate sysctls
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (5 preceding siblings ...)
  2008-05-25 14:23 ` [patch 06/23] hugetlbfs: per mount hstates npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 21:00   ` Adam Litke
                     ` (2 more replies)
  2008-05-25 14:23 ` [patch 08/23] hugetlb: abstract numa round robin selection npiggin
                   ` (16 subsequent siblings)
  23 siblings, 3 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlbfs-sysctl-hstates.patch --]
[-- Type: text/plain, Size: 6051 bytes --]

Expand the hugetlbfs sysctls to handle arrays for all hstates. This
now allows the removal of global_hstate -- everything is now hstate
aware.

- I didn't bother with hugetlb_shm_group and treat_as_movable,
these are still single global.
- Also improve error propagation for the sysctl handlers a bit

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/hugetlb.h |    7 ++--
 kernel/sysctl.c         |    4 ++
 mm/hugetlb.c            |   70 +++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 61 insertions(+), 20 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -32,8 +32,6 @@ int hugetlb_fault(struct mm_struct *mm, 
 int hugetlb_reserve_pages(struct inode *inode, long from, long to);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 
-extern unsigned long max_huge_pages;
-extern unsigned long sysctl_overcommit_huge_pages;
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
@@ -182,8 +180,6 @@ struct hstate *size_to_hstate(unsigned l
 
 extern struct hstate hstates[HUGE_MAX_HSTATE];
 
-#define global_hstate (hstates[0])
-
 static inline struct hstate *hstate_inode(struct inode *i)
 {
 	struct hugetlbfs_sb_info *hsb;
@@ -231,6 +227,9 @@ static inline struct hstate *page_hstate
 	return size_to_hstate(PAGE_SIZE << compound_order(page));
 }
 
+extern unsigned long max_huge_pages[HUGE_MAX_HSTATE];
+extern unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
+
 #else
 struct hstate {};
 #define hstate_file(f) NULL
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -928,7 +928,7 @@ static struct ctl_table vm_table[] = {
 	 {
 		.procname	= "nr_hugepages",
 		.data		= &max_huge_pages,
-		.maxlen		= sizeof(unsigned long),
+		.maxlen 	= sizeof(max_huge_pages),
 		.mode		= 0644,
 		.proc_handler	= &hugetlb_sysctl_handler,
 		.extra1		= (void *)&hugetlb_zero,
@@ -957,6 +957,8 @@ static struct ctl_table vm_table[] = {
 		.maxlen		= sizeof(sysctl_overcommit_huge_pages),
 		.mode		= 0644,
 		.proc_handler	= &hugetlb_overcommit_handler,
+		.extra1		= (void *)&hugetlb_zero,
+		.extra2		= (void *)&hugetlb_infinity,
 	},
 #endif
 	{
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -22,8 +22,8 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-unsigned long max_huge_pages;
-unsigned long sysctl_overcommit_huge_pages;
+unsigned long max_huge_pages[HUGE_MAX_HSTATE];
+unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
@@ -614,8 +614,16 @@ void __init hugetlb_add_hstate(unsigned 
 
 static int __init hugetlb_setup(char *s)
 {
-	if (sscanf(s, "%lu", &default_hstate_max_huge_pages) <= 0)
-		default_hstate_max_huge_pages = 0;
+	unsigned long *mhp;
+
+	if (!max_hstate)
+		mhp = &default_hstate_max_huge_pages;
+	else
+		mhp = &parsed_hstate->max_huge_pages;
+
+	if (sscanf(s, "%lu", mhp) <= 0)
+		*mhp = 0;
+
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);
@@ -659,10 +667,12 @@ static inline void try_to_free_low(struc
 #endif
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(unsigned long count)
+static unsigned long
+set_max_huge_pages(struct hstate *h, unsigned long count, int *err)
 {
 	unsigned long min_count, ret;
-	struct hstate *h = &global_hstate;
+
+	*err = 0;
 
 	/*
 	 * Increase the pool size
@@ -734,16 +744,33 @@ int hugetlb_sysctl_handler(struct ctl_ta
 			   struct file *file, void __user *buffer,
 			   size_t *length, loff_t *ppos)
 {
-	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
-	max_huge_pages = set_max_huge_pages(max_huge_pages);
-	global_hstate.max_huge_pages = max_huge_pages;
-	return 0;
+	int err;
+
+	table->maxlen = max_hstate * sizeof(unsigned long);
+	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+	if (err)
+		return err;
+
+	if (write) {
+		struct hstate *h;
+		for_each_hstate (h) {
+			int tmp;
+
+			h->max_huge_pages = set_max_huge_pages(h,
+					max_huge_pages[h - hstates], &tmp);
+			max_huge_pages[h - hstates] = h->max_huge_pages;
+			if (tmp && !err)
+				err = tmp;
+		}
+	}
+	return err;
 }
 
 int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
+ 	table->maxlen = max_hstate * sizeof(int);
 	proc_dointvec(table, write, file, buffer, length, ppos);
 	if (hugepages_treat_as_movable)
 		htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
@@ -756,11 +783,24 @@ int hugetlb_overcommit_handler(struct ct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
-	struct hstate *h = &global_hstate;
-	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
-	spin_lock(&hugetlb_lock);
-	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
-	spin_unlock(&hugetlb_lock);
+	int err;
+
+	table->maxlen = max_hstate * sizeof(unsigned long);
+	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+	if (err)
+		return err;
+
+	if (write) {
+		struct hstate *h;
+
+		spin_lock(&hugetlb_lock);
+		for_each_hstate (h) {
+			h->nr_overcommit_huge_pages =
+				sysctl_overcommit_huge_pages[h - hstates];
+		}
+		spin_unlock(&hugetlb_lock);
+	}
+
 	return 0;
 }
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 07/23] hugetlb: multi hstate sysctls
  2008-05-25 14:23 ` [patch 07/23] hugetlb: multi hstate sysctls npiggin
@ 2008-05-27 21:00   ` Adam Litke
  2008-05-28  9:59     ` Nick Piggin
  2008-05-29  4:59   ` Nishanth Aravamudan
  2008-05-29  6:39   ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Nishanth Aravamudan
  2 siblings, 1 reply; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:00 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> @@ -614,8 +614,16 @@ void __init hugetlb_add_hstate(unsigned 
> 
>  static int __init hugetlb_setup(char *s)
>  {
> -	if (sscanf(s, "%lu", &default_hstate_max_huge_pages) <= 0)
> -		default_hstate_max_huge_pages = 0;
> +	unsigned long *mhp;
> +

Perhaps a one-liner comment here to remind us that !max_hstate means we
currently have only one huge page size defined, and that it is
considered the default (or compat) size, and that it gets special
treatment by using i>>?default_hstate_max_huge_pages.

> +	if (!max_hstate)
> +		mhp = &default_hstate_max_huge_pages;
> +	else
> +		mhp = &parsed_hstate->max_huge_pages;
> +
> +	if (sscanf(s, "%lu", mhp) <= 0)
> +		*mhp = 0;
> +
>  	return 1;
>  }
>  __setup("hugepages=", hugetlb_setup);

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 07/23] hugetlb: multi hstate sysctls
  2008-05-27 21:00   ` Adam Litke
@ 2008-05-28  9:59     ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-28  9:59 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara

On Tue, May 27, 2008 at 04:00:31PM -0500, Adam Litke wrote:
> On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> > @@ -614,8 +614,16 @@ void __init hugetlb_add_hstate(unsigned 
> > 
> >  static int __init hugetlb_setup(char *s)
> >  {
> > -	if (sscanf(s, "%lu", &default_hstate_max_huge_pages) <= 0)
> > -		default_hstate_max_huge_pages = 0;
> > +	unsigned long *mhp;
> > +
> 
> Perhaps a one-liner comment here to remind us that !max_hstate means we
> currently have only one huge page size defined, and that it is
> considered the default (or compat) size, and that it gets special
> treatment by using ???default_hstate_max_huge_pages.

Sure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 07/23] hugetlb: multi hstate sysctls
  2008-05-25 14:23 ` [patch 07/23] hugetlb: multi hstate sysctls npiggin
  2008-05-27 21:00   ` Adam Litke
@ 2008-05-29  4:59   ` Nishanth Aravamudan
  2008-05-29  5:36     ` Nishanth Aravamudan
  2008-05-29  8:59     ` Nick Piggin
  2008-05-29  6:39   ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Nishanth Aravamudan
  2 siblings, 2 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-29  4:59 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Andi Kleen

On 26.05.2008 [00:23:24 +1000], npiggin@suse.de wrote:
> Expand the hugetlbfs sysctls to handle arrays for all hstates. This
> now allows the removal of global_hstate -- everything is now hstate
> aware.
> 
> - I didn't bother with hugetlb_shm_group and treat_as_movable,
> these are still single global.
> - Also improve error propagation for the sysctl handlers a bit
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

<snip>

>  int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
>  			struct file *file, void __user *buffer,
>  			size_t *length, loff_t *ppos)
>  {
> + 	table->maxlen = max_hstate * sizeof(int);

Are you sure this is correct? I was just testing my sysfs patch (and the
removal of the multi-valued proc files) and noticed that
/proc/sys/vm/hugepages_treat_as_movable was multi-valued (3 values,
corresponding to the three page sizes on this machine), and the last
value was garbage. And, in any case, this change seems to conflict with
the changelog?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 07/23] hugetlb: multi hstate sysctls
  2008-05-29  4:59   ` Nishanth Aravamudan
@ 2008-05-29  5:36     ` Nishanth Aravamudan
  2008-05-29  8:59     ` Nick Piggin
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-29  5:36 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Andi Kleen

On 28.05.2008 [21:59:19 -0700], Nishanth Aravamudan wrote:
> On 26.05.2008 [00:23:24 +1000], npiggin@suse.de wrote:
> > Expand the hugetlbfs sysctls to handle arrays for all hstates. This
> > now allows the removal of global_hstate -- everything is now hstate
> > aware.
> > 
> > - I didn't bother with hugetlb_shm_group and treat_as_movable,
> > these are still single global.
> > - Also improve error propagation for the sysctl handlers a bit
> > 
> > Signed-off-by: Andi Kleen <ak@suse.de>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> <snip>
> 
> >  int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
> >  			struct file *file, void __user *buffer,
> >  			size_t *length, loff_t *ppos)
> >  {
> > + 	table->maxlen = max_hstate * sizeof(int);
> 
> Are you sure this is correct? I was just testing my sysfs patch (and the
> removal of the multi-valued proc files) and noticed that
> /proc/sys/vm/hugepages_treat_as_movable was multi-valued (3 values,
> corresponding to the three page sizes on this machine), and the last
> value was garbage. And, in any case, this change seems to conflict with
> the changelog?

Confirmed that with just your patches, I see

# cat /proc/sys/vm/hugepages_treat_as_movable 
0	0	-1073741824

which is hopefully bogus :) So I'd say this is a bad part of this particular
change?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 07/23] hugetlb: multi hstate sysctls
  2008-05-29  4:59   ` Nishanth Aravamudan
  2008-05-29  5:36     ` Nishanth Aravamudan
@ 2008-05-29  8:59     ` Nick Piggin
  1 sibling, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-29  8:59 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Andi Kleen

On Wed, May 28, 2008 at 09:59:19PM -0700, Nishanth Aravamudan wrote:
> On 26.05.2008 [00:23:24 +1000], npiggin@suse.de wrote:
> > Expand the hugetlbfs sysctls to handle arrays for all hstates. This
> > now allows the removal of global_hstate -- everything is now hstate
> > aware.
> > 
> > - I didn't bother with hugetlb_shm_group and treat_as_movable,
> > these are still single global.
> > - Also improve error propagation for the sysctl handlers a bit
> > 
> > Signed-off-by: Andi Kleen <ak@suse.de>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> <snip>
> 
> >  int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
> >  			struct file *file, void __user *buffer,
> >  			size_t *length, loff_t *ppos)
> >  {
> > + 	table->maxlen = max_hstate * sizeof(int);
> 
> Are you sure this is correct? I was just testing my sysfs patch (and the
> removal of the multi-valued proc files) and noticed that
> /proc/sys/vm/hugepages_treat_as_movable was multi-valued (3 values,
> corresponding to the three page sizes on this machine), and the last
> value was garbage. And, in any case, this change seems to conflict with
> the changelog?

Hmm, might have slipped in during a merge. I'll fix it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [RFC][PATCH 1/2] hugetlb: present information in sysfs
  2008-05-25 14:23 ` [patch 07/23] hugetlb: multi hstate sysctls npiggin
  2008-05-27 21:00   ` Adam Litke
  2008-05-29  4:59   ` Nishanth Aravamudan
@ 2008-05-29  6:39   ` Nishanth Aravamudan
  2008-05-29  6:42     ` [RFC][PATCH 2/2] hugetlb: remove multi-valued proc files Nishanth Aravamudan
  2008-05-30  2:58     ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Greg KH
  2 siblings, 2 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-29  6:39 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, greg

While the procfs presentation of the hstate counters has tried to be as
backwards compatible as possible, I do not believe trying to maintain
all of the information in the same files is a good long-term plan. This
particularly matters for architectures that can support many hugepage
sizes (sparc64 might be one). Even with the three potential pagesizes on
power (64k, 16m and 16g), I found the proc interface to be a little
awkward.

Instead, migrate the information to sysfs in a new directory,
/sys/kernel/hugepages. Underneath that directory there will be a
directory per-supported hugepage size, e.g.:

/sys/kernel/hugepages/hugepages-64
/sys/kernel/hugepages/hugepages-16384
/sys/kernel/hugepages/hugepages-16777216

corresponding to 64k, 16m and 16g respectively. Within each
hugepages-size directory there are a number of files, corresponding to
the tracked counters in the hstate, e.g.:

/sys/kernel/hugepages/hugepages-64/nr_hugepages
/sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-64/free_hugepages
/sys/kernel/hugepages/hugepages-64/resv_hugepages
/sys/kernel/hugepages/hugepages-64/surplus_hugepages

Of these files, the first two are read-write and the latter three are
read-only. The size of the hugepage being manipulated is trivially
deducible from the enclosing directory and is always expressed in kB (to
match meminfo).

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

---
Nick, I tested this patch and the following one at this point the
series, that is between patches 7 and 8. This does require a few compile
fixes/patch modifications in the later parts of the series. If we decide
that 2/2 is undesirable, there will be fewer of those and 1/2 could also
apply at the end, with less work. I can send you that diff, if you'd
prefer.

Greg, I didn't hear back from you on the last posting of this patch. Not
intended as a complaint, just an indication of why I didn't make any
changes relative to that version. Does this seem like a reasonable
patch as far as using the sysfs API? I realize a follow-on patch will be
needed to updated Documentation/ABI.

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36624f1..3fe461d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -169,6 +169,7 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+	char name[32];
 };
 
 void __init hugetlb_add_hstate(unsigned order);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 080f17a..da7a4aa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/sysfs.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -578,67 +579,6 @@ static void __init report_hugepages(void)
 	}
 }
 
-static int __init hugetlb_init(void)
-{
-	BUILD_BUG_ON(HPAGE_SHIFT == 0);
-
-	if (!size_to_hstate(HPAGE_SIZE)) {
-		hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
-		parsed_hstate->max_huge_pages = default_hstate_max_huge_pages;
-	}
-
-	hugetlb_init_hstates();
-
-	report_hugepages();
-
-	return 0;
-}
-module_init(hugetlb_init);
-
-/* Should be called on processing a hugepagesz=... option */
-void __init hugetlb_add_hstate(unsigned order)
-{
-	struct hstate *h;
-	if (size_to_hstate(PAGE_SIZE << order)) {
-		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
-		return;
-	}
-	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
-	BUG_ON(order == 0);
-	h = &hstates[max_hstate++];
-	h->order = order;
-	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
-	hugetlb_init_one_hstate(h);
-	parsed_hstate = h;
-}
-
-static int __init hugetlb_setup(char *s)
-{
-	unsigned long *mhp;
-
-	if (!max_hstate)
-		mhp = &default_hstate_max_huge_pages;
-	else
-		mhp = &parsed_hstate->max_huge_pages;
-
-	if (sscanf(s, "%lu", mhp) <= 0)
-		*mhp = 0;
-
-	return 1;
-}
-__setup("hugepages=", hugetlb_setup);
-
-static unsigned int cpuset_mems_nr(unsigned int *array)
-{
-	int node;
-	unsigned int nr = 0;
-
-	for_each_node_mask(node, cpuset_current_mems_allowed)
-		nr += array[node];
-
-	return nr;
-}
-
 #ifdef CONFIG_SYSCTL
 #ifdef CONFIG_HIGHMEM
 static void try_to_free_low(struct hstate *h, unsigned long count)
@@ -740,6 +680,229 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_SYSFS
+#define HSTATE_ATTR_RO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+
+#define HSTATE_ATTR(_name) \
+	static struct kobj_attribute _name##_attr = \
+		__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static struct kobject *hugepages_kobj;
+static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
+
+static struct hstate *kobj_to_hstate(struct kobject *kobj)
+{
+	int i;
+	for (i = 0; i < HUGE_MAX_HSTATE; i++)
+		if (hstate_kobjs[i] == kobj)
+			return &hstates[i];
+	BUG();
+	return NULL;
+}
+
+static ssize_t nr_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->nr_huge_pages);
+}
+static ssize_t nr_hugepages_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int tmp, err;
+	unsigned long input;
+	struct hstate *h = kobj_to_hstate(kobj);
+
+	err = strict_strtoul(buf, 10, &input);
+	if (err)
+		return 0;
+
+	h->max_huge_pages = set_max_huge_pages(h, input, &tmp);
+	max_huge_pages[h - hstates] = h->max_huge_pages;
+
+	return count;
+}
+HSTATE_ATTR(nr_hugepages);
+
+static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
+}
+static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int err;
+	unsigned long input;
+	struct hstate *h = kobj_to_hstate(kobj);
+
+	err = strict_strtoul(buf, 10, &input);
+	if (err)
+		return 0;
+
+	spin_lock(&hugetlb_lock);
+	h->nr_overcommit_huge_pages = input;
+	sysctl_overcommit_huge_pages[h - hstates] = h->nr_overcommit_huge_pages;
+	spin_unlock(&hugetlb_lock);
+
+	return count;
+}
+HSTATE_ATTR(nr_overcommit_hugepages);
+
+static ssize_t free_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->free_huge_pages);
+}
+HSTATE_ATTR_RO(free_hugepages);
+
+static ssize_t resv_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->resv_huge_pages);
+}
+HSTATE_ATTR_RO(resv_hugepages);
+
+static ssize_t surplus_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj);
+	return sprintf(buf, "%lu\n", h->surplus_huge_pages);
+}
+HSTATE_ATTR_RO(surplus_hugepages);
+
+static struct attribute *hstate_attrs[] = {
+	&nr_hugepages_attr.attr,
+	&nr_overcommit_hugepages_attr.attr,
+	&free_hugepages_attr.attr,
+	&resv_hugepages_attr.attr,
+	&surplus_hugepages_attr.attr,
+	NULL,
+};
+
+static struct attribute_group hstate_attr_group = {
+	.attrs = hstate_attrs,
+};
+
+static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
+{
+	int retval;
+
+	hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
+							hugepages_kobj);
+	if (!hstate_kobjs[h - hstates])
+		return -ENOMEM;
+
+	retval = sysfs_create_group(hstate_kobjs[h - hstates],
+							&hstate_attr_group);
+	if (retval)
+		kobject_put(hstate_kobjs[h - hstates]);
+
+	return retval;
+}
+
+static void __init hugetlb_sysfs_init(void)
+{
+	struct hstate *h;
+	int err;
+
+	hugepages_kobj = kobject_create_and_add("hugepages", kernel_kobj);
+	if (!hugepages_kobj)
+		return;
+
+	for_each_hstate(h) {
+		err = hugetlb_sysfs_add_hstate(h);
+		if (err)
+			printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
+								h->name);
+	}
+}
+#else
+static void __init hugetlb_sysfs_init(void)
+{
+}
+#endif
+
+static int __init hugetlb_init(void)
+{
+	BUILD_BUG_ON(HPAGE_SHIFT == 0);
+
+	if (!size_to_hstate(HPAGE_SIZE)) {
+		hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
+		parsed_hstate->max_huge_pages = default_hstate_max_huge_pages;
+	}
+
+	hugetlb_init_hstates();
+
+	report_hugepages();
+
+	hugetlb_sysfs_init();
+
+	return 0;
+}
+module_init(hugetlb_init);
+
+static void __exit hugetlb_exit(void)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		kobject_put(hstate_kobjs[h - hstates]);
+	}
+
+	kobject_put(hugepages_kobj);
+}
+module_exit(hugetlb_exit);
+
+/* Should be called on processing a hugepagesz=... option */
+void __init hugetlb_add_hstate(unsigned order)
+{
+	struct hstate *h;
+	if (size_to_hstate(PAGE_SIZE << order)) {
+		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
+		return;
+	}
+	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+	BUG_ON(order == 0);
+	h = &hstates[max_hstate++];
+	h->order = order;
+	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
+	snprintf(h->name, 32, "hugepages-%lu", huge_page_size(h)/1024);
+	hugetlb_init_one_hstate(h);
+	parsed_hstate = h;
+}
+
+static int __init hugetlb_setup(char *s)
+{
+	unsigned long *mhp;
+
+	if (!max_hstate)
+		mhp = &default_hstate_max_huge_pages;
+	else
+		mhp = &parsed_hstate->max_huge_pages;
+
+	if (sscanf(s, "%lu", mhp) <= 0)
+		*mhp = 0;
+
+	return 1;
+}
+__setup("hugepages=", hugetlb_setup);
+
+static unsigned int cpuset_mems_nr(unsigned int *array)
+{
+	int node;
+	unsigned int nr = 0;
+
+	for_each_node_mask(node, cpuset_current_mems_allowed)
+		nr += array[node];
+
+	return nr;
+}
+
 int hugetlb_sysctl_handler(struct ctl_table *table, int write,
 			   struct file *file, void __user *buffer,
 			   size_t *length, loff_t *ppos)

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [RFC][PATCH 2/2] hugetlb: remove multi-valued proc files.
  2008-05-29  6:39   ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Nishanth Aravamudan
@ 2008-05-29  6:42     ` Nishanth Aravamudan
  2008-05-30  3:51       ` Nick Piggin
  2008-05-30  2:58     ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Greg KH
  1 sibling, 1 reply; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-29  6:42 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

Now that we present the same information in a cleaner way in sysfs, we
can remove the duplicate information and interfaces from procfs (and
consider them to be the legacy interface). The proc interface only
controls the default hugepage size, which is either

a) the first one specified via hugepagesz= on the kernel command-line, if any
b) the legacy huge page size, otherwise

All other hugepage size pool manipulations can occur through sysfs.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

---
Note, this does end up making the manipulation and validation of
multiple hstates impossible without sysfs enabled and mounted. As such,
I'm not sure if this is the right approach and perhaps we should be
leaving the multi-valued proc files in place (but not as the preferred
interface). Or we could present the values in procfs only if SYSFS is
not enabled in the kernel? I imagine (but am not 100% sure) that the
only current architecture where this might be important is SUPERH?

Nick, this includes the fix to make hugepages_treat_as_movable
single-valued again, which presumably will get thrown up as a merge
conflict if it's fixed at the right place in the stack.

Realistically, this patch shouldn't need to exist in the upstream
patchset, if we decide to not extend the proc files, as we can add the
sysfs files as a new patch 5 and drop the current patches 5 and 7. I can
work out how the patch should look if that is what we decide to do (or
`git-rebase -i` can :).

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3fe461d..fb7ef81 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -228,8 +228,8 @@ static inline struct hstate *page_hstate(struct page *page)
 	return size_to_hstate(PAGE_SIZE << compound_order(page));
 }
 
-extern unsigned long max_huge_pages[HUGE_MAX_HSTATE];
-extern unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
+extern unsigned long max_huge_pages;
+extern unsigned long sysctl_overcommit_huge_pages;
 
 #else
 struct hstate {};
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index da7a4aa..15b25f0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -23,8 +23,8 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-unsigned long max_huge_pages[HUGE_MAX_HSTATE];
-unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
+unsigned long max_huge_pages;
+unsigned long sysctl_overcommit_huge_pages;
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
@@ -719,7 +719,8 @@ static ssize_t nr_hugepages_store(struct kobject *kobj,
 		return 0;
 
 	h->max_huge_pages = set_max_huge_pages(h, input, &tmp);
-	max_huge_pages[h - hstates] = h->max_huge_pages;
+	if (h == hstates)
+		max_huge_pages = h->max_huge_pages;
 
 	return count;
 }
@@ -744,7 +745,8 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 
 	spin_lock(&hugetlb_lock);
 	h->nr_overcommit_huge_pages = input;
-	sysctl_overcommit_huge_pages[h - hstates] = h->nr_overcommit_huge_pages;
+	if (h == hstates)
+		sysctl_overcommit_huge_pages = h->nr_overcommit_huge_pages;
 	spin_unlock(&hugetlb_lock);
 
 	return count;
@@ -909,22 +911,18 @@ int hugetlb_sysctl_handler(struct ctl_table *table, int write,
 {
 	int err;
 
-	table->maxlen = max_hstate * sizeof(unsigned long);
 	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	if (err)
 		return err;
 
 	if (write) {
-		struct hstate *h;
-		for_each_hstate (h) {
-			int tmp;
-
-			h->max_huge_pages = set_max_huge_pages(h,
-					max_huge_pages[h - hstates], &tmp);
-			max_huge_pages[h - hstates] = h->max_huge_pages;
-			if (tmp && !err)
-				err = tmp;
-		}
+		struct hstate *h = hstates;
+		int tmp;
+
+		h->max_huge_pages = set_max_huge_pages(h, max_huge_pages, &tmp);
+		max_huge_pages = h->max_huge_pages;
+		if (tmp && !err)
+			err = tmp;
 	}
 	return err;
 }
@@ -933,7 +931,6 @@ int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
- 	table->maxlen = max_hstate * sizeof(int);
 	proc_dointvec(table, write, file, buffer, length, ppos);
 	if (hugepages_treat_as_movable)
 		htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
@@ -948,19 +945,15 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 {
 	int err;
 
-	table->maxlen = max_hstate * sizeof(unsigned long);
 	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	if (err)
 		return err;
 
 	if (write) {
-		struct hstate *h;
+		struct hstate *h = hstates;
 
 		spin_lock(&hugetlb_lock);
-		for_each_hstate (h) {
-			h->nr_overcommit_huge_pages =
-				sysctl_overcommit_huge_pages[h - hstates];
-		}
+		h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
 		spin_unlock(&hugetlb_lock);
 	}
 
@@ -969,48 +962,32 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 
 #endif /* CONFIG_SYSCTL */
 
-static int dump_field(char *buf, unsigned field)
-{
-	int n = 0;
-	struct hstate *h;
-	for_each_hstate (h)
-		n += sprintf(buf + n, " %5lu", *(unsigned long *)((char *)h + field));
-	buf[n++] = '\n';
-	return n;
-}
-
 int hugetlb_report_meminfo(char *buf)
 {
-	struct hstate *h;
-	int n = 0;
-	n += sprintf(buf + 0, "HugePages_Total:");
-	n += dump_field(buf + n, offsetof(struct hstate, nr_huge_pages));
-	n += sprintf(buf + n, "HugePages_Free: ");
-	n += dump_field(buf + n, offsetof(struct hstate, free_huge_pages));
-	n += sprintf(buf + n, "HugePages_Rsvd: ");
-	n += dump_field(buf + n, offsetof(struct hstate, resv_huge_pages));
-	n += sprintf(buf + n, "HugePages_Surp: ");
-	n += dump_field(buf + n, offsetof(struct hstate, surplus_huge_pages));
-	n += sprintf(buf + n, "Hugepagesize:   ");
-	for_each_hstate (h)
-		n += sprintf(buf + n, " %5lu", huge_page_size(h) / 1024);
-	n += sprintf(buf + n, " kB\n");
-	return n;
+	struct hstate *h = hstates;
+	return sprintf(buf,
+			"HugePages_Total: %5lu\n"
+			"HugePages_Free:  %5lu\n"
+			"HugePages_Rsvd:  %5lu\n"
+			"HugePages_Surp:  %5lu\n"
+			"Hugepagesize:    %5lu kB\n",
+			h->nr_huge_pages,
+			h->free_huge_pages,
+			h->resv_huge_pages,
+			h->surplus_huge_pages,
+			huge_page_size(h) / 1024);
 }
 
 int hugetlb_report_node_meminfo(int nid, char *buf)
 {
-	int n = 0;
-	n += sprintf(buf, "Node %d HugePages_Total: ", nid);
-	n += dump_field(buf + n, offsetof(struct hstate,
-						nr_huge_pages_node[nid]));
-	n += sprintf(buf + n, "Node %d HugePages_Free: ", nid);
-	n += dump_field(buf + n, offsetof(struct hstate,
-						free_huge_pages_node[nid]));
-	n += sprintf(buf + n, "Node %d HugePages_Surp: ", nid);
-	n += dump_field(buf + n, offsetof(struct hstate,
-						surplus_huge_pages_node[nid]));
-	return n;
+	struct hstate *h = hstates;
+	return sprintf(buf,
+			"HugePages_Total: %5u\n"
+			"HugePages_Free:  %5u\n"
+			"HugePages_Surp:  %5u\n",
+			h->nr_huge_pages_node[nid],
+			h->free_huge_pages_node[nid],
+			h->surplus_huge_pages_node[nid]);
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 2/2] hugetlb: remove multi-valued proc files.
  2008-05-29  6:42     ` [RFC][PATCH 2/2] hugetlb: remove multi-valued proc files Nishanth Aravamudan
@ 2008-05-30  3:51       ` Nick Piggin
  2008-05-30  7:43         ` Nishanth Aravamudan
  0 siblings, 1 reply; 88+ messages in thread
From: Nick Piggin @ 2008-05-30  3:51 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On Wed, May 28, 2008 at 11:42:42PM -0700, Nishanth Aravamudan wrote:
> Now that we present the same information in a cleaner way in sysfs, we
> can remove the duplicate information and interfaces from procfs (and
> consider them to be the legacy interface). The proc interface only
> controls the default hugepage size, which is either
> 
> a) the first one specified via hugepagesz= on the kernel command-line, if any
> b) the legacy huge page size, otherwise
> 
> All other hugepage size pool manipulations can occur through sysfs.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> ---
> Note, this does end up making the manipulation and validation of
> multiple hstates impossible without sysfs enabled and mounted. As such,

I don't think that's such a problem. The overlap between users with
no sysfs and those that use multiple hugepages won't be large. And
if any exist, they can specify at boot or come up with their own
customer solution.


> I'm not sure if this is the right approach and perhaps we should be
> leaving the multi-valued proc files in place (but not as the preferred
> interface). Or we could present the values in procfs only if SYSFS is
> not enabled in the kernel? I imagine (but am not 100% sure) that the
> only current architecture where this might be important is SUPERH?

I wouldn't worry too much. I think /proc/sys/vm/nr_hugepages etc
is better as one (the compat) value after we now have the sysfs
stuff. However /proc/meminfo is a little more tricky. Of course
the information does exist in sysfs too, but meminfo is also for
user reporting, so maybe it will be better to leave it multi
column?


> Nick, this includes the fix to make hugepages_treat_as_movable
> single-valued again, which presumably will get thrown up as a merge
> conflict if it's fixed at the right place in the stack.

Thanks.

 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 2/2] hugetlb: remove multi-valued proc files.
  2008-05-30  3:51       ` Nick Piggin
@ 2008-05-30  7:43         ` Nishanth Aravamudan
  0 siblings, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-30  7:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On 30.05.2008 [05:51:23 +0200], Nick Piggin wrote:
> On Wed, May 28, 2008 at 11:42:42PM -0700, Nishanth Aravamudan wrote:
> > Now that we present the same information in a cleaner way in sysfs, we
> > can remove the duplicate information and interfaces from procfs (and
> > consider them to be the legacy interface). The proc interface only
> > controls the default hugepage size, which is either
> > 
> > a) the first one specified via hugepagesz= on the kernel command-line, if any
> > b) the legacy huge page size, otherwise
> > 
> > All other hugepage size pool manipulations can occur through sysfs.
> > 
> > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > 
> > ---
> > Note, this does end up making the manipulation and validation of
> > multiple hstates impossible without sysfs enabled and mounted. As such,
> 
> I don't think that's such a problem. The overlap between users with
> no sysfs and those that use multiple hugepages won't be large. And
> if any exist, they can specify at boot or come up with their own
> customer solution.

Yeah, like I said, I imagine the only ones that might care are sh folks
and even there, I don't know their MMU well enough to know how big of a
deal it is.

> > I'm not sure if this is the right approach and perhaps we should be
> > leaving the multi-valued proc files in place (but not as the preferred
> > interface). Or we could present the values in procfs only if SYSFS is
> > not enabled in the kernel? I imagine (but am not 100% sure) that the
> > only current architecture where this might be important is SUPERH?
> 
> I wouldn't worry too much. I think /proc/sys/vm/nr_hugepages etc
> is better as one (the compat) value after we now have the sysfs
> stuff. However /proc/meminfo is a little more tricky. Of course
> the information does exist in sysfs too, but meminfo is also for
> user reporting, so maybe it will be better to leave it multi
> column?

Yeah, I suppose it could be either way. I definitely agree the writable
interfaces are cleaner single-valued.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 1/2] hugetlb: present information in sysfs
  2008-05-29  6:39   ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Nishanth Aravamudan
  2008-05-29  6:42     ` [RFC][PATCH 2/2] hugetlb: remove multi-valued proc files Nishanth Aravamudan
@ 2008-05-30  2:58     ` Greg KH
  2008-05-30  3:37       ` Nick Piggin
  2008-05-30  7:39       ` Nishanth Aravamudan
  1 sibling, 2 replies; 88+ messages in thread
From: Greg KH @ 2008-05-30  2:58 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: npiggin, linux-mm, kniht, andi, agl, abh, joachim.deguara

On Wed, May 28, 2008 at 11:39:15PM -0700, Nishanth Aravamudan wrote:
> While the procfs presentation of the hstate counters has tried to be as
> backwards compatible as possible, I do not believe trying to maintain
> all of the information in the same files is a good long-term plan. This
> particularly matters for architectures that can support many hugepage
> sizes (sparc64 might be one). Even with the three potential pagesizes on
> power (64k, 16m and 16g), I found the proc interface to be a little
> awkward.
> 
> Instead, migrate the information to sysfs in a new directory,
> /sys/kernel/hugepages. Underneath that directory there will be a
> directory per-supported hugepage size, e.g.:
> 
> /sys/kernel/hugepages/hugepages-64
> /sys/kernel/hugepages/hugepages-16384
> /sys/kernel/hugepages/hugepages-16777216
> 
> corresponding to 64k, 16m and 16g respectively. Within each
> hugepages-size directory there are a number of files, corresponding to
> the tracked counters in the hstate, e.g.:
> 
> /sys/kernel/hugepages/hugepages-64/nr_hugepages
> /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
> /sys/kernel/hugepages/hugepages-64/free_hugepages
> /sys/kernel/hugepages/hugepages-64/resv_hugepages
> /sys/kernel/hugepages/hugepages-64/surplus_hugepages
> 
> Of these files, the first two are read-write and the latter three are
> read-only. The size of the hugepage being manipulated is trivially
> deducible from the enclosing directory and is always expressed in kB (to
> match meminfo).
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> ---
> Nick, I tested this patch and the following one at this point the
> series, that is between patches 7 and 8. This does require a few compile
> fixes/patch modifications in the later parts of the series. If we decide
> that 2/2 is undesirable, there will be fewer of those and 1/2 could also
> apply at the end, with less work. I can send you that diff, if you'd
> prefer.
> 
> Greg, I didn't hear back from you on the last posting of this patch. Not
> intended as a complaint, just an indication of why I didn't make any
> changes relative to that version. Does this seem like a reasonable
> patch as far as using the sysfs API? I realize a follow-on patch will be
> needed to updated Documentation/ABI.

I'm sorry, it got lost in the bowels of my inbox, my appologies.

This looks fine to me, nice job.  And yes, i do want to see the ABI
addition as well :)

If you add that, feel free to add an:
	Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
to the patch.

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 1/2] hugetlb: present information in sysfs
  2008-05-30  2:58     ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Greg KH
@ 2008-05-30  3:37       ` Nick Piggin
  2008-05-30  4:21         ` Greg KH
                           ` (2 more replies)
  2008-05-30  7:39       ` Nishanth Aravamudan
  1 sibling, 3 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-30  3:37 UTC (permalink / raw)
  To: Greg KH
  Cc: Nishanth Aravamudan, linux-mm, kniht, andi, agl, abh,
	joachim.deguara

On Thu, May 29, 2008 at 07:58:46PM -0700, Greg KH wrote:
> On Wed, May 28, 2008 at 11:39:15PM -0700, Nishanth Aravamudan wrote:
> > While the procfs presentation of the hstate counters has tried to be as
> > backwards compatible as possible, I do not believe trying to maintain
> > all of the information in the same files is a good long-term plan. This
> > particularly matters for architectures that can support many hugepage
> > sizes (sparc64 might be one). Even with the three potential pagesizes on
> > power (64k, 16m and 16g), I found the proc interface to be a little
> > awkward.
> > 
> > Instead, migrate the information to sysfs in a new directory,
> > /sys/kernel/hugepages. Underneath that directory there will be a
> > directory per-supported hugepage size, e.g.:
> > 
> > /sys/kernel/hugepages/hugepages-64
> > /sys/kernel/hugepages/hugepages-16384
> > /sys/kernel/hugepages/hugepages-16777216
> > 
> > corresponding to 64k, 16m and 16g respectively. Within each
> > hugepages-size directory there are a number of files, corresponding to
> > the tracked counters in the hstate, e.g.:
> > 
> > /sys/kernel/hugepages/hugepages-64/nr_hugepages
> > /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
> > /sys/kernel/hugepages/hugepages-64/free_hugepages
> > /sys/kernel/hugepages/hugepages-64/resv_hugepages
> > /sys/kernel/hugepages/hugepages-64/surplus_hugepages
> > 
> > Of these files, the first two are read-write and the latter three are
> > read-only. The size of the hugepage being manipulated is trivially
> > deducible from the enclosing directory and is always expressed in kB (to
> > match meminfo).
> > 
> > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > 
> > ---
> > Nick, I tested this patch and the following one at this point the
> > series, that is between patches 7 and 8. This does require a few compile
> > fixes/patch modifications in the later parts of the series. If we decide
> > that 2/2 is undesirable, there will be fewer of those and 1/2 could also
> > apply at the end, with less work. I can send you that diff, if you'd
> > prefer.
> > 
> > Greg, I didn't hear back from you on the last posting of this patch. Not
> > intended as a complaint, just an indication of why I didn't make any
> > changes relative to that version. Does this seem like a reasonable
> > patch as far as using the sysfs API? I realize a follow-on patch will be
> > needed to updated Documentation/ABI.
> 
> I'm sorry, it got lost in the bowels of my inbox, my appologies.
> 
> This looks fine to me, nice job.  And yes, i do want to see the ABI
> addition as well :)
> 
> If you add that, feel free to add an:
> 	Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
> to the patch.

Thanks Greg. Nish will be away for a few weeks but I'm picking up his patch
and so I can add the Documentation/ABI change.

I agree the interface looks nice, so thanks to everyone for the input and
discussion. A minor nit: is there any point specifying units in the
hugepages directory names? hugepages-64K hugepages-16M hugepages-16G?

Or perhaps for easier parsing, they could be the same unit but still
specificied? hugepages-64K hugepages-16384K etc?

But it's just a very minor point, I'll not make the change myself...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 1/2] hugetlb: present information in sysfs
  2008-05-30  3:37       ` Nick Piggin
@ 2008-05-30  4:21         ` Greg KH
  2008-05-30  4:28           ` Nick Piggin
  2008-05-30  7:44           ` Nishanth Aravamudan
  2008-05-30  7:41         ` Nishanth Aravamudan
  2008-05-30 13:40         ` Adam Litke
  2 siblings, 2 replies; 88+ messages in thread
From: Greg KH @ 2008-05-30  4:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nishanth Aravamudan, linux-mm, kniht, andi, agl, abh,
	joachim.deguara

On Fri, May 30, 2008 at 05:37:49AM +0200, Nick Piggin wrote:
> On Thu, May 29, 2008 at 07:58:46PM -0700, Greg KH wrote:
> > On Wed, May 28, 2008 at 11:39:15PM -0700, Nishanth Aravamudan wrote:
> > > While the procfs presentation of the hstate counters has tried to be as
> > > backwards compatible as possible, I do not believe trying to maintain
> > > all of the information in the same files is a good long-term plan. This
> > > particularly matters for architectures that can support many hugepage
> > > sizes (sparc64 might be one). Even with the three potential pagesizes on
> > > power (64k, 16m and 16g), I found the proc interface to be a little
> > > awkward.
> > > 
> > > Instead, migrate the information to sysfs in a new directory,
> > > /sys/kernel/hugepages. Underneath that directory there will be a
> > > directory per-supported hugepage size, e.g.:
> > > 
> > > /sys/kernel/hugepages/hugepages-64
> > > /sys/kernel/hugepages/hugepages-16384
> > > /sys/kernel/hugepages/hugepages-16777216
> > > 
> > > corresponding to 64k, 16m and 16g respectively. Within each
> > > hugepages-size directory there are a number of files, corresponding to
> > > the tracked counters in the hstate, e.g.:
> > > 
> > > /sys/kernel/hugepages/hugepages-64/nr_hugepages
> > > /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
> > > /sys/kernel/hugepages/hugepages-64/free_hugepages
> > > /sys/kernel/hugepages/hugepages-64/resv_hugepages
> > > /sys/kernel/hugepages/hugepages-64/surplus_hugepages
> > > 
> > > Of these files, the first two are read-write and the latter three are
> > > read-only. The size of the hugepage being manipulated is trivially
> > > deducible from the enclosing directory and is always expressed in kB (to
> > > match meminfo).
> > > 
> > > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > > 
> > > ---
> > > Nick, I tested this patch and the following one at this point the
> > > series, that is between patches 7 and 8. This does require a few compile
> > > fixes/patch modifications in the later parts of the series. If we decide
> > > that 2/2 is undesirable, there will be fewer of those and 1/2 could also
> > > apply at the end, with less work. I can send you that diff, if you'd
> > > prefer.
> > > 
> > > Greg, I didn't hear back from you on the last posting of this patch. Not
> > > intended as a complaint, just an indication of why I didn't make any
> > > changes relative to that version. Does this seem like a reasonable
> > > patch as far as using the sysfs API? I realize a follow-on patch will be
> > > needed to updated Documentation/ABI.
> > 
> > I'm sorry, it got lost in the bowels of my inbox, my appologies.
> > 
> > This looks fine to me, nice job.  And yes, i do want to see the ABI
> > addition as well :)
> > 
> > If you add that, feel free to add an:
> > 	Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
> > to the patch.
> 
> Thanks Greg. Nish will be away for a few weeks but I'm picking up his patch
> and so I can add the Documentation/ABI change.
> 
> I agree the interface looks nice, so thanks to everyone for the input and
> discussion. A minor nit: is there any point specifying units in the
> hugepages directory names? hugepages-64K hugepages-16M hugepages-16G?
> 
> Or perhaps for easier parsing, they could be the same unit but still
> specificied? hugepages-64K hugepages-16384K etc?

I don't care, nothing is going to parse the directory names, they are
pretty much fixed, right?  Just pick a unit and stick with it :)

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 1/2] hugetlb: present information in sysfs
  2008-05-30  4:21         ` Greg KH
@ 2008-05-30  4:28           ` Nick Piggin
  2008-05-30  7:44           ` Nishanth Aravamudan
  1 sibling, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-30  4:28 UTC (permalink / raw)
  To: Greg KH
  Cc: Nishanth Aravamudan, linux-mm, kniht, andi, agl, abh,
	joachim.deguara

On Thu, May 29, 2008 at 09:21:07PM -0700, Greg KH wrote:
> On Fri, May 30, 2008 at 05:37:49AM +0200, Nick Piggin wrote:
> > 
> > Thanks Greg. Nish will be away for a few weeks but I'm picking up his patch
> > and so I can add the Documentation/ABI change.
> > 
> > I agree the interface looks nice, so thanks to everyone for the input and
> > discussion. A minor nit: is there any point specifying units in the
> > hugepages directory names? hugepages-64K hugepages-16M hugepages-16G?
> > 
> > Or perhaps for easier parsing, they could be the same unit but still
> > specificied? hugepages-64K hugepages-16384K etc?
> 
> I don't care, nothing is going to parse the directory names, they are
> pretty much fixed, right?  Just pick a unit and stick with it :)

I can imagine a cross platform app or library parsing them to find
eg. the largest one available that fits the required size and alignment.
Even within the same platform, there could be many different sizes
(eg. ia64).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 1/2] hugetlb: present information in sysfs
  2008-05-30  4:21         ` Greg KH
  2008-05-30  4:28           ` Nick Piggin
@ 2008-05-30  7:44           ` Nishanth Aravamudan
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-30  7:44 UTC (permalink / raw)
  To: Greg KH; +Cc: Nick Piggin, linux-mm, kniht, andi, agl, abh, joachim.deguara

On 29.05.2008 [21:21:07 -0700], Greg KH wrote:
> On Fri, May 30, 2008 at 05:37:49AM +0200, Nick Piggin wrote:
> > On Thu, May 29, 2008 at 07:58:46PM -0700, Greg KH wrote:
> > > On Wed, May 28, 2008 at 11:39:15PM -0700, Nishanth Aravamudan wrote:
> > > > While the procfs presentation of the hstate counters has tried to be as
> > > > backwards compatible as possible, I do not believe trying to maintain
> > > > all of the information in the same files is a good long-term plan. This
> > > > particularly matters for architectures that can support many hugepage
> > > > sizes (sparc64 might be one). Even with the three potential pagesizes on
> > > > power (64k, 16m and 16g), I found the proc interface to be a little
> > > > awkward.
> > > > 
> > > > Instead, migrate the information to sysfs in a new directory,
> > > > /sys/kernel/hugepages. Underneath that directory there will be a
> > > > directory per-supported hugepage size, e.g.:
> > > > 
> > > > /sys/kernel/hugepages/hugepages-64
> > > > /sys/kernel/hugepages/hugepages-16384
> > > > /sys/kernel/hugepages/hugepages-16777216
> > > > 
> > > > corresponding to 64k, 16m and 16g respectively. Within each
> > > > hugepages-size directory there are a number of files, corresponding to
> > > > the tracked counters in the hstate, e.g.:
> > > > 
> > > > /sys/kernel/hugepages/hugepages-64/nr_hugepages
> > > > /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
> > > > /sys/kernel/hugepages/hugepages-64/free_hugepages
> > > > /sys/kernel/hugepages/hugepages-64/resv_hugepages
> > > > /sys/kernel/hugepages/hugepages-64/surplus_hugepages
> > > > 
> > > > Of these files, the first two are read-write and the latter three are
> > > > read-only. The size of the hugepage being manipulated is trivially
> > > > deducible from the enclosing directory and is always expressed in kB (to
> > > > match meminfo).
> > > > 
> > > > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > > > 
> > > > ---
> > > > Nick, I tested this patch and the following one at this point the
> > > > series, that is between patches 7 and 8. This does require a few compile
> > > > fixes/patch modifications in the later parts of the series. If we decide
> > > > that 2/2 is undesirable, there will be fewer of those and 1/2 could also
> > > > apply at the end, with less work. I can send you that diff, if you'd
> > > > prefer.
> > > > 
> > > > Greg, I didn't hear back from you on the last posting of this patch. Not
> > > > intended as a complaint, just an indication of why I didn't make any
> > > > changes relative to that version. Does this seem like a reasonable
> > > > patch as far as using the sysfs API? I realize a follow-on patch will be
> > > > needed to updated Documentation/ABI.
> > > 
> > > I'm sorry, it got lost in the bowels of my inbox, my appologies.
> > > 
> > > This looks fine to me, nice job.  And yes, i do want to see the ABI
> > > addition as well :)
> > > 
> > > If you add that, feel free to add an:
> > > 	Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
> > > to the patch.
> > 
> > Thanks Greg. Nish will be away for a few weeks but I'm picking up his patch
> > and so I can add the Documentation/ABI change.
> > 
> > I agree the interface looks nice, so thanks to everyone for the input and
> > discussion. A minor nit: is there any point specifying units in the
> > hugepages directory names? hugepages-64K hugepages-16M hugepages-16G?
> > 
> > Or perhaps for easier parsing, they could be the same unit but still
> > specificied? hugepages-64K hugepages-16384K etc?
> 
> I don't care, nothing is going to parse the directory names, they are
> pretty much fixed, right?  Just pick a unit and stick with it :)

Well, sort of. libhuge will either parse sysfs or meminfo to see what
the supported hugepage sizes are. Really, it doesn't matter too much,
though, if meminfo contains the same information.

And they are only fixed per-arch, and only until someone needs the next
ginormous huge page size :)

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 1/2] hugetlb: present information in sysfs
  2008-05-30  3:37       ` Nick Piggin
  2008-05-30  4:21         ` Greg KH
@ 2008-05-30  7:41         ` Nishanth Aravamudan
  2008-05-30 13:40         ` Adam Litke
  2 siblings, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-30  7:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Greg KH, linux-mm, kniht, andi, agl, abh, joachim.deguara

On 30.05.2008 [05:37:49 +0200], Nick Piggin wrote:
> On Thu, May 29, 2008 at 07:58:46PM -0700, Greg KH wrote:
> > On Wed, May 28, 2008 at 11:39:15PM -0700, Nishanth Aravamudan wrote:
> > > While the procfs presentation of the hstate counters has tried to be as
> > > backwards compatible as possible, I do not believe trying to maintain
> > > all of the information in the same files is a good long-term plan. This
> > > particularly matters for architectures that can support many hugepage
> > > sizes (sparc64 might be one). Even with the three potential pagesizes on
> > > power (64k, 16m and 16g), I found the proc interface to be a little
> > > awkward.
> > > 
> > > Instead, migrate the information to sysfs in a new directory,
> > > /sys/kernel/hugepages. Underneath that directory there will be a
> > > directory per-supported hugepage size, e.g.:
> > > 
> > > /sys/kernel/hugepages/hugepages-64
> > > /sys/kernel/hugepages/hugepages-16384
> > > /sys/kernel/hugepages/hugepages-16777216
> > > 
> > > corresponding to 64k, 16m and 16g respectively. Within each
> > > hugepages-size directory there are a number of files, corresponding to
> > > the tracked counters in the hstate, e.g.:
> > > 
> > > /sys/kernel/hugepages/hugepages-64/nr_hugepages
> > > /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
> > > /sys/kernel/hugepages/hugepages-64/free_hugepages
> > > /sys/kernel/hugepages/hugepages-64/resv_hugepages
> > > /sys/kernel/hugepages/hugepages-64/surplus_hugepages
> > > 
> > > Of these files, the first two are read-write and the latter three are
> > > read-only. The size of the hugepage being manipulated is trivially
> > > deducible from the enclosing directory and is always expressed in kB (to
> > > match meminfo).
> > > 
> > > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > > 
> > > ---
> > > Nick, I tested this patch and the following one at this point the
> > > series, that is between patches 7 and 8. This does require a few compile
> > > fixes/patch modifications in the later parts of the series. If we decide
> > > that 2/2 is undesirable, there will be fewer of those and 1/2 could also
> > > apply at the end, with less work. I can send you that diff, if you'd
> > > prefer.
> > > 
> > > Greg, I didn't hear back from you on the last posting of this patch. Not
> > > intended as a complaint, just an indication of why I didn't make any
> > > changes relative to that version. Does this seem like a reasonable
> > > patch as far as using the sysfs API? I realize a follow-on patch will be
> > > needed to updated Documentation/ABI.
> > 
> > I'm sorry, it got lost in the bowels of my inbox, my appologies.
> > 
> > This looks fine to me, nice job.  And yes, i do want to see the ABI
> > addition as well :)
> > 
> > If you add that, feel free to add an:
> > 	Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
> > to the patch.
> 
> Thanks Greg. Nish will be away for a few weeks but I'm picking up his patch
> and so I can add the Documentation/ABI change.
> 
> I agree the interface looks nice, so thanks to everyone for the input
> and discussion. A minor nit: is there any point specifying units in
> the hugepages directory names? hugepages-64K hugepages-16M
> hugepages-16G?
> 
> Or perhaps for easier parsing, they could be the same unit but still
> specificied? hugepages-64K hugepages-16384K etc?

Basically, I left it in kilobytes just to avoid any extra work on the
kernel name side (e.g., using memfmt to dynamically create the string
with the appropriate suffix) and because that's the size we report in
meminfo. Userspace can presumably do the parsing to figure out what the
underlying size might look prettier as, but I wanted to avoid it (and
feel like that may have come up in one of Greg's reviews of a previous
version of the patch).

> But it's just a very minor point, I'll not make the change myself...

Well, if it's at all a concern, we should make sure we're all happy,
sysfs being an ABI and all :)

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 1/2] hugetlb: present information in sysfs
  2008-05-30  3:37       ` Nick Piggin
  2008-05-30  4:21         ` Greg KH
  2008-05-30  7:41         ` Nishanth Aravamudan
@ 2008-05-30 13:40         ` Adam Litke
  2 siblings, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-30 13:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Greg KH, Nishanth Aravamudan, linux-mm, kniht, andi, abh,
	joachim.deguara

On Fri, 2008-05-30 at 05:37 +0200, Nick Piggin wrote:
> On Thu, May 29, 2008 at 07:58:46PM -0700, Greg KH wrote:
> > On Wed, May 28, 2008 at 11:39:15PM -0700, Nishanth Aravamudan wrote:
> > > While the procfs presentation of the hstate counters has tried to be as
> > > backwards compatible as possible, I do not believe trying to maintain
> > > all of the information in the same files is a good long-term plan. This
> > > particularly matters for architectures that can support many hugepage
> > > sizes (sparc64 might be one). Even with the three potential pagesizes on
> > > power (64k, 16m and 16g), I found the proc interface to be a little
> > > awkward.
> > > 
> > > Instead, migrate the information to sysfs in a new directory,
> > > /sys/kernel/hugepages. Underneath that directory there will be a
> > > directory per-supported hugepage size, e.g.:
> > > 
> > > /sys/kernel/hugepages/hugepages-64
> > > /sys/kernel/hugepages/hugepages-16384
> > > /sys/kernel/hugepages/hugepages-16777216
> > > 
> > > corresponding to 64k, 16m and 16g respectively. Within each
> > > hugepages-size directory there are a number of files, corresponding to
> > > the tracked counters in the hstate, e.g.:
> > > 
> > > /sys/kernel/hugepages/hugepages-64/nr_hugepages
> > > /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
> > > /sys/kernel/hugepages/hugepages-64/free_hugepages
> > > /sys/kernel/hugepages/hugepages-64/resv_hugepages
> > > /sys/kernel/hugepages/hugepages-64/surplus_hugepages
> > > 
> > > Of these files, the first two are read-write and the latter three are
> > > read-only. The size of the hugepage being manipulated is trivially
> > > deducible from the enclosing directory and is always expressed in kB (to
> > > match meminfo).
> > > 
> > > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > > 
> > > ---
> > > Nick, I tested this patch and the following one at this point the
> > > series, that is between patches 7 and 8. This does require a few compile
> > > fixes/patch modifications in the later parts of the series. If we decide
> > > that 2/2 is undesirable, there will be fewer of those and 1/2 could also
> > > apply at the end, with less work. I can send you that diff, if you'd
> > > prefer.
> > > 
> > > Greg, I didn't hear back from you on the last posting of this patch. Not
> > > intended as a complaint, just an indication of why I didn't make any
> > > changes relative to that version. Does this seem like a reasonable
> > > patch as far as using the sysfs API? I realize a follow-on patch will be
> > > needed to updated Documentation/ABI.
> > 
> > I'm sorry, it got lost in the bowels of my inbox, my appologies.
> > 
> > This looks fine to me, nice job.  And yes, i do want to see the ABI
> > addition as well :)
> > 
> > If you add that, feel free to add an:
> > 	Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
> > to the patch.
> 
> Thanks Greg. Nish will be away for a few weeks but I'm picking up his patch
> and so I can add the Documentation/ABI change.
> 
> I agree the interface looks nice, so thanks to everyone for the input and
> discussion. A minor nit: is there any point specifying units in the
> hugepages directory names? hugepages-64K hugepages-16M hugepages-16G?
> 
> Or perhaps for easier parsing, they could be the same unit but still
> specificied? hugepages-64K hugepages-16384K etc?

Just my two cents, but I would prefer to either leave them as-is, or to
append the K suffix to all values.  I don't think mixing the K/M/G units
buys enough user-friendliness to justify the extra complexity on the
kernel side and in programs that will work with these directories.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC][PATCH 1/2] hugetlb: present information in sysfs
  2008-05-30  2:58     ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Greg KH
  2008-05-30  3:37       ` Nick Piggin
@ 2008-05-30  7:39       ` Nishanth Aravamudan
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-30  7:39 UTC (permalink / raw)
  To: Greg KH; +Cc: npiggin, linux-mm, kniht, andi, agl, abh, joachim.deguara

On 29.05.2008 [19:58:46 -0700], Greg KH wrote:
> On Wed, May 28, 2008 at 11:39:15PM -0700, Nishanth Aravamudan wrote:
> > While the procfs presentation of the hstate counters has tried to be as
> > backwards compatible as possible, I do not believe trying to maintain
> > all of the information in the same files is a good long-term plan. This
> > particularly matters for architectures that can support many hugepage
> > sizes (sparc64 might be one). Even with the three potential pagesizes on
> > power (64k, 16m and 16g), I found the proc interface to be a little
> > awkward.
> > 
> > Instead, migrate the information to sysfs in a new directory,
> > /sys/kernel/hugepages. Underneath that directory there will be a
> > directory per-supported hugepage size, e.g.:
> > 
> > /sys/kernel/hugepages/hugepages-64
> > /sys/kernel/hugepages/hugepages-16384
> > /sys/kernel/hugepages/hugepages-16777216
> > 
> > corresponding to 64k, 16m and 16g respectively. Within each
> > hugepages-size directory there are a number of files, corresponding to
> > the tracked counters in the hstate, e.g.:
> > 
> > /sys/kernel/hugepages/hugepages-64/nr_hugepages
> > /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
> > /sys/kernel/hugepages/hugepages-64/free_hugepages
> > /sys/kernel/hugepages/hugepages-64/resv_hugepages
> > /sys/kernel/hugepages/hugepages-64/surplus_hugepages
> > 
> > Of these files, the first two are read-write and the latter three are
> > read-only. The size of the hugepage being manipulated is trivially
> > deducible from the enclosing directory and is always expressed in kB (to
> > match meminfo).
> > 
> > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > 
> > ---
> > Nick, I tested this patch and the following one at this point the
> > series, that is between patches 7 and 8. This does require a few compile
> > fixes/patch modifications in the later parts of the series. If we decide
> > that 2/2 is undesirable, there will be fewer of those and 1/2 could also
> > apply at the end, with less work. I can send you that diff, if you'd
> > prefer.
> > 
> > Greg, I didn't hear back from you on the last posting of this patch. Not
> > intended as a complaint, just an indication of why I didn't make any
> > changes relative to that version. Does this seem like a reasonable
> > patch as far as using the sysfs API? I realize a follow-on patch will be
> > needed to updated Documentation/ABI.
> 
> I'm sorry, it got lost in the bowels of my inbox, my appologies.

No need to apologize, you get a lot more e-mail than I ever will :) I
appreciate all your help with this stuff as well as the excellent
examples in samples and the Documentation.

> This looks fine to me, nice job.  And yes, i do want to see the ABI
> addition as well :)
> 
> If you add that, feel free to add an:
> 	Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
> to the patch.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 08/23] hugetlb: abstract numa round robin selection
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (6 preceding siblings ...)
  2008-05-25 14:23 ` [patch 07/23] hugetlb: multi hstate sysctls npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 17:01   ` Nishanth Aravamudan
  2008-05-27 21:02   ` Adam Litke
  2008-05-25 14:23 ` [patch 09/23] mm: introduce non panic alloc_bootmem npiggin
                   ` (15 subsequent siblings)
  23 siblings, 2 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlb-abstract-numa-rr.patch --]
[-- Type: text/plain, Size: 2540 bytes --]

Need this as a separate function for a future patch.

No behaviour change.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -243,6 +243,27 @@ static struct page *alloc_fresh_huge_pag
 	return page;
 }
 
+/*
+ * Use a helper variable to find the next node and then
+ * copy it back to hugetlb_next_nid afterwards:
+ * otherwise there's a window in which a racer might
+ * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * But we don't need to use a spin_lock here: it really
+ * doesn't matter if occasionally a racer chooses the
+ * same nid as we do.  Move nid forward in the mask even
+ * if we just successfully allocated a hugepage so that
+ * the next caller gets hugepages on the next node.
+ */
+static int hstate_next_node(struct hstate *h)
+{
+	int next_nid;
+	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(node_online_map);
+	h->hugetlb_next_nid = next_nid;
+	return next_nid;
+}
+
 static int alloc_fresh_huge_page(struct hstate *h)
 {
 	struct page *page;
@@ -256,21 +277,7 @@ static int alloc_fresh_huge_page(struct 
 		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
 		if (page)
 			ret = 1;
-		/*
-		 * Use a helper variable to find the next node and then
-		 * copy it back to hugetlb_next_nid afterwards:
-		 * otherwise there's a window in which a racer might
-		 * pass invalid nid MAX_NUMNODES to alloc_pages_node.
-		 * But we don't need to use a spin_lock here: it really
-		 * doesn't matter if occasionally a racer chooses the
-		 * same nid as we do.  Move nid forward in the mask even
-		 * if we just successfully allocated a hugepage so that
-		 * the next caller gets hugepages on the next node.
-		 */
-		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
-		if (next_nid == MAX_NUMNODES)
-			next_nid = first_node(node_online_map);
-		h->hugetlb_next_nid = next_nid;
+		next_nid = hstate_next_node(h);
 	} while (!page && h->hugetlb_next_nid != start_nid);
 
 	if (ret)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 08/23] hugetlb: abstract numa round robin selection
  2008-05-25 14:23 ` [patch 08/23] hugetlb: abstract numa round robin selection npiggin
@ 2008-05-27 17:01   ` Nishanth Aravamudan
  2008-05-27 21:02   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 17:01 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Andi Kleen

On 26.05.2008 [00:23:25 +1000], npiggin@suse.de wrote:
> Need this as a separate function for a future patch.
> 
> No behaviour change.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 08/23] hugetlb: abstract numa round robin selection
  2008-05-25 14:23 ` [patch 08/23] hugetlb: abstract numa round robin selection npiggin
  2008-05-27 17:01   ` Nishanth Aravamudan
@ 2008-05-27 21:02   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:02 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara, Andi Kleen

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlb-abstract-numa-rr.patch)
> Need this as a separate function for a future patch.
> 
> No behaviour change.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 09/23] mm: introduce non panic alloc_bootmem
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (7 preceding siblings ...)
  2008-05-25 14:23 ` [patch 08/23] hugetlb: abstract numa round robin selection npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-25 14:23 ` [patch 10/23] mm: export prep_compound_page to mm npiggin
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: __alloc_bootmem_node_nopanic.patch --]
[-- Type: text/plain, Size: 1779 bytes --]

Straight forward variant of the existing __alloc_bootmem_node, only 
difference is that it doesn't panic on failure.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/bootmem.h |    4 ++++
 mm/bootmem.c            |   12 ++++++++++++
 2 files changed, 16 insertions(+)

Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -576,6 +576,18 @@ void * __init alloc_bootmem_section(unsi
 }
 #endif
 
+void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
+				   unsigned long align, unsigned long goal)
+{
+	void *ptr;
+
+	ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, 0);
+	if (ptr)
+		return ptr;
+
+	return __alloc_bootmem_nopanic(size, align, goal);
+}
+
 #ifndef ARCH_LOW_ADDRESS_LIMIT
 #define ARCH_LOW_ADDRESS_LIMIT	0xffffffffUL
 #endif
Index: linux-2.6/include/linux/bootmem.h
===================================================================
--- linux-2.6.orig/include/linux/bootmem.h
+++ linux-2.6/include/linux/bootmem.h
@@ -90,6 +90,10 @@ extern void *__alloc_bootmem_node(pg_dat
 				  unsigned long size,
 				  unsigned long align,
 				  unsigned long goal);
+extern void *__alloc_bootmem_node_nopanic(pg_data_t *pgdat,
+				  unsigned long size,
+				  unsigned long align,
+				  unsigned long goal);
 extern unsigned long init_bootmem_node(pg_data_t *pgdat,
 				       unsigned long freepfn,
 				       unsigned long startpfn,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 10/23] mm: export prep_compound_page to mm
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (8 preceding siblings ...)
  2008-05-25 14:23 ` [patch 09/23] mm: introduce non panic alloc_bootmem npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 21:05   ` Adam Litke
  2008-05-25 14:23 ` [patch 11/23] hugetlb: support larger than MAX_ORDER npiggin
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: mm-export-prep_compound_page.patch --]
[-- Type: text/plain, Size: 1428 bytes --]

hugetlb will need to get compound pages from bootmem to handle the case of them
being greater than or equal to MAX_ORDER. Export the constructor function
needed for this.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/internal.h   |    2 ++
 mm/page_alloc.c |    2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -13,6 +13,8 @@
 
 #include <linux/mm.h>
 
+extern void prep_compound_page(struct page *page, unsigned long order);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -273,7 +273,7 @@ static void free_compound_page(struct pa
 	__free_pages_ok(page, compound_order(page));
 }
 
-static void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 10/23] mm: export prep_compound_page to mm
  2008-05-25 14:23 ` [patch 10/23] mm: export prep_compound_page to mm npiggin
@ 2008-05-27 21:05   ` Adam Litke
  0 siblings, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:05 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (mm-export-prep_compound_page.patch)
> hugetlb will need to get compound pages from bootmem to handle the case of them
> being greater than or equal to MAX_ORDER. Export the constructor function
> needed for this.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 11/23] hugetlb: support larger than MAX_ORDER
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (9 preceding siblings ...)
  2008-05-25 14:23 ` [patch 10/23] mm: export prep_compound_page to mm npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 21:23   ` Adam Litke
  2008-05-25 14:23 ` [patch 12/23] hugetlb: support boot allocate different sizes npiggin
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlb-unlimited-order.patch --]
[-- Type: text/plain, Size: 5182 bytes --]

This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
not practical to enlarge MAX_ORDER to 1GB. 

Instead the 1GB pages are only allocated at boot using the bootmem
allocator using the hugepages=... option.

These 1G bootmem pages are never freed. In theory it would be possible
to implement that with some complications, but since it would be a one-way
street (>= MAX_ORDER pages cannot be allocated later) I decided not to
currently.

The >= MAX_ORDER code is not ifdef'ed per architecture. It is not very big
and the ifdef uglyness seemed not be worth it.

Known problems: /proc/meminfo and "free" do not display the memory 
allocated for gb pages in "Total". This is a little confusing for the
user.

Acked-by: Andrew Hastings <abh@cray.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/bootmem.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -167,7 +168,7 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	if (h->surplus_huge_pages_node[nid]) {
+	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
 		update_and_free_page(h, page);
 		h->surplus_huge_pages--;
 		h->surplus_huge_pages_node[nid]--;
@@ -228,6 +229,9 @@ static struct page *alloc_fresh_huge_pag
 {
 	struct page *page;
 
+	if (h->order >= MAX_ORDER)
+		return NULL;
+
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
@@ -294,6 +298,9 @@ static struct page *alloc_buddy_huge_pag
 	struct page *page;
 	unsigned int nid;
 
+	if (h->order >= MAX_ORDER)
+		return NULL;
+
 	/*
 	 * Assume we will successfully allocate the surplus page to
 	 * prevent racing processes from causing the surplus to exceed
@@ -470,6 +477,10 @@ static void return_unused_surplus_pages(
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
+	/* Cannot return gigantic pages currently */
+	if (h->order >= MAX_ORDER)
+		return;
+
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
 	while (remaining_iterations-- && nr_pages) {
@@ -549,6 +560,51 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
+static __initdata LIST_HEAD(huge_boot_pages);
+
+struct huge_bootmem_page {
+	struct list_head list;
+	struct hstate *hstate;
+};
+
+static int __init alloc_bootmem_huge_page(struct hstate *h)
+{
+	struct huge_bootmem_page *m;
+	int nr_nodes = nodes_weight(node_online_map);
+
+	while (nr_nodes) {
+		m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
+					huge_page_size(h), huge_page_size(h),
+					0);
+		if (m)
+			goto found;
+		hstate_next_node(h);
+		nr_nodes--;
+	}
+	return 0;
+
+found:
+	BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
+	/* Put them into a private list first because mem_map is not up yet */
+	list_add(&m->list, &huge_boot_pages);
+	m->hstate = h;
+	return 1;
+}
+
+/* Put bootmem huge pages into the standard lists after mem_map is up */
+static void __init gather_bootmem_prealloc(void)
+{
+	struct huge_bootmem_page *m;
+	list_for_each_entry (m, &huge_boot_pages, list) {
+		struct page *page = virt_to_page(m);
+		struct hstate *h = m->hstate;
+		__ClearPageReserved(page);
+		WARN_ON(page_count(page) != 1);
+		prep_compound_page(page, h->order);
+		prep_new_huge_page(h, page, page_to_nid(page));
+	}
+}
+
 static void __init hugetlb_init_one_hstate(struct hstate *h)
 {
 	unsigned long i;
@@ -559,7 +615,10 @@ static void __init hugetlb_init_one_hsta
 	h->hugetlb_next_nid = first_node(node_online_map);
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
-		if (!alloc_fresh_huge_page(h))
+		if (h->order >= MAX_ORDER) {
+			if (!alloc_bootmem_huge_page(h))
+				break;
+		} else if (!alloc_fresh_huge_page(h))
 			break;
 	}
 	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
@@ -596,6 +655,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_init_hstates();
 
+	gather_bootmem_prealloc();
+
 	report_hugepages();
 
 	return 0;
@@ -652,6 +713,9 @@ static void try_to_free_low(struct hstat
 {
 	int i;
 
+	if (h->order >= MAX_ORDER)
+		return;
+
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
@@ -681,6 +745,12 @@ set_max_huge_pages(struct hstate *h, uns
 
 	*err = 0;
 
+	if (h->order >= MAX_ORDER) {
+		if (count != h->max_huge_pages)
+			*err = -EINVAL;
+		return h->max_huge_pages;
+	}
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 11/23] hugetlb: support larger than MAX_ORDER
  2008-05-25 14:23 ` [patch 11/23] hugetlb: support larger than MAX_ORDER npiggin
@ 2008-05-27 21:23   ` Adam Litke
  2008-05-28 10:22     ` Nick Piggin
  0 siblings, 1 reply; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:23 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> @@ -549,6 +560,51 @@ static struct page *alloc_huge_page(stru
>  	return page;
>  }
> 
> +static __initdata LIST_HEAD(huge_boot_pages);
> +
> +struct huge_bootmem_page {
> +	struct list_head list;
> +	struct hstate *hstate;
> +};
> +
> +static int __init alloc_bootmem_huge_page(struct hstate *h)
> +{
> +	struct huge_bootmem_page *m;
> +	int nr_nodes = nodes_weight(node_online_map);
> +
> +	while (nr_nodes) {
> +		m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
> +					huge_page_size(h), huge_page_size(h),
> +					0);
> +		if (m)
> +			goto found;
> +		hstate_next_node(h);
> +		nr_nodes--;
> +	}
> +	return 0;
> +
> +found:
> +	BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
> +	/* Put them into a private list first because mem_map is not up yet */
> +	list_add(&m->list, &huge_boot_pages);
> +	m->hstate = h;
> +	return 1;
> +}

At first I was pretty confused by how you are directly using the
newly-allocated bootmem page to create a temporary list until the mem
map comes up.  Clever.  I bet I would have understood right away if it
were written like the following:

void *vaddr;
struct huge_bootmem_page *m;

vaddr = __alloc_bootmem_node_nopanic(...);
if (vaddr) {
	/*
	 * Use the beginning of this block to store some temporary
	 * meta-data until the mem_map comes up.
	 */
	m = (huge_bootmem_page *) vaddr;
	goto found;
}

If you don't like that level of verbosity, could we add a comment just
to make it immediately clear to the reader?

> +/* Put bootmem huge pages into the standard lists after mem_map is up */
> +static void __init gather_bootmem_prealloc(void)
> +{
> +	struct huge_bootmem_page *m;
> +	list_for_each_entry (m, &huge_boot_pages, list) {
> +		struct page *page = virt_to_page(m);
> +		struct hstate *h = m->hstate;
> +		__ClearPageReserved(page);
> +		WARN_ON(page_count(page) != 1);
> +		prep_compound_page(page, h->order);
> +		prep_new_huge_page(h, page, page_to_nid(page));
> +	}
> +}
> +
>  static void __init hugetlb_init_one_hstate(struct hstate *h)
>  {
>  	unsigned long i;

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 11/23] hugetlb: support larger than MAX_ORDER
  2008-05-27 21:23   ` Adam Litke
@ 2008-05-28 10:22     ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-28 10:22 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara

On Tue, May 27, 2008 at 04:23:38PM -0500, Adam Litke wrote:
> On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> > @@ -549,6 +560,51 @@ static struct page *alloc_huge_page(stru
> >  	return page;
> >  }
> > 
> > +static __initdata LIST_HEAD(huge_boot_pages);
> > +
> > +struct huge_bootmem_page {
> > +	struct list_head list;
> > +	struct hstate *hstate;
> > +};
> > +
> > +static int __init alloc_bootmem_huge_page(struct hstate *h)
> > +{
> > +	struct huge_bootmem_page *m;
> > +	int nr_nodes = nodes_weight(node_online_map);
> > +
> > +	while (nr_nodes) {
> > +		m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
> > +					huge_page_size(h), huge_page_size(h),
> > +					0);
> > +		if (m)
> > +			goto found;
> > +		hstate_next_node(h);
> > +		nr_nodes--;
> > +	}
> > +	return 0;
> > +
> > +found:
> > +	BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
> > +	/* Put them into a private list first because mem_map is not up yet */
> > +	list_add(&m->list, &huge_boot_pages);
> > +	m->hstate = h;
> > +	return 1;
> > +}
> 
> At first I was pretty confused by how you are directly using the
> newly-allocated bootmem page to create a temporary list until the mem
> map comes up.  Clever.  I bet I would have understood right away if it

Just a note that Andi wrote it.


> were written like the following:
> 
> void *vaddr;
> struct huge_bootmem_page *m;
> 
> vaddr = __alloc_bootmem_node_nopanic(...);
> if (vaddr) {
> 	/*
> 	 * Use the beginning of this block to store some temporary
> 	 * meta-data until the mem_map comes up.
> 	 */
> 	m = (huge_bootmem_page *) vaddr;
> 	goto found;
> }
> 
> If you don't like that level of verbosity, could we add a comment just
> to make it immediately clear to the reader?


Yeah OK. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 12/23] hugetlb: support boot allocate different sizes
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (10 preceding siblings ...)
  2008-05-25 14:23 ` [patch 11/23] hugetlb: support larger than MAX_ORDER npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 17:04   ` Nishanth Aravamudan
  2008-05-27 21:28   ` Adam Litke
  2008-05-25 14:23 ` [patch 13/23] hugetlb: printk cleanup npiggin
                   ` (11 subsequent siblings)
  23 siblings, 2 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlb-different-page-sizes.patch --]
[-- Type: text/plain, Size: 2061 bytes --]

---
 mm/hugetlb.c |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -609,10 +609,13 @@ static void __init hugetlb_init_one_hsta
 {
 	unsigned long i;
 
-	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+	/* Don't reinitialize lists if they have been already init'ed */
+	if (!h->hugepage_freelists[0].next) {
+		for (i = 0; i < MAX_NUMNODES; ++i)
+			INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
-	h->hugetlb_next_nid = first_node(node_online_map);
+		h->hugetlb_next_nid = first_node(node_online_map);
+	}
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (h->order >= MAX_ORDER) {
@@ -621,7 +624,7 @@ static void __init hugetlb_init_one_hsta
 		} else if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	h->max_huge_pages = i;
 }
 
 static void __init hugetlb_init_hstates(void)
@@ -629,7 +632,10 @@ static void __init hugetlb_init_hstates(
 	struct hstate *h;
 
 	for_each_hstate(h) {
-		hugetlb_init_one_hstate(h);
+		/* oversize hugepages were init'ed in early boot */
+		if (h->order < MAX_ORDER)
+			hugetlb_init_one_hstate(h);
+		max_huge_pages[h - hstates] = h->max_huge_pages;
 	}
 }
 
@@ -692,6 +698,14 @@ static int __init hugetlb_setup(char *s)
 	if (sscanf(s, "%lu", mhp) <= 0)
 		*mhp = 0;
 
+	/*
+	 * Global state is always initialized later in hugetlb_init.
+	 * But we need to allocate >= MAX_ORDER hstates here early to still
+	 * use the bootmem allocator.
+	 */
+	if (max_hstate > 0 && parsed_hstate->order >= MAX_ORDER)
+		hugetlb_init_one_hstate(parsed_hstate);
+
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 12/23] hugetlb: support boot allocate different sizes
  2008-05-25 14:23 ` [patch 12/23] hugetlb: support boot allocate different sizes npiggin
@ 2008-05-27 17:04   ` Nishanth Aravamudan
  2008-05-27 21:28   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 17:04 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara

On 26.05.2008 [00:23:29 +1000], npiggin@suse.de wrote:
> Acked-by: Andrew Hastings <abh@cray.com>
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 12/23] hugetlb: support boot allocate different sizes
  2008-05-25 14:23 ` [patch 12/23] hugetlb: support boot allocate different sizes npiggin
  2008-05-27 17:04   ` Nishanth Aravamudan
@ 2008-05-27 21:28   ` Adam Litke
  2008-05-28 10:57     ` Nick Piggin
  1 sibling, 1 reply; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:28 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara, Andi Kleen

Seems nice, but what exactly is this patch for?  From reading the code
it would seem that this allows more than one >MAX_ORDER hstates to exist
and removes assumptions about their positioning withing the hstates
array?  A small patch leader would definitely clear up my confusion.

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlb-different-page-sizes.patch)
> Acked-by: Andrew Hastings <abh@cray.com>
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  mm/hugetlb.c |   24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -609,10 +609,13 @@ static void __init hugetlb_init_one_hsta
>  {
>  	unsigned long i;
> 
> -	for (i = 0; i < MAX_NUMNODES; ++i)
> -		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> +	/* Don't reinitialize lists if they have been already init'ed */
> +	if (!h->hugepage_freelists[0].next) {
> +		for (i = 0; i < MAX_NUMNODES; ++i)
> +			INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> 
> -	h->hugetlb_next_nid = first_node(node_online_map);
> +		h->hugetlb_next_nid = first_node(node_online_map);
> +	}
> 
>  	for (i = 0; i < h->max_huge_pages; ++i) {
>  		if (h->order >= MAX_ORDER) {
> @@ -621,7 +624,7 @@ static void __init hugetlb_init_one_hsta
>  		} else if (!alloc_fresh_huge_page(h))
>  			break;
>  	}
> -	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> +	h->max_huge_pages = i;
>  }
> 
>  static void __init hugetlb_init_hstates(void)
> @@ -629,7 +632,10 @@ static void __init hugetlb_init_hstates(
>  	struct hstate *h;
> 
>  	for_each_hstate(h) {
> -		hugetlb_init_one_hstate(h);
> +		/* oversize hugepages were init'ed in early boot */
> +		if (h->order < MAX_ORDER)
> +			hugetlb_init_one_hstate(h);
> +		max_huge_pages[h - hstates] = h->max_huge_pages;
>  	}
>  }
> 
> @@ -692,6 +698,14 @@ static int __init hugetlb_setup(char *s)
>  	if (sscanf(s, "%lu", mhp) <= 0)
>  		*mhp = 0;
> 
> +	/*
> +	 * Global state is always initialized later in hugetlb_init.
> +	 * But we need to allocate >= MAX_ORDER hstates here early to still
> +	 * use the bootmem allocator.
> +	 */
> +	if (max_hstate > 0 && parsed_hstate->order >= MAX_ORDER)
> +		hugetlb_init_one_hstate(parsed_hstate);
> +
>  	return 1;
>  }
>  __setup("hugepages=", hugetlb_setup);
> 
-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 12/23] hugetlb: support boot allocate different sizes
  2008-05-27 21:28   ` Adam Litke
@ 2008-05-28 10:57     ` Nick Piggin
  2008-05-28 14:01       ` Nick Piggin
  0 siblings, 1 reply; 88+ messages in thread
From: Nick Piggin @ 2008-05-28 10:57 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara, Andi Kleen

On Tue, May 27, 2008 at 04:28:55PM -0500, Adam Litke wrote:
> Seems nice, but what exactly is this patch for?  From reading the code
> it would seem that this allows more than one >MAX_ORDER hstates to exist
> and removes assumptions about their positioning withing the hstates
> array?  A small patch leader would definitely clear up my confusion.

Yes it allows I guess hugetlb_init_one_hstate to be called multiple
times on an hstate, and also some logic dealing with giant page setup.

Though hmm, possibly it can be made a little cleaner by separating
hstate init from the actual page allocation a little more. I'll have
a look but it is kind of tricky... otherwise I can try a changelog.

 
> 
> On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> > plain text document attachment (hugetlb-different-page-sizes.patch)
> > Acked-by: Andrew Hastings <abh@cray.com>
> > Signed-off-by: Andi Kleen <ak@suse.de>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> >  mm/hugetlb.c |   24 +++++++++++++++++++-----
> >  1 file changed, 19 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.orig/mm/hugetlb.c
> > +++ linux-2.6/mm/hugetlb.c
> > @@ -609,10 +609,13 @@ static void __init hugetlb_init_one_hsta
> >  {
> >  	unsigned long i;
> > 
> > -	for (i = 0; i < MAX_NUMNODES; ++i)
> > -		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> > +	/* Don't reinitialize lists if they have been already init'ed */
> > +	if (!h->hugepage_freelists[0].next) {
> > +		for (i = 0; i < MAX_NUMNODES; ++i)
> > +			INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> > 
> > -	h->hugetlb_next_nid = first_node(node_online_map);
> > +		h->hugetlb_next_nid = first_node(node_online_map);
> > +	}
> > 
> >  	for (i = 0; i < h->max_huge_pages; ++i) {
> >  		if (h->order >= MAX_ORDER) {
> > @@ -621,7 +624,7 @@ static void __init hugetlb_init_one_hsta
> >  		} else if (!alloc_fresh_huge_page(h))
> >  			break;
> >  	}
> > -	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> > +	h->max_huge_pages = i;
> >  }
> > 
> >  static void __init hugetlb_init_hstates(void)
> > @@ -629,7 +632,10 @@ static void __init hugetlb_init_hstates(
> >  	struct hstate *h;
> > 
> >  	for_each_hstate(h) {
> > -		hugetlb_init_one_hstate(h);
> > +		/* oversize hugepages were init'ed in early boot */
> > +		if (h->order < MAX_ORDER)
> > +			hugetlb_init_one_hstate(h);
> > +		max_huge_pages[h - hstates] = h->max_huge_pages;
> >  	}
> >  }
> > 
> > @@ -692,6 +698,14 @@ static int __init hugetlb_setup(char *s)
> >  	if (sscanf(s, "%lu", mhp) <= 0)
> >  		*mhp = 0;
> > 
> > +	/*
> > +	 * Global state is always initialized later in hugetlb_init.
> > +	 * But we need to allocate >= MAX_ORDER hstates here early to still
> > +	 * use the bootmem allocator.
> > +	 */
> > +	if (max_hstate > 0 && parsed_hstate->order >= MAX_ORDER)
> > +		hugetlb_init_one_hstate(parsed_hstate);
> > +
> >  	return 1;
> >  }
> >  __setup("hugepages=", hugetlb_setup);
> > 
> -- 
> Adam Litke - (agl at us.ibm.com)
> IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 12/23] hugetlb: support boot allocate different sizes
  2008-05-28 10:57     ` Nick Piggin
@ 2008-05-28 14:01       ` Nick Piggin
  2008-05-28 14:35         ` Adam Litke
  0 siblings, 1 reply; 88+ messages in thread
From: Nick Piggin @ 2008-05-28 14:01 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara, Andi Kleen

On Wed, May 28, 2008 at 12:57:59PM +0200, Nick Piggin wrote:
> On Tue, May 27, 2008 at 04:28:55PM -0500, Adam Litke wrote:
> > Seems nice, but what exactly is this patch for?  From reading the code
> > it would seem that this allows more than one >MAX_ORDER hstates to exist
> > and removes assumptions about their positioning withing the hstates
> > array?  A small patch leader would definitely clear up my confusion.
> 
> Yes it allows I guess hugetlb_init_one_hstate to be called multiple
> times on an hstate, and also some logic dealing with giant page setup.
> 
> Though hmm, possibly it can be made a little cleaner by separating
> hstate init from the actual page allocation a little more. I'll have
> a look but it is kind of tricky... otherwise I can try a changelog.

This is how I've made the patch:

---

hugetlb: support boot allocate different sizes

Make some infrastructure changes to allow boot allocation of different
hugepage page sizes.

- move all basic hstate initialisation into hugetlb_add_hstate
- create a new function hugetlb_hstate_alloc_pages() to do the
  actual initial page allocations. Call this function early in
  order to allocate giant pages from bootmem.
- Check for multiple hugepages= parameters

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Andrew Hastings <abh@cray.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -617,15 +617,10 @@ static void __init gather_bootmem_preall
 	}
 }
 
-static void __init hugetlb_init_one_hstate(struct hstate *h)
+static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
 
-	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
-
-	h->hugetlb_next_nid = first_node(node_online_map);
-
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (h->order >= MAX_ORDER) {
 			if (!alloc_bootmem_huge_page(h))
@@ -633,7 +628,7 @@ static void __init hugetlb_init_one_hsta
 		} else if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	h->max_huge_pages = i;
 }
 
 static void __init hugetlb_init_hstates(void)
@@ -641,7 +636,10 @@ static void __init hugetlb_init_hstates(
 	struct hstate *h;
 
 	for_each_hstate(h) {
-		hugetlb_init_one_hstate(h);
+		/* oversize hugepages were init'ed in early boot */
+		if (h->order < MAX_ORDER)
+			hugetlb_hstate_alloc_pages(h);
+		max_huge_pages[h - hstates] = h->max_huge_pages;
 	}
 }
 
@@ -679,6 +677,8 @@ module_init(hugetlb_init);
 void __init hugetlb_add_hstate(unsigned order)
 {
 	struct hstate *h;
+	unsigned long i;
+
 	if (size_to_hstate(PAGE_SIZE << order)) {
 		printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
 		return;
@@ -688,13 +688,19 @@ void __init hugetlb_add_hstate(unsigned 
 	h = &hstates[max_hstate++];
 	h->order = order;
 	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
-	hugetlb_init_one_hstate(h);
+	h->nr_huge_pages = 0;
+	h->free_huge_pages = 0;
+	for (i = 0; i < MAX_NUMNODES; ++i)
+		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+	h->hugetlb_next_nid = first_node(node_online_map);
+
 	parsed_hstate = h;
 }
 
 static int __init hugetlb_setup(char *s)
 {
 	unsigned long *mhp;
+	static unsigned long *last_mhp;
 
 	/*
 	 * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
@@ -705,9 +711,24 @@ static int __init hugetlb_setup(char *s)
 	else
 		mhp = &parsed_hstate->max_huge_pages;
 
+	if (mhp == last_mhp) {
+		printk(KERN_WARNING "hugepages= specified twice without interleaving hugepagesz=, ignoring\n");
+		return 1;
+	}
+
 	if (sscanf(s, "%lu", mhp) <= 0)
 		*mhp = 0;
 
+	/*
+	 * Global state is always initialized later in hugetlb_init.
+	 * But we need to allocate >= MAX_ORDER hstates here early to still
+	 * use the bootmem allocator.
+	 */
+	if (max_hstate && parsed_hstate->order >= MAX_ORDER)
+		hugetlb_hstate_alloc_pages(parsed_hstate);
+
+	last_mhp = mhp;
+
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 12/23] hugetlb: support boot allocate different sizes
  2008-05-28 14:01       ` Nick Piggin
@ 2008-05-28 14:35         ` Adam Litke
  0 siblings, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-28 14:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, kniht, andi, nacc, abh, joachim.deguara, Andi Kleen

On Wed, 2008-05-28 at 16:01 +0200, Nick Piggin wrote:
> On Wed, May 28, 2008 at 12:57:59PM +0200, Nick Piggin wrote:
> > On Tue, May 27, 2008 at 04:28:55PM -0500, Adam Litke wrote:
> > > Seems nice, but what exactly is this patch for?  From reading the code
> > > it would seem that this allows more than one >MAX_ORDER hstates to exist
> > > and removes assumptions about their positioning withing the hstates
> > > array?  A small patch leader would definitely clear up my confusion.
> > 
> > Yes it allows I guess hugetlb_init_one_hstate to be called multiple
> > times on an hstate, and also some logic dealing with giant page setup.
> > 
> > Though hmm, possibly it can be made a little cleaner by separating
> > hstate init from the actual page allocation a little more. I'll have
> > a look but it is kind of tricky... otherwise I can try a changelog.
> 
> This is how I've made the patch:

Thanks.  That's a lot clearer to me.

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 13/23] hugetlb: printk cleanup
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (11 preceding siblings ...)
  2008-05-25 14:23 ` [patch 12/23] hugetlb: support boot allocate different sizes npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 17:05   ` Nishanth Aravamudan
  2008-05-27 21:30   ` Adam Litke
  2008-05-25 14:23 ` [patch 14/23] hugetlb: introduce huge_pud npiggin
                   ` (10 subsequent siblings)
  23 siblings, 2 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlb-printk-cleanup.patch --]
[-- Type: text/plain, Size: 1517 bytes --]

- Reword sentence to clarify meaning with multiple options
- Add support for using GB prefixes for the page size
- Add extra printk to delayed > MAX_ORDER allocation code

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -639,15 +639,28 @@ static void __init hugetlb_init_hstates(
 	}
 }
 
+static char * __init memfmt(char *buf, unsigned long n)
+{
+	if (n >= (1UL << 30))
+		sprintf(buf, "%lu GB", n >> 30);
+	else if (n >= (1UL << 20))
+		sprintf(buf, "%lu MB", n >> 20);
+	else
+		sprintf(buf, "%lu KB", n >> 10);
+	return buf;
+}
+
 static void __init report_hugepages(void)
 {
 	struct hstate *h;
 
 	for_each_hstate(h) {
-		printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
-				h->free_huge_pages,
-				1 << (h->order + PAGE_SHIFT - 20));
-	}
+		char buf[32];
+		printk(KERN_INFO "HugeTLB registered %s page size, "
+				 "pre-allocated %ld pages\n",
+			memfmt(buf, huge_page_size(h)),
+			h->free_huge_pages);
+        }
 }
 
 static int __init hugetlb_init(void)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 13/23] hugetlb: printk cleanup
  2008-05-25 14:23 ` [patch 13/23] hugetlb: printk cleanup npiggin
@ 2008-05-27 17:05   ` Nishanth Aravamudan
  2008-05-27 21:30   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 17:05 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Andi Kleen

On 26.05.2008 [00:23:30 +1000], npiggin@suse.de wrote:
> - Reword sentence to clarify meaning with multiple options
> - Add support for using GB prefixes for the page size
> - Add extra printk to delayed > MAX_ORDER allocation code
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 13/23] hugetlb: printk cleanup
  2008-05-25 14:23 ` [patch 13/23] hugetlb: printk cleanup npiggin
  2008-05-27 17:05   ` Nishanth Aravamudan
@ 2008-05-27 21:30   ` Adam Litke
  1 sibling, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:30 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlb-printk-cleanup.patch)
> - Reword sentence to clarify meaning with multiple options
> - Add support for using GB prefixes for the page size
> - Add extra printk to delayed > MAX_ORDER allocation code
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 14/23] hugetlb: introduce huge_pud
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (12 preceding siblings ...)
  2008-05-25 14:23 ` [patch 13/23] hugetlb: printk cleanup npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-26 11:09   ` Hugh Dickins
  2008-05-25 14:23 ` [patch 15/23] x86: support GB hugepages on 64-bit npiggin
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: hugetlbfs-huge_pud.patch --]
[-- Type: text/plain, Size: 6220 bytes --]

Straight forward extensions for huge pages located in the PUD
instead of PMDs.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    6 ++++++
 arch/powerpc/mm/hugetlbpage.c |    5 +++++
 arch/sh/mm/hugetlbpage.c      |    5 +++++
 arch/sparc64/mm/hugetlbpage.c |    5 +++++
 arch/x86/mm/hugetlbpage.c     |   25 ++++++++++++++++++++++++-
 include/linux/hugetlb.h       |    5 +++++
 mm/hugetlb.c                  |    9 +++++++++
 mm/memory.c                   |   10 +++++++++-
 8 files changed, 68 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -45,7 +45,10 @@ struct page *follow_huge_addr(struct mm_
 			      int write);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 				pmd_t *pmd, int write);
+struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
+				pud_t *pud, int write);
 int pmd_huge(pmd_t pmd);
+int pud_huge(pud_t pmd);
 void hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
 
@@ -68,8 +71,10 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL
+#define follow_huge_pud(mm, addr, pud, write)	NULL
 #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
+#define pud_huge(x)	0
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, write)	({ BUG(); 0; })
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c
@@ -106,6 +106,12 @@ int pmd_huge(pmd_t pmd)
 {
 	return 0;
 }
+
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -368,6 +368,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
Index: linux-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/hugetlbpage.c
+++ linux-2.6/arch/sh/mm/hugetlbpage.c
@@ -78,6 +78,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 			     pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c
@@ -294,6 +294,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 			     pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -188,6 +188,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
@@ -208,6 +213,11 @@ int pmd_huge(pmd_t pmd)
 	return !!(pmd_val(pmd) & _PAGE_PSE);
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
@@ -216,9 +226,22 @@ follow_huge_pmd(struct mm_struct *mm, un
 
 	page = pte_page(*(pte_t *)pmd);
 	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
+		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
 	return page;
 }
+
+struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+		pud_t *pud, int write)
+{
+	struct page *page;
+
+	page = pte_page(*(pte_t *)pud);
+	if (page)
+		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
+	return page;
+}
+
 #endif
 
 /* x86_64 also uses this file */
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -1275,6 +1275,15 @@ int hugetlb_fault(struct mm_struct *mm, 
 	return ret;
 }
 
+/* Can be overriden by architectures */
+__attribute__((weak)) struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+	       pud_t *pud, int write)
+{
+	BUG();
+	return NULL;
+}
+
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			struct page **pages, struct vm_area_struct **vmas,
 			unsigned long *position, int *length, int i,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -998,7 +998,13 @@ struct page *follow_page(struct vm_area_
 	pud = pud_offset(pgd, address);
 	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
 		goto no_page_table;
-	
+
+	if (pud_huge(*pud)) {
+		BUG_ON(flags & FOLL_GET);
+		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
+		goto out;
+	}
+
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
@@ -1534,6 +1540,8 @@ static int apply_to_pmd_range(struct mm_
 	unsigned long next;
 	int err;
 
+	BUG_ON(pud_huge(*pud));
+
 	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		return -ENOMEM;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 14/23] hugetlb: introduce huge_pud
  2008-05-25 14:23 ` [patch 14/23] hugetlb: introduce huge_pud npiggin
@ 2008-05-26 11:09   ` Hugh Dickins
  2008-05-27  2:24     ` Nick Piggin
  0 siblings, 1 reply; 88+ messages in thread
From: Hugh Dickins @ 2008-05-26 11:09 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, nacc, agl, abh, joachim.deguara,
	Andi Kleen

On Mon, 26 May 2008, npiggin@suse.de wrote:
> Straight forward extensions for huge pages located in the PUD
> instead of PMDs.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Sorry, I've not looked through all these, but the subject of this one
(which should say "pud_huge" rather than "huge_pud") led me to check:
please take a look at commit aeed5fce37196e09b4dac3a1c00d8b7122e040ce,
I believe your follow_page will need to try pud_huge before pud_bad.

Though note in the comment to that commit, I'm dubious whether we
can ever actually hit that case, or need follow_huge_pmd (or your
follow_huge_pud) at all: please cross check, you might prefer to
delete the huge pmd code there rather than add huge pud code,
if you agree that there's actually no way we need it.

Hugh

> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -998,7 +998,13 @@ struct page *follow_page(struct vm_area_
>  	pud = pud_offset(pgd, address);
>  	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
>  		goto no_page_table;
> -	
> +
> +	if (pud_huge(*pud)) {
> +		BUG_ON(flags & FOLL_GET);
> +		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
> +		goto out;
> +	}
> +
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		goto no_page_table;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 14/23] hugetlb: introduce huge_pud
  2008-05-26 11:09   ` Hugh Dickins
@ 2008-05-27  2:24     ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-27  2:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, kniht, andi, nacc, agl, abh, joachim.deguara,
	Andi Kleen

On Mon, May 26, 2008 at 12:09:05PM +0100, Hugh Dickins wrote:
> On Mon, 26 May 2008, npiggin@suse.de wrote:
> > Straight forward extensions for huge pages located in the PUD
> > instead of PMDs.
> > 
> > Signed-off-by: Andi Kleen <ak@suse.de>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> Sorry, I've not looked through all these, but the subject of this one
> (which should say "pud_huge" rather than "huge_pud") led me to check:
> please take a look at commit aeed5fce37196e09b4dac3a1c00d8b7122e040ce,
> I believe your follow_page will need to try pud_huge before pud_bad.
 
Ah, you're right there, yes thanks.


> Though note in the comment to that commit, I'm dubious whether we
> can ever actually hit that case, or need follow_huge_pmd (or your
> follow_huge_pud) at all: please cross check, you might prefer to
> delete the huge pmd code there rather than add huge pud code,
> if you agree that there's actually no way we need it.

Haven't had a look yet, but I'll probably leave that for another
person or time to do.

Thanks,
Nick

> 
> Hugh
> 
> > --- linux-2.6.orig/mm/memory.c
> > +++ linux-2.6/mm/memory.c
> > @@ -998,7 +998,13 @@ struct page *follow_page(struct vm_area_
> >  	pud = pud_offset(pgd, address);
> >  	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> >  		goto no_page_table;
> > -	
> > +
> > +	if (pud_huge(*pud)) {
> > +		BUG_ON(flags & FOLL_GET);
> > +		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
> > +		goto out;
> > +	}
> > +
> >  	pmd = pmd_offset(pud, address);
> >  	if (pmd_none(*pmd))
> >  		goto no_page_table;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 15/23] x86: support GB hugepages on 64-bit
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (13 preceding siblings ...)
  2008-05-25 14:23 ` [patch 14/23] hugetlb: introduce huge_pud npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 21:35   ` Adam Litke
  2008-05-25 14:23 ` [patch 16/23] x86: add hugepagesz option " npiggin
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: x86-support-GB-hugetlb-pages.patch --]
[-- Type: text/plain, Size: 3897 bytes --]

---
 arch/x86/mm/hugetlbpage.c |   33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -133,9 +133,14 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	pgd = pgd_offset(mm, addr);
 	pud = pud_alloc(mm, pgd, addr);
 	if (pud) {
-		if (pud_none(*pud))
-			huge_pmd_share(mm, addr, pud);
-		pte = (pte_t *) pmd_alloc(mm, pud, addr);
+		if (sz == PUD_SIZE) {
+			pte = (pte_t *)pud;
+		} else {
+			BUG_ON(sz != PMD_SIZE);
+			if (pud_none(*pud))
+				huge_pmd_share(mm, addr, pud);
+			pte = (pte_t *) pmd_alloc(mm, pud, addr);
+		}
 	}
 	BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
 
@@ -151,8 +156,11 @@ pte_t *huge_pte_offset(struct mm_struct 
 	pgd = pgd_offset(mm, addr);
 	if (pgd_present(*pgd)) {
 		pud = pud_offset(pgd, addr);
-		if (pud_present(*pud))
+		if (pud_present(*pud)) {
+			if (pud_large(*pud))
+				return (pte_t *)pud;
 			pmd = pmd_offset(pud, addr);
+		}
 	}
 	return (pte_t *) pmd;
 }
@@ -215,7 +223,7 @@ int pmd_huge(pmd_t pmd)
 
 int pud_huge(pud_t pud)
 {
-	return 0;
+	return !!(pud_val(pud) & _PAGE_PSE);
 }
 
 struct page *
@@ -251,6 +259,7 @@ static unsigned long hugetlb_get_unmappe
 		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long start_addr;
@@ -263,7 +272,7 @@ static unsigned long hugetlb_get_unmappe
 	}
 
 full_search:
-	addr = ALIGN(start_addr, HPAGE_SIZE);
+	addr = ALIGN(start_addr, huge_page_size(h));
 
 	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
@@ -285,7 +294,7 @@ full_search:
 		}
 		if (addr + mm->cached_hole_size < vma->vm_start)
 		        mm->cached_hole_size = vma->vm_start - addr;
-		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+		addr = ALIGN(vma->vm_end, huge_page_size(h));
 	}
 }
 
@@ -293,6 +302,7 @@ static unsigned long hugetlb_get_unmappe
 		unsigned long addr0, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev_vma;
 	unsigned long base = mm->mmap_base, addr = addr0;
@@ -313,7 +323,7 @@ try_again:
 		goto fail;
 
 	/* either no address requested or cant fit in requested address hole */
-	addr = (mm->free_area_cache - len) & HPAGE_MASK;
+	addr = (mm->free_area_cache - len) & huge_page_mask(h);
 	do {
 		/*
 		 * Lookup failure means no vma is above this address,
@@ -344,7 +354,7 @@ try_again:
 		        largest_hole = vma->vm_start - addr;
 
 		/* try just below the current vma->vm_start */
-		addr = (vma->vm_start - len) & HPAGE_MASK;
+		addr = (vma->vm_start - len) & huge_page_mask(h);
 	} while (len <= vma->vm_start);
 
 fail:
@@ -382,10 +392,11 @@ unsigned long
 hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 
-	if (len & ~HPAGE_MASK)
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -397,7 +408,7 @@ hugetlb_get_unmapped_area(struct file *f
 	}
 
 	if (addr) {
-		addr = ALIGN(addr, HPAGE_SIZE);
+		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 15/23] x86: support GB hugepages on 64-bit
  2008-05-25 14:23 ` [patch 15/23] x86: support GB hugepages on 64-bit npiggin
@ 2008-05-27 21:35   ` Adam Litke
  0 siblings, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:35 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (x86-support-GB-hugetlb-pages.patch)
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 16/23] x86: add hugepagesz option on 64-bit
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (14 preceding siblings ...)
  2008-05-25 14:23 ` [patch 15/23] x86: support GB hugepages on 64-bit npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-25 14:23 ` [patch 17/23] hugetlb: do not always register default HPAGE_SIZE huge page size npiggin
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Andi Kleen

[-- Attachment #1: x86-64-implement-hugepagesz.patch --]
[-- Type: text/plain, Size: 3031 bytes --]

Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.

This finally allows to select GB pages for hugetlbfs in x86 now
that all the infrastructure is in place.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 Documentation/kernel-parameters.txt |   11 +++++++++--
 arch/x86/mm/hugetlbpage.c           |   17 +++++++++++++++++
 include/asm-x86/page.h              |    2 ++
 3 files changed, 28 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -424,3 +424,20 @@ hugetlb_get_unmapped_area(struct file *f
 
 #endif /*HAVE_ARCH_HUGETLB_UNMAPPED_AREA*/
 
+#ifdef CONFIG_X86_64
+static __init int setup_hugepagesz(char *opt)
+{
+	unsigned long ps = memparse(opt, &opt);
+	if (ps == PMD_SIZE) {
+		hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT);
+	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
+		hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
+	} else {
+		printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
+			ps >> 20);
+		return 0;
+	}
+	return 1;
+}
+__setup("hugepagesz=", setup_hugepagesz);
+#endif
Index: linux-2.6/include/asm-x86/page.h
===================================================================
--- linux-2.6.orig/include/asm-x86/page.h
+++ linux-2.6/include/asm-x86/page.h
@@ -29,6 +29,8 @@
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
 
+#define HUGE_MAX_HSTATE 2
+
 /* to align the pointer to the (next) page boundary */
 #define PAGE_ALIGN(addr)	(((addr)+PAGE_SIZE-1)&PAGE_MASK)
 
Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -737,8 +737,15 @@ and is between 256 and 4096 characters. 
 	hisax=		[HW,ISDN]
 			See Documentation/isdn/README.HiSax.
 
-	hugepages=	[HW,X86-32,IA-64] Maximal number of HugeTLB pages.
-	hugepagesz=	[HW,IA-64,PPC] The size of the HugeTLB pages.
+	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
+			On x86 this option can be specified multiple times
+			interleaved with hugepages= to reserve huge pages
+			of different sizes. Valid pages sizes on x86-64
+			are 2M (when the CPU supports "pse") and 1G (when the
+			CPU supports the "pdpe1gb" cpuinfo flag)
+			Note that 1GB pages can only be allocated at boot time
+			using hugepages= and not freed afterwards.
 
 	i8042.direct	[HW] Put keyboard port into non-translated mode
 	i8042.dumbkbd	[HW] Pretend that controller can only read data from

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 17/23] hugetlb: do not always register default HPAGE_SIZE huge page size
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (15 preceding siblings ...)
  2008-05-25 14:23 ` [patch 16/23] x86: add hugepagesz option " npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 21:39   ` Adam Litke
  2008-05-25 14:23 ` [patch 18/23] hugetlb: allow arch overried hugepage allocation npiggin
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara

[-- Attachment #1: hugetlb-non-default-hstate.patch --]
[-- Type: text/plain, Size: 1448 bytes --]

Allow configurations without the default HPAGE_SIZE size (mainly useful for
testing -- the final form of the userspace API / cmdline is not quite
nailed down).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/hugetlbfs/inode.c |    2 ++
 mm/hugetlb.c         |    2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -667,7 +667,7 @@ static int __init hugetlb_init(void)
 {
 	BUILD_BUG_ON(HPAGE_SHIFT == 0);
 
-	if (!size_to_hstate(HPAGE_SIZE)) {
+	if (!max_hstate) {
 		hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
 		parsed_hstate->max_huge_pages = default_hstate_max_huge_pages;
 	}
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -858,6 +858,8 @@ hugetlbfs_fill_super(struct super_block 
 	config.gid = current->fsgid;
 	config.mode = 0755;
 	config.hstate = size_to_hstate(HPAGE_SIZE);
+	if (!config.hstate)
+		config.hstate = &hstates[0];
 	ret = hugetlbfs_parse_options(data, &config);
 	if (ret)
 		return ret;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 17/23] hugetlb: do not always register default HPAGE_SIZE huge page size
  2008-05-25 14:23 ` [patch 17/23] hugetlb: do not always register default HPAGE_SIZE huge page size npiggin
@ 2008-05-27 21:39   ` Adam Litke
  0 siblings, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:39 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlb-non-default-hstate.patch)
> Allow configurations without the default HPAGE_SIZE size (mainly useful for
> testing -- the final form of the userspace API / cmdline is not quite
> nailed down).
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 18/23] hugetlb: allow arch overried hugepage allocation
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (16 preceding siblings ...)
  2008-05-25 14:23 ` [patch 17/23] hugetlb: do not always register default HPAGE_SIZE huge page size npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 21:41   ` Adam Litke
  2008-05-25 14:23 ` [patch 19/23] powerpc: function to allocate gigantic hugepages npiggin
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Jon Tollefson

[-- Attachment #1: hugetlb-allow-arch-override-hugepage-allocation.patch --]
[-- Type: text/plain, Size: 3057 bytes --]

Allow alloc_bootmem_huge_page() to be overridden by architectures that can't
always use bootmem. This requires huge_boot_pages to be available for
use by this function. The 16G pages on ppc64 have to be reserved prior
to boot-time. The location of these pages are indicated in the device
tree.

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 include/linux/hugetlb.h |   10 ++++++++++
 mm/hugetlb.c            |   12 ++++--------
 2 files changed, 14 insertions(+), 8 deletions(-)


Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -35,6 +35,7 @@ void hugetlb_unreserve_pages(struct inod
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
+extern struct list_head huge_boot_pages;
 
 /* arch callbacks */
 
@@ -176,6 +177,14 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 };
 
+struct huge_bootmem_page {
+	struct list_head list;
+	struct hstate *hstate;
+};
+
+/* arch callback */
+int __init alloc_bootmem_huge_page(struct hstate *h);
+
 void __init hugetlb_add_hstate(unsigned order);
 struct hstate *size_to_hstate(unsigned long size);
 
@@ -237,6 +246,7 @@ extern unsigned long sysctl_overcommit_h
 
 #else
 struct hstate {};
+#define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
 #define hstate_inode(i) NULL
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -27,6 +27,7 @@ unsigned long max_huge_pages[HUGE_MAX_HS
 unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
+struct list_head huge_boot_pages;
 
 static int max_hstate = 0;
 struct hstate hstates[HUGE_MAX_HSTATE];
@@ -560,14 +561,7 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
-static __initdata LIST_HEAD(huge_boot_pages);
-
-struct huge_bootmem_page {
-	struct list_head list;
-	struct hstate *hstate;
-};
-
-static int __init alloc_bootmem_huge_page(struct hstate *h)
+__attribute__((weak)) int alloc_bootmem_huge_page(struct hstate *h)
 {
 	struct huge_bootmem_page *m;
 	int nr_nodes = nodes_weight(node_online_map);
@@ -610,6 +604,8 @@ static void __init hugetlb_init_one_hsta
 	unsigned long i;
 
 	/* Don't reinitialize lists if they have been already init'ed */
+	if (!huge_boot_pages.next)
+		INIT_LIST_HEAD(&huge_boot_pages);
 	if (!h->hugepage_freelists[0].next) {
 		for (i = 0; i < MAX_NUMNODES; ++i)
 			INIT_LIST_HEAD(&h->hugepage_freelists[i]);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 18/23] hugetlb: allow arch overried hugepage allocation
  2008-05-25 14:23 ` [patch 18/23] hugetlb: allow arch overried hugepage allocation npiggin
@ 2008-05-27 21:41   ` Adam Litke
  0 siblings, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:41 UTC (permalink / raw)
  To: npiggin
  Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara,
	Jon Tollefson

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment
> (hugetlb-allow-arch-override-hugepage-allocation.patch)
> Allow alloc_bootmem_huge_page() to be overridden by architectures that can't
> always use bootmem. This requires huge_boot_pages to be available for
> use by this function. The 16G pages on ppc64 have to be reserved prior
> to boot-time. The location of these pages are indicated in the device
> tree.
> 
> Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 19/23] powerpc: function to allocate gigantic hugepages
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (17 preceding siblings ...)
  2008-05-25 14:23 ` [patch 18/23] hugetlb: allow arch overried hugepage allocation npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 21:44   ` Adam Litke
  2008-05-25 14:23 ` [patch 20/23] powerpc: scan device tree for gigantic pages npiggin
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Jon Tollefson

[-- Attachment #1: powerpc-function-for-gigantic-hugepage-allocation.patch --]
[-- Type: text/plain, Size: 1805 bytes --]

The 16G page locations have been saved during early boot in an array.
The alloc_bootmem_huge_page() function adds a page from here to the
huge_boot_pages list.

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 arch/powerpc/mm/hugetlbpage.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -29,6 +29,12 @@
 
 #define NUM_LOW_AREAS	(0x100000000UL >> SID_SHIFT)
 #define NUM_HIGH_AREAS	(PGTABLE_RANGE >> HTLB_AREA_SHIFT)
+#define MAX_NUMBER_GPAGES	1024
+
+/* Tracks the 16G pages after the device tree is scanned and before the
+ *  huge_boot_pages list is ready.  */
+static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
+static unsigned nr_gpages;
 
 unsigned int hugepte_shift;
 #define PTRS_PER_HUGEPTE	(1 << hugepte_shift)
@@ -104,6 +110,22 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, 
 }
 #endif
 
+/* Moves the gigantic page addresses from the temporary list to the
+  * huge_boot_pages list.
+ */
+int alloc_bootmem_huge_page(struct hstate *h)
+{
+	struct huge_bootmem_page *m;
+	if (nr_gpages == 0)
+		return 0;
+	m = phys_to_virt(gpage_freearray[--nr_gpages]);
+	gpage_freearray[nr_gpages] = 0;
+	list_add(&m->list, &huge_boot_pages);
+	m->hstate = h;
+	return 1;
+}
+
+
 /* Modelled after find_linux_pte() */
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 19/23] powerpc: function to allocate gigantic hugepages
  2008-05-25 14:23 ` [patch 19/23] powerpc: function to allocate gigantic hugepages npiggin
@ 2008-05-27 21:44   ` Adam Litke
  0 siblings, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:44 UTC (permalink / raw)
  To: npiggin
  Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara,
	Jon Tollefson

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment
> (powerpc-function-for-gigantic-hugepage-allocation.patch)
> The 16G page locations have been saved during early boot in an array.
> The alloc_bootmem_huge_page() function adds a page from here to the
> huge_boot_pages list.
> 
> Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Adam Litke <agl@us.ibm.com>

> ---
> 
>  arch/powerpc/mm/hugetlbpage.c |   22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
> +++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
> @@ -29,6 +29,12 @@
> 
>  #define NUM_LOW_AREAS	(0x100000000UL >> SID_SHIFT)
>  #define NUM_HIGH_AREAS	(PGTABLE_RANGE >> HTLB_AREA_SHIFT)
> +#define MAX_NUMBER_GPAGES	1024
> +
> +/* Tracks the 16G pages after the device tree is scanned and before the
> + *  huge_boot_pages list is ready.  */

Minor nit: This comment format looks a bit wacky.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 20/23] powerpc: scan device tree for gigantic pages
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (18 preceding siblings ...)
  2008-05-25 14:23 ` [patch 19/23] powerpc: function to allocate gigantic hugepages npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 21:47   ` Adam Litke
  2008-05-25 14:23 ` [patch 21/23] powerpc: define support for 16G hugepages npiggin
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Jon Tollefson

[-- Attachment #1: powerpc-scan-device-tree-and-save-gigantic-page-locations.patch --]
[-- Type: text/plain, Size: 4436 bytes --]

The 16G huge pages have to be reserved in the HMC prior to boot. The
location of the pages are placed in the device tree.   This patch adds
code to scan the device tree during very early boot and save these page
locations until hugetlbfs is ready for them.

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 arch/powerpc/mm/hash_utils_64.c  |   44 ++++++++++++++++++++++++++++++++++++++-
 arch/powerpc/mm/hugetlbpage.c    |   16 ++++++++++++++
 include/asm-powerpc/mmu-hash64.h |    2 +
 3 files changed, 61 insertions(+), 1 deletion(-)



Index: linux-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hash_utils_64.c
+++ linux-2.6/arch/powerpc/mm/hash_utils_64.c
@@ -68,6 +68,7 @@
 
 #define KB (1024)
 #define MB (1024*KB)
+#define GB (1024L*MB)
 
 /*
  * Note:  pte   --> Linux PTE
@@ -329,6 +330,44 @@ static int __init htab_dt_scan_page_size
 	return 0;
 }
 
+/* Scan for 16G memory blocks that have been set aside for huge pages
+ * and reserve those blocks for 16G huge pages.
+ */
+static int __init htab_dt_scan_hugepage_blocks(unsigned long node,
+					const char *uname, int depth,
+					void *data) {
+	char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+	unsigned long *addr_prop;
+	u32 *page_count_prop;
+	unsigned int expected_pages;
+	long unsigned int phys_addr;
+	long unsigned int block_size;
+
+	/* We are scanning "memory" nodes only */
+	if (type == NULL || strcmp(type, "memory") != 0)
+		return 0;
+
+	/* This property is the log base 2 of the number of virtual pages that
+	 * will represent this memory block. */
+	page_count_prop = of_get_flat_dt_prop(node, "ibm,expected#pages", NULL);
+	if (page_count_prop == NULL)
+		return 0;
+	expected_pages = (1 << page_count_prop[0]);
+	addr_prop = of_get_flat_dt_prop(node, "reg", NULL);
+	if (addr_prop == NULL)
+		return 0;
+	phys_addr = addr_prop[0];
+	block_size = addr_prop[1];
+	if (block_size != (16 * GB))
+		return 0;
+	printk(KERN_INFO "Huge page(16GB) memory: "
+			"addr = 0x%lX size = 0x%lX pages = %d\n",
+			phys_addr, block_size, expected_pages);
+	lmb_reserve(phys_addr, block_size * expected_pages);
+	add_gpage(phys_addr, block_size, expected_pages);
+	return 0;
+}
+
 static void __init htab_init_page_sizes(void)
 {
 	int rc;
@@ -418,7 +457,10 @@ static void __init htab_init_page_sizes(
 	       );
 
 #ifdef CONFIG_HUGETLB_PAGE
-	/* Init large page size. Currently, we pick 16M or 1M depending
+	/* Reserve 16G huge page memory sections for huge pages */
+	of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
+
+/* Init large page size. Currently, we pick 16M or 1M depending
 	 * on what is available
 	 */
 	if (mmu_psize_defs[MMU_PAGE_16M].shift)
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -110,6 +110,22 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, 
 }
 #endif
 
+/* Build list of addresses of gigantic pages.  This function is used in early
+ * boot before the buddy or bootmem allocator is setup.
+ */
+void add_gpage(unsigned long addr, unsigned long page_size,
+	unsigned long number_of_pages)
+{
+	if (!addr)
+		return;
+	while (number_of_pages > 0) {
+		gpage_freearray[nr_gpages] = addr;
+		nr_gpages++;
+		number_of_pages--;
+		addr += page_size;
+	}
+}
+
 /* Moves the gigantic page addresses from the temporary list to the
   * huge_boot_pages list.
  */
Index: linux-2.6/include/asm-powerpc/mmu-hash64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/mmu-hash64.h
+++ linux-2.6/include/asm-powerpc/mmu-hash64.h
@@ -280,6 +280,8 @@ extern int htab_bolt_mapping(unsigned lo
 			     unsigned long pstart, unsigned long mode,
 			     int psize, int ssize);
 extern void set_huge_psize(int psize);
+extern void add_gpage(unsigned long addr, unsigned long page_size,
+			  unsigned long number_of_pages);
 extern void demote_segment_4k(struct mm_struct *mm, unsigned long addr);
 
 extern void htab_initialize(void);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 20/23] powerpc: scan device tree for gigantic pages
  2008-05-25 14:23 ` [patch 20/23] powerpc: scan device tree for gigantic pages npiggin
@ 2008-05-27 21:47   ` Adam Litke
  0 siblings, 0 replies; 88+ messages in thread
From: Adam Litke @ 2008-05-27 21:47 UTC (permalink / raw)
  To: npiggin
  Cc: linux-mm, kniht, andi-suse, nacc, abh, joachim.deguara,
	Jon Tollefson

On Mon, 2008-05-26 at 00:23 +1000, npiggin@suse.de wrote:
> plain text document attachment
> (powerpc-scan-device-tree-and-save-gigantic-page-locations.patch)
> The 16G huge pages have to be reserved in the HMC prior to boot. The
> location of the pages are placed in the device tree.   This patch adds
> code to scan the device tree during very early boot and save these page
> locations until hugetlbfs is ready for them.
> 
> Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

I am not really qualified to pass judgment on the device tree-specific
parts of this patch, but as for the rest:

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 21/23] powerpc: define support for 16G hugepages
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (19 preceding siblings ...)
  2008-05-25 14:23 ` [patch 20/23] powerpc: scan device tree for gigantic pages npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-25 14:23 ` [patch 22/23] fs: check for statfs overflow npiggin
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Jon Tollefson

[-- Attachment #1: powerpc-define-page-support-for-16g-hugepages.patch --]
[-- Type: text/plain, Size: 5492 bytes --]

The huge page size is defined for 16G pages.  If a hugepagesz of 16G is
specified at boot-time then it becomes the huge page size instead of
the default 16M.

The change in pgtable-64K.h is to the macro
pte_iterate_hashed_subpages to make the increment to va (the 1
being shifted) be a long so that it is not shifted to 0.  Otherwise it
would create an infinite loop when the shift value is for a 16G page
(when base page size is 64K).

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 arch/powerpc/mm/hugetlbpage.c     |   62 ++++++++++++++++++++++++++------------
 include/asm-powerpc/pgtable-64k.h |    2 -
 2 files changed, 45 insertions(+), 19 deletions(-)

Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -24,8 +24,9 @@
 #include <asm/cputable.h>
 #include <asm/spu.h>
 
-#define HPAGE_SHIFT_64K	16
-#define HPAGE_SHIFT_16M	24
+#define PAGE_SHIFT_64K	16
+#define PAGE_SHIFT_16M	24
+#define PAGE_SHIFT_16G	34
 
 #define NUM_LOW_AREAS	(0x100000000UL >> SID_SHIFT)
 #define NUM_HIGH_AREAS	(PGTABLE_RANGE >> HTLB_AREA_SHIFT)
@@ -95,7 +96,7 @@ static int __hugepte_alloc(struct mm_str
 static inline
 pmd_t *hpmd_offset(pud_t *pud, unsigned long addr)
 {
-	if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+	if (HPAGE_SHIFT == PAGE_SHIFT_64K)
 		return pmd_offset(pud, addr);
 	else
 		return (pmd_t *) pud;
@@ -103,7 +104,7 @@ pmd_t *hpmd_offset(pud_t *pud, unsigned 
 static inline
 pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr)
 {
-	if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+	if (HPAGE_SHIFT == PAGE_SHIFT_64K)
 		return pmd_alloc(mm, pud, addr);
 	else
 		return (pmd_t *) pud;
@@ -260,7 +261,7 @@ static void hugetlb_free_pud_range(struc
 			continue;
 		hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling);
 #else
-		if (HPAGE_SHIFT == HPAGE_SHIFT_64K) {
+		if (HPAGE_SHIFT == PAGE_SHIFT_64K) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
 			hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling);
@@ -591,20 +592,40 @@ void set_huge_psize(int psize)
 {
 	/* Check that it is a page size supported by the hardware and
 	 * that it fits within pagetable limits. */
-	if (mmu_psize_defs[psize].shift && mmu_psize_defs[psize].shift < SID_SHIFT &&
+	if (mmu_psize_defs[psize].shift &&
+		mmu_psize_defs[psize].shift < SID_SHIFT_1T &&
 		(mmu_psize_defs[psize].shift > MIN_HUGEPTE_SHIFT ||
-			mmu_psize_defs[psize].shift == HPAGE_SHIFT_64K)) {
+		 mmu_psize_defs[psize].shift == PAGE_SHIFT_64K ||
+		 mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) {
+		/* Return if huge page size is the same as the
+		 * base page size. */
+		if (mmu_psize_defs[psize].shift == PAGE_SHIFT)
+			return;
+
 		HPAGE_SHIFT = mmu_psize_defs[psize].shift;
 		mmu_huge_psize = psize;
-#ifdef CONFIG_PPC_64K_PAGES
-		hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT);
-#else
-		if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
-			hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT);
-		else
-			hugepte_shift = (PUD_SHIFT-HPAGE_SHIFT);
-#endif
 
+		switch (HPAGE_SHIFT) {
+		case PAGE_SHIFT_64K:
+		    /* We only allow 64k hpages with 4k base page,
+		     * which was checked above, and always put them
+		     * at the PMD */
+		    hugepte_shift = PMD_SHIFT;
+		    break;
+		case PAGE_SHIFT_16M:
+		    /* 16M pages can be at two different levels
+		     * of pagestables based on base page size */
+		    if (PAGE_SHIFT == PAGE_SHIFT_64K)
+			    hugepte_shift = PMD_SHIFT;
+		    else /* 4k base page */
+			    hugepte_shift = PUD_SHIFT;
+		    break;
+		case PAGE_SHIFT_16G:
+		    /* 16G pages are always at PGD level */
+		    hugepte_shift = PGDIR_SHIFT;
+		    break;
+		}
+		hugepte_shift -= HPAGE_SHIFT;
 	} else
 		HPAGE_SHIFT = 0;
 }
@@ -620,17 +641,22 @@ static int __init hugepage_setup_sz(char
 	shift = __ffs(size);
 	switch (shift) {
 #ifndef CONFIG_PPC_64K_PAGES
-	case HPAGE_SHIFT_64K:
+	case PAGE_SHIFT_64K:
 		mmu_psize = MMU_PAGE_64K;
 		break;
 #endif
-	case HPAGE_SHIFT_16M:
+	case PAGE_SHIFT_16M:
 		mmu_psize = MMU_PAGE_16M;
 		break;
+	case PAGE_SHIFT_16G:
+		mmu_psize = MMU_PAGE_16G;
+		break;
 	}
 
-	if (mmu_psize >=0 && mmu_psize_defs[mmu_psize].shift)
+	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift) {
 		set_huge_psize(mmu_psize);
+		hugetlb_add_hstate(shift - PAGE_SHIFT);
+	}
 	else
 		printk(KERN_WARNING "Invalid huge page size specified(%llu)\n", size);
 
Index: linux-2.6/include/asm-powerpc/pgtable-64k.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable-64k.h
+++ linux-2.6/include/asm-powerpc/pgtable-64k.h
@@ -125,7 +125,7 @@ static inline struct subpage_prot_table 
                 unsigned __split = (psize == MMU_PAGE_4K ||                 \
 				    psize == MMU_PAGE_64K_AP);              \
                 shift = mmu_psize_defs[psize].shift;                        \
-	        for (index = 0; va < __end; index++, va += (1 << shift)) {  \
+		for (index = 0; va < __end; index++, va += (1L << shift)) { \
 		        if (!__split || __rpte_sub_valid(rpte, index)) do { \
 
 #define pte_iterate_hashed_end() } while(0); } } while(0)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 22/23] fs: check for statfs overflow
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (20 preceding siblings ...)
  2008-05-25 14:23 ` [patch 21/23] powerpc: define support for 16G hugepages npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 17:14   ` Nishanth Aravamudan
  2008-05-25 14:23 ` [patch 23/23] powerpc: support multiple hugepage sizes npiggin
  2008-05-25 14:42 ` [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc Nick Piggin
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Jon Tollefson

[-- Attachment #1: fs-check-for-statfs-overflow.patch --]
[-- Type: text/plain, Size: 1340 bytes --]

Adds a check for an overflow in the filesystem size so if someone is
checking with statfs() on a 16G hugetlbfs  in a 32bit binary that it
will report back EOVERFLOW instead of a size of 0.

Are other places that need a similar check?  I had tried a similar
check in put_compat_statfs64 too but it didn't seem to generate an
EOVERFLOW in my test case.

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 fs/compat.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


Index: linux-2.6/fs/compat.c
===================================================================
--- linux-2.6.orig/fs/compat.c
+++ linux-2.6/fs/compat.c
@@ -197,8 +197,8 @@ static int put_compat_statfs(struct comp
 {
 	
 	if (sizeof ubuf->f_blocks == 4) {
-		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail) &
-		    0xffffffff00000000ULL)
+		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail |
+		     kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL)
 			return -EOVERFLOW;
 		/* f_files and f_ffree may be -1; it's okay
 		 * to stuff that into 32 bits */

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 22/23] fs: check for statfs overflow
  2008-05-25 14:23 ` [patch 22/23] fs: check for statfs overflow npiggin
@ 2008-05-27 17:14   ` Nishanth Aravamudan
  2008-05-27 17:19     ` Jon Tollefson
  0 siblings, 1 reply; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 17:14 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Jon Tollefson

On 26.05.2008 [00:23:39 +1000], npiggin@suse.de wrote:
> Adds a check for an overflow in the filesystem size so if someone is
> checking with statfs() on a 16G hugetlbfs  in a 32bit binary that it
> will report back EOVERFLOW instead of a size of 0.
> 
> Are other places that need a similar check?  I had tried a similar
> check in put_compat_statfs64 too but it didn't seem to generate an
> EOVERFLOW in my test case.

I think this part of the changelog was meant to be a post-"---"
question, which I don't have an answer for, but probably shouldn't go in
the final changelog?

> Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 22/23] fs: check for statfs overflow
  2008-05-27 17:14   ` Nishanth Aravamudan
@ 2008-05-27 17:19     ` Jon Tollefson
  2008-05-28  9:02       ` Nick Piggin
  0 siblings, 1 reply; 88+ messages in thread
From: Jon Tollefson @ 2008-05-27 17:19 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: npiggin, linux-mm, andi, agl, abh, joachim.deguara, Jon Tollefson

Nishanth Aravamudan wrote:
> On 26.05.2008 [00:23:39 +1000], npiggin@suse.de wrote:
>   
>> Adds a check for an overflow in the filesystem size so if someone is
>> checking with statfs() on a 16G hugetlbfs  in a 32bit binary that it
>> will report back EOVERFLOW instead of a size of 0.
>>
>> Are other places that need a similar check?  I had tried a similar
>> check in put_compat_statfs64 too but it didn't seem to generate an
>> EOVERFLOW in my test case.
>>     
>
> I think this part of the changelog was meant to be a post-"---"
> question, which I don't have an answer for, but probably shouldn't go in
> the final changelog?
>   
You are correct.
>   
>> Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
>> Signed-off-by: Nick Piggin <npiggin@suse.de>
>>     
>
> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
>
> Thanks,
> Nish
>
>   
Jon

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 22/23] fs: check for statfs overflow
  2008-05-27 17:19     ` Jon Tollefson
@ 2008-05-28  9:02       ` Nick Piggin
  2008-05-29 23:56         ` Andreas Dilger
  0 siblings, 1 reply; 88+ messages in thread
From: Nick Piggin @ 2008-05-28  9:02 UTC (permalink / raw)
  To: Jon Tollefson
  Cc: Nishanth Aravamudan, linux-mm, andi, agl, abh, joachim.deguara,
	linux-fsdevel

On Tue, May 27, 2008 at 12:19:53PM -0500, Jon Tollefson wrote:
> Nishanth Aravamudan wrote:
> > On 26.05.2008 [00:23:39 +1000], npiggin@suse.de wrote:
> >   
> >> Adds a check for an overflow in the filesystem size so if someone is
> >> checking with statfs() on a 16G hugetlbfs  in a 32bit binary that it
> >> will report back EOVERFLOW instead of a size of 0.
> >>
> >> Are other places that need a similar check?  I had tried a similar
> >> check in put_compat_statfs64 too but it didn't seem to generate an
> >> EOVERFLOW in my test case.
> >>     
> >
> > I think this part of the changelog was meant to be a post-"---"
> > question, which I don't have an answer for, but probably shouldn't go in
> > the final changelog?
> >   
> You are correct.

I think the question is OK for the changelog. Unless we can get
somebody answering it yes or no, I'll leave it (but I'd rather get
an answer first).

I'm pretty unfamiliar with how the APIs work, but I'd think statfs64
is less likely to overflow because f_blocks is likely to be 8 bytes.
But I still think the check might be good to have.

The non-compat stat() (and stat64 even) might also need the eoverflow
check. cc'ing fsdevel with the patch attached again.

---

fs: check for statfs overflow

Adds a check for an overflow in the filesystem size so if someone is
checking with statfs() on a 16G hugetlbfs  in a 32bit binary that it
will report back EOVERFLOW instead of a size of 0.

Are other places that need a similar check?  I had tried a similar
check in put_compat_statfs64 too but it didn't seem to generate an
EOVERFLOW in my test case.

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 fs/compat.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


Index: linux-2.6/fs/compat.c
===================================================================
--- linux-2.6.orig/fs/compat.c
+++ linux-2.6/fs/compat.c
@@ -197,8 +197,8 @@ static int put_compat_statfs(struct comp
 {
 	
 	if (sizeof ubuf->f_blocks == 4) {
-		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail) &
-		    0xffffffff00000000ULL)
+		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail |
+		     kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL)
 			return -EOVERFLOW;
 		/* f_files and f_ffree may be -1; it's okay
 		 * to stuff that into 32 bits */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 22/23] fs: check for statfs overflow
  2008-05-28  9:02       ` Nick Piggin
@ 2008-05-29 23:56         ` Andreas Dilger
  2008-05-30  0:12           ` Nishanth Aravamudan
  2008-05-30  1:14           ` Nick Piggin
  0 siblings, 2 replies; 88+ messages in thread
From: Andreas Dilger @ 2008-05-29 23:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jon Tollefson, Nishanth Aravamudan, linux-mm, andi, agl, abh,
	joachim.deguara, linux-fsdevel

On May 28, 2008  11:02 +0200, Nick Piggin wrote:
> fs: check for statfs overflow
> 
> Adds a check for an overflow in the filesystem size so if someone is
> checking with statfs() on a 16G hugetlbfs  in a 32bit binary that it
> will report back EOVERFLOW instead of a size of 0.
> 
> Are other places that need a similar check?  I had tried a similar
> check in put_compat_statfs64 too but it didn't seem to generate an
> EOVERFLOW in my test case.
> 
> Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> 
>  fs/compat.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> 
> Index: linux-2.6/fs/compat.c
> ===================================================================
> --- linux-2.6.orig/fs/compat.c
> +++ linux-2.6/fs/compat.c
> @@ -197,8 +197,8 @@ static int put_compat_statfs(struct comp
>  {
>  	
>  	if (sizeof ubuf->f_blocks == 4) {
> -		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail) &
> -		    0xffffffff00000000ULL)
> +		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail |
> +		     kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL)
>  			return -EOVERFLOW;

Hmm, doesn't this check break every filesystem > 16TB on 4kB PAGE_SIZE
nodes?  It would be better, IMHO, to scale down f_blocks, f_bfree, and
f_bavail and correspondingly scale up f_bsize to fit into the 32-bit
statfs structure.

We did this for several years with Lustre, as the first installation was
already larger than 16TB on 32-bit clients at the time.  There was never
a problem with statfs returning a larger f_bsize, since applications
generally use the fstat() st_blocksize to determine IO size and not the
statfs() data.

Returning statfs data accurate to within a few kB is better than failing
the request outright, IMHO.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 22/23] fs: check for statfs overflow
  2008-05-29 23:56         ` Andreas Dilger
@ 2008-05-30  0:12           ` Nishanth Aravamudan
  2008-05-30  1:14           ` Nick Piggin
  1 sibling, 0 replies; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-30  0:12 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Nick Piggin, Jon Tollefson, linux-mm, andi, agl, abh,
	joachim.deguara, linux-fsdevel

On 29.05.2008 [17:56:07 -0600], Andreas Dilger wrote:
> On May 28, 2008  11:02 +0200, Nick Piggin wrote:
> > fs: check for statfs overflow
> > 
> > Adds a check for an overflow in the filesystem size so if someone is
> > checking with statfs() on a 16G hugetlbfs  in a 32bit binary that it
> > will report back EOVERFLOW instead of a size of 0.
> > 
> > Are other places that need a similar check?  I had tried a similar
> > check in put_compat_statfs64 too but it didn't seem to generate an
> > EOVERFLOW in my test case.
> > 
> > Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> > 
> >  fs/compat.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > 
> > Index: linux-2.6/fs/compat.c
> > ===================================================================
> > --- linux-2.6.orig/fs/compat.c
> > +++ linux-2.6/fs/compat.c
> > @@ -197,8 +197,8 @@ static int put_compat_statfs(struct comp
> >  {
> >  	
> >  	if (sizeof ubuf->f_blocks == 4) {
> > -		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail) &
> > -		    0xffffffff00000000ULL)
> > +		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail |
> > +		     kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL)
> >  			return -EOVERFLOW;
> 
> Hmm, doesn't this check break every filesystem > 16TB on 4kB PAGE_SIZE
> nodes?  It would be better, IMHO, to scale down f_blocks, f_bfree, and
> f_bavail and correspondingly scale up f_bsize to fit into the 32-bit
> statfs structure.

Being a FS newbie, I'm not entirely sure I follow, could you say that
again in patch-form? :) Seriously, it might make it clear to me.

> We did this for several years with Lustre, as the first installation
> was already larger than 16TB on 32-bit clients at the time.  There was
> never a problem with statfs returning a larger f_bsize, since
> applications generally use the fstat() st_blocksize to determine IO
> size and not the statfs() data.

I'm not sure that's a good reason to give bad data back to userspace...
We have both interfaces and both should work?

> Returning statfs data accurate to within a few kB is better than
> failing the request outright, IMHO.

Well, currently (iirc), we see statfs() give bad values for 16gb
hugetlbfs mountpoints. That's not good, and is inconsistent with the
other hugetlbfs mountpoints. We actually do want to indicate EOVERFLOW
there, as the 32-bit binary, or some kind of error, although the binary
will notice it can't use the pages from that mountpoint when mmap()
fails :)

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 22/23] fs: check for statfs overflow
  2008-05-29 23:56         ` Andreas Dilger
  2008-05-30  0:12           ` Nishanth Aravamudan
@ 2008-05-30  1:14           ` Nick Piggin
  2008-06-02  3:16             ` Andreas Dilger
  1 sibling, 1 reply; 88+ messages in thread
From: Nick Piggin @ 2008-05-30  1:14 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jon Tollefson, Nishanth Aravamudan, linux-mm, andi, agl, abh,
	joachim.deguara, linux-fsdevel

On Thu, May 29, 2008 at 05:56:07PM -0600, Andreas Dilger wrote:
> On May 28, 2008  11:02 +0200, Nick Piggin wrote:
> > fs: check for statfs overflow
> > 
> > Adds a check for an overflow in the filesystem size so if someone is
> > checking with statfs() on a 16G hugetlbfs  in a 32bit binary that it
> > will report back EOVERFLOW instead of a size of 0.
> > 
> > Are other places that need a similar check?  I had tried a similar
> > check in put_compat_statfs64 too but it didn't seem to generate an
> > EOVERFLOW in my test case.
> > 
> > Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> > 
> >  fs/compat.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > 
> > Index: linux-2.6/fs/compat.c
> > ===================================================================
> > --- linux-2.6.orig/fs/compat.c
> > +++ linux-2.6/fs/compat.c
> > @@ -197,8 +197,8 @@ static int put_compat_statfs(struct comp
> >  {
> >  	
> >  	if (sizeof ubuf->f_blocks == 4) {
> > -		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail) &
> > -		    0xffffffff00000000ULL)
> > +		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail |
> > +		     kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL)
> >  			return -EOVERFLOW;
> 
> Hmm, doesn't this check break every filesystem > 16TB on 4kB PAGE_SIZE
> nodes?  It would be better, IMHO, to scale down f_blocks, f_bfree, and
> f_bavail and correspondingly scale up f_bsize to fit into the 32-bit
> statfs structure.

Oh? Hmm, from my reading, such filesystems will already overflow f_blocks
check which is already there. Jon's patch only adds checks for f_bsize
and f_frsize.

One thing I'm a little worried about is the _exact_ semantics required
of the syscall wrt overflow, and  type sizes. In the man page here for
example, ubuf->f_blocks is a differnt type to f_bsize and f_frsize...


Thanks,
Nick

> We did this for several years with Lustre, as the first installation was
> already larger than 16TB on 32-bit clients at the time.  There was never
> a problem with statfs returning a larger f_bsize, since applications
> generally use the fstat() st_blocksize to determine IO size and not the
> statfs() data.
> 
> Returning statfs data accurate to within a few kB is better than failing
> the request outright, IMHO.
> 
> Cheers, Andreas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 22/23] fs: check for statfs overflow
  2008-05-30  1:14           ` Nick Piggin
@ 2008-06-02  3:16             ` Andreas Dilger
  2008-06-03  3:27               ` Nick Piggin
  0 siblings, 1 reply; 88+ messages in thread
From: Andreas Dilger @ 2008-06-02  3:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jon Tollefson, Nishanth Aravamudan, linux-mm, andi, agl, abh,
	joachim.deguara, linux-fsdevel

On May 30, 2008  03:14 +0200, Nick Piggin wrote:
> On Thu, May 29, 2008 at 05:56:07PM -0600, Andreas Dilger wrote:
> > On May 28, 2008  11:02 +0200, Nick Piggin wrote:
> > > @@ -197,8 +197,8 @@ static int put_compat_statfs(struct comp
> > >  	if (sizeof ubuf->f_blocks == 4) {
> > > +		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail |
> > > +		     kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL)
> > >  			return -EOVERFLOW;
> > 
> > Hmm, doesn't this check break every filesystem > 16TB on 4kB PAGE_SIZE
> > nodes?  It would be better, IMHO, to scale down f_blocks, f_bfree, and
> > f_bavail and correspondingly scale up f_bsize to fit into the 32-bit
> > statfs structure.
> 
> Oh? Hmm, from my reading, such filesystems will already overflow f_blocks
> check which is already there. Jon's patch only adds checks for f_bsize
> and f_frsize.

Sorry, you are right - I meant that the whole f_blocks check is broken
for filesystems > 16TB.  Scaling f_bsize is easy, and prevents gratuitous
breakage of old applications for a few kB of accuracy.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 22/23] fs: check for statfs overflow
  2008-06-02  3:16             ` Andreas Dilger
@ 2008-06-03  3:27               ` Nick Piggin
  2008-06-03 17:17                 ` Andreas Dilger
  0 siblings, 1 reply; 88+ messages in thread
From: Nick Piggin @ 2008-06-03  3:27 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jon Tollefson, Nishanth Aravamudan, linux-mm, andi, agl, abh,
	joachim.deguara, linux-fsdevel

On Sun, Jun 01, 2008 at 09:16:02PM -0600, Andreas Dilger wrote:
> On May 30, 2008  03:14 +0200, Nick Piggin wrote:
> > On Thu, May 29, 2008 at 05:56:07PM -0600, Andreas Dilger wrote:
> > > On May 28, 2008  11:02 +0200, Nick Piggin wrote:
> > > > @@ -197,8 +197,8 @@ static int put_compat_statfs(struct comp
> > > >  	if (sizeof ubuf->f_blocks == 4) {
> > > > +		if ((kbuf->f_blocks | kbuf->f_bfree | kbuf->f_bavail |
> > > > +		     kbuf->f_bsize | kbuf->f_frsize) & 0xffffffff00000000ULL)
> > > >  			return -EOVERFLOW;
> > > 
> > > Hmm, doesn't this check break every filesystem > 16TB on 4kB PAGE_SIZE
> > > nodes?  It would be better, IMHO, to scale down f_blocks, f_bfree, and
> > > f_bavail and correspondingly scale up f_bsize to fit into the 32-bit
> > > statfs structure.
> > 
> > Oh? Hmm, from my reading, such filesystems will already overflow f_blocks
> > check which is already there. Jon's patch only adds checks for f_bsize
> > and f_frsize.
> 
> Sorry, you are right - I meant that the whole f_blocks check is broken
> for filesystems > 16TB.  Scaling f_bsize is easy, and prevents gratuitous
> breakage of old applications for a few kB of accuracy.

Oh... hmm OK but they do have stat64 I guess, although maybe they aren't
coded for it.

Anyway, point is noted, but I'm not the person (nor is this the patchset)
to make such changes.

Do you agree that if we have these checks in coimpat_statfs, then we
should put the same ones in the non-compat as well as the 64 bit
versions?

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 22/23] fs: check for statfs overflow
  2008-06-03  3:27               ` Nick Piggin
@ 2008-06-03 17:17                 ` Andreas Dilger
  0 siblings, 0 replies; 88+ messages in thread
From: Andreas Dilger @ 2008-06-03 17:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jon Tollefson, Nishanth Aravamudan, linux-mm, andi, agl, abh,
	joachim.deguara, linux-fsdevel

On Jun 03, 2008  05:27 +0200, Nick Piggin wrote:
> On Sun, Jun 01, 2008 at 09:16:02PM -0600, Andreas Dilger wrote:
> > On May 30, 2008  03:14 +0200, Nick Piggin wrote:
> > > Oh? Hmm, from my reading, such filesystems will already overflow f_blocks
> > > check which is already there. Jon's patch only adds checks for f_bsize
> > > and f_frsize.
> > 
> > Sorry, you are right - I meant that the whole f_blocks check is broken
> > for filesystems > 16TB.  Scaling f_bsize is easy, and prevents gratuitous
> > breakage of old applications for a few kB of accuracy.
> 
> Oh... hmm OK but they do have stat64 I guess, although maybe they aren't
> coded for it.

Right - we had this problem with all of the tools with some older distros
being compiled against the old statfs syscall and we had to put the statfs
scaling inside Lustre to avoid the 16TB overflow.

The problem with the current kernel VFS interface is that the filesystem
doesn't know whether the 32-bit or 64-bit statfs interface is being called,
and rather than returning an error to an application we'd prefer to return
scaled statfs results (with some small amount of rounding error).  Even
for 20PB filesystems (the largest planned for this year) the free/used/avail
space would only be rounded to 4MB sizes, which isn't so bad.

> Anyway, point is noted, but I'm not the person (nor is this the patchset)
> to make such changes.

Right...

> Do you agree that if we have these checks in coimpat_statfs, then we
> should put the same ones in the non-compat as well as the 64 bit
> versions?

If it only affects hugetlbfs then I'm not too concerned.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch 23/23] powerpc: support multiple hugepage sizes
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (21 preceding siblings ...)
  2008-05-25 14:23 ` [patch 22/23] fs: check for statfs overflow npiggin
@ 2008-05-25 14:23 ` npiggin
  2008-05-27 17:14   ` Nishanth Aravamudan
  2008-05-25 14:42 ` [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc Nick Piggin
  23 siblings, 1 reply; 88+ messages in thread
From: npiggin @ 2008-05-25 14:23 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara, Jon Tollefson

[-- Attachment #1: powerpc-support-multiple-hugepage-sizes.patch --]
[-- Type: text/plain, Size: 24745 bytes --]

Instead of using the variable mmu_huge_psize to keep track of the huge
page size we use an array of MMU_PAGE_* values.  For each supported
huge page size we need to know the hugepte_shift value and have a
pgtable_cache.  The hstate or an mmu_huge_psizes index is passed to
functions so that they know which huge page size they should use.

The hugepage sizes 16M and 64K are setup(if available on the
hardware) so that they don't have to be set on the boot cmd line in
order to use them.  The number of 16G pages have to be specified at
boot-time though (e.g. hugepagesz=16G hugepages=5).

Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---

 arch/powerpc/mm/hash_utils_64.c  |    9 -
 arch/powerpc/mm/hugetlbpage.c    |  272 +++++++++++++++++++++++++--------------
 arch/powerpc/mm/init_64.c        |    8 -
 arch/powerpc/mm/tlb_64.c         |    2 
 include/asm-powerpc/hugetlb.h    |    5 
 include/asm-powerpc/mmu-hash64.h |    4 
 include/asm-powerpc/page_64.h    |    1 
 include/asm-powerpc/pgalloc-64.h |    4 
 8 files changed, 192 insertions(+), 113 deletions(-)


Index: linux-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hash_utils_64.c
+++ linux-2.6/arch/powerpc/mm/hash_utils_64.c
@@ -103,7 +103,6 @@ int mmu_kernel_ssize = MMU_SEGSIZE_256M;
 int mmu_highuser_ssize = MMU_SEGSIZE_256M;
 u16 mmu_slb_size = 64;
 #ifdef CONFIG_HUGETLB_PAGE
-int mmu_huge_psize = MMU_PAGE_16M;
 unsigned int HPAGE_SHIFT;
 #endif
 #ifdef CONFIG_PPC_64K_PAGES
@@ -460,15 +459,15 @@ static void __init htab_init_page_sizes(
 	/* Reserve 16G huge page memory sections for huge pages */
 	of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
 
-/* Init large page size. Currently, we pick 16M or 1M depending
+/* Set default large page size. Currently, we pick 16M or 1M depending
 	 * on what is available
 	 */
 	if (mmu_psize_defs[MMU_PAGE_16M].shift)
-		set_huge_psize(MMU_PAGE_16M);
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
 	/* With 4k/4level pagetables, we can't (for now) cope with a
 	 * huge page size < PMD_SIZE */
 	else if (mmu_psize_defs[MMU_PAGE_1M].shift)
-		set_huge_psize(MMU_PAGE_1M);
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
 #endif /* CONFIG_HUGETLB_PAGE */
 }
 
@@ -873,7 +872,7 @@ int hash_page(unsigned long ea, unsigned
 
 #ifdef CONFIG_HUGETLB_PAGE
 	/* Handle hugepage regions */
-	if (HPAGE_SHIFT && psize == mmu_huge_psize) {
+	if (HPAGE_SHIFT && mmu_huge_psizes[psize]) {
 		DBG_LOW(" -> huge page !\n");
 		return hash_huge_page(mm, access, ea, vsid, local, trap);
 	}
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -37,15 +37,30 @@
 static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
 static unsigned nr_gpages;
 
-unsigned int hugepte_shift;
-#define PTRS_PER_HUGEPTE	(1 << hugepte_shift)
-#define HUGEPTE_TABLE_SIZE	(sizeof(pte_t) << hugepte_shift)
-
-#define HUGEPD_SHIFT		(HPAGE_SHIFT + hugepte_shift)
-#define HUGEPD_SIZE		(1UL << HUGEPD_SHIFT)
-#define HUGEPD_MASK		(~(HUGEPD_SIZE-1))
+/* Array of valid huge page sizes - non-zero value(hugepte_shift) is
+ * stored for the huge page sizes that are valid.
+ */
+unsigned int mmu_huge_psizes[MMU_PAGE_COUNT];
 
-#define huge_pgtable_cache	(pgtable_cache[HUGEPTE_CACHE_NUM])
+#define hugepte_shift			mmu_huge_psizes
+#define PTRS_PER_HUGEPTE(psize)		(1 << hugepte_shift[psize])
+#define HUGEPTE_TABLE_SIZE(psize)	(sizeof(pte_t) << hugepte_shift[psize])
+
+#define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
+						+ hugepte_shift[psize])
+#define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
+#define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
+
+/* Subtract one from array size because we don't need a cache for 4K since
+ * is not a huge page size */
+#define huge_pgtable_cache(psize)	(pgtable_cache[HUGEPTE_CACHE_NUM \
+							+ psize-1])
+#define HUGEPTE_CACHE_NAME(psize)	(huge_pgtable_cache_name[psize])
+
+static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
+	"unused_4K", "hugepte_cache_64K", "unused_64K_AP",
+	"hugepte_cache_1M", "hugepte_cache_16M", "hugepte_cache_16G"
+};
 
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
@@ -56,24 +71,49 @@ typedef struct { unsigned long pd; } hug
 
 #define hugepd_none(hpd)	((hpd).pd == 0)
 
+static inline int shift_to_mmu_psize(unsigned int shift)
+{
+	switch (shift) {
+#ifndef CONFIG_PPC_64K_PAGES
+	case PAGE_SHIFT_64K:
+	    return MMU_PAGE_64K;
+#endif
+	case PAGE_SHIFT_16M:
+	    return MMU_PAGE_16M;
+	case PAGE_SHIFT_16G:
+	    return MMU_PAGE_16G;
+	}
+	return -1;
+}
+
+static inline unsigned int mmu_psize_to_shift(unsigned int mmu_psize)
+{
+	if (mmu_psize_defs[mmu_psize].shift)
+		return mmu_psize_defs[mmu_psize].shift;
+	BUG();
+}
+
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
 	BUG_ON(!(hpd.pd & HUGEPD_OK));
 	return (pte_t *)(hpd.pd & ~HUGEPD_OK);
 }
 
-static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr)
+static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
+				    struct hstate *hstate)
 {
-	unsigned long idx = ((addr >> HPAGE_SHIFT) & (PTRS_PER_HUGEPTE-1));
+	unsigned int shift = huge_page_shift(hstate);
+	int psize = shift_to_mmu_psize(shift);
+	unsigned long idx = ((addr >> shift) & (PTRS_PER_HUGEPTE(psize)-1));
 	pte_t *dir = hugepd_page(*hpdp);
 
 	return dir + idx;
 }
 
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
-			   unsigned long address)
+			   unsigned long address, unsigned int psize)
 {
-	pte_t *new = kmem_cache_alloc(huge_pgtable_cache,
+	pte_t *new = kmem_cache_alloc(huge_pgtable_cache(psize),
 				      GFP_KERNEL|__GFP_REPEAT);
 
 	if (! new)
@@ -81,7 +121,7 @@ static int __hugepte_alloc(struct mm_str
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(huge_pgtable_cache, new);
+		kmem_cache_free(huge_pgtable_cache(psize), new);
 	else
 		hpdp->pd = (unsigned long)new | HUGEPD_OK;
 	spin_unlock(&mm->page_table_lock);
@@ -90,21 +130,22 @@ static int __hugepte_alloc(struct mm_str
 
 /* Base page size affects how we walk hugetlb page tables */
 #ifdef CONFIG_PPC_64K_PAGES
-#define hpmd_offset(pud, addr)		pmd_offset(pud, addr)
-#define hpmd_alloc(mm, pud, addr)	pmd_alloc(mm, pud, addr)
+#define hpmd_offset(pud, addr, h)	pmd_offset(pud, addr)
+#define hpmd_alloc(mm, pud, addr, h)	pmd_alloc(mm, pud, addr)
 #else
 static inline
-pmd_t *hpmd_offset(pud_t *pud, unsigned long addr)
+pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
 {
-	if (HPAGE_SHIFT == PAGE_SHIFT_64K)
+	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 		return pmd_offset(pud, addr);
 	else
 		return (pmd_t *) pud;
 }
 static inline
-pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr)
+pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
+		  struct hstate *hstate)
 {
-	if (HPAGE_SHIFT == PAGE_SHIFT_64K)
+	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 		return pmd_alloc(mm, pud, addr);
 	else
 		return (pmd_t *) pud;
@@ -130,7 +171,7 @@ void add_gpage(unsigned long addr, unsig
 /* Moves the gigantic page addresses from the temporary list to the
   * huge_boot_pages list.
  */
-int alloc_bootmem_huge_page(struct hstate *h)
+int alloc_bootmem_huge_page(struct hstate *hstate)
 {
 	struct huge_bootmem_page *m;
 	if (nr_gpages == 0)
@@ -138,7 +179,7 @@ int alloc_bootmem_huge_page(struct hstat
 	m = phys_to_virt(gpage_freearray[--nr_gpages]);
 	gpage_freearray[nr_gpages] = 0;
 	list_add(&m->list, &huge_boot_pages);
-	m->hstate = h;
+	m->hstate = hstate;
 	return 1;
 }
 
@@ -150,17 +191,25 @@ pte_t *huge_pte_offset(struct mm_struct 
 	pud_t *pu;
 	pmd_t *pm;
 
-	BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize);
+	unsigned int psize;
+	unsigned int shift;
+	unsigned long sz;
+	struct hstate *hstate;
+	psize = get_slice_psize(mm, addr);
+	shift = mmu_psize_to_shift(psize);
+	sz = ((1UL) << shift);
+	hstate = size_to_hstate(sz);
 
-	addr &= HPAGE_MASK;
+	addr &= hstate->mask;
 
 	pg = pgd_offset(mm, addr);
 	if (!pgd_none(*pg)) {
 		pu = pud_offset(pg, addr);
 		if (!pud_none(*pu)) {
-			pm = hpmd_offset(pu, addr);
+			pm = hpmd_offset(pu, addr, hstate);
 			if (!pmd_none(*pm))
-				return hugepte_offset((hugepd_t *)pm, addr);
+				return hugepte_offset((hugepd_t *)pm, addr,
+						      hstate);
 		}
 	}
 
@@ -173,16 +222,20 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	pud_t *pu;
 	pmd_t *pm;
 	hugepd_t *hpdp = NULL;
+	struct hstate *hstate;
+	unsigned int psize;
+	hstate = size_to_hstate(sz);
 
-	BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize);
+	psize = get_slice_psize(mm, addr);
+	BUG_ON(!mmu_huge_psizes[psize]);
 
-	addr &= HPAGE_MASK;
+	addr &= hstate->mask;
 
 	pg = pgd_offset(mm, addr);
 	pu = pud_alloc(mm, pg, addr);
 
 	if (pu) {
-		pm = hpmd_alloc(mm, pu, addr);
+		pm = hpmd_alloc(mm, pu, addr, hstate);
 		if (pm)
 			hpdp = (hugepd_t *)pm;
 	}
@@ -190,10 +243,10 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	if (! hpdp)
 		return NULL;
 
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr))
+	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, psize))
 		return NULL;
 
-	return hugepte_offset(hpdp, addr);
+	return hugepte_offset(hpdp, addr, hstate);
 }
 
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
@@ -201,19 +254,22 @@ int huge_pmd_unshare(struct mm_struct *m
 	return 0;
 }
 
-static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp)
+static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp,
+			       unsigned int psize)
 {
 	pte_t *hugepte = hugepd_page(*hpdp);
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte, HUGEPTE_CACHE_NUM,
+	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
+						 HUGEPTE_CACHE_NUM+psize-1,
 						 PGF_CACHENUM_MASK));
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				   unsigned long addr, unsigned long end,
-				   unsigned long floor, unsigned long ceiling)
+				   unsigned long floor, unsigned long ceiling,
+				   unsigned int psize)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -225,7 +281,7 @@ static void hugetlb_free_pmd_range(struc
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd))
 			continue;
-		free_hugepte_range(tlb, (hugepd_t *)pmd);
+		free_hugepte_range(tlb, (hugepd_t *)pmd, psize);
 	} while (pmd++, addr = next, addr != end);
 
 	start &= PUD_MASK;
@@ -251,6 +307,9 @@ static void hugetlb_free_pud_range(struc
 	pud_t *pud;
 	unsigned long next;
 	unsigned long start;
+	unsigned int shift;
+	unsigned int psize = get_slice_psize(tlb->mm, addr);
+	shift = mmu_psize_to_shift(psize);
 
 	start = addr;
 	pud = pud_offset(pgd, addr);
@@ -259,16 +318,18 @@ static void hugetlb_free_pud_range(struc
 #ifdef CONFIG_PPC_64K_PAGES
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling);
+		hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling,
+				       psize);
 #else
-		if (HPAGE_SHIFT == PAGE_SHIFT_64K) {
+		if (shift == PAGE_SHIFT_64K) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
-			hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling);
+			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
+					       ceiling, psize);
 		} else {
 			if (pud_none(*pud))
 				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pud);
+			free_hugepte_range(tlb, (hugepd_t *)pud, psize);
 		}
 #endif
 	} while (pud++, addr = next, addr != end);
@@ -336,27 +397,29 @@ void hugetlb_free_pgd_range(struct mmu_g
 	 * now has no other vmas using it, so can be freed, we don't
 	 * bother to round floor or end up - the tests don't need that.
 	 */
+	unsigned int psize = get_slice_psize((*tlb)->mm, addr);
 
-	addr &= HUGEPD_MASK;
+	addr &= HUGEPD_MASK(psize);
 	if (addr < floor) {
-		addr += HUGEPD_SIZE;
+		addr += HUGEPD_SIZE(psize);
 		if (!addr)
 			return;
 	}
 	if (ceiling) {
-		ceiling &= HUGEPD_MASK;
+		ceiling &= HUGEPD_MASK(psize);
 		if (!ceiling)
 			return;
 	}
 	if (end - 1 > ceiling - 1)
-		end -= HUGEPD_SIZE;
+		end -= HUGEPD_SIZE(psize);
 	if (addr > end - 1)
 		return;
 
 	start = addr;
 	pgd = pgd_offset((*tlb)->mm, addr);
 	do {
-		BUG_ON(get_slice_psize((*tlb)->mm, addr) != mmu_huge_psize);
+		psize = get_slice_psize((*tlb)->mm, addr);
+		BUG_ON(!mmu_huge_psizes[psize]);
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
@@ -373,7 +436,11 @@ void set_huge_pte_at(struct mm_struct *m
 		 * necessary anymore if we make hpte_need_flush() get the
 		 * page size from the slices
 		 */
-		pte_update(mm, addr & HPAGE_MASK, ptep, ~0UL, 1);
+		unsigned int psize = get_slice_psize(mm, addr);
+		unsigned int shift = mmu_psize_to_shift(psize);
+		unsigned long sz = ((1UL) << shift);
+		struct hstate *hstate = size_to_hstate(sz);
+		pte_update(mm, addr & hstate->mask, ptep, ~0UL, 1);
 	}
 	*ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
 }
@@ -390,14 +457,19 @@ follow_huge_addr(struct mm_struct *mm, u
 {
 	pte_t *ptep;
 	struct page *page;
+	unsigned int mmu_psize = get_slice_psize(mm, address);
 
-	if (get_slice_psize(mm, address) != mmu_huge_psize)
+	/* Verify it is a huge page else bail. */
+	if (!mmu_huge_psizes[mmu_psize])
 		return ERR_PTR(-EINVAL);
 
 	ptep = huge_pte_offset(mm, address);
 	page = pte_page(*ptep);
-	if (page)
-		page += (address % HPAGE_SIZE) / PAGE_SIZE;
+	if (page) {
+		unsigned int shift = mmu_psize_to_shift(mmu_psize);
+		unsigned long sz = ((1UL) << shift);
+		page += (address % sz) / PAGE_SIZE;
+	}
 
 	return page;
 }
@@ -425,15 +497,16 @@ unsigned long hugetlb_get_unmapped_area(
 					unsigned long len, unsigned long pgoff,
 					unsigned long flags)
 {
-	return slice_get_unmapped_area(addr, len, flags,
-				       mmu_huge_psize, 1, 0);
+	struct hstate *hstate = hstate_file(file);
+	int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
+	return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
 }
 
 /*
  * Called by asm hashtable.S for doing lazy icache flush
  */
 static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
-						  pte_t pte, int trap)
+					pte_t pte, int trap, unsigned long sz)
 {
 	struct page *page;
 	int i;
@@ -446,7 +519,7 @@ static unsigned int hash_huge_page_do_la
 	/* page is dirty */
 	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
 		if (trap == 0x400) {
-			for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++)
+			for (i = 0; i < (sz / PAGE_SIZE); i++)
 				__flush_dcache_icache(page_address(page+i));
 			set_bit(PG_arch_1, &page->flags);
 		} else {
@@ -462,11 +535,16 @@ int hash_huge_page(struct mm_struct *mm,
 {
 	pte_t *ptep;
 	unsigned long old_pte, new_pte;
-	unsigned long va, rflags, pa;
+	unsigned long va, rflags, pa, sz;
 	long slot;
 	int err = 1;
 	int ssize = user_segment_size(ea);
+	unsigned int mmu_psize;
+	int shift;
+	mmu_psize = get_slice_psize(mm, ea);
 
+	if(!mmu_huge_psizes[mmu_psize])
+		goto out;
 	ptep = huge_pte_offset(mm, ea);
 
 	/* Search the Linux page table for a match with va */
@@ -510,30 +588,32 @@ int hash_huge_page(struct mm_struct *mm,
 	rflags = 0x2 | (!(new_pte & _PAGE_RW));
  	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
 	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
+	shift = mmu_psize_to_shift(mmu_psize);
+	sz = ((1UL) << shift);
 	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
 		/* No CPU has hugepages but lacks no execute, so we
 		 * don't need to worry about that case */
 		rflags = hash_huge_page_do_lazy_icache(rflags, __pte(old_pte),
-						       trap);
+						       trap, sz);
 
 	/* Check if pte already has an hpte (case 2) */
 	if (unlikely(old_pte & _PAGE_HASHPTE)) {
 		/* There MIGHT be an HPTE for this pte */
 		unsigned long hash, slot;
 
-		hash = hpt_hash(va, HPAGE_SHIFT, ssize);
+		hash = hpt_hash(va, shift, ssize);
 		if (old_pte & _PAGE_F_SECOND)
 			hash = ~hash;
 		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
 		slot += (old_pte & _PAGE_F_GIX) >> 12;
 
-		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_huge_psize,
+		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_psize,
 					 ssize, local) == -1)
 			old_pte &= ~_PAGE_HPTEFLAGS;
 	}
 
 	if (likely(!(old_pte & _PAGE_HASHPTE))) {
-		unsigned long hash = hpt_hash(va, HPAGE_SHIFT, ssize);
+		unsigned long hash = hpt_hash(va, shift, ssize);
 		unsigned long hpte_group;
 
 		pa = pte_pfn(__pte(old_pte)) << PAGE_SHIFT;
@@ -552,7 +632,7 @@ repeat:
 
 		/* Insert into the hash table, primary slot */
 		slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags, 0,
-					  mmu_huge_psize, ssize);
+					  mmu_psize, ssize);
 
 		/* Primary is full, try the secondary */
 		if (unlikely(slot == -1)) {
@@ -560,7 +640,7 @@ repeat:
 				      HPTES_PER_GROUP) & ~0x7UL; 
 			slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags,
 						  HPTE_V_SECONDARY,
-						  mmu_huge_psize, ssize);
+						  mmu_psize, ssize);
 			if (slot == -1) {
 				if (mftb() & 0x1)
 					hpte_group = ((hash & htab_hash_mask) *
@@ -597,66 +677,50 @@ void set_huge_psize(int psize)
 		(mmu_psize_defs[psize].shift > MIN_HUGEPTE_SHIFT ||
 		 mmu_psize_defs[psize].shift == PAGE_SHIFT_64K ||
 		 mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) {
-		/* Return if huge page size is the same as the
-		 * base page size. */
-		if (mmu_psize_defs[psize].shift == PAGE_SHIFT)
+		/* Return if huge page size has already been setup or is the
+		 * same as the base page size. */
+		if (mmu_huge_psizes[psize] ||
+		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
 			return;
+		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
-		HPAGE_SHIFT = mmu_psize_defs[psize].shift;
-		mmu_huge_psize = psize;
-
-		switch (HPAGE_SHIFT) {
+		switch (mmu_psize_defs[psize].shift) {
 		case PAGE_SHIFT_64K:
 		    /* We only allow 64k hpages with 4k base page,
 		     * which was checked above, and always put them
 		     * at the PMD */
-		    hugepte_shift = PMD_SHIFT;
+		    hugepte_shift[psize] = PMD_SHIFT;
 		    break;
 		case PAGE_SHIFT_16M:
 		    /* 16M pages can be at two different levels
 		     * of pagestables based on base page size */
 		    if (PAGE_SHIFT == PAGE_SHIFT_64K)
-			    hugepte_shift = PMD_SHIFT;
+			    hugepte_shift[psize] = PMD_SHIFT;
 		    else /* 4k base page */
-			    hugepte_shift = PUD_SHIFT;
+			    hugepte_shift[psize] = PUD_SHIFT;
 		    break;
 		case PAGE_SHIFT_16G:
 		    /* 16G pages are always at PGD level */
-		    hugepte_shift = PGDIR_SHIFT;
+		    hugepte_shift[psize] = PGDIR_SHIFT;
 		    break;
 		}
-		hugepte_shift -= HPAGE_SHIFT;
+		hugepte_shift[psize] -= mmu_psize_defs[psize].shift;
 	} else
-		HPAGE_SHIFT = 0;
+		hugepte_shift[psize] = 0;
 }
 
 static int __init hugepage_setup_sz(char *str)
 {
 	unsigned long long size;
-	int mmu_psize = -1;
+	int mmu_psize;
 	int shift;
 
 	size = memparse(str, &str);
 
 	shift = __ffs(size);
-	switch (shift) {
-#ifndef CONFIG_PPC_64K_PAGES
-	case PAGE_SHIFT_64K:
-		mmu_psize = MMU_PAGE_64K;
-		break;
-#endif
-	case PAGE_SHIFT_16M:
-		mmu_psize = MMU_PAGE_16M;
-		break;
-	case PAGE_SHIFT_16G:
-		mmu_psize = MMU_PAGE_16G;
-		break;
-	}
-
-	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift) {
+	mmu_psize = shift_to_mmu_psize(shift);
+	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift)
 		set_huge_psize(mmu_psize);
-		hugetlb_add_hstate(shift - PAGE_SHIFT);
-	}
 	else
 		printk(KERN_WARNING "Invalid huge page size specified(%llu)\n", size);
 
@@ -671,16 +735,30 @@ static void zero_ctor(struct kmem_cache 
 
 static int __init hugetlbpage_init(void)
 {
+	unsigned int psize;
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
-
-	huge_pgtable_cache = kmem_cache_create("hugepte_cache",
-					       HUGEPTE_TABLE_SIZE,
-					       HUGEPTE_TABLE_SIZE,
-					       0,
-					       zero_ctor);
-	if (! huge_pgtable_cache)
-		panic("hugetlbpage_init(): could not create hugepte cache\n");
+	/* Add supported huge page sizes.  Need to change HUGE_MAX_HSTATE
+	 * and adjust PTE_NONCACHE_NUM if the number of supported huge page
+	 * sizes changes.
+	 */
+	set_huge_psize(MMU_PAGE_16M);
+	set_huge_psize(MMU_PAGE_64K);
+	set_huge_psize(MMU_PAGE_16G);
+
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
+		if (mmu_huge_psizes[psize]) {
+			huge_pgtable_cache(psize) = kmem_cache_create(
+						HUGEPTE_CACHE_NAME(psize),
+						HUGEPTE_TABLE_SIZE(psize),
+						HUGEPTE_TABLE_SIZE(psize),
+						0,
+						zero_ctor);
+			if (!huge_pgtable_cache(psize))
+				panic("hugetlbpage_init(): could not create %s"\
+				      "\n", HUGEPTE_CACHE_NAME(psize));
+		}
+	}
 
 	return 0;
 }
Index: linux-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/init_64.c
+++ linux-2.6/arch/powerpc/mm/init_64.c
@@ -153,10 +153,10 @@ static const char *pgtable_cache_name[AR
 };
 
 #ifdef CONFIG_HUGETLB_PAGE
-/* Hugepages need one extra cache, initialized in hugetlbpage.c.  We
- * can't put into the tables above, because HPAGE_SHIFT is not compile
- * time constant. */
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+1];
+/* Hugepages need an extra cache per hugepagesize, initialized in
+ * hugetlbpage.c.  We can't put into the tables above, because HPAGE_SHIFT
+ * is not compile time constant. */
+struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+MMU_PAGE_COUNT];
 #else
 struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)];
 #endif
Index: linux-2.6/arch/powerpc/mm/tlb_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_64.c
+++ linux-2.6/arch/powerpc/mm/tlb_64.c
@@ -150,7 +150,7 @@ void hpte_need_flush(struct mm_struct *m
 	 */
 	if (huge) {
 #ifdef CONFIG_HUGETLB_PAGE
-		psize = mmu_huge_psize;
+		psize = get_slice_psize(mm, addr);;
 #else
 		BUG();
 		psize = pte_pagesize_index(mm, addr, pte); /* shutup gcc */
Index: linux-2.6/include/asm-powerpc/mmu-hash64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/mmu-hash64.h
+++ linux-2.6/include/asm-powerpc/mmu-hash64.h
@@ -193,9 +193,9 @@ extern int mmu_ci_restrictions;
 
 #ifdef CONFIG_HUGETLB_PAGE
 /*
- * The page size index of the huge pages for use by hugetlbfs
+ * The page size indexes of the huge pages for use by hugetlbfs
  */
-extern int mmu_huge_psize;
+extern unsigned int mmu_huge_psizes[MMU_PAGE_COUNT];
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
Index: linux-2.6/include/asm-powerpc/page_64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/page_64.h
+++ linux-2.6/include/asm-powerpc/page_64.h
@@ -90,6 +90,7 @@ extern unsigned int HPAGE_SHIFT;
 #define HPAGE_SIZE		((1UL) << HPAGE_SHIFT)
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
+#define HUGE_MAX_HSTATE		3
 
 #endif /* __ASSEMBLY__ */
 
Index: linux-2.6/include/asm-powerpc/pgalloc-64.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgalloc-64.h
+++ linux-2.6/include/asm-powerpc/pgalloc-64.h
@@ -22,7 +22,7 @@ extern struct kmem_cache *pgtable_cache[
 #define PUD_CACHE_NUM		1
 #define PMD_CACHE_NUM		1
 #define HUGEPTE_CACHE_NUM	2
-#define PTE_NONCACHE_NUM	3  /* from GFP rather than kmem_cache */
+#define PTE_NONCACHE_NUM	7  /* from GFP rather than kmem_cache */
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
@@ -119,7 +119,7 @@ static inline void pte_free(struct mm_st
 	__free_page(ptepage);
 }
 
-#define PGF_CACHENUM_MASK	0x3
+#define PGF_CACHENUM_MASK	0x7
 
 typedef struct pgtable_free {
 	unsigned long val;
Index: linux-2.6/include/asm-powerpc/hugetlb.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/hugetlb.h
+++ linux-2.6/include/asm-powerpc/hugetlb.h
@@ -23,9 +23,10 @@ pte_t huge_ptep_get_and_clear(struct mm_
  */
 static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
 {
-	if (len & ~HPAGE_MASK)
+	struct hstate *h = hstate_file(file);
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (addr & ~HPAGE_MASK)
+	if (addr & ~huge_page_mask(h))
 		return -EINVAL;
 	return 0;
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 23/23] powerpc: support multiple hugepage sizes
  2008-05-25 14:23 ` [patch 23/23] powerpc: support multiple hugepage sizes npiggin
@ 2008-05-27 17:14   ` Nishanth Aravamudan
  2008-05-28  8:49     ` Nick Piggin
  0 siblings, 1 reply; 88+ messages in thread
From: Nishanth Aravamudan @ 2008-05-27 17:14 UTC (permalink / raw)
  To: npiggin; +Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Jon Tollefson

On 26.05.2008 [00:23:40 +1000], npiggin@suse.de wrote:
> Instead of using the variable mmu_huge_psize to keep track of the huge
> page size we use an array of MMU_PAGE_* values.  For each supported
> huge page size we need to know the hugepte_shift value and have a
> pgtable_cache.  The hstate or an mmu_huge_psizes index is passed to
> functions so that they know which huge page size they should use.
> 
> The hugepage sizes 16M and 64K are setup(if available on the
> hardware) so that they don't have to be set on the boot cmd line in
> order to use them.  The number of 16G pages have to be specified at
> boot-time though (e.g. hugepagesz=16G hugepages=5).

This patch probably should updated Documentation as well, to indicate
power can also specify hugepagesz multiple times?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 23/23] powerpc: support multiple hugepage sizes
  2008-05-27 17:14   ` Nishanth Aravamudan
@ 2008-05-28  8:49     ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-28  8:49 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: linux-mm, kniht, andi, agl, abh, joachim.deguara, Jon Tollefson

On Tue, May 27, 2008 at 10:14:03AM -0700, Nishanth Aravamudan wrote:
> On 26.05.2008 [00:23:40 +1000], npiggin@suse.de wrote:
> > Instead of using the variable mmu_huge_psize to keep track of the huge
> > page size we use an array of MMU_PAGE_* values.  For each supported
> > huge page size we need to know the hugepte_shift value and have a
> > pgtable_cache.  The hstate or an mmu_huge_psizes index is passed to
> > functions so that they know which huge page size they should use.
> > 
> > The hugepage sizes 16M and 64K are setup(if available on the
> > hardware) so that they don't have to be set on the boot cmd line in
> > order to use them.  The number of 16G pages have to be specified at
> > boot-time though (e.g. hugepagesz=16G hugepages=5).
> 
> This patch probably should updated Documentation as well, to indicate
> power can also specify hugepagesz multiple times?

OK, added a small bit to it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc
  2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
                   ` (22 preceding siblings ...)
  2008-05-25 14:23 ` [patch 23/23] powerpc: support multiple hugepage sizes npiggin
@ 2008-05-25 14:42 ` Nick Piggin
  23 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2008-05-25 14:42 UTC (permalink / raw)
  To: linux-mm; +Cc: kniht, andi, nacc, agl, abh, joachim.deguara

On Mon, May 26, 2008 at 12:23:17AM +1000, npiggin@suse.de wrote:
> Hi all,
> 
> Given the amount of feedback this has had, and the powerpc patches from Jon,
> I'll send out one more request for review and testing before asking Andrew
> to merge in -mm.
> 
> Patches are against Linus's current git (eb90d81d). I will have to rebase
> to -mm next.
> 
> The patches pass the libhugetlbfs regression test suite here on x86 and
> powerpc (although my G5 can only run 16MB hugepages, so it is less
> interesting...).
> 
> So, review and testing welcome.
> 
> Thanks!
> Nick

Arg, sorry I've got Andi's old suse address on some of these:
quilt mail extracts SOBs and adds them to the cc list which always
gets me :(


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2008-06-03 17:17 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-25 14:23 [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc npiggin
2008-05-25 14:23 ` [patch 01/23] hugetlb: fix lockdep error npiggin
2008-05-27 16:30   ` Nishanth Aravamudan
2008-05-27 19:55   ` Adam Litke
2008-05-25 14:23 ` [patch 02/23] hugetlb: factor out huge_new_page npiggin
2008-05-27 16:31   ` Nishanth Aravamudan
2008-05-27 20:03   ` Adam Litke
2008-05-25 14:23 ` [patch 03/23] hugetlb: modular state npiggin
2008-05-27 16:44   ` Nishanth Aravamudan
2008-05-28  8:40     ` Nick Piggin
2008-05-27 20:38   ` Adam Litke
2008-05-28  9:13     ` Nick Piggin
2008-05-25 14:23 ` [patch 04/23] hugetlb: multiple hstates npiggin
2008-05-27 16:52   ` Nishanth Aravamudan
2008-05-27 20:43   ` Adam Litke
2008-05-25 14:23 ` [patch 05/23] hugetlb: multi hstate proc files npiggin
2008-05-29  5:07   ` Nishanth Aravamudan
2008-05-29  5:44     ` Nishanth Aravamudan
2008-05-29  6:30       ` Nishanth Aravamudan
2008-05-29  9:04     ` Nick Piggin
2008-05-25 14:23 ` [patch 06/23] hugetlbfs: per mount hstates npiggin
2008-05-27 16:58   ` Nishanth Aravamudan
2008-05-27 20:50   ` Adam Litke
2008-05-25 14:23 ` [patch 07/23] hugetlb: multi hstate sysctls npiggin
2008-05-27 21:00   ` Adam Litke
2008-05-28  9:59     ` Nick Piggin
2008-05-29  4:59   ` Nishanth Aravamudan
2008-05-29  5:36     ` Nishanth Aravamudan
2008-05-29  8:59     ` Nick Piggin
2008-05-29  6:39   ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Nishanth Aravamudan
2008-05-29  6:42     ` [RFC][PATCH 2/2] hugetlb: remove multi-valued proc files Nishanth Aravamudan
2008-05-30  3:51       ` Nick Piggin
2008-05-30  7:43         ` Nishanth Aravamudan
2008-05-30  2:58     ` [RFC][PATCH 1/2] hugetlb: present information in sysfs Greg KH
2008-05-30  3:37       ` Nick Piggin
2008-05-30  4:21         ` Greg KH
2008-05-30  4:28           ` Nick Piggin
2008-05-30  7:44           ` Nishanth Aravamudan
2008-05-30  7:41         ` Nishanth Aravamudan
2008-05-30 13:40         ` Adam Litke
2008-05-30  7:39       ` Nishanth Aravamudan
2008-05-25 14:23 ` [patch 08/23] hugetlb: abstract numa round robin selection npiggin
2008-05-27 17:01   ` Nishanth Aravamudan
2008-05-27 21:02   ` Adam Litke
2008-05-25 14:23 ` [patch 09/23] mm: introduce non panic alloc_bootmem npiggin
2008-05-25 14:23 ` [patch 10/23] mm: export prep_compound_page to mm npiggin
2008-05-27 21:05   ` Adam Litke
2008-05-25 14:23 ` [patch 11/23] hugetlb: support larger than MAX_ORDER npiggin
2008-05-27 21:23   ` Adam Litke
2008-05-28 10:22     ` Nick Piggin
2008-05-25 14:23 ` [patch 12/23] hugetlb: support boot allocate different sizes npiggin
2008-05-27 17:04   ` Nishanth Aravamudan
2008-05-27 21:28   ` Adam Litke
2008-05-28 10:57     ` Nick Piggin
2008-05-28 14:01       ` Nick Piggin
2008-05-28 14:35         ` Adam Litke
2008-05-25 14:23 ` [patch 13/23] hugetlb: printk cleanup npiggin
2008-05-27 17:05   ` Nishanth Aravamudan
2008-05-27 21:30   ` Adam Litke
2008-05-25 14:23 ` [patch 14/23] hugetlb: introduce huge_pud npiggin
2008-05-26 11:09   ` Hugh Dickins
2008-05-27  2:24     ` Nick Piggin
2008-05-25 14:23 ` [patch 15/23] x86: support GB hugepages on 64-bit npiggin
2008-05-27 21:35   ` Adam Litke
2008-05-25 14:23 ` [patch 16/23] x86: add hugepagesz option " npiggin
2008-05-25 14:23 ` [patch 17/23] hugetlb: do not always register default HPAGE_SIZE huge page size npiggin
2008-05-27 21:39   ` Adam Litke
2008-05-25 14:23 ` [patch 18/23] hugetlb: allow arch overried hugepage allocation npiggin
2008-05-27 21:41   ` Adam Litke
2008-05-25 14:23 ` [patch 19/23] powerpc: function to allocate gigantic hugepages npiggin
2008-05-27 21:44   ` Adam Litke
2008-05-25 14:23 ` [patch 20/23] powerpc: scan device tree for gigantic pages npiggin
2008-05-27 21:47   ` Adam Litke
2008-05-25 14:23 ` [patch 21/23] powerpc: define support for 16G hugepages npiggin
2008-05-25 14:23 ` [patch 22/23] fs: check for statfs overflow npiggin
2008-05-27 17:14   ` Nishanth Aravamudan
2008-05-27 17:19     ` Jon Tollefson
2008-05-28  9:02       ` Nick Piggin
2008-05-29 23:56         ` Andreas Dilger
2008-05-30  0:12           ` Nishanth Aravamudan
2008-05-30  1:14           ` Nick Piggin
2008-06-02  3:16             ` Andreas Dilger
2008-06-03  3:27               ` Nick Piggin
2008-06-03 17:17                 ` Andreas Dilger
2008-05-25 14:23 ` [patch 23/23] powerpc: support multiple hugepage sizes npiggin
2008-05-27 17:14   ` Nishanth Aravamudan
2008-05-28  8:49     ` Nick Piggin
2008-05-25 14:42 ` [patch 00/23] multi size, giant hugetlb support, 1GB for x86, 16GB for powerpc Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).