[patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
@ 2008-04-23  1:53 npiggin
  2008-04-23  1:53 ` [patch 01/18] hugetlb: fix lockdep spew npiggin
                   ` (18 more replies)
  0 siblings, 19 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

Hi

Patches 1 and 2 are good to merge upstream now. Patch 3 I hope gets merged
too. After Andrew's big upstream merge, and this round of review, I plan
push the reset of the patchset into -mm. It would be very nice to have the
powerpc patches integrated and tested by that point too -- is there something
I can pick up?

I'm again not sure of the sysfs work. I think this patchset probably actually
does make sense to go in first, because it will necessarily change the
layout of the sysfs directories.

I have integrated bounds fixes, and type size fixes suggested by reviewers.
Merged those and my previous round of fixes into the previous patches in
the patchset.

Have done a little bit more juggling of the patchset (without changing the end
result but trying to improve the steps).

Then I have done another set of fixes in the last patch of the patchset,
which will again be merged after review.

Testing-wise, I've changed the registration mechanism so that if you specify
hugepagesz=1G on the command line, then you do not get the 2M pages by default
(you have to also specify hugepagesz=2M). Also, when only one hstate is
registered, all the proc outputs appear unchanged, so this makes it very easy
to test with.

Thanks,
Nick

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 01/18] hugetlb: fix lockdep spew
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-23 13:06   ` KOSAKI Motohiro
  2008-04-23  1:53 ` [patch 02/18] hugetlb: factor out huge_new_page npiggin
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-copy-lockdep.patch --]
[-- Type: text/plain, Size: 769 bytes --]

---
 mm/hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -761,7 +761,7 @@ int copy_hugetlb_page_range(struct mm_st
 			continue;
 
 		spin_lock(&dst->page_table_lock);
-		spin_lock(&src->page_table_lock);
+		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
 		if (!pte_none(*src_pte)) {
 			if (cow)
 				ptep_set_wrprotect(src, addr, src_pte);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
  2008-04-23  1:53 ` [patch 01/18] hugetlb: fix lockdep spew npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-24 23:49   ` Nishanth Aravamudan
  2008-04-24 23:54   ` Nishanth Aravamudan
  2008-04-23  1:53 ` [patch 03/18] mm: offset align in alloc_bootmem npiggin, Yinghai Lu
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-factor-page-prep.patch --]
[-- Type: text/plain, Size: 1825 bytes --]

Needed to avoid code duplication in follow up patches.

This happens to fix a minor bug. When alloc_bootmem_node returns
a fallback node on a different node than passed the old code
would have put it into the free lists of the wrong node.
Now it would end up in the freelist of the correct node.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -190,6 +190,17 @@ static int adjust_pool_surplus(int delta
 	return ret;
 }
 
+static void prep_new_huge_page(struct page *page)
+{
+	unsigned nid = pfn_to_nid(page_to_pfn(page));
+	set_compound_page_dtor(page, free_huge_page);
+	spin_lock(&hugetlb_lock);
+	nr_huge_pages++;
+	nr_huge_pages_node[nid]++;
+	spin_unlock(&hugetlb_lock);
+	put_page(page); /* free it into the hugepage allocator */
+}
+
 static struct page *alloc_fresh_huge_page_node(int nid)
 {
 	struct page *page;
@@ -197,14 +208,8 @@ static struct page *alloc_fresh_huge_pag
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
 		HUGETLB_PAGE_ORDER);
-	if (page) {
-		set_compound_page_dtor(page, free_huge_page);
-		spin_lock(&hugetlb_lock);
-		nr_huge_pages++;
-		nr_huge_pages_node[nid]++;
-		spin_unlock(&hugetlb_lock);
-		put_page(page); /* free it into the hugepage allocator */
-	}
+	if (page)
+		prep_new_huge_page(page);
 
 	return page;
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 03/18] mm: offset align in alloc_bootmem
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
  2008-04-23  1:53 ` [patch 01/18] hugetlb: fix lockdep spew npiggin
  2008-04-23  1:53 ` [patch 02/18] hugetlb: factor out huge_new_page npiggin
@ 2008-04-23  1:53 ` npiggin, Yinghai Lu
  2008-04-23  1:53 ` [patch 04/18] hugetlb: modular state npiggin
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: npiggin, Yinghai Lu @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: mm-offset-align-in-alloc_bootmem.patch --]
[-- Type: text/plain, Size: 5586 bytes --]

Need offset alignment when node_boot_start's alignment is less than align
required

Use local node_boot_start to match align.  so don't add extra opteration in
search loop.

[this is in -mm already, but needs to be applied to mainline to run this
patchset]

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
---

 mm/bootmem.c |   60 +++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 34 insertions(+), 26 deletions(-)

Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -206,9 +206,11 @@ void * __init
 __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	      unsigned long align, unsigned long goal, unsigned long limit)
 {
-	unsigned long offset, remaining_size, areasize, preferred;
+	unsigned long areasize, preferred;
 	unsigned long i, start = 0, incr, eidx, end_pfn;
 	void *ret;
+	unsigned long node_boot_start;
+	void *node_bootmem_map;
 
 	if (!size) {
 		printk("__alloc_bootmem_core(): zero-sized request\n");
@@ -216,54 +218,61 @@ __alloc_bootmem_core(struct bootmem_data
 	}
 	BUG_ON(align & (align-1));
 
-	if (limit && bdata->node_boot_start >= limit)
-		return NULL;
-
 	/* on nodes without memory - bootmem_map is NULL */
 	if (!bdata->node_bootmem_map)
 		return NULL;
 
+	/* bdata->node_boot_start is supposed to be (12+6)bits alignment on x86_64 ? */
+	node_boot_start = bdata->node_boot_start;
+	node_bootmem_map = bdata->node_bootmem_map;
+	if (align) {
+		node_boot_start = ALIGN(bdata->node_boot_start, align);
+		if (node_boot_start > bdata->node_boot_start)
+			node_bootmem_map = (unsigned long *)bdata->node_bootmem_map +
+			    PFN_DOWN(node_boot_start - bdata->node_boot_start)/BITS_PER_LONG;
+	}
+
+	if (limit && node_boot_start >= limit)
+		return NULL;
+
 	end_pfn = bdata->node_low_pfn;
 	limit = PFN_DOWN(limit);
 	if (limit && end_pfn > limit)
 		end_pfn = limit;
 
-	eidx = end_pfn - PFN_DOWN(bdata->node_boot_start);
-	offset = 0;
-	if (align && (bdata->node_boot_start & (align - 1UL)) != 0)
-		offset = align - (bdata->node_boot_start & (align - 1UL));
-	offset = PFN_DOWN(offset);
+	eidx = end_pfn - PFN_DOWN(node_boot_start);
 
 	/*
 	 * We try to allocate bootmem pages above 'goal'
 	 * first, then we try to allocate lower pages.
 	 */
-	if (goal && goal >= bdata->node_boot_start && PFN_DOWN(goal) < end_pfn) {
-		preferred = goal - bdata->node_boot_start;
+	if (goal && goal >= node_boot_start && PFN_DOWN(goal) < end_pfn) {
+		preferred = goal - node_boot_start;
 
-		if (bdata->last_success >= preferred)
+		if (bdata->last_success > node_boot_start &&
+			bdata->last_success - node_boot_start >= preferred)
 			if (!limit || (limit && limit > bdata->last_success))
-				preferred = bdata->last_success;
+				preferred = bdata->last_success - node_boot_start;
 	} else
 		preferred = 0;
 
-	preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
+	preferred = PFN_DOWN(ALIGN(preferred, align));
 	areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
 	incr = align >> PAGE_SHIFT ? : 1;
 
 restart_scan:
 	for (i = preferred; i < eidx; i += incr) {
 		unsigned long j;
-		i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
+		i = find_next_zero_bit(node_bootmem_map, eidx, i);
 		i = ALIGN(i, incr);
 		if (i >= eidx)
 			break;
-		if (test_bit(i, bdata->node_bootmem_map))
+		if (test_bit(i, node_bootmem_map))
 			continue;
 		for (j = i + 1; j < i + areasize; ++j) {
 			if (j >= eidx)
 				goto fail_block;
-			if (test_bit(j, bdata->node_bootmem_map))
+			if (test_bit(j, node_bootmem_map))
 				goto fail_block;
 		}
 		start = i;
@@ -272,14 +281,14 @@ restart_scan:
 		i = ALIGN(j, incr);
 	}
 
-	if (preferred > offset) {
-		preferred = offset;
+	if (preferred > 0) {
+		preferred = 0;
 		goto restart_scan;
 	}
 	return NULL;
 
 found:
-	bdata->last_success = PFN_PHYS(start);
+	bdata->last_success = PFN_PHYS(start) + node_boot_start;
 	BUG_ON(start >= eidx);
 
 	/*
@@ -289,6 +298,7 @@ found:
 	 */
 	if (align < PAGE_SIZE &&
 	    bdata->last_offset && bdata->last_pos+1 == start) {
+		unsigned long offset, remaining_size;
 		offset = ALIGN(bdata->last_offset, align);
 		BUG_ON(offset > PAGE_SIZE);
 		remaining_size = PAGE_SIZE - offset;
@@ -297,14 +307,12 @@ found:
 			/* last_pos unchanged */
 			bdata->last_offset = offset + size;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 		} else {
 			remaining_size = size - remaining_size;
 			areasize = (remaining_size + PAGE_SIZE-1) / PAGE_SIZE;
 			ret = phys_to_virt(bdata->last_pos * PAGE_SIZE +
-					   offset +
-					   bdata->node_boot_start);
+					   offset + node_boot_start);
 			bdata->last_pos = start + areasize - 1;
 			bdata->last_offset = remaining_size;
 		}
@@ -312,14 +320,14 @@ found:
 	} else {
 		bdata->last_pos = start + areasize - 1;
 		bdata->last_offset = size & ~PAGE_MASK;
-		ret = phys_to_virt(start * PAGE_SIZE + bdata->node_boot_start);
+		ret = phys_to_virt(start * PAGE_SIZE + node_boot_start);
 	}
 
 	/*
 	 * Reserve the area now:
 	 */
 	for (i = start; i < start + areasize; i++)
-		if (unlikely(test_and_set_bit(i, bdata->node_bootmem_map)))
+		if (unlikely(test_and_set_bit(i, node_bootmem_map)))
 			BUG();
 	memset(ret, 0, size);
 	return ret;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 04/18] hugetlb: modular state
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (2 preceding siblings ...)
  2008-04-23  1:53 ` [patch 03/18] mm: offset align in alloc_bootmem npiggin, Yinghai Lu
@ 2008-04-23  1:53 ` npiggin
  2008-04-23 15:21   ` Jon Tollefson
  2008-04-25 17:13   ` Nishanth Aravamudan
  2008-04-23  1:53 ` [patch 05/18] hugetlb: multiple hstates npiggin
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-modular-state.patch --]
[-- Type: text/plain, Size: 40457 bytes --]

Large, but rather mechanical patch that converts most of the hugetlb.c
globals into structure members and passes them around.

Right now there is only a single global hstate structure, but 
most of the infrastructure to extend it is there.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    2 
 arch/powerpc/mm/hugetlbpage.c |    2 
 arch/sh/mm/hugetlbpage.c      |    2 
 arch/sparc64/mm/hugetlbpage.c |    2 
 arch/x86/mm/hugetlbpage.c     |    2 
 fs/hugetlbfs/inode.c          |   47 +++---
 include/linux/hugetlb.h       |   72 +++++++++-
 ipc/shm.c                     |    3 
 mm/hugetlb.c                  |  302 ++++++++++++++++++++++--------------------
 mm/memory.c                   |    2 
 mm/mempolicy.c                |   10 -
 mm/mmap.c                     |    3 
 12 files changed, 276 insertions(+), 173 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -22,30 +22,24 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
-static unsigned long surplus_huge_pages;
-static unsigned long nr_overcommit_huge_pages;
 unsigned long max_huge_pages;
 unsigned long sysctl_overcommit_huge_pages;
-static struct list_head hugepage_freelists[MAX_NUMNODES];
-static unsigned int nr_huge_pages_node[MAX_NUMNODES];
-static unsigned int free_huge_pages_node[MAX_NUMNODES];
-static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
-static int hugetlb_next_nid;
+
+struct hstate global_hstate;
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
-static void clear_huge_page(struct page *page, unsigned long addr)
+static void clear_huge_page(struct page *page, unsigned long addr, unsigned long sz)
 {
 	int i;
 
 	might_sleep();
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
+	for (i = 0; i < sz/PAGE_SIZE; i++) {
 		cond_resched();
 		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
 	}
@@ -55,34 +49,35 @@ static void copy_huge_page(struct page *
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
+	struct hstate *h = hstate_vma(vma);
 
 	might_sleep();
-	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+	for (i = 0; i < 1 << huge_page_order(h); i++) {
 		cond_resched();
 		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
 	}
 }
 
-static void enqueue_huge_page(struct page *page)
+static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_add(&page->lru, &hugepage_freelists[nid]);
-	free_huge_pages++;
-	free_huge_pages_node[nid]++;
+	list_add(&page->lru, &h->hugepage_freelists[nid]);
+	h->free_huge_pages++;
+	h->free_huge_pages_node[nid]++;
 }
 
-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct hstate *h)
 {
 	int nid;
 	struct page *page = NULL;
 
 	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 			break;
 		}
 	}
@@ -98,18 +93,19 @@ static struct page *dequeue_huge_page_vm
 	struct zonelist *zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol);
 	struct zone **z;
+	struct hstate *h = hstate_vma(vma);
 
 	for (z = zonelist->zones; *z; z++) {
 		nid = zone_to_nid(*z);
 		if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
-		    !list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		    !list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 			if (vma && vma->vm_flags & VM_MAYSHARE)
-				resv_huge_pages--;
+				h->resv_huge_pages--;
 			break;
 		}
 	}
@@ -117,23 +113,24 @@ static struct page *dequeue_huge_page_vm
 	return page;
 }
 
-static void update_and_free_page(struct page *page)
+static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
-	nr_huge_pages--;
-	nr_huge_pages_node[page_to_nid(page)]--;
-	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
+	h->nr_huge_pages--;
+	h->nr_huge_pages_node[page_to_nid(page)]--;
+	for (i = 0; i < (1 << huge_page_order(h)); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
 				1 << PG_private | 1<< PG_writeback);
 	}
 	set_compound_page_dtor(page, NULL);
 	set_page_refcounted(page);
-	__free_pages(page, HUGETLB_PAGE_ORDER);
+	__free_pages(page, huge_page_order(h));
 }
 
 static void free_huge_page(struct page *page)
 {
+	struct hstate *h = &global_hstate;
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
@@ -143,12 +140,12 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages_node[nid]) {
-		update_and_free_page(page);
-		surplus_huge_pages--;
-		surplus_huge_pages_node[nid]--;
+	if (h->surplus_huge_pages_node[nid]) {
+		update_and_free_page(h, page);
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
 	} else {
-		enqueue_huge_page(page);
+		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
 	if (mapping)
@@ -160,7 +157,7 @@ static void free_huge_page(struct page *
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(int delta)
+static int adjust_pool_surplus(struct hstate *h, int delta)
 {
 	static int prev_nid;
 	int nid = prev_nid;
@@ -173,15 +170,15 @@ static int adjust_pool_surplus(int delta
 			nid = first_node(node_online_map);
 
 		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !surplus_huge_pages_node[nid])
+		if (delta < 0 && !h->surplus_huge_pages_node[nid])
 			continue;
 		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && surplus_huge_pages_node[nid] >=
-						nr_huge_pages_node[nid])
+		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+						h->nr_huge_pages_node[nid])
 			continue;
 
-		surplus_huge_pages += delta;
-		surplus_huge_pages_node[nid] += delta;
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
 		ret = 1;
 		break;
 	} while (nid != prev_nid);
@@ -190,41 +187,41 @@ static int adjust_pool_surplus(int delta
 	return ret;
 }
 
-static void prep_new_huge_page(struct page *page)
+static void prep_new_huge_page(struct hstate *h, struct page *page)
 {
 	unsigned nid = pfn_to_nid(page_to_pfn(page));
 	set_compound_page_dtor(page, free_huge_page);
 	spin_lock(&hugetlb_lock);
-	nr_huge_pages++;
-	nr_huge_pages_node[nid]++;
+	h->nr_huge_pages++;
+	h->nr_huge_pages_node[nid]++;
 	spin_unlock(&hugetlb_lock);
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static struct page *alloc_fresh_huge_page_node(int nid)
+static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
-		HUGETLB_PAGE_ORDER);
+		huge_page_order(h));
 	if (page)
-		prep_new_huge_page(page);
+		prep_new_huge_page(h, page);
 
 	return page;
 }
 
-static int alloc_fresh_huge_page(void)
+static int alloc_fresh_huge_page(struct hstate *h)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hugetlb_next_nid;
+	start_nid = h->hugetlb_next_nid;
 
 	do {
-		page = alloc_fresh_huge_page_node(hugetlb_next_nid);
+		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
 		if (page)
 			ret = 1;
 		/*
@@ -238,17 +235,18 @@ static int alloc_fresh_huge_page(void)
 		 * if we just successfully allocated a hugepage so that
 		 * the next caller gets hugepages on the next node.
 		 */
-		next_nid = next_node(hugetlb_next_nid, node_online_map);
+		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
 		if (next_nid == MAX_NUMNODES)
 			next_nid = first_node(node_online_map);
-		hugetlb_next_nid = next_nid;
-	} while (!page && hugetlb_next_nid != start_nid);
+		h->hugetlb_next_nid = next_nid;
+	} while (!page && h->hugetlb_next_nid != start_nid);
 
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
-						unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h,
+					  struct vm_area_struct *vma,
+					  unsigned long address)
 {
 	struct page *page;
 	unsigned int nid;
@@ -277,17 +275,17 @@ static struct page *alloc_buddy_huge_pag
 	 * per-node value is checked there.
 	 */
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
+	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		spin_unlock(&hugetlb_lock);
 		return NULL;
 	} else {
-		nr_huge_pages++;
-		surplus_huge_pages++;
+		h->nr_huge_pages++;
+		h->surplus_huge_pages++;
 	}
 	spin_unlock(&hugetlb_lock);
 
 	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
+			   huge_page_order(h));
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -302,11 +300,11 @@ static struct page *alloc_buddy_huge_pag
 		/*
 		 * We incremented the global counters already
 		 */
-		nr_huge_pages_node[nid]++;
-		surplus_huge_pages_node[nid]++;
+		h->nr_huge_pages_node[nid]++;
+		h->surplus_huge_pages_node[nid]++;
 	} else {
-		nr_huge_pages--;
-		surplus_huge_pages--;
+		h->nr_huge_pages--;
+		h->surplus_huge_pages--;
 	}
 	spin_unlock(&hugetlb_lock);
 
@@ -317,16 +315,16 @@ static struct page *alloc_buddy_huge_pag
  * Increase the hugetlb pool such that it can accomodate a reservation
  * of size 'delta'.
  */
-static int gather_surplus_pages(int delta)
+static int gather_surplus_pages(struct hstate *h, int delta)
 {
 	struct list_head surplus_list;
 	struct page *page, *tmp;
 	int ret, i;
 	int needed, allocated;
 
-	needed = (resv_huge_pages + delta) - free_huge_pages;
+	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
 	if (needed <= 0) {
-		resv_huge_pages += delta;
+		h->resv_huge_pages += delta;
 		return 0;
 	}
 
@@ -337,7 +335,7 @@ static int gather_surplus_pages(int delt
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(NULL, 0);
+		page = alloc_buddy_huge_page(h, NULL, 0);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -358,7 +356,8 @@ retry:
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
 	spin_lock(&hugetlb_lock);
-	needed = (resv_huge_pages + delta) - (free_huge_pages + allocated);
+	needed = (h->resv_huge_pages + delta) -
+			(h->free_huge_pages + allocated);
 	if (needed > 0)
 		goto retry;
 
@@ -371,13 +370,13 @@ retry:
 	 * before they are reserved.
 	 */
 	needed += allocated;
-	resv_huge_pages += delta;
+	h->resv_huge_pages += delta;
 	ret = 0;
 free:
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 		list_del(&page->lru);
 		if ((--needed) >= 0)
-			enqueue_huge_page(page);
+			enqueue_huge_page(h, page);
 		else {
 			/*
 			 * The page has a reference count of zero already, so
@@ -400,7 +399,8 @@ free:
  * allocated to satisfy the reservation must be explicitly freed if they were
  * never used.
  */
-static void return_unused_surplus_pages(unsigned long unused_resv_pages)
+static void return_unused_surplus_pages(struct hstate *h,
+					unsigned long unused_resv_pages)
 {
 	static int nid = -1;
 	struct page *page;
@@ -415,27 +415,27 @@ static void return_unused_surplus_pages(
 	unsigned long remaining_iterations = num_online_nodes();
 
 	/* Uncommit the reservation */
-	resv_huge_pages -= unused_resv_pages;
+	h->resv_huge_pages -= unused_resv_pages;
 
-	nr_pages = min(unused_resv_pages, surplus_huge_pages);
+	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
 	while (remaining_iterations-- && nr_pages) {
 		nid = next_node(nid, node_online_map);
 		if (nid == MAX_NUMNODES)
 			nid = first_node(node_online_map);
 
-		if (!surplus_huge_pages_node[nid])
+		if (!h->surplus_huge_pages_node[nid])
 			continue;
 
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
-			surplus_huge_pages--;
-			surplus_huge_pages_node[nid]--;
+			update_and_free_page(h, page);
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
+			h->surplus_huge_pages--;
+			h->surplus_huge_pages_node[nid]--;
 			nr_pages--;
 			remaining_iterations = num_online_nodes();
 		}
@@ -458,16 +458,17 @@ static struct page *alloc_huge_page_priv
 						unsigned long addr)
 {
 	struct page *page = NULL;
+	struct hstate *h = hstate_vma(vma);
 
 	if (hugetlb_get_quota(vma->vm_file->f_mapping, 1))
 		return ERR_PTR(-VM_FAULT_SIGBUS);
 
 	spin_lock(&hugetlb_lock);
-	if (free_huge_pages > resv_huge_pages)
+	if (h->free_huge_pages > h->resv_huge_pages)
 		page = dequeue_huge_page_vma(vma, addr);
 	spin_unlock(&hugetlb_lock);
 	if (!page) {
-		page = alloc_buddy_huge_page(vma, addr);
+		page = alloc_buddy_huge_page(h, vma, addr);
 		if (!page) {
 			hugetlb_put_quota(vma->vm_file->f_mapping, 1);
 			return ERR_PTR(-VM_FAULT_OOM);
@@ -497,21 +498,27 @@ static struct page *alloc_huge_page(stru
 static int __init hugetlb_init(void)
 {
 	unsigned long i;
+	struct hstate *h = &global_hstate;
 
 	if (HPAGE_SHIFT == 0)
 		return 0;
 
+	if (!h->order) {
+		h->order = HPAGE_SHIFT - PAGE_SHIFT;
+		h->mask = HPAGE_MASK;
+	}
+
 	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&hugepage_freelists[i]);
+		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
-	hugetlb_next_nid = first_node(node_online_map);
+	h->hugetlb_next_nid = first_node(node_online_map);
 
 	for (i = 0; i < max_huge_pages; ++i) {
-		if (!alloc_fresh_huge_page())
+		if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	max_huge_pages = free_huge_pages = nr_huge_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
 	return 0;
 }
 module_init(hugetlb_init);
@@ -539,19 +546,21 @@ static unsigned int cpuset_mems_nr(unsig
 #ifdef CONFIG_HIGHMEM
 static void try_to_free_low(unsigned long count)
 {
+	struct hstate *h = &global_hstate;
 	int i;
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
-		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
+		struct list_head *freel = &h->hugepage_freelists[i];
+		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= nr_huge_pages)
 				return;
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
 			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[page_to_nid(page)]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
@@ -561,10 +570,11 @@ static inline void try_to_free_low(unsig
 }
 #endif
 
-#define persistent_huge_pages (nr_huge_pages - surplus_huge_pages)
+#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(unsigned long count)
 {
 	unsigned long min_count, ret;
+	struct hstate *h = &global_hstate;
 
 	/*
 	 * Increase the pool size
@@ -578,12 +588,12 @@ static unsigned long set_max_huge_pages(
 	 * within all the constraints specified by the sysctls.
 	 */
 	spin_lock(&hugetlb_lock);
-	while (surplus_huge_pages && count > persistent_huge_pages) {
-		if (!adjust_pool_surplus(-1))
+	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, -1))
 			break;
 	}
 
-	while (count > persistent_huge_pages) {
+	while (count > persistent_huge_pages(h)) {
 		int ret;
 		/*
 		 * If this allocation races such that we no longer need the
@@ -591,7 +601,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page();
+		ret = alloc_fresh_huge_page(h);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -613,21 +623,21 @@ static unsigned long set_max_huge_pages(
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
 	 */
-	min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
+	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
 	try_to_free_low(min_count);
-	while (min_count < persistent_huge_pages) {
-		struct page *page = dequeue_huge_page();
+	while (min_count < persistent_huge_pages(h)) {
+		struct page *page = dequeue_huge_page(h);
 		if (!page)
 			break;
-		update_and_free_page(page);
+		update_and_free_page(h, page);
 	}
-	while (count < persistent_huge_pages) {
-		if (!adjust_pool_surplus(1))
+	while (count < persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, 1))
 			break;
 	}
 out:
-	ret = persistent_huge_pages;
+	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
 	return ret;
 }
@@ -657,9 +667,10 @@ int hugetlb_overcommit_handler(struct ct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
+	struct hstate *h = &global_hstate;
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	spin_lock(&hugetlb_lock);
-	nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
+	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
 	spin_unlock(&hugetlb_lock);
 	return 0;
 }
@@ -668,34 +679,37 @@ int hugetlb_overcommit_handler(struct ct
 
 int hugetlb_report_meminfo(char *buf)
 {
+	struct hstate *h = &global_hstate;
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
 			"HugePages_Rsvd:  %5lu\n"
 			"HugePages_Surp:  %5lu\n"
 			"Hugepagesize:    %5lu kB\n",
-			nr_huge_pages,
-			free_huge_pages,
-			resv_huge_pages,
-			surplus_huge_pages,
-			HPAGE_SIZE/1024);
+			h->nr_huge_pages,
+			h->free_huge_pages,
+			h->resv_huge_pages,
+			h->surplus_huge_pages,
+			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
 }
 
 int hugetlb_report_node_meminfo(int nid, char *buf)
 {
+	struct hstate *h = &global_hstate;
 	return sprintf(buf,
 		"Node %d HugePages_Total: %5u\n"
 		"Node %d HugePages_Free:  %5u\n"
 		"Node %d HugePages_Surp:  %5u\n",
-		nid, nr_huge_pages_node[nid],
-		nid, free_huge_pages_node[nid],
-		nid, surplus_huge_pages_node[nid]);
+		nid, h->nr_huge_pages_node[nid],
+		nid, h->free_huge_pages_node[nid],
+		nid, h->surplus_huge_pages_node[nid]);
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
-	return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
+	struct hstate *h = &global_hstate;
+	return h->nr_huge_pages * (1 << huge_page_order(h));
 }
 
 /*
@@ -750,14 +764,16 @@ int copy_hugetlb_page_range(struct mm_st
 	struct page *ptepage;
 	unsigned long addr;
 	int cow;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, addr);
+		dst_pte = huge_pte_alloc(dst, addr, sz);
 		if (!dst_pte)
 			goto nomem;
 
@@ -793,6 +809,9 @@ void __unmap_hugepage_range(struct vm_ar
 	pte_t pte;
 	struct page *page;
 	struct page *tmp;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long sz = huge_page_size(h);
+
 	/*
 	 * A page gathering list, protected by per file i_mmap_lock. The
 	 * lock is used to avoid list corruption from multiple unmapping
@@ -801,11 +820,11 @@ void __unmap_hugepage_range(struct vm_ar
 	LIST_HEAD(page_list);
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON(start & ~HPAGE_MASK);
-	BUG_ON(end & ~HPAGE_MASK);
+	BUG_ON(start & ~huge_page_mask(h));
+	BUG_ON(end & ~huge_page_mask(h));
 
 	spin_lock(&mm->page_table_lock);
-	for (address = start; address < end; address += HPAGE_SIZE) {
+	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -853,6 +872,7 @@ static int hugetlb_cow(struct mm_struct 
 {
 	struct page *old_page, *new_page;
 	int avoidcopy;
+	struct hstate *h = hstate_vma(vma);
 
 	old_page = pte_page(pte);
 
@@ -877,7 +897,7 @@ static int hugetlb_cow(struct mm_struct 
 	__SetPageUptodate(new_page);
 	spin_lock(&mm->page_table_lock);
 
-	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
+	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(*ptep, pte))) {
 		/* Break COW */
 		set_huge_pte_at(mm, address, ptep,
@@ -899,10 +919,11 @@ static int hugetlb_no_page(struct mm_str
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	struct hstate *h = hstate_vma(vma);
 
 	mapping = vma->vm_file->f_mapping;
-	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
-		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+	idx = ((address - vma->vm_start) >> huge_page_shift(h))
+		+ (vma->vm_pgoff >> huge_page_order(h));
 
 	/*
 	 * Use page lock to guard against racing truncation
@@ -911,7 +932,7 @@ static int hugetlb_no_page(struct mm_str
 retry:
 	page = find_lock_page(mapping, idx);
 	if (!page) {
-		size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
 		page = alloc_huge_page(vma, address);
@@ -919,7 +940,7 @@ retry:
 			ret = -PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, address);
+		clear_huge_page(page, address, huge_page_size(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_SHARED) {
@@ -935,14 +956,14 @@ retry:
 			}
 
 			spin_lock(&inode->i_lock);
-			inode->i_blocks += BLOCKS_PER_HUGEPAGE;
+			inode->i_blocks += (huge_page_size(h)) / 512;
 			spin_unlock(&inode->i_lock);
 		} else
 			lock_page(page);
 	}
 
 	spin_lock(&mm->page_table_lock);
-	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
 
@@ -978,8 +999,9 @@ int hugetlb_fault(struct mm_struct *mm, 
 	pte_t entry;
 	int ret;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	struct hstate *h = hstate_vma(vma);
 
-	ptep = huge_pte_alloc(mm, address);
+	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
 	if (!ptep)
 		return VM_FAULT_OOM;
 
@@ -1017,6 +1039,7 @@ int follow_hugetlb_page(struct mm_struct
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
 	int remainder = *length;
+	struct hstate *h = hstate_vma(vma);
 
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
@@ -1028,7 +1051,7 @@ int follow_hugetlb_page(struct mm_struct
 		 * each hugepage.  We have to make * sure we get the
 		 * first, for the page indexing below to work.
 		 */
-		pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
+		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
 
 		if (!pte || pte_none(*pte) || (write && !pte_write(*pte))) {
 			int ret;
@@ -1045,7 +1068,7 @@ int follow_hugetlb_page(struct mm_struct
 			break;
 		}
 
-		pfn_offset = (vaddr & ~HPAGE_MASK) >> PAGE_SHIFT;
+		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(*pte);
 same_page:
 		if (pages) {
@@ -1061,7 +1084,7 @@ same_page:
 		--remainder;
 		++i;
 		if (vaddr < vma->vm_end && remainder &&
-				pfn_offset < HPAGE_SIZE/PAGE_SIZE) {
+				pfn_offset < (1 << huge_page_order(h))) {
 			/*
 			 * We use pfn_offset to avoid touching the pageframes
 			 * of this compound page.
@@ -1083,13 +1106,14 @@ void hugetlb_change_protection(struct vm
 	unsigned long start = address;
 	pte_t *ptep;
 	pte_t pte;
+	struct hstate *h = hstate_vma(vma);
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
 	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
-	for (; address < end; address += HPAGE_SIZE) {
+	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -1228,7 +1252,7 @@ static long region_truncate(struct list_
 	return chg;
 }
 
-static int hugetlb_acct_memory(long delta)
+static int hugetlb_acct_memory(struct hstate *h, long delta)
 {
 	int ret = -ENOMEM;
 
@@ -1251,18 +1275,18 @@ static int hugetlb_acct_memory(long delt
 	 * semantics that cpuset has.
 	 */
 	if (delta > 0) {
-		if (gather_surplus_pages(delta) < 0)
+		if (gather_surplus_pages(h, delta) < 0)
 			goto out;
 
-		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
-			return_unused_surplus_pages(delta);
+		if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
+			return_unused_surplus_pages(h, delta);
 			goto out;
 		}
 	}
 
 	ret = 0;
 	if (delta < 0)
-		return_unused_surplus_pages((unsigned long) -delta);
+		return_unused_surplus_pages(h, (unsigned long) -delta);
 
 out:
 	spin_unlock(&hugetlb_lock);
@@ -1272,6 +1296,7 @@ out:
 int hugetlb_reserve_pages(struct inode *inode, long from, long to)
 {
 	long ret, chg;
+	struct hstate *h = &global_hstate;
 
 	chg = region_chg(&inode->i_mapping->private_list, from, to);
 	if (chg < 0)
@@ -1279,7 +1304,7 @@ int hugetlb_reserve_pages(struct inode *
 
 	if (hugetlb_get_quota(inode->i_mapping, chg))
 		return -ENOSPC;
-	ret = hugetlb_acct_memory(chg);
+	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
 		hugetlb_put_quota(inode->i_mapping, chg);
 		return ret;
@@ -1290,12 +1315,13 @@ int hugetlb_reserve_pages(struct inode *
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
+	struct hstate *h = &global_hstate;
 	long chg = region_truncate(&inode->i_mapping->private_list, offset);
 
 	spin_lock(&inode->i_lock);
-	inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
+	inode->i_blocks -= ((huge_page_size(h))/512) * freed;
 	spin_unlock(&inode->i_lock);
 
 	hugetlb_put_quota(inode->i_mapping, (chg - freed));
-	hugetlb_acct_memory(-(chg - freed));
+	hugetlb_acct_memory(h, -(chg - freed));
 }
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -128,7 +128,7 @@ pte_t *huge_pte_offset(struct mm_struct 
 	return NULL;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
 {
 	pgd_t *pg;
 	pud_t *pu;
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c
@@ -195,7 +195,7 @@ hugetlb_get_unmapped_area(struct file *f
 				pgoff, flags);
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/hugetlbpage.c
+++ linux-2.6/arch/sh/mm/hugetlbpage.c
@@ -22,7 +22,7 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c
@@ -24,7 +24,7 @@
 unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
 
 pte_t *
-huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
+huge_pte_alloc (struct mm_struct *mm, unsigned long addr, int sz)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
 	return 1;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -40,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
 
 /* arch callbacks */
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -95,7 +95,6 @@ pte_t huge_ptep_get_and_clear(struct mm_
 #else
 void hugetlb_prefault_arch_hook(struct mm_struct *mm);
 #endif
-
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -169,8 +168,6 @@ struct file *hugetlb_file_setup(const ch
 int hugetlb_get_quota(struct address_space *mapping, long delta);
 void hugetlb_put_quota(struct address_space *mapping, long delta);
 
-#define BLOCKS_PER_HUGEPAGE	(HPAGE_SIZE / 512)
-
 static inline int is_file_hugepages(struct file *file)
 {
 	if (file->f_op == &hugetlbfs_file_operations)
@@ -199,4 +196,71 @@ unsigned long hugetlb_get_unmapped_area(
 					unsigned long flags);
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+/* Defines one hugetlb page size */
+struct hstate {
+	int hugetlb_next_nid;
+	unsigned int order;
+	unsigned long mask;
+	unsigned long max_huge_pages;
+	unsigned long nr_huge_pages;
+	unsigned long free_huge_pages;
+	unsigned long resv_huge_pages;
+	unsigned long surplus_huge_pages;
+	unsigned long nr_overcommit_huge_pages;
+	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int nr_huge_pages_node[MAX_NUMNODES];
+	unsigned int free_huge_pages_node[MAX_NUMNODES];
+	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+};
+
+extern struct hstate global_hstate;
+
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+{
+	return &global_hstate;
+}
+
+static inline struct hstate *hstate_file(struct file *f)
+{
+	return &global_hstate;
+}
+
+static inline struct hstate *hstate_inode(struct inode *i)
+{
+	return &global_hstate;
+}
+
+static inline unsigned long huge_page_size(struct hstate *h)
+{
+	return (unsigned long)PAGE_SIZE << h->order;
+}
+
+static inline unsigned long huge_page_mask(struct hstate *h)
+{
+	return h->mask;
+}
+
+static inline unsigned long huge_page_order(struct hstate *h)
+{
+	return h->order;
+}
+
+static inline unsigned huge_page_shift(struct hstate *h)
+{
+	return h->order + PAGE_SHIFT;
+}
+
+#else
+struct hstate {};
+#define hstate_file(f) NULL
+#define hstate_vma(v) NULL
+#define hstate_inode(i) NULL
+#define huge_page_size(h) PAGE_SIZE
+#define huge_page_mask(h) PAGE_MASK
+#define huge_page_order(h) 0
+#define huge_page_shift(h) PAGE_SHIFT
+#endif
+
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -80,6 +80,7 @@ static int hugetlbfs_file_mmap(struct fi
 	struct inode *inode = file->f_path.dentry->d_inode;
 	loff_t len, vma_len;
 	int ret;
+	struct hstate *h = hstate_file(file);
 
 	/*
 	 * vma address alignment (but not the pgoff alignment) has
@@ -92,7 +93,7 @@ static int hugetlbfs_file_mmap(struct fi
 	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
 	vma->vm_ops = &hugetlb_vm_ops;
 
-	if (vma->vm_pgoff & ~(HPAGE_MASK >> PAGE_SHIFT))
+	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
 		return -EINVAL;
 
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
@@ -104,8 +105,8 @@ static int hugetlbfs_file_mmap(struct fi
 	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
 	if (vma->vm_flags & VM_MAYSHARE &&
-	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
-				  len >> HPAGE_SHIFT))
+	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> huge_page_order(h),
+				  len >> huge_page_shift(h)))
 		goto out;
 
 	ret = 0;
@@ -130,8 +131,9 @@ hugetlb_get_unmapped_area(struct file *f
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long start_addr;
+	struct hstate *h = hstate_file(file);
 
-	if (len & ~HPAGE_MASK)
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -143,7 +145,7 @@ hugetlb_get_unmapped_area(struct file *f
 	}
 
 	if (addr) {
-		addr = ALIGN(addr, HPAGE_SIZE);
+		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
@@ -156,7 +158,7 @@ hugetlb_get_unmapped_area(struct file *f
 		start_addr = TASK_UNMAPPED_BASE;
 
 full_search:
-	addr = ALIGN(start_addr, HPAGE_SIZE);
+	addr = ALIGN(start_addr, huge_page_size(h));
 
 	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
@@ -174,7 +176,7 @@ full_search:
 
 		if (!vma || addr + len <= vma->vm_start)
 			return addr;
-		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+		addr = ALIGN(vma->vm_end, huge_page_size(h));
 	}
 }
 #endif
@@ -225,10 +227,11 @@ hugetlbfs_read_actor(struct page *page, 
 static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
 			      size_t len, loff_t *ppos)
 {
+	struct hstate *h = hstate_file(filp);
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-	unsigned long index = *ppos >> HPAGE_SHIFT;
-	unsigned long offset = *ppos & ~HPAGE_MASK;
+	unsigned long index = *ppos >> huge_page_shift(h);
+	unsigned long offset = *ppos & ~huge_page_mask(h);
 	unsigned long end_index;
 	loff_t isize;
 	ssize_t retval = 0;
@@ -243,17 +246,17 @@ static ssize_t hugetlbfs_read(struct fil
 	if (!isize)
 		goto out;
 
-	end_index = (isize - 1) >> HPAGE_SHIFT;
+	end_index = (isize - 1) >> huge_page_shift(h);
 	for (;;) {
 		struct page *page;
-		int nr, ret;
+		unsigned long nr, ret;
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = HPAGE_SIZE;
+		nr = huge_page_size(h);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
-			nr = ((isize - 1) & ~HPAGE_MASK) + 1;
+			nr = ((isize - 1) & ~huge_page_mask(h)) + 1;
 			if (nr <= offset) {
 				goto out;
 			}
@@ -287,8 +290,8 @@ static ssize_t hugetlbfs_read(struct fil
 		offset += ret;
 		retval += ret;
 		len -= ret;
-		index += offset >> HPAGE_SHIFT;
-		offset &= ~HPAGE_MASK;
+		index += offset >> huge_page_shift(h);
+		offset &= ~huge_page_mask(h);
 
 		if (page)
 			page_cache_release(page);
@@ -298,7 +301,7 @@ static ssize_t hugetlbfs_read(struct fil
 			break;
 	}
 out:
-	*ppos = ((loff_t)index << HPAGE_SHIFT) + offset;
+	*ppos = ((loff_t)index << huge_page_shift(h)) + offset;
 	mutex_unlock(&inode->i_mutex);
 	return retval;
 }
@@ -339,8 +342,9 @@ static void truncate_huge_page(struct pa
 
 static void truncate_hugepages(struct inode *inode, loff_t lstart)
 {
+	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
-	const pgoff_t start = lstart >> HPAGE_SHIFT;
+	const pgoff_t start = lstart >> huge_page_shift(h);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
@@ -449,8 +453,9 @@ static int hugetlb_vmtruncate(struct ino
 {
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
 
-	BUG_ON(offset & ~HPAGE_MASK);
+	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
@@ -465,6 +470,7 @@ static int hugetlb_vmtruncate(struct ino
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct hstate *h = hstate_inode(inode);
 	int error;
 	unsigned int ia_valid = attr->ia_valid;
 
@@ -476,7 +482,7 @@ static int hugetlbfs_setattr(struct dent
 
 	if (ia_valid & ATTR_SIZE) {
 		error = -EINVAL;
-		if (!(attr->ia_size & ~HPAGE_MASK))
+		if (!(attr->ia_size & ~huge_page_mask(h)))
 			error = hugetlb_vmtruncate(inode, attr->ia_size);
 		if (error)
 			goto out;
@@ -610,9 +616,10 @@ static int hugetlbfs_set_page_dirty(stru
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
+	struct hstate *h = hstate_inode(dentry->d_inode);
 
 	buf->f_type = HUGETLBFS_MAGIC;
-	buf->f_bsize = HPAGE_SIZE;
+	buf->f_bsize = huge_page_size(h);
 	if (sbinfo) {
 		spin_lock(&sbinfo->stat_lock);
 		/* If no limits set, just report 0 for max/free/used
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c
+++ linux-2.6/ipc/shm.c
@@ -613,7 +613,8 @@ static void shm_get_stat(struct ipc_name
 
 		if (is_file_hugepages(shp->shm_file)) {
 			struct address_space *mapping = inode->i_mapping;
-			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+			struct hstate *h = hstate_file(shp->shm_file);
+			*rss += (1 << huge_page_order(h)) * mapping->nrpages;
 		} else {
 			struct shmem_inode_info *info = SHMEM_I(inode);
 			spin_lock(&info->lock);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -848,7 +848,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				unmap_hugepage_range(vma, start, end);
 				zap_work -= (end - start) /
-						(HPAGE_SIZE / PAGE_SIZE);
+					(1 << huge_page_order(hstate_vma(vma)));
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c
+++ linux-2.6/mm/mempolicy.c
@@ -1295,7 +1295,8 @@ struct zonelist *huge_zonelist(struct vm
 	if (pol->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
-		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
+		nid = interleave_nid(pol, vma, addr,
+					huge_page_shift(hstate_vma(vma)));
 		if (unlikely(pol != &default_policy &&
 				pol != current->mempolicy))
 			__mpol_free(pol);	/* finished with pol */
@@ -1944,9 +1945,12 @@ static void check_huge_range(struct vm_a
 {
 	unsigned long addr;
 	struct page *page;
+	struct hstate *h = hstate_vma(vma);
+	unsigned sz = huge_page_size(h);
 
-	for (addr = start; addr < end; addr += HPAGE_SIZE) {
-		pte_t *ptep = huge_pte_offset(vma->vm_mm, addr & HPAGE_MASK);
+	for (addr = start; addr < end; addr += sz) {
+		pte_t *ptep = huge_pte_offset(vma->vm_mm,
+						addr & huge_page_mask(h));
 		pte_t pte;
 
 		if (!ptep)
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1793,7 +1793,8 @@ int split_vma(struct mm_struct * mm, str
 	struct mempolicy *pol;
 	struct vm_area_struct *new;
 
-	if (is_vm_hugetlb_page(vma) && (addr & ~HPAGE_MASK))
+	if (is_vm_hugetlb_page(vma) && (addr &
+					~(huge_page_mask(hstate_vma(vma)))))
 		return -EINVAL;
 
 	if (mm->map_count >= sysctl_max_map_count)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 05/18] hugetlb: multiple hstates
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (3 preceding siblings ...)
  2008-04-23  1:53 ` [patch 04/18] hugetlb: modular state npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-25 17:38   ` Nishanth Aravamudan
  2008-04-29 17:27   ` Nishanth Aravamudan
  2008-04-23  1:53 ` [patch 06/18] hugetlb: multi hstate proc files npiggin
                   ` (13 subsequent siblings)
  18 siblings, 2 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-multiple-hstates.patch --]
[-- Type: text/plain, Size: 6971 bytes --]

Add basic support for more than one hstate in hugetlbfs

- Convert hstates to an array
- Add a first default entry covering the standard huge page size
- Add functions for architectures to register new hstates
- Add basic iterators over hstates

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/hugetlb.h |   11 ++++
 mm/hugetlb.c            |  112 +++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 97 insertions(+), 26 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -27,7 +27,17 @@ unsigned long sysctl_overcommit_huge_pag
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
-struct hstate global_hstate;
+static int max_hstate = 0;
+
+static unsigned long default_hstate_resv = 0;
+
+struct hstate hstates[HUGE_MAX_HSTATE];
+
+/* for command line parsing */
+struct hstate *parsed_hstate __initdata = NULL;
+
+#define for_each_hstate(h) \
+	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -128,9 +138,19 @@ static void update_and_free_page(struct 
 	__free_pages(page, huge_page_order(h));
 }
 
+struct hstate *size_to_hstate(unsigned long size)
+{
+	struct hstate *h;
+	for_each_hstate (h) {
+		if (huge_page_size(h) == size)
+			return h;
+	}
+	return NULL;
+}
+
 static void free_huge_page(struct page *page)
 {
-	struct hstate *h = &global_hstate;
+	struct hstate *h = size_to_hstate(PAGE_SIZE << compound_order(page));
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
@@ -495,38 +515,80 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
-static int __init hugetlb_init(void)
+static void __init hugetlb_init_hstate(struct hstate *h)
 {
 	unsigned long i;
-	struct hstate *h = &global_hstate;
-
-	if (HPAGE_SHIFT == 0)
-		return 0;
-
-	if (!h->order) {
-		h->order = HPAGE_SHIFT - PAGE_SHIFT;
-		h->mask = HPAGE_MASK;
-	}
 
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
 	h->hugetlb_next_nid = first_node(node_online_map);
 
-	for (i = 0; i < max_huge_pages; ++i) {
+	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
+	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+}
+
+static void __init hugetlb_init_hstates(void)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		hugetlb_init_hstate(h);
+	}
+}
+
+static void __init report_hugepages(void)
+{
+	struct hstate *h;
+
+	for_each_hstate(h) {
+		printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
+				h->free_huge_pages,
+				1 << (h->order + PAGE_SHIFT - 20));
+	}
+}
+
+static int __init hugetlb_init(void)
+{
+	BUILD_BUG_ON(HPAGE_SHIFT == 0);
+
+	if (!size_to_hstate(HPAGE_SIZE)) {
+		huge_add_hstate(HUGETLB_PAGE_ORDER);
+		parsed_hstate->max_huge_pages = default_hstate_resv;
+	}
+
+	hugetlb_init_hstates();
+
+	report_hugepages();
+
 	return 0;
 }
 module_init(hugetlb_init);
 
+/* Should be called on processing a hugepagesz=... option */
+void __init huge_add_hstate(unsigned order)
+{
+	struct hstate *h;
+	if (size_to_hstate(PAGE_SIZE << order)) {
+		printk("hugepagesz= specified twice, ignoring\n");
+		return;
+	}
+	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+	BUG_ON(order < HPAGE_SHIFT - PAGE_SHIFT);
+	h = &hstates[max_hstate++];
+	h->order = order;
+	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
+	hugetlb_init_hstate(h);
+	parsed_hstate = h;
+}
+
 static int __init hugetlb_setup(char *s)
 {
-	if (sscanf(s, "%lu", &max_huge_pages) <= 0)
-		max_huge_pages = 0;
+	if (sscanf(s, "%lu", &default_hstate_resv) <= 0)
+		default_hstate_resv = 0;
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);
@@ -544,28 +606,27 @@ static unsigned int cpuset_mems_nr(unsig
 
 #ifdef CONFIG_SYSCTL
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count)
 {
-	struct hstate *h = &global_hstate;
 	int i;
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
-			if (count >= nr_huge_pages)
+			if (count >= h->nr_huge_pages)
 				return;
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
-			update_and_free_page(page);
+			update_and_free_page(h, page);
 			h->free_huge_pages--;
 			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
 #else
-static inline void try_to_free_low(unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count)
 {
 }
 #endif
@@ -625,7 +686,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(min_count);
+	try_to_free_low(h, min_count);
 	while (min_count < persistent_huge_pages(h)) {
 		struct page *page = dequeue_huge_page(h);
 		if (!page)
@@ -648,6 +709,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
 {
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	max_huge_pages = set_max_huge_pages(max_huge_pages);
+	global_hstate.max_huge_pages = max_huge_pages;
 	return 0;
 }
 
@@ -1296,7 +1358,7 @@ out:
 int hugetlb_reserve_pages(struct inode *inode, long from, long to)
 {
 	long ret, chg;
-	struct hstate *h = &global_hstate;
+	struct hstate *h = hstate_inode(inode);
 
 	chg = region_chg(&inode->i_mapping->private_list, from, to);
 	if (chg < 0)
@@ -1315,7 +1377,7 @@ int hugetlb_reserve_pages(struct inode *
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
-	struct hstate *h = &global_hstate;
+	struct hstate *h = hstate_inode(inode);
 	long chg = region_truncate(&inode->i_mapping->private_list, offset);
 
 	spin_lock(&inode->i_lock);
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -215,7 +215,16 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 };
 
-extern struct hstate global_hstate;
+void __init huge_add_hstate(unsigned order);
+struct hstate *size_to_hstate(unsigned long size);
+
+#ifndef HUGE_MAX_HSTATE
+#define HUGE_MAX_HSTATE 1
+#endif
+
+extern struct hstate hstates[HUGE_MAX_HSTATE];
+
+#define global_hstate (hstates[0])
 
 static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 {

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 06/18] hugetlb: multi hstate proc files
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (4 preceding siblings ...)
  2008-04-23  1:53 ` [patch 05/18] hugetlb: multiple hstates npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-05-02 19:53   ` Nishanth Aravamudan
  2008-04-23  1:53 ` [patch 07/18] hugetlbfs: per mount hstates npiggin
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-proc-hstates.patch --]
[-- Type: text/plain, Size: 3367 bytes --]

Convert /proc output code over to report multiple hstates

I chose to just report the numbers in a row, in the hope 
to minimze breakage of existing software. The "compat" page size
is always the first number.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   64 ++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 42 insertions(+), 22 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -739,39 +739,59 @@ int hugetlb_overcommit_handler(struct ct
 
 #endif /* CONFIG_SYSCTL */
 
+static int dump_field(char *buf, unsigned field)
+{
+	int n = 0;
+	struct hstate *h;
+	for_each_hstate (h)
+		n += sprintf(buf + n, " %5lu", *(unsigned long *)((char *)h + field));
+	buf[n++] = '\n';
+	return n;
+}
+
 int hugetlb_report_meminfo(char *buf)
 {
-	struct hstate *h = &global_hstate;
-	return sprintf(buf,
-			"HugePages_Total: %5lu\n"
-			"HugePages_Free:  %5lu\n"
-			"HugePages_Rsvd:  %5lu\n"
-			"HugePages_Surp:  %5lu\n"
-			"Hugepagesize:    %5lu kB\n",
-			h->nr_huge_pages,
-			h->free_huge_pages,
-			h->resv_huge_pages,
-			h->surplus_huge_pages,
-			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
+	struct hstate *h;
+	int n = 0;
+	n += sprintf(buf + 0, "HugePages_Total:");
+	n += dump_field(buf + n, offsetof(struct hstate, nr_huge_pages));
+	n += sprintf(buf + n, "HugePages_Free: ");
+	n += dump_field(buf + n, offsetof(struct hstate, free_huge_pages));
+	n += sprintf(buf + n, "HugePages_Rsvd: ");
+	n += dump_field(buf + n, offsetof(struct hstate, resv_huge_pages));
+	n += sprintf(buf + n, "HugePages_Surp: ");
+	n += dump_field(buf + n, offsetof(struct hstate, surplus_huge_pages));
+	n += sprintf(buf + n, "Hugepagesize:   ");
+	for_each_hstate (h)
+		n += sprintf(buf + n, " %5lu", huge_page_size(h) / 1024);
+	n += sprintf(buf + n, " kB\n");
+	return n;
 }
 
 int hugetlb_report_node_meminfo(int nid, char *buf)
 {
-	struct hstate *h = &global_hstate;
-	return sprintf(buf,
-		"Node %d HugePages_Total: %5u\n"
-		"Node %d HugePages_Free:  %5u\n"
-		"Node %d HugePages_Surp:  %5u\n",
-		nid, h->nr_huge_pages_node[nid],
-		nid, h->free_huge_pages_node[nid],
-		nid, h->surplus_huge_pages_node[nid]);
+	int n = 0;
+	n += sprintf(buf, "Node %d HugePages_Total: ", nid);
+	n += dump_field(buf + n, offsetof(struct hstate,
+						nr_huge_pages_node[nid]));
+	n += sprintf(buf + n, "Node %d HugePages_Free: ", nid);
+	n += dump_field(buf + n, offsetof(struct hstate,
+						free_huge_pages_node[nid]));
+	n += sprintf(buf + n, "Node %d HugePages_Surp: ", nid);
+	n += dump_field(buf + n, offsetof(struct hstate,
+						surplus_huge_pages_node[nid]));
+	return n;
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
-	struct hstate *h = &global_hstate;
-	return h->nr_huge_pages * (1 << huge_page_order(h));
+	long x = 0;
+	struct hstate *h;
+	for_each_hstate (h) {
+		x += h->nr_huge_pages * (1 << huge_page_order(h));
+	}
+	return x;
 }
 
 /*

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 07/18] hugetlbfs: per mount hstates
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (5 preceding siblings ...)
  2008-04-23  1:53 ` [patch 06/18] hugetlb: multi hstate proc files npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-25 18:09   ` Nishanth Aravamudan
  2008-04-23  1:53 ` [patch 08/18] hugetlb: multi hstate sysctls npiggin
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlbfs-per-mount-hstate.patch --]
[-- Type: text/plain, Size: 7997 bytes --]

Add support to have individual hstates for each hugetlbfs mount

- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- Set up pointers to a suitable hstate for the set page size option
to the super block and the inode and the vma.
- Change the hstate accessors to use this information
- Add code to the hstate init function to set parsed_hstate for command
line processing
- Handle duplicated hstate registrations to the make command line user proof

[np: take hstate out of hugetlbfs inode and vma->vm_private_data]

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
 include/linux/hugetlb.h |   14 +++++++++-----
 mm/hugetlb.c            |   16 +++-------------
 mm/memory.c             |   18 ++++++++++++++++--
 4 files changed, 66 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -136,6 +136,7 @@ struct hugetlbfs_config {
 	umode_t mode;
 	long	nr_blocks;
 	long	nr_inodes;
+	struct hstate *hstate;
 };
 
 struct hugetlbfs_sb_info {
@@ -144,6 +145,7 @@ struct hugetlbfs_sb_info {
 	long	max_inodes;   /* inodes allowed */
 	long	free_inodes;  /* inodes free */
 	spinlock_t	stat_lock;
+	struct hstate *hstate;
 };
 
 
@@ -226,19 +228,21 @@ extern struct hstate hstates[HUGE_MAX_HS
 
 #define global_hstate (hstates[0])
 
-static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+static inline struct hstate *hstate_inode(struct inode *i)
 {
-	return &global_hstate;
+	struct hugetlbfs_sb_info *hsb;
+	hsb = HUGETLBFS_SB(i->i_sb);
+	return hsb->hstate;
 }
 
 static inline struct hstate *hstate_file(struct file *f)
 {
-	return &global_hstate;
+	return hstate_inode(f->f_dentry->d_inode);
 }
 
-static inline struct hstate *hstate_inode(struct inode *i)
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 {
-	return &global_hstate;
+	return hstate_file(vma->vm_file);
 }
 
 static inline unsigned long huge_page_size(struct hstate *h)
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -53,6 +53,7 @@ int sysctl_hugetlb_shm_group;
 enum {
 	Opt_size, Opt_nr_inodes,
 	Opt_mode, Opt_uid, Opt_gid,
+	Opt_pagesize,
 	Opt_err,
 };
 
@@ -62,6 +63,7 @@ static match_table_t tokens = {
 	{Opt_mode,	"mode=%o"},
 	{Opt_uid,	"uid=%u"},
 	{Opt_gid,	"gid=%u"},
+	{Opt_pagesize,	"pagesize=%s"},
 	{Opt_err,	NULL},
 };
 
@@ -750,6 +752,8 @@ hugetlbfs_parse_options(char *options, s
 	char *p, *rest;
 	substring_t args[MAX_OPT_ARGS];
 	int option;
+	unsigned long long size = 0;
+	enum { NO_SIZE, SIZE_STD, SIZE_PERCENT } setsize = NO_SIZE;
 
 	if (!options)
 		return 0;
@@ -780,17 +784,13 @@ hugetlbfs_parse_options(char *options, s
 			break;
 
 		case Opt_size: {
- 			unsigned long long size;
 			/* memparse() will accept a K/M/G without a digit */
 			if (!isdigit(*args[0].from))
 				goto bad_val;
 			size = memparse(args[0].from, &rest);
-			if (*rest == '%') {
-				size <<= HPAGE_SHIFT;
-				size *= max_huge_pages;
-				do_div(size, 100);
-			}
-			pconfig->nr_blocks = (size >> HPAGE_SHIFT);
+			setsize = SIZE_STD;
+			if (*rest == '%')
+				setsize = SIZE_PERCENT;
 			break;
 		}
 
@@ -801,6 +801,19 @@ hugetlbfs_parse_options(char *options, s
 			pconfig->nr_inodes = memparse(args[0].from, &rest);
 			break;
 
+		case Opt_pagesize: {
+			unsigned long ps;
+			ps = memparse(args[0].from, &rest);
+			pconfig->hstate = size_to_hstate(ps);
+			if (!pconfig->hstate) {
+				printk(KERN_ERR
+				"hugetlbfs: Unsupported page size %lu MB\n",
+					ps >> 20);
+				return -EINVAL;
+			}
+			break;
+		}
+
 		default:
 			printk(KERN_ERR "hugetlbfs: Bad mount option: \"%s\"\n",
 				 p);
@@ -808,6 +821,18 @@ hugetlbfs_parse_options(char *options, s
 			break;
 		}
 	}
+
+	/* Do size after hstate is set up */
+	if (setsize > NO_SIZE) {
+		struct hstate *h = pconfig->hstate;
+		if (setsize == SIZE_PERCENT) {
+			size <<= huge_page_shift(h);
+			size *= h->max_huge_pages;
+			do_div(size, 100);
+		}
+		pconfig->nr_blocks = (size >> huge_page_shift(h));
+	}
+
 	return 0;
 
 bad_val:
@@ -832,6 +857,7 @@ hugetlbfs_fill_super(struct super_block 
 	config.uid = current->fsuid;
 	config.gid = current->fsgid;
 	config.mode = 0755;
+	config.hstate = size_to_hstate(HPAGE_SIZE);
 	ret = hugetlbfs_parse_options(data, &config);
 	if (ret)
 		return ret;
@@ -840,14 +866,15 @@ hugetlbfs_fill_super(struct super_block 
 	if (!sbinfo)
 		return -ENOMEM;
 	sb->s_fs_info = sbinfo;
+	sbinfo->hstate = config.hstate;
 	spin_lock_init(&sbinfo->stat_lock);
 	sbinfo->max_blocks = config.nr_blocks;
 	sbinfo->free_blocks = config.nr_blocks;
 	sbinfo->max_inodes = config.nr_inodes;
 	sbinfo->free_inodes = config.nr_inodes;
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = HPAGE_SIZE;
-	sb->s_blocksize_bits = HPAGE_SHIFT;
+	sb->s_blocksize = huge_page_size(config.hstate);
+	sb->s_blocksize_bits = huge_page_shift(config.hstate);
 	sb->s_magic = HUGETLBFS_MAGIC;
 	sb->s_op = &hugetlbfs_ops;
 	sb->s_time_gran = 1;
@@ -949,7 +976,8 @@ struct file *hugetlb_file_setup(const ch
 		goto out_dentry;
 
 	error = -ENOMEM;
-	if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT))
+	if (hugetlb_reserve_pages(inode, 0,
+			size >> huge_page_shift(hstate_inode(inode))))
 		goto out_inode;
 
 	d_instantiate(dentry, inode);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -934,19 +934,9 @@ void __unmap_hugepage_range(struct vm_ar
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end)
 {
-	/*
-	 * It is undesirable to test vma->vm_file as it should be non-null
-	 * for valid hugetlb area. However, vm_file will be NULL in the error
-	 * cleanup path of do_mmap_pgoff. When hugetlbfs ->mmap method fails,
-	 * do_mmap_pgoff() nullifies vma->vm_file before calling this function
-	 * to clean up. Since no pte has actually been setup, it is safe to
-	 * do nothing in this case.
-	 */
-	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
-		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
-	}
+	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	__unmap_hugepage_range(vma, start, end);
+	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -846,9 +846,23 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			if (unlikely(is_vm_hugetlb_page(vma))) {
-				unmap_hugepage_range(vma, start, end);
-				zap_work -= (end - start) /
+				/*
+				 * It is undesirable to test vma->vm_file as it
+				 * should be non-null for valid hugetlb area.
+				 * However, vm_file will be NULL in the error
+				 * cleanup path of do_mmap_pgoff. When
+				 * hugetlbfs ->mmap method fails,
+				 * do_mmap_pgoff() nullifies vma->vm_file
+				 * before calling this function to clean up.
+				 * Since no pte has actually been setup, it is
+				 * safe to do nothing in this case.
+	 			 */
+				if (vma->vm_file) {
+					unmap_hugepage_range(vma, start, end);
+					zap_work -= (end - start) /
 					(1 << huge_page_order(hstate_vma(vma)));
+				}
+
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 08/18] hugetlb: multi hstate sysctls
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (6 preceding siblings ...)
  2008-04-23  1:53 ` [patch 07/18] hugetlbfs: per mount hstates npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-25 18:14   ` Nishanth Aravamudan
  2008-04-25 23:35   ` Nishanth Aravamudan
  2008-04-23  1:53 ` [patch 09/18] hugetlb: abstract numa round robin selection npiggin
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlbfs-sysctl-hstates.patch --]
[-- Type: text/plain, Size: 5157 bytes --]

Expand the hugetlbfs sysctls to handle arrays for all hstates. This
now allows the removal of global_hstate -- everything is now hstate
aware.

- I didn't bother with hugetlb_shm_group and treat_as_movable,
these are still single global.
- Also improve error propagation for the sysctl handlers a bit

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/hugetlb.h |    7 ++----
 kernel/sysctl.c         |    2 -
 mm/hugetlb.c            |   53 +++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 45 insertions(+), 17 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -32,8 +32,6 @@ int hugetlb_fault(struct mm_struct *mm, 
 int hugetlb_reserve_pages(struct inode *inode, long from, long to);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 
-extern unsigned long max_huge_pages;
-extern unsigned long sysctl_overcommit_huge_pages;
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
@@ -226,8 +224,6 @@ struct hstate *size_to_hstate(unsigned l
 
 extern struct hstate hstates[HUGE_MAX_HSTATE];
 
-#define global_hstate (hstates[0])
-
 static inline struct hstate *hstate_inode(struct inode *i)
 {
 	struct hugetlbfs_sb_info *hsb;
@@ -265,6 +261,9 @@ static inline unsigned huge_page_shift(s
 	return h->order + PAGE_SHIFT;
 }
 
+extern unsigned long max_huge_pages[HUGE_MAX_HSTATE];
+extern unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
+
 #else
 struct hstate {};
 #define hstate_file(f) NULL
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -924,7 +924,7 @@ static struct ctl_table vm_table[] = {
 	 {
 		.procname	= "nr_hugepages",
 		.data		= &max_huge_pages,
-		.maxlen		= sizeof(unsigned long),
+		.maxlen 	= sizeof(max_huge_pages),
 		.mode		= 0644,
 		.proc_handler	= &hugetlb_sysctl_handler,
 		.extra1		= (void *)&hugetlb_zero,
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -22,8 +22,8 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-unsigned long max_huge_pages;
-unsigned long sysctl_overcommit_huge_pages;
+unsigned long max_huge_pages[HUGE_MAX_HSTATE];
+unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
@@ -587,8 +587,16 @@ void __init huge_add_hstate(unsigned ord
 
 static int __init hugetlb_setup(char *s)
 {
-	if (sscanf(s, "%lu", &default_hstate_resv) <= 0)
-		default_hstate_resv = 0;
+	unsigned long *mhp;
+
+	if (!max_hstate)
+		mhp = &default_hstate_resv;
+	else
+		mhp = &parsed_hstate->max_huge_pages;
+
+	if (sscanf(s, "%lu", mhp) <= 0)
+		*mhp = 0;
+
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);
@@ -632,10 +640,12 @@ static inline void try_to_free_low(struc
 #endif
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(unsigned long count)
+static unsigned long
+set_max_huge_pages(struct hstate *h, unsigned long count, int *err)
 {
 	unsigned long min_count, ret;
-	struct hstate *h = &global_hstate;
+
+	*err = 0;
 
 	/*
 	 * Increase the pool size
@@ -707,10 +717,25 @@ int hugetlb_sysctl_handler(struct ctl_ta
 			   struct file *file, void __user *buffer,
 			   size_t *length, loff_t *ppos)
 {
-	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
-	max_huge_pages = set_max_huge_pages(max_huge_pages);
-	global_hstate.max_huge_pages = max_huge_pages;
-	return 0;
+	int err = 0;
+	struct hstate *h;
+
+	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+	if (err)
+		return err;
+
+	if (write) {
+		for_each_hstate (h) {
+			int tmp;
+
+			h->max_huge_pages = set_max_huge_pages(h,
+					max_huge_pages[h - hstates], &tmp);
+			max_huge_pages[h - hstates] = h->max_huge_pages;
+			if (tmp && !err)
+				err = tmp;
+		}
+	}
+	return err;
 }
 
 int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
@@ -729,10 +754,14 @@ int hugetlb_overcommit_handler(struct ct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
-	struct hstate *h = &global_hstate;
+	struct hstate *h;
+	int i = 0;
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	spin_lock(&hugetlb_lock);
-	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
+	for_each_hstate (h) {
+		h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages[i];
+		i++;
+	}
 	spin_unlock(&hugetlb_lock);
 	return 0;
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 09/18] hugetlb: abstract numa round robin selection
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (7 preceding siblings ...)
  2008-04-23  1:53 ` [patch 08/18] hugetlb: multi hstate sysctls npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-23  1:53 ` [patch 10/18] mm: introduce non panic alloc_bootmem npiggin
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-abstract-numa-rr.patch --]
[-- Type: text/plain, Size: 2543 bytes --]

Need this as a separate function for a future patch.

No behaviour change.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -231,6 +231,27 @@ static struct page *alloc_fresh_huge_pag
 	return page;
 }
 
+/*
+ * Use a helper variable to find the next node and then
+ * copy it back to hugetlb_next_nid afterwards:
+ * otherwise there's a window in which a racer might
+ * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * But we don't need to use a spin_lock here: it really
+ * doesn't matter if occasionally a racer chooses the
+ * same nid as we do.  Move nid forward in the mask even
+ * if we just successfully allocated a hugepage so that
+ * the next caller gets hugepages on the next node.
+ */
+static int hstate_next_node(struct hstate *h)
+{
+	int next_nid;
+	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(node_online_map);
+	h->hugetlb_next_nid = next_nid;
+	return next_nid;
+}
+
 static int alloc_fresh_huge_page(struct hstate *h)
 {
 	struct page *page;
@@ -244,21 +265,7 @@ static int alloc_fresh_huge_page(struct 
 		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
 		if (page)
 			ret = 1;
-		/*
-		 * Use a helper variable to find the next node and then
-		 * copy it back to hugetlb_next_nid afterwards:
-		 * otherwise there's a window in which a racer might
-		 * pass invalid nid MAX_NUMNODES to alloc_pages_node.
-		 * But we don't need to use a spin_lock here: it really
-		 * doesn't matter if occasionally a racer chooses the
-		 * same nid as we do.  Move nid forward in the mask even
-		 * if we just successfully allocated a hugepage so that
-		 * the next caller gets hugepages on the next node.
-		 */
-		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
-		if (next_nid == MAX_NUMNODES)
-			next_nid = first_node(node_online_map);
-		h->hugetlb_next_nid = next_nid;
+		next_nid = hstate_next_node(h);
 	} while (!page && h->hugetlb_next_nid != start_nid);
 
 	return ret;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 10/18] mm: introduce non panic alloc_bootmem
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (8 preceding siblings ...)
  2008-04-23  1:53 ` [patch 09/18] hugetlb: abstract numa round robin selection npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-23  1:53 ` [patch 11/18] mm: export prep_compound_page to mm npiggin
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: __alloc_bootmem_node_nopanic.patch --]
[-- Type: text/plain, Size: 1816 bytes --]

Straight forward variant of the existing __alloc_bootmem_node, only 
difference is that it doesn't panic on failure.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/bootmem.h |    4 ++++
 mm/bootmem.c            |   12 ++++++++++++
 2 files changed, 16 insertions(+)

Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -492,6 +492,18 @@ void * __init __alloc_bootmem_node(pg_da
 	return __alloc_bootmem(size, align, goal);
 }
 
+void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
+				   unsigned long align, unsigned long goal)
+{
+	void *ptr;
+
+	ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, 0);
+	if (ptr)
+		return ptr;
+
+	return __alloc_bootmem_nopanic(size, align, goal);
+}
+
 #ifndef ARCH_LOW_ADDRESS_LIMIT
 #define ARCH_LOW_ADDRESS_LIMIT	0xffffffffUL
 #endif
Index: linux-2.6/include/linux/bootmem.h
===================================================================
--- linux-2.6.orig/include/linux/bootmem.h
+++ linux-2.6/include/linux/bootmem.h
@@ -90,6 +90,10 @@ extern void *__alloc_bootmem_node(pg_dat
 				  unsigned long size,
 				  unsigned long align,
 				  unsigned long goal);
+extern void *__alloc_bootmem_node_nopanic(pg_data_t *pgdat,
+				  unsigned long size,
+				  unsigned long align,
+				  unsigned long goal);
 extern unsigned long init_bootmem_node(pg_data_t *pgdat,
 				       unsigned long freepfn,
 				       unsigned long startpfn,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 11/18] mm: export prep_compound_page to mm
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (9 preceding siblings ...)
  2008-04-23  1:53 ` [patch 10/18] mm: introduce non panic alloc_bootmem npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-23 16:12   ` Andrew Hastings
  2008-04-23  1:53 ` [patch 12/18] hugetlbfs: support larger than MAX_ORDER npiggin
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: mm-export-prep_compound_page.patch --]
[-- Type: text/plain, Size: 1415 bytes --]

hugetlb will need to get compound pages from bootmem to handle
the case of them being larger than MAX_ORDER. Export
the constructor function needed for this.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/internal.h   |    2 ++
 mm/page_alloc.c |    2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -13,6 +13,8 @@
 
 #include <linux/mm.h>
 
+extern void prep_compound_page(struct page *page, unsigned long order);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -272,7 +272,7 @@ static void free_compound_page(struct pa
 	__free_pages_ok(page, compound_order(page));
 }
 
-static void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 12/18] hugetlbfs: support larger than MAX_ORDER
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (10 preceding siblings ...)
  2008-04-23  1:53 ` [patch 11/18] mm: export prep_compound_page to mm npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-23 16:15   ` Andrew Hastings
                     ` (2 more replies)
  2008-04-23  1:53 ` [patch 13/18] hugetlb: support boot allocate different sizes npiggin
                   ` (6 subsequent siblings)
  18 siblings, 3 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-unlimited-order.patch --]
[-- Type: text/plain, Size: 5086 bytes --]

This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
not practical to enlarge MAX_ORDER to 1GB. 

Instead the 1GB pages are only allocated at boot using the bootmem
allocator using the hugepages=... option.

These 1G bootmem pages are never freed. In theory it would be possible
to implement that with some complications, but since it would be a one-way
street (>= MAX_ORDER pages cannot be allocated later) I decided not to
currently.

The >= MAX_ORDER code is not ifdef'ed per architecture. It is not very big
and the ifdef uglyness seemed not be worth it.

Known problems: /proc/meminfo and "free" do not display the memory 
allocated for gb pages in "Total". This is a little confusing for the
user.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/bootmem.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -160,7 +161,7 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	if (h->surplus_huge_pages_node[nid]) {
+	if (h->surplus_huge_pages_node[nid] && h->order < MAX_ORDER) {
 		update_and_free_page(h, page);
 		h->surplus_huge_pages--;
 		h->surplus_huge_pages_node[nid]--;
@@ -222,6 +223,9 @@ static struct page *alloc_fresh_huge_pag
 {
 	struct page *page;
 
+	if (h->order >= MAX_ORDER)
+		return NULL;
+
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
 		huge_page_order(h));
@@ -278,6 +282,9 @@ static struct page *alloc_buddy_huge_pag
 	struct page *page;
 	unsigned int nid;
 
+	if (h->order >= MAX_ORDER)
+		return NULL;
+
 	/*
 	 * Assume we will successfully allocate the surplus page to
 	 * prevent racing processes from causing the surplus to exceed
@@ -444,6 +451,10 @@ static void return_unused_surplus_pages(
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
+	/* Cannot return gigantic pages currently */
+	if (h->order >= MAX_ORDER)
+		return;
+
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
 	while (remaining_iterations-- && nr_pages) {
@@ -522,6 +533,51 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
+static __initdata LIST_HEAD(huge_boot_pages);
+
+struct huge_bm_page {
+	struct list_head list;
+	struct hstate *hstate;
+};
+
+static int __init alloc_bm_huge_page(struct hstate *h)
+{
+	struct huge_bm_page *m;
+	int nr_nodes = nodes_weight(node_online_map);
+
+	while (nr_nodes) {
+		m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
+					huge_page_size(h), huge_page_size(h),
+					0);
+		if (m)
+			goto found;
+		hstate_next_node(h);
+		nr_nodes--;
+	}
+	return 0;
+
+found:
+	BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
+	/* Put them into a private list first because mem_map is not up yet */
+	list_add(&m->list, &huge_boot_pages);
+	m->hstate = h;
+	return 1;
+}
+
+/* Put bootmem huge pages into the standard lists after mem_map is up */
+static void __init gather_bootmem_prealloc(void)
+{
+	struct huge_bm_page *m;
+	list_for_each_entry (m, &huge_boot_pages, list) {
+		struct page *page = virt_to_page(m);
+		struct hstate *h = m->hstate;
+		__ClearPageReserved(page);
+		WARN_ON(page_count(page) != 1);
+		prep_compound_page(page, h->order);
+		prep_new_huge_page(h, page);
+	}
+}
+
 static void __init hugetlb_init_hstate(struct hstate *h)
 {
 	unsigned long i;
@@ -532,7 +588,10 @@ static void __init hugetlb_init_hstate(s
 	h->hugetlb_next_nid = first_node(node_online_map);
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
-		if (!alloc_fresh_huge_page(h))
+		if (h->order >= MAX_ORDER) {
+			if (!alloc_bm_huge_page(h))
+				break;
+		} else if (!alloc_fresh_huge_page(h))
 			break;
 	}
 	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
@@ -569,6 +628,8 @@ static int __init hugetlb_init(void)
 
 	hugetlb_init_hstates();
 
+	gather_bootmem_prealloc();
+
 	report_hugepages();
 
 	return 0;
@@ -625,6 +686,9 @@ static void try_to_free_low(struct hstat
 {
 	int i;
 
+	if (h->order >= MAX_ORDER)
+		return;
+
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
@@ -654,6 +718,12 @@ set_max_huge_pages(struct hstate *h, uns
 
 	*err = 0;
 
+	if (h->order >= MAX_ORDER) {
+		if (count != h->max_huge_pages)
+			*err = -EINVAL;
+		return h->max_huge_pages;
+	}
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 13/18] hugetlb: support boot allocate different sizes
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (11 preceding siblings ...)
  2008-04-23  1:53 ` [patch 12/18] hugetlbfs: support larger than MAX_ORDER npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-23 16:15   ` Andrew Hastings
  2008-04-25 18:40   ` Nishanth Aravamudan
  2008-04-23  1:53 ` [patch 14/18] hugetlb: printk cleanup npiggin
                   ` (5 subsequent siblings)
  18 siblings, 2 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-different-page-sizes.patch --]
[-- Type: text/plain, Size: 2133 bytes --]

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -582,10 +582,13 @@ static void __init hugetlb_init_hstate(s
 {
 	unsigned long i;
 
-	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+	/* Don't reinitialize lists if they have been already init'ed */
+	if (!h->hugepage_freelists[0].next) {
+		for (i = 0; i < MAX_NUMNODES; ++i)
+			INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
-	h->hugetlb_next_nid = first_node(node_online_map);
+		h->hugetlb_next_nid = first_node(node_online_map);
+	}
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (h->order >= MAX_ORDER) {
@@ -594,7 +597,7 @@ static void __init hugetlb_init_hstate(s
 		} else if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	h->max_huge_pages = i;
 }
 
 static void __init hugetlb_init_hstates(void)
@@ -602,7 +605,10 @@ static void __init hugetlb_init_hstates(
 	struct hstate *h;
 
 	for_each_hstate(h) {
-		hugetlb_init_hstate(h);
+		/* oversize hugepages were init'ed in early boot */
+		if (h->order < MAX_ORDER)
+			hugetlb_init_hstate(h);
+		max_huge_pages[h - hstates] = h->max_huge_pages;
 	}
 }
 
@@ -665,6 +671,14 @@ static int __init hugetlb_setup(char *s)
 	if (sscanf(s, "%lu", mhp) <= 0)
 		*mhp = 0;
 
+	/*
+	 * Global state is always initialized later in hugetlb_init.
+	 * But we need to allocate >= MAX_ORDER hstates here early to still
+	 * use the bootmem allocator.
+	 */
+	if (max_hstate > 0 && parsed_hstate->order >= MAX_ORDER)
+		hugetlb_init_hstate(parsed_hstate);
+
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 14/18] hugetlb: printk cleanup
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (12 preceding siblings ...)
  2008-04-23  1:53 ` [patch 13/18] hugetlb: support boot allocate different sizes npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-27  3:32   ` Nishanth Aravamudan
  2008-04-23  1:53 ` [patch 15/18] hugetlb: introduce huge_pud npiggin
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-printk-cleanup.patch --]
[-- Type: text/plain, Size: 1516 bytes --]

- Reword sentence to clarify meaning with multiple options
- Add support for using GB prefixes for the page size
- Add extra printk to delayed > MAX_ORDER allocation code

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/hugetlb.c |   21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -612,15 +612,28 @@ static void __init hugetlb_init_hstates(
 	}
 }
 
+static __init char *memfmt(char *buf, unsigned long n)
+{
+	if (n >= (1UL << 30))
+		sprintf(buf, "%lu GB", n >> 30);
+	else if (n >= (1UL << 20))
+		sprintf(buf, "%lu MB", n >> 20);
+	else
+		sprintf(buf, "%lu KB", n >> 10);
+	return buf;
+}
+
 static void __init report_hugepages(void)
 {
 	struct hstate *h;
 
 	for_each_hstate(h) {
-		printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
-				h->free_huge_pages,
-				1 << (h->order + PAGE_SHIFT - 20));
-	}
+		char buf[32];
+		printk(KERN_INFO "HugeTLB registered %s page size, "
+				 "pre-allocated %ld pages\n",
+			memfmt(buf, huge_page_size(h)),
+			h->free_huge_pages);
+        }
 }
 
 static int __init hugetlb_init(void)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 15/18] hugetlb: introduce huge_pud
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (13 preceding siblings ...)
  2008-04-23  1:53 ` [patch 14/18] hugetlb: printk cleanup npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-23  1:53 ` [patch 16/18] x86: support GB hugepages on 64-bit npiggin
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlbfs-huge_pud.patch --]
[-- Type: text/plain, Size: 6242 bytes --]

Straight forward extensions for huge pages located in the PUD
instead of PMDs.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 arch/ia64/mm/hugetlbpage.c    |    6 ++++++
 arch/powerpc/mm/hugetlbpage.c |    5 +++++
 arch/sh/mm/hugetlbpage.c      |    5 +++++
 arch/sparc64/mm/hugetlbpage.c |    5 +++++
 arch/x86/mm/hugetlbpage.c     |   25 ++++++++++++++++++++++++-
 include/linux/hugetlb.h       |    5 +++++
 mm/hugetlb.c                  |    9 +++++++++
 mm/memory.c                   |   10 +++++++++-
 8 files changed, 68 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -45,7 +45,10 @@ struct page *follow_huge_addr(struct mm_
 			      int write);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 				pmd_t *pmd, int write);
+struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
+				pud_t *pud, int write);
 int pmd_huge(pmd_t pmd);
+int pud_huge(pud_t pmd);
 void hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
 
@@ -112,8 +115,10 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL
+#define follow_huge_pud(mm, addr, pud, write)	NULL
 #define prepare_hugepage_range(addr,len)	(-EINVAL)
 #define pmd_huge(x)	0
+#define pud_huge(x)	0
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, write)	({ BUG(); 0; })
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c
@@ -106,6 +106,12 @@ int pmd_huge(pmd_t pmd)
 {
 	return 0;
 }
+
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -368,6 +368,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
Index: linux-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/hugetlbpage.c
+++ linux-2.6/arch/sh/mm/hugetlbpage.c
@@ -78,6 +78,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 			     pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c
@@ -294,6 +294,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 			     pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -188,6 +188,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
@@ -208,6 +213,11 @@ int pmd_huge(pmd_t pmd)
 	return !!(pmd_val(pmd) & _PAGE_PSE);
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
@@ -216,9 +226,22 @@ follow_huge_pmd(struct mm_struct *mm, un
 
 	page = pte_page(*(pte_t *)pmd);
 	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
+		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
 	return page;
 }
+
+struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+		pud_t *pud, int write)
+{
+	struct page *page;
+
+	page = pte_page(*(pte_t *)pud);
+	if (page)
+		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
+	return page;
+}
+
 #endif
 
 /* x86_64 also uses this file */
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -1236,6 +1236,15 @@ int hugetlb_fault(struct mm_struct *mm, 
 	return ret;
 }
 
+/* Can be overriden by architectures */
+__attribute__((weak)) struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+	       pud_t *pud, int write)
+{
+	BUG();
+	return NULL;
+}
+
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			struct page **pages, struct vm_area_struct **vmas,
 			unsigned long *position, int *length, int i,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -945,7 +945,13 @@ struct page *follow_page(struct vm_area_
 	pud = pud_offset(pgd, address);
 	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
 		goto no_page_table;
-	
+
+	if (pud_huge(*pud)) {
+		BUG_ON(flags & FOLL_GET);
+		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
+		goto out;
+	}
+
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
@@ -1436,6 +1442,8 @@ static int apply_to_pmd_range(struct mm_
 	unsigned long next;
 	int err;
 
+	BUG_ON(pud_huge(*pud));
+
 	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		return -ENOMEM;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 16/18] x86: support GB hugepages on 64-bit
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (14 preceding siblings ...)
  2008-04-23  1:53 ` [patch 15/18] hugetlb: introduce huge_pud npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-23  1:53 ` [patch 17/18] x86: add hugepagesz option " npiggin
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: x86-support-GB-hugetlb-pages.patch --]
[-- Type: text/plain, Size: 3897 bytes --]

---
 arch/x86/mm/hugetlbpage.c |   33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -133,9 +133,14 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	pgd = pgd_offset(mm, addr);
 	pud = pud_alloc(mm, pgd, addr);
 	if (pud) {
-		if (pud_none(*pud))
-			huge_pmd_share(mm, addr, pud);
-		pte = (pte_t *) pmd_alloc(mm, pud, addr);
+		if (sz == PUD_SIZE) {
+			pte = (pte_t *)pud;
+		} else {
+			BUG_ON(sz != PMD_SIZE);
+			if (pud_none(*pud))
+				huge_pmd_share(mm, addr, pud);
+			pte = (pte_t *) pmd_alloc(mm, pud, addr);
+		}
 	}
 	BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
 
@@ -151,8 +156,11 @@ pte_t *huge_pte_offset(struct mm_struct 
 	pgd = pgd_offset(mm, addr);
 	if (pgd_present(*pgd)) {
 		pud = pud_offset(pgd, addr);
-		if (pud_present(*pud))
+		if (pud_present(*pud)) {
+			if (pud_large(*pud))
+				return (pte_t *)pud;
 			pmd = pmd_offset(pud, addr);
+		}
 	}
 	return (pte_t *) pmd;
 }
@@ -215,7 +223,7 @@ int pmd_huge(pmd_t pmd)
 
 int pud_huge(pud_t pud)
 {
-	return 0;
+	return !!(pud_val(pud) & _PAGE_PSE);
 }
 
 struct page *
@@ -251,6 +259,7 @@ static unsigned long hugetlb_get_unmappe
 		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long start_addr;
@@ -263,7 +272,7 @@ static unsigned long hugetlb_get_unmappe
 	}
 
 full_search:
-	addr = ALIGN(start_addr, HPAGE_SIZE);
+	addr = ALIGN(start_addr, huge_page_size(h));
 
 	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
@@ -285,7 +294,7 @@ full_search:
 		}
 		if (addr + mm->cached_hole_size < vma->vm_start)
 		        mm->cached_hole_size = vma->vm_start - addr;
-		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+		addr = ALIGN(vma->vm_end, huge_page_size(h));
 	}
 }
 
@@ -293,6 +302,7 @@ static unsigned long hugetlb_get_unmappe
 		unsigned long addr0, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev_vma;
 	unsigned long base = mm->mmap_base, addr = addr0;
@@ -313,7 +323,7 @@ try_again:
 		goto fail;
 
 	/* either no address requested or cant fit in requested address hole */
-	addr = (mm->free_area_cache - len) & HPAGE_MASK;
+	addr = (mm->free_area_cache - len) & huge_page_mask(h);
 	do {
 		/*
 		 * Lookup failure means no vma is above this address,
@@ -344,7 +354,7 @@ try_again:
 		        largest_hole = vma->vm_start - addr;
 
 		/* try just below the current vma->vm_start */
-		addr = (vma->vm_start - len) & HPAGE_MASK;
+		addr = (vma->vm_start - len) & huge_page_mask(h);
 	} while (len <= vma->vm_start);
 
 fail:
@@ -382,10 +392,11 @@ unsigned long
 hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 
-	if (len & ~HPAGE_MASK)
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -397,7 +408,7 @@ hugetlb_get_unmapped_area(struct file *f
 	}
 
 	if (addr) {
-		addr = ALIGN(addr, HPAGE_SIZE);
+		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (15 preceding siblings ...)
  2008-04-23  1:53 ` [patch 16/18] x86: support GB hugepages on 64-bit npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-30 19:34   ` Nishanth Aravamudan
  2008-04-30 20:48   ` Nishanth Aravamudan
  2008-04-23  1:53 ` [patch 18/18] hugetlb: my fixes 2 npiggin
  2008-04-23  8:05 ` [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 Andi Kleen
  18 siblings, 2 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: x86-64-implement-hugepagesz.patch --]
[-- Type: text/plain, Size: 3025 bytes --]

Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.

This finally allows to select GB pages for hugetlbfs in x86 now
that all the infrastructure is in place.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 Documentation/kernel-parameters.txt |   11 +++++++++--
 arch/x86/mm/hugetlbpage.c           |   17 +++++++++++++++++
 include/asm-x86/page.h              |    2 ++
 3 files changed, 28 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -424,3 +424,20 @@ hugetlb_get_unmapped_area(struct file *f
 
 #endif /*HAVE_ARCH_HUGETLB_UNMAPPED_AREA*/
 
+#ifdef CONFIG_X86_64
+static __init int setup_hugepagesz(char *opt)
+{
+	unsigned long ps = memparse(opt, &opt);
+	if (ps == PMD_SIZE) {
+		huge_add_hstate(PMD_SHIFT - PAGE_SHIFT);
+	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
+		huge_add_hstate(PUD_SHIFT - PAGE_SHIFT);
+	} else {
+		printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
+			ps >> 20);
+		return 0;
+	}
+	return 1;
+}
+__setup("hugepagesz=", setup_hugepagesz);
+#endif
Index: linux-2.6/include/asm-x86/page.h
===================================================================
--- linux-2.6.orig/include/asm-x86/page.h
+++ linux-2.6/include/asm-x86/page.h
@@ -21,6 +21,8 @@
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
 
+#define HUGE_MAX_HSTATE 2
+
 /* to align the pointer to the (next) page boundary */
 #define PAGE_ALIGN(addr)	(((addr)+PAGE_SIZE-1)&PAGE_MASK)
 
Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -728,8 +728,15 @@ and is between 256 and 4096 characters. 
 	hisax=		[HW,ISDN]
 			See Documentation/isdn/README.HiSax.
 
-	hugepages=	[HW,X86-32,IA-64] Maximal number of HugeTLB pages.
-	hugepagesz=	[HW,IA-64,PPC] The size of the HugeTLB pages.
+	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
+			On x86 this option can be specified multiple times
+			interleaved with hugepages= to reserve huge pages
+			of different sizes. Valid pages sizes on x86-64
+			are 2M (when the CPU supports "pse") and 1G (when the
+			CPU supports the "pdpe1gb" cpuinfo flag)
+			Note that 1GB pages can only be allocated at boot time
+			using hugepages= and not freed afterwards.
 
 	i8042.direct	[HW] Put keyboard port into non-translated mode
 	i8042.dumbkbd	[HW] Pretend that controller can only read data from

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [patch 18/18] hugetlb: my fixes 2
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (16 preceding siblings ...)
  2008-04-23  1:53 ` [patch 17/18] x86: add hugepagesz option " npiggin
@ 2008-04-23  1:53 ` npiggin
  2008-04-23 10:48   ` Andi Kleen
  2008-04-23 15:20   ` Jon Tollefson
  2008-04-23  8:05 ` [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 Andi Kleen
  18 siblings, 2 replies; 123+ messages in thread
From: npiggin @ 2008-04-23  1:53 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, andi, kniht, nacc, abh, wli

[-- Attachment #1: hugetlb-fixes2.patch --]
[-- Type: text/plain, Size: 6869 bytes --]

Here is my next set of fixes and changes:
- Allow configurations without the default HPAGE_SIZE size (mainly useful
  for testing but maybe it is the right way to go).
- Fixed another case where mappings would be set up on incorrect boundaries
  because prepare_hugepage_range was not hpage-ified.
- Changed the sysctl table behaviour so it only displays as many values in
  the vector as there are hstates configured.
- Fixed oops in overcommit sysctl handler

This fixes several oopses seen on the libhugetlbfs test suite. Now it seems to
pass most of them and fails reasonably on others (eg. most 32-bit tests fail
due to being unable to map enough virtual memory, others due to not enough
hugepages given that I only have 2).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
---
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/hugetlbfs/inode.c      |    4 +++-
 include/linux/hugetlb.h   |   19 ++-----------------
 kernel/sysctl.c           |    2 ++
 mm/hugetlb.c              |   35 ++++++++++++++++++++++++++++++-----
 5 files changed, 39 insertions(+), 25 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
 	return 1;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -402,7 +402,7 @@ hugetlb_get_unmapped_area(struct file *f
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -640,7 +640,7 @@ static int __init hugetlb_init(void)
 {
 	BUILD_BUG_ON(HPAGE_SHIFT == 0);
 
-	if (!size_to_hstate(HPAGE_SIZE)) {
+	if (!max_hstate) {
 		huge_add_hstate(HUGETLB_PAGE_ORDER);
 		parsed_hstate->max_huge_pages = default_hstate_resv;
 	}
@@ -821,9 +821,10 @@ int hugetlb_sysctl_handler(struct ctl_ta
 			   struct file *file, void __user *buffer,
 			   size_t *length, loff_t *ppos)
 {
-	int err = 0;
+	int err;
 	struct hstate *h;
 
+	table->maxlen = max_hstate * sizeof(unsigned long);
 	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	if (err)
 		return err;
@@ -846,6 +847,7 @@ int hugetlb_treat_movable_handler(struct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
+	table->maxlen = max_hstate * sizeof(int);
 	proc_dointvec(table, write, file, buffer, length, ppos);
 	if (hugepages_treat_as_movable)
 		htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
@@ -858,15 +860,22 @@ int hugetlb_overcommit_handler(struct ct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
+	int err;
 	struct hstate *h;
-	int i = 0;
-	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+
+	table->maxlen = max_hstate * sizeof(unsigned long);
+	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+	if (err)
+		return err;
+
 	spin_lock(&hugetlb_lock);
 	for_each_hstate (h) {
-		h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages[i];
+		h->nr_overcommit_huge_pages =
+				sysctl_overcommit_huge_pages[h - hstates];
 		i++;
 	}
 	spin_unlock(&hugetlb_lock);
+
 	return 0;
 }
 
@@ -1015,6 +1024,22 @@ nomem:
 	return -ENOMEM;
 }
 
+#ifndef ARCH_HAS_PREPARE_HUGEPAGE_RANGE
+/*
+ * If the arch doesn't supply something else, assume that hugepage
+ * size aligned regions are ok without further preparation.
+ */
+int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len)
+{
+	struct hstate *h = hstate_file(file);
+	if (len & ~huge_page_mask(h))
+		return -EINVAL;
+	if (addr & ~huge_page_mask(h))
+		return -EINVAL;
+	return 0;
+}
+#endif
+
 void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			    unsigned long end)
 {
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -141,7 +141,7 @@ hugetlb_get_unmapped_area(struct file *f
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
-		if (prepare_hugepage_range(addr, len))
+		if (prepare_hugepage_range(file, addr, len))
 			return -EINVAL;
 		return addr;
 	}
@@ -858,6 +858,8 @@ hugetlbfs_fill_super(struct super_block 
 	config.gid = current->fsgid;
 	config.mode = 0755;
 	config.hstate = size_to_hstate(HPAGE_SIZE);
+	if (!config.hstate)
+		config.hstate = &hstates[0];
 	ret = hugetlbfs_parse_options(data, &config);
 	if (ret)
 		return ret;
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -64,22 +64,7 @@ void hugetlb_free_pgd_range(struct mmu_g
 			    unsigned long ceiling);
 #endif
 
-#ifndef ARCH_HAS_PREPARE_HUGEPAGE_RANGE
-/*
- * If the arch doesn't supply something else, assume that hugepage
- * size aligned regions are ok without further preparation.
- */
-static inline int prepare_hugepage_range(unsigned long addr, unsigned long len)
-{
-	if (len & ~HPAGE_MASK)
-		return -EINVAL;
-	if (addr & ~HPAGE_MASK)
-		return -EINVAL;
-	return 0;
-}
-#else
-int prepare_hugepage_range(unsigned long addr, unsigned long len);
-#endif
+int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len);
 
 #ifndef ARCH_HAS_SETCLEAR_HUGE_PTE
 #define set_huge_pte_at(mm, addr, ptep, pte)	set_pte_at(mm, addr, ptep, pte)
@@ -116,7 +101,7 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL
 #define follow_huge_pud(mm, addr, pud, write)	NULL
-#define prepare_hugepage_range(addr,len)	(-EINVAL)
+#define prepare_hugepage_range(file,addr,len)	(-EINVAL)
 #define pmd_huge(x)	0
 #define pud_huge(x)	0
 #define is_hugepage_only_range(mm, addr, len)	0
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -953,6 +953,8 @@ static struct ctl_table vm_table[] = {
 		.maxlen		= sizeof(sysctl_overcommit_huge_pages),
 		.mode		= 0644,
 		.proc_handler	= &hugetlb_overcommit_handler,
+		.extra1		= (void *)&hugetlb_zero,
+		.extra2		= (void *)&hugetlb_infinity,
 	},
 #endif
 	{

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (17 preceding siblings ...)
  2008-04-23  1:53 ` [patch 18/18] hugetlb: my fixes 2 npiggin
@ 2008-04-23  8:05 ` Andi Kleen
  2008-04-23 15:34   ` Nick Piggin
  2008-04-23 18:43   ` Nishanth Aravamudan
  18 siblings, 2 replies; 123+ messages in thread
From: Andi Kleen @ 2008-04-23  8:05 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, kniht, nacc, abh, wli

> Testing-wise, I've changed the registration mechanism so that if you specify
> hugepagesz=1G on the command line, then you do not get the 2M pages by default
> (you have to also specify hugepagesz=2M). Also, when only one hstate is
> registered, all the proc outputs appear unchanged, so this makes it very easy
> to test with.

Are you sure that's a good idea? Just replacing the 2M count in meminfo
with 1G pages is not fully compatible proc ABI wise I think.

I think rather that applications who only know about 2M pages should
see "0" in this case and not be confused by larger pages. And only
applications who are multi page size aware should see the new page
sizes.

If you prefer it you could move all the new page sizes to sysfs
and only ever display the "legacy page size" in meminfo,
but frankly I personally prefer the quite simple and comparatively
efficient /proc/meminfo with multiple numbers interface.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 18/18] hugetlb: my fixes 2
  2008-04-23  1:53 ` [patch 18/18] hugetlb: my fixes 2 npiggin
@ 2008-04-23 10:48   ` Andi Kleen
  2008-04-23 15:36     ` Nick Piggin
  2008-04-23 18:49     ` Nishanth Aravamudan
  2008-04-23 15:20   ` Jon Tollefson
  1 sibling, 2 replies; 123+ messages in thread
From: Andi Kleen @ 2008-04-23 10:48 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, kniht, nacc, abh, wli

npiggin@suse.de wrote:

Thanks for these fixes. The subject definitely needs improvement, or
rather all these fixes should be folded into the original patches.

> Here is my next set of fixes and changes:
> - Allow configurations without the default HPAGE_SIZE size (mainly useful
>   for testing but maybe it is the right way to go).

I don't think it is the correct way. If you want to do it this way you
would need to special case it in /proc/meminfo to keep things compatible.

Also in general I would think that always keeping the old huge page size
around is a good idea. There is some chance at least to allocate 2MB
pages after boot (especially with the new movable zone and with lumpy
reclaim), so it doesn't need to be configured at boot time strictly. And
why take that option away from the user?

Also I would hope that distributions keep their existing /hugetlbfs
(if they have one) at the compat size for 100% compatibility to existing
applications.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 01/18] hugetlb: fix lockdep spew
  2008-04-23  1:53 ` [patch 01/18] hugetlb: fix lockdep spew npiggin
@ 2008-04-23 13:06   ` KOSAKI Motohiro
  0 siblings, 0 replies; 123+ messages in thread
From: KOSAKI Motohiro @ 2008-04-23 13:06 UTC (permalink / raw)
  To: npiggin; +Cc: kosaki.motohiro, akpm, linux-mm, andi, kniht, nacc, abh, wli

Hi

>  
>  		spin_lock(&dst->page_table_lock);
> -		spin_lock(&src->page_table_lock);
> +		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
>  		if (!pte_none(*src_pte)) {
>  			if (cow)
>  				ptep_set_wrprotect(src, addr, src_pte);
> 

Good improvement :)

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 18/18] hugetlb: my fixes 2
  2008-04-23  1:53 ` [patch 18/18] hugetlb: my fixes 2 npiggin
  2008-04-23 10:48   ` Andi Kleen
@ 2008-04-23 15:20   ` Jon Tollefson
  2008-04-23 15:44     ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Jon Tollefson @ 2008-04-23 15:20 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, nacc, abh, wli

On Wed, 2008-04-23 at 11:53 +1000, npiggin@suse.de wrote:
> plain text document attachment (hugetlb-fixes2.patch)
> Here is my next set of fixes and changes:
> - Allow configurations without the default HPAGE_SIZE size (mainly useful
>   for testing but maybe it is the right way to go).
> - Fixed another case where mappings would be set up on incorrect boundaries
>   because prepare_hugepage_range was not hpage-ified.
> - Changed the sysctl table behaviour so it only displays as many values in
>   the vector as there are hstates configured.
> - Fixed oops in overcommit sysctl handler
> 
> This fixes several oopses seen on the libhugetlbfs test suite. Now it seems to
> pass most of them and fails reasonably on others (eg. most 32-bit tests fail
> due to being unable to map enough virtual memory, others due to not enough
> hugepages given that I only have 2).
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> ---
>  arch/x86/mm/hugetlbpage.c |    4 ++--
>  fs/hugetlbfs/inode.c      |    4 +++-
>  include/linux/hugetlb.h   |   19 ++-----------------
>  kernel/sysctl.c           |    2 ++
>  mm/hugetlb.c              |   35 ++++++++++++++++++++++++++++++-----
>  5 files changed, 39 insertions(+), 25 deletions(-)
> 
> Index: linux-2.6/arch/x86/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
> +++ linux-2.6/arch/x86/mm/hugetlbpage.c
> @@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
>  	return 1;
>  }
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> @@ -402,7 +402,7 @@ hugetlb_get_unmapped_area(struct file *f
>  		return -ENOMEM;
> 
>  	if (flags & MAP_FIXED) {
> -		if (prepare_hugepage_range(addr, len))
> +		if (prepare_hugepage_range(file, addr, len))
>  			return -EINVAL;
>  		return addr;
>  	}
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -640,7 +640,7 @@ static int __init hugetlb_init(void)
>  {
>  	BUILD_BUG_ON(HPAGE_SHIFT == 0);
> 
> -	if (!size_to_hstate(HPAGE_SIZE)) {
> +	if (!max_hstate) {
>  		huge_add_hstate(HUGETLB_PAGE_ORDER);
>  		parsed_hstate->max_huge_pages = default_hstate_resv;
>  	}
> @@ -821,9 +821,10 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  			   struct file *file, void __user *buffer,
>  			   size_t *length, loff_t *ppos)
>  {
> -	int err = 0;
> +	int err;
>  	struct hstate *h;
> 
> +	table->maxlen = max_hstate * sizeof(unsigned long);
>  	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
>  	if (err)
>  		return err;
> @@ -846,6 +847,7 @@ int hugetlb_treat_movable_handler(struct
>  			struct file *file, void __user *buffer,
>  			size_t *length, loff_t *ppos)
>  {
> +	table->maxlen = max_hstate * sizeof(int);
>  	proc_dointvec(table, write, file, buffer, length, ppos);
>  	if (hugepages_treat_as_movable)
>  		htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
> @@ -858,15 +860,22 @@ int hugetlb_overcommit_handler(struct ct
>  			struct file *file, void __user *buffer,
>  			size_t *length, loff_t *ppos)
>  {
> +	int err;
>  	struct hstate *h;
> -	int i = 0;
> -	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> +
> +	table->maxlen = max_hstate * sizeof(unsigned long);
> +	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> +	if (err)
> +		return err;
> +
>  	spin_lock(&hugetlb_lock);
>  	for_each_hstate (h) {
> -		h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages[i];
> +		h->nr_overcommit_huge_pages =
> +				sysctl_overcommit_huge_pages[h - hstates];
>  		i++;

The increment of i can be removed since it is no longer used or defined.

<snip>

Jon
Tollefson


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 04/18] hugetlb: modular state
  2008-04-23  1:53 ` [patch 04/18] hugetlb: modular state npiggin
@ 2008-04-23 15:21   ` Jon Tollefson
  2008-04-23 15:38     ` Nick Piggin
  2008-04-25 17:13   ` Nishanth Aravamudan
  1 sibling, 1 reply; 123+ messages in thread
From: Jon Tollefson @ 2008-04-23 15:21 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, nacc, abh, wli

On Wed, 2008-04-23 at 11:53 +1000, npiggin@suse.de wrote:

<snip>

> Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
> +++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
> @@ -128,7 +128,7 @@ pte_t *huge_pte_offset(struct mm_struct 
>  	return NULL;
>  }
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)

The sz has to be an unsigned long to match the definition in the header.
The same is true for the other architectures too.

Jon
Tollefson


>  {
>  	pgd_t *pg;
>  	pud_t *pu;
> Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c
> +++ linux-2.6/arch/sparc64/mm/hugetlbpage.c
> @@ -195,7 +195,7 @@ hugetlb_get_unmapped_area(struct file *f
>  				pgoff, flags);
>  }
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> Index: linux-2.6/arch/sh/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/sh/mm/hugetlbpage.c
> +++ linux-2.6/arch/sh/mm/hugetlbpage.c
> @@ -22,7 +22,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/cacheflush.h>
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c
> +++ linux-2.6/arch/ia64/mm/hugetlbpage.c
> @@ -24,7 +24,7 @@
>  unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
> 
>  pte_t *
> -huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
> +huge_pte_alloc (struct mm_struct *mm, unsigned long addr, int sz)
>  {
>  	unsigned long taddr = htlbpage_to_page(addr);
>  	pgd_t *pgd;
> Index: linux-2.6/arch/x86/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
> +++ linux-2.6/arch/x86/mm/hugetlbpage.c
> @@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
>  	return 1;
>  }
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> Index: linux-2.6/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/linux/hugetlb.h
> +++ linux-2.6/include/linux/hugetlb.h
> @@ -40,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
> 
>  /* arch callbacks */
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz);

<snip>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23  8:05 ` [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 Andi Kleen
@ 2008-04-23 15:34   ` Nick Piggin
  2008-04-23 15:46     ` Andi Kleen
  2008-04-23 18:43   ` Nishanth Aravamudan
  1 sibling, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-04-23 15:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, kniht, nacc, abh, wli

On Wed, Apr 23, 2008 at 10:05:45AM +0200, Andi Kleen wrote:
> 
> > Testing-wise, I've changed the registration mechanism so that if you specify
> > hugepagesz=1G on the command line, then you do not get the 2M pages by default
> > (you have to also specify hugepagesz=2M). Also, when only one hstate is
> > registered, all the proc outputs appear unchanged, so this makes it very easy
> > to test with.
> 
> Are you sure that's a good idea? Just replacing the 2M count in meminfo
> with 1G pages is not fully compatible proc ABI wise I think.

Not sure that it is a good idea, but it did allow the test suite to pass
more tests ;)

What the best option is for backwards compatibility, I don't know. I
think this approach would give things a better chance of actually
working with 1G hugepags and old userspace, but it probably also
increases the chances of funny bugs.


> I think rather that applications who only know about 2M pages should
> see "0" in this case and not be confused by larger pages. And only
> applications who are multi page size aware should see the new page
> sizes.
> 
> If you prefer it you could move all the new page sizes to sysfs
> and only ever display the "legacy page size" in meminfo,
> but frankly I personally prefer the quite simple and comparatively
> efficient /proc/meminfo with multiple numbers interface.

Well I've chance it so it just has single numbers if a single hstate
is registered: that way we're completely backwards compatible in the
case of only using 2M pages.

But I think your multiple hstates in /proc/meminfo isn't too bad
given the bad situation. Maybe just adding more meminfo lines would
be better though?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 18/18] hugetlb: my fixes 2
  2008-04-23 10:48   ` Andi Kleen
@ 2008-04-23 15:36     ` Nick Piggin
  2008-04-23 18:49     ` Nishanth Aravamudan
  1 sibling, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-04-23 15:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, kniht, nacc, abh, wli

On Wed, Apr 23, 2008 at 12:48:21PM +0200, Andi Kleen wrote:
> npiggin@suse.de wrote:
> 
> Thanks for these fixes. The subject definitely needs improvement, or
> rather all these fixes should be folded into the original patches.

Yes that's what I intend. I just have the broken out patch at the end
so it is easy to review. Afterwards I will fold it into your patches.

 
> > Here is my next set of fixes and changes:
> > - Allow configurations without the default HPAGE_SIZE size (mainly useful
> >   for testing but maybe it is the right way to go).
> 
> I don't think it is the correct way. If you want to do it this way you
> would need to special case it in /proc/meminfo to keep things compatible.
> 
> Also in general I would think that always keeping the old huge page size
> around is a good idea. There is some chance at least to allocate 2MB
> pages after boot (especially with the new movable zone and with lumpy
> reclaim), so it doesn't need to be configured at boot time strictly. And
> why take that option away from the user?
> 
> Also I would hope that distributions keep their existing /hugetlbfs
> (if they have one) at the compat size for 100% compatibility to existing
> applications.

You are probably right on all counts here. I did intend to stress
that it was mainly for my ease of testing and I don't know so
much about the userspace aspect of it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 04/18] hugetlb: modular state
  2008-04-23 15:21   ` Jon Tollefson
@ 2008-04-23 15:38     ` Nick Piggin
  0 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-04-23 15:38 UTC (permalink / raw)
  To: Jon Tollefson; +Cc: akpm, linux-mm, andi, nacc, abh, wli

On Wed, Apr 23, 2008 at 10:21:38AM -0500, Jon Tollefson wrote:
> 
> On Wed, 2008-04-23 at 11:53 +1000, npiggin@suse.de wrote:
> 
> <snip>
> 
> > Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
> > ===================================================================
> > --- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
> > +++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
> > @@ -128,7 +128,7 @@ pte_t *huge_pte_offset(struct mm_struct 
> >  	return NULL;
> >  }
> > 
> > -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> > +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
> 
> The sz has to be an unsigned long to match the definition in the header.
> The same is true for the other architectures too.

Ah, sorry I forgot to do an arch sweep after the change :P

Thanks for picking that up

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 18/18] hugetlb: my fixes 2
  2008-04-23 15:20   ` Jon Tollefson
@ 2008-04-23 15:44     ` Nick Piggin
  0 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-04-23 15:44 UTC (permalink / raw)
  To: Jon Tollefson; +Cc: akpm, linux-mm, andi, nacc, abh, wli

On Wed, Apr 23, 2008 at 10:20:53AM -0500, Jon Tollefson wrote:
> 
> On Wed, 2008-04-23 at 11:53 +1000, npiggin@suse.de wrote:
> > plain text document attachment (hugetlb-fixes2.patch)
> > Here is my next set of fixes and changes:
> > - Allow configurations without the default HPAGE_SIZE size (mainly useful
> >   for testing but maybe it is the right way to go).
> > - Fixed another case where mappings would be set up on incorrect boundaries
> >   because prepare_hugepage_range was not hpage-ified.
> > - Changed the sysctl table behaviour so it only displays as many values in
> >   the vector as there are hstates configured.
> > - Fixed oops in overcommit sysctl handler
> > 
> > This fixes several oopses seen on the libhugetlbfs test suite. Now it seems to
> > pass most of them and fails reasonably on others (eg. most 32-bit tests fail
> > due to being unable to map enough virtual memory, others due to not enough
> > hugepages given that I only have 2).
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> > ---
> >  arch/x86/mm/hugetlbpage.c |    4 ++--
> >  fs/hugetlbfs/inode.c      |    4 +++-
> >  include/linux/hugetlb.h   |   19 ++-----------------
> >  kernel/sysctl.c           |    2 ++
> >  mm/hugetlb.c              |   35 ++++++++++++++++++++++++++++++-----
> >  5 files changed, 39 insertions(+), 25 deletions(-)
> > 
> > Index: linux-2.6/arch/x86/mm/hugetlbpage.c
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
> > +++ linux-2.6/arch/x86/mm/hugetlbpage.c
> > @@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
> >  	return 1;
> >  }
> > 
> > -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
> > +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
> >  {
> >  	pgd_t *pgd;
> >  	pud_t *pud;
> > @@ -402,7 +402,7 @@ hugetlb_get_unmapped_area(struct file *f
> >  		return -ENOMEM;
> > 
> >  	if (flags & MAP_FIXED) {
> > -		if (prepare_hugepage_range(addr, len))
> > +		if (prepare_hugepage_range(file, addr, len))
> >  			return -EINVAL;
> >  		return addr;
> >  	}
> > Index: linux-2.6/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.orig/mm/hugetlb.c
> > +++ linux-2.6/mm/hugetlb.c
> > @@ -640,7 +640,7 @@ static int __init hugetlb_init(void)
> >  {
> >  	BUILD_BUG_ON(HPAGE_SHIFT == 0);
> > 
> > -	if (!size_to_hstate(HPAGE_SIZE)) {
> > +	if (!max_hstate) {
> >  		huge_add_hstate(HUGETLB_PAGE_ORDER);
> >  		parsed_hstate->max_huge_pages = default_hstate_resv;
> >  	}
> > @@ -821,9 +821,10 @@ int hugetlb_sysctl_handler(struct ctl_ta
> >  			   struct file *file, void __user *buffer,
> >  			   size_t *length, loff_t *ppos)
> >  {
> > -	int err = 0;
> > +	int err;
> >  	struct hstate *h;
> > 
> > +	table->maxlen = max_hstate * sizeof(unsigned long);
> >  	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> >  	if (err)
> >  		return err;
> > @@ -846,6 +847,7 @@ int hugetlb_treat_movable_handler(struct
> >  			struct file *file, void __user *buffer,
> >  			size_t *length, loff_t *ppos)
> >  {
> > +	table->maxlen = max_hstate * sizeof(int);
> >  	proc_dointvec(table, write, file, buffer, length, ppos);
> >  	if (hugepages_treat_as_movable)
> >  		htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
> > @@ -858,15 +860,22 @@ int hugetlb_overcommit_handler(struct ct
> >  			struct file *file, void __user *buffer,
> >  			size_t *length, loff_t *ppos)
> >  {
> > +	int err;
> >  	struct hstate *h;
> > -	int i = 0;
> > -	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> > +
> > +	table->maxlen = max_hstate * sizeof(unsigned long);
> > +	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> > +	if (err)
> > +		return err;
> > +
> >  	spin_lock(&hugetlb_lock);
> >  	for_each_hstate (h) {
> > -		h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages[i];
> > +		h->nr_overcommit_huge_pages =
> > +				sysctl_overcommit_huge_pages[h - hstates];
> >  		i++;
> 
> The increment of i can be removed since it is no longer used or defined.

Thanks... sorry for the poor quality of patch. I honestly thought I
actually compiled and ran this one, but I must have made said compile
fix on my test box and not picked it up in my working tree. Otherwise
it looks good to go.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23 15:34   ` Nick Piggin
@ 2008-04-23 15:46     ` Andi Kleen
  2008-04-23 15:53       ` Nick Piggin
  0 siblings, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-04-23 15:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, akpm, linux-mm, kniht, nacc, abh, wli

On Wed, Apr 23, 2008 at 05:34:04PM +0200, Nick Piggin wrote:
> On Wed, Apr 23, 2008 at 10:05:45AM +0200, Andi Kleen wrote:
> > 
> > > Testing-wise, I've changed the registration mechanism so that if you specify
> > > hugepagesz=1G on the command line, then you do not get the 2M pages by default
> > > (you have to also specify hugepagesz=2M). Also, when only one hstate is
> > > registered, all the proc outputs appear unchanged, so this makes it very easy
> > > to test with.
> > 
> > Are you sure that's a good idea? Just replacing the 2M count in meminfo
> > with 1G pages is not fully compatible proc ABI wise I think.
> 
> Not sure that it is a good idea, but it did allow the test suite to pass
> more tests ;)

Then the test suite is wrong. Really I expect programs that want
to use 1G pages to be adapted to it.

> What the best option is for backwards compatibility, I don't know. I

The first number has to be always the "legacy" size for compatibility.   
I don't think know why you don't know that, it really seems like an
obvious fact to me.

> think this approach would give things a better chance of actually
> working with 1G hugepags and old userspace, but it probably also
> increases the chances of funny bugs.

It's not fully compatible. And that is bad.

> > I think rather that applications who only know about 2M pages should
> > see "0" in this case and not be confused by larger pages. And only
> > applications who are multi page size aware should see the new page
> > sizes.
> > 
> > If you prefer it you could move all the new page sizes to sysfs
> > and only ever display the "legacy page size" in meminfo,
> > but frankly I personally prefer the quite simple and comparatively
> > efficient /proc/meminfo with multiple numbers interface.
> 
> Well I've chance it so it just has single numbers if a single hstate
> is registered: that way we're completely backwards compatible in the
> case of only using 2M pages.

That makes sense.

But please also undo the change to not have the legacy page size.

> But I think your multiple hstates in /proc/meminfo isn't too bad
> given the bad situation. Maybe just adding more meminfo lines would
> be better though?

Would also work for me, no particular opinion either way.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23 15:46     ` Andi Kleen
@ 2008-04-23 15:53       ` Nick Piggin
  2008-04-23 16:02         ` Andi Kleen
  2008-04-23 18:52         ` Nishanth Aravamudan
  0 siblings, 2 replies; 123+ messages in thread
From: Nick Piggin @ 2008-04-23 15:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, kniht, nacc, abh, wli

On Wed, Apr 23, 2008 at 05:46:52PM +0200, Andi Kleen wrote:
> On Wed, Apr 23, 2008 at 05:34:04PM +0200, Nick Piggin wrote:
> > On Wed, Apr 23, 2008 at 10:05:45AM +0200, Andi Kleen wrote:
> > > 
> > > > Testing-wise, I've changed the registration mechanism so that if you specify
> > > > hugepagesz=1G on the command line, then you do not get the 2M pages by default
> > > > (you have to also specify hugepagesz=2M). Also, when only one hstate is
> > > > registered, all the proc outputs appear unchanged, so this makes it very easy
> > > > to test with.
> > > 
> > > Are you sure that's a good idea? Just replacing the 2M count in meminfo
> > > with 1G pages is not fully compatible proc ABI wise I think.
> > 
> > Not sure that it is a good idea, but it did allow the test suite to pass
> > more tests ;)
> 
> Then the test suite is wrong. Really I expect programs that want
> to use 1G pages to be adapted to it.

No, it can generally determine the size of the hugepages. It would
be more wrong (but probably more common) for portable code to assume
2MB hugepages.

> > What the best option is for backwards compatibility, I don't know. I
> 
> The first number has to be always the "legacy" size for compatibility.   
> I don't think know why you don't know that, it really seems like an
> obvious fact to me.

Obvious? When you want your legacy userspace to use 1G pages and don't
have any 2MB pages in the machine? In that case IMO there is no question
that my way is the most likely possibility. We have a hugepagesize
field there, so the assumption would be that it gets used.

If you want your legacy userspace to have 2MB hugepages, then you would
have a 2MB hstate and see the 2MB sizes there.

> > think this approach would give things a better chance of actually
> > working with 1G hugepags and old userspace, but it probably also
> > increases the chances of funny bugs.
> 
> It's not fully compatible. And that is bad.

It is fully compatible because if you don't actually ask for any new
option then you don't get it. What you see will be exactly unchanged.
If you ask for _only_ 1G pages, then this new scheme is very likely to
work with well written applications wheras if you also print out the 2MB
legacy values first, then they have little to no chance of working.

Then if you want legacy apps to use 2MB pages, and new ones to use 1G,
then you ask for both and get the 2MB column printed in /proc/meminfo
(actually it can probably get printed 2nd if you ask for 2MB pages
after asking for 1G pages -- that is something I'll fix).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23 15:53       ` Nick Piggin
@ 2008-04-23 16:02         ` Andi Kleen
  2008-04-23 16:02           ` Nick Piggin
  2008-04-23 18:54           ` Nishanth Aravamudan
  2008-04-23 18:52         ` Nishanth Aravamudan
  1 sibling, 2 replies; 123+ messages in thread
From: Andi Kleen @ 2008-04-23 16:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, akpm, linux-mm, kniht, nacc, abh, wli

> No, it can generally determine the size of the hugepages. It would
> be more wrong (but probably more common) for portable code to assume

For compatibility we have to assume code does that.

> 2MB hugepages.

Well then it should just run with 2MB pages on a kernel where both
1G and 2M are configured. Does it not do that? 

> If you want your legacy userspace to have 2MB hugepages, then you would

I think all legacy user space should only use 2MB huge pages.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23 16:02         ` Andi Kleen
@ 2008-04-23 16:02           ` Nick Piggin
  2008-04-23 18:54           ` Nishanth Aravamudan
  1 sibling, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-04-23 16:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, kniht, nacc, abh, wli

On Wed, Apr 23, 2008 at 06:02:10PM +0200, Andi Kleen wrote:
> > No, it can generally determine the size of the hugepages. It would
> > be more wrong (but probably more common) for portable code to assume
> 
> For compatibility we have to assume code does that.

True, and that's definitely what it does by default. But the option is
there for people to ask for 1G pages as the primary size, in which case
well written and legacy applications will be able to use them. I assume
most important Java HPC and database codes will be because they are
multi platform and at the very least they would have to deal with 2MB
and 4MB hugepages for x86.

> > 2MB hugepages.
> 
> Well then it should just run with 2MB pages on a kernel where both
> 1G and 2M are configured. Does it not do that? 

Yes, if you ask for 1G and 2M it will run with them OK.

> > If you want your legacy userspace to have 2MB hugepages, then you would
> 
> I think all legacy user space should only use 2MB huge pages.

Why? You would be wary of bugs coming up? 

Anyway, I'm sure it is not a problem to just allow the opportunity to
have 1GB as primary page size. Of course it will be 2MB only by default.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 11/18] mm: export prep_compound_page to mm
  2008-04-23  1:53 ` [patch 11/18] mm: export prep_compound_page to mm npiggin
@ 2008-04-23 16:12   ` Andrew Hastings
  2008-05-23  5:29     ` Nick Piggin
  0 siblings, 1 reply; 123+ messages in thread
From: Andrew Hastings @ 2008-04-23 16:12 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, nacc, wli

npiggin@suse.de wrote:
> hugetlb will need to get compound pages from bootmem to handle
> the case of them being larger than MAX_ORDER. Export

s/larger/greater than or equal to/

> the constructor function needed for this.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  mm/internal.h   |    2 ++
>  mm/page_alloc.c |    2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/mm/internal.h
> ===================================================================
> --- linux-2.6.orig/mm/internal.h
> +++ linux-2.6/mm/internal.h
> @@ -13,6 +13,8 @@
>  
>  #include <linux/mm.h>
>  
> +extern void prep_compound_page(struct page *page, unsigned long order);
> +
>  static inline void set_page_count(struct page *page, int v)
>  {
>  	atomic_set(&page->_count, v);
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -272,7 +272,7 @@ static void free_compound_page(struct pa
>  	__free_pages_ok(page, compound_order(page));
>  }
>  
> -static void prep_compound_page(struct page *page, unsigned long order)
> +void prep_compound_page(struct page *page, unsigned long order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> 

-Andrew Hastings
  Cray Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 12/18] hugetlbfs: support larger than MAX_ORDER
  2008-04-23  1:53 ` [patch 12/18] hugetlbfs: support larger than MAX_ORDER npiggin
@ 2008-04-23 16:15   ` Andrew Hastings
  2008-04-23 16:25     ` Andi Kleen
  2008-04-25 18:55   ` Nishanth Aravamudan
  2008-04-30 21:01   ` Dave Hansen
  2 siblings, 1 reply; 123+ messages in thread
From: Andrew Hastings @ 2008-04-23 16:15 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, nacc, wli

npiggin@suse.de wrote:
> This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
> not practical to enlarge MAX_ORDER to 1GB. 

Sorry to ask what is probably a dumb question, but why is it not 
practical to increase MAX_ORDER to 1GB for a 64-bit platform like 
x86-64?  Doing so would make 1GB pages much more practical to use.

> Instead the 1GB pages are only allocated at boot using the bootmem
> allocator using the hugepages=... option.
> 
> These 1G bootmem pages are never freed. In theory it would be possible
> to implement that with some complications, but since it would be a one-way
> street (>= MAX_ORDER pages cannot be allocated later) I decided not to
> currently.
> 
> The >= MAX_ORDER code is not ifdef'ed per architecture. It is not very big
> and the ifdef uglyness seemed not be worth it.
> 
> Known problems: /proc/meminfo and "free" do not display the memory 
> allocated for gb pages in "Total". This is a little confusing for the
> user.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  mm/hugetlb.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 72 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -14,6 +14,7 @@
>  #include <linux/mempolicy.h>
>  #include <linux/cpuset.h>
>  #include <linux/mutex.h>
> +#include <linux/bootmem.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -160,7 +161,7 @@ static void free_huge_page(struct page *
>  	INIT_LIST_HEAD(&page->lru);
>  
>  	spin_lock(&hugetlb_lock);
> -	if (h->surplus_huge_pages_node[nid]) {
> +	if (h->surplus_huge_pages_node[nid] && h->order < MAX_ORDER) {
>  		update_and_free_page(h, page);
>  		h->surplus_huge_pages--;
>  		h->surplus_huge_pages_node[nid]--;
> @@ -222,6 +223,9 @@ static struct page *alloc_fresh_huge_pag
>  {
>  	struct page *page;
>  
> +	if (h->order >= MAX_ORDER)
> +		return NULL;
> +
>  	page = alloc_pages_node(nid,
>  		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
>  		huge_page_order(h));
> @@ -278,6 +282,9 @@ static struct page *alloc_buddy_huge_pag
>  	struct page *page;
>  	unsigned int nid;
>  
> +	if (h->order >= MAX_ORDER)
> +		return NULL;
> +
>  	/*
>  	 * Assume we will successfully allocate the surplus page to
>  	 * prevent racing processes from causing the surplus to exceed
> @@ -444,6 +451,10 @@ static void return_unused_surplus_pages(
>  	/* Uncommit the reservation */
>  	h->resv_huge_pages -= unused_resv_pages;
>  
> +	/* Cannot return gigantic pages currently */
> +	if (h->order >= MAX_ORDER)
> +		return;
> +
>  	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
>  
>  	while (remaining_iterations-- && nr_pages) {
> @@ -522,6 +533,51 @@ static struct page *alloc_huge_page(stru
>  	return page;
>  }
>  
> +static __initdata LIST_HEAD(huge_boot_pages);
> +
> +struct huge_bm_page {
> +	struct list_head list;
> +	struct hstate *hstate;
> +};
> +
> +static int __init alloc_bm_huge_page(struct hstate *h)
> +{
> +	struct huge_bm_page *m;
> +	int nr_nodes = nodes_weight(node_online_map);
> +
> +	while (nr_nodes) {
> +		m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
> +					huge_page_size(h), huge_page_size(h),
> +					0);
> +		if (m)
> +			goto found;
> +		hstate_next_node(h);
> +		nr_nodes--;
> +	}
> +	return 0;
> +
> +found:
> +	BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
> +	/* Put them into a private list first because mem_map is not up yet */
> +	list_add(&m->list, &huge_boot_pages);
> +	m->hstate = h;
> +	return 1;
> +}
> +
> +/* Put bootmem huge pages into the standard lists after mem_map is up */
> +static void __init gather_bootmem_prealloc(void)
> +{
> +	struct huge_bm_page *m;
> +	list_for_each_entry (m, &huge_boot_pages, list) {
> +		struct page *page = virt_to_page(m);
> +		struct hstate *h = m->hstate;
> +		__ClearPageReserved(page);
> +		WARN_ON(page_count(page) != 1);
> +		prep_compound_page(page, h->order);
> +		prep_new_huge_page(h, page);
> +	}
> +}
> +
>  static void __init hugetlb_init_hstate(struct hstate *h)
>  {
>  	unsigned long i;
> @@ -532,7 +588,10 @@ static void __init hugetlb_init_hstate(s
>  	h->hugetlb_next_nid = first_node(node_online_map);
>  
>  	for (i = 0; i < h->max_huge_pages; ++i) {
> -		if (!alloc_fresh_huge_page(h))
> +		if (h->order >= MAX_ORDER) {
> +			if (!alloc_bm_huge_page(h))
> +				break;
> +		} else if (!alloc_fresh_huge_page(h))
>  			break;
>  	}
>  	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> @@ -569,6 +628,8 @@ static int __init hugetlb_init(void)
>  
>  	hugetlb_init_hstates();
>  
> +	gather_bootmem_prealloc();
> +
>  	report_hugepages();
>  
>  	return 0;
> @@ -625,6 +686,9 @@ static void try_to_free_low(struct hstat
>  {
>  	int i;
>  
> +	if (h->order >= MAX_ORDER)
> +		return;
> +
>  	for (i = 0; i < MAX_NUMNODES; ++i) {
>  		struct page *page, *next;
>  		struct list_head *freel = &h->hugepage_freelists[i];
> @@ -654,6 +718,12 @@ set_max_huge_pages(struct hstate *h, uns
>  
>  	*err = 0;
>  
> +	if (h->order >= MAX_ORDER) {
> +		if (count != h->max_huge_pages)
> +			*err = -EINVAL;
> +		return h->max_huge_pages;
> +	}
> +
>  	/*
>  	 * Increase the pool size
>  	 * First take pages out of surplus state.  Then make up the
> 

Acked-by: Andrew Hastings <abh@cray.com>

-Andrew Hastings
  Cray Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 13/18] hugetlb: support boot allocate different sizes
  2008-04-23  1:53 ` [patch 13/18] hugetlb: support boot allocate different sizes npiggin
@ 2008-04-23 16:15   ` Andrew Hastings
  2008-04-25 18:40   ` Nishanth Aravamudan
  1 sibling, 0 replies; 123+ messages in thread
From: Andrew Hastings @ 2008-04-23 16:15 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, nacc, wli

npiggin@suse.de wrote:
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  mm/hugetlb.c |   24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -582,10 +582,13 @@ static void __init hugetlb_init_hstate(s
>  {
>  	unsigned long i;
>  
> -	for (i = 0; i < MAX_NUMNODES; ++i)
> -		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> +	/* Don't reinitialize lists if they have been already init'ed */
> +	if (!h->hugepage_freelists[0].next) {
> +		for (i = 0; i < MAX_NUMNODES; ++i)
> +			INIT_LIST_HEAD(&h->hugepage_freelists[i]);
>  
> -	h->hugetlb_next_nid = first_node(node_online_map);
> +		h->hugetlb_next_nid = first_node(node_online_map);
> +	}
>  
>  	for (i = 0; i < h->max_huge_pages; ++i) {
>  		if (h->order >= MAX_ORDER) {
> @@ -594,7 +597,7 @@ static void __init hugetlb_init_hstate(s
>  		} else if (!alloc_fresh_huge_page(h))
>  			break;
>  	}
> -	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> +	h->max_huge_pages = i;
>  }
>  
>  static void __init hugetlb_init_hstates(void)
> @@ -602,7 +605,10 @@ static void __init hugetlb_init_hstates(
>  	struct hstate *h;
>  
>  	for_each_hstate(h) {
> -		hugetlb_init_hstate(h);
> +		/* oversize hugepages were init'ed in early boot */
> +		if (h->order < MAX_ORDER)
> +			hugetlb_init_hstate(h);
> +		max_huge_pages[h - hstates] = h->max_huge_pages;
>  	}
>  }
>  
> @@ -665,6 +671,14 @@ static int __init hugetlb_setup(char *s)
>  	if (sscanf(s, "%lu", mhp) <= 0)
>  		*mhp = 0;
>  
> +	/*
> +	 * Global state is always initialized later in hugetlb_init.
> +	 * But we need to allocate >= MAX_ORDER hstates here early to still
> +	 * use the bootmem allocator.
> +	 */
> +	if (max_hstate > 0 && parsed_hstate->order >= MAX_ORDER)
> +		hugetlb_init_hstate(parsed_hstate);
> +
>  	return 1;
>  }
>  __setup("hugepages=", hugetlb_setup);
> 

Acked-by: Andrew Hastings <abh@cray.com>

-Andrew Hastings
  Cray Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 12/18] hugetlbfs: support larger than MAX_ORDER
  2008-04-23 16:15   ` Andrew Hastings
@ 2008-04-23 16:25     ` Andi Kleen
  0 siblings, 0 replies; 123+ messages in thread
From: Andi Kleen @ 2008-04-23 16:25 UTC (permalink / raw)
  To: Andrew Hastings; +Cc: npiggin, akpm, linux-mm, andi, kniht, nacc, wli

On Wed, Apr 23, 2008 at 11:15:07AM -0500, Andrew Hastings wrote:
> npiggin@suse.de wrote:
> >This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
> >not practical to enlarge MAX_ORDER to 1GB. 
> 
> Sorry to ask what is probably a dumb question, but why is it not 
> practical to increase MAX_ORDER to 1GB for a 64-bit platform like 
> x86-64?  

That would mean all zones would need to be 1GB aligned.
That would make it impossible to have a 16MB zone dma and
the following normal zone. That one is actually going 
away with the mask allocator patchkit, but also the
movable zone is not necessarily aligned to 1GB.

The other issue is that it would increase the cache foot print
of the page allocator significantly and that is very sensitive
in important benchmarks.

> Doing so would make 1GB pages much more practical to use.

It's very doubtful that even with an increased MAX_ORDER you would
be actually able to allocate GB pages efficiently after boot.
Even with all tricks like movable zone etc.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23  8:05 ` [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 Andi Kleen
  2008-04-23 15:34   ` Nick Piggin
@ 2008-04-23 18:43   ` Nishanth Aravamudan
  1 sibling, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-23 18:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

On 23.04.2008 [10:05:45 +0200], Andi Kleen wrote:
> 
> > Testing-wise, I've changed the registration mechanism so that if you
> > specify hugepagesz=1G on the command line, then you do not get the
> > 2M pages by default (you have to also specify hugepagesz=2M). Also,
> > when only one hstate is registered, all the proc outputs appear
> > unchanged, so this makes it very easy to test with.
> 
> Are you sure that's a good idea? Just replacing the 2M count in
> meminfo with 1G pages is not fully compatible proc ABI wise I think.

If this is the case, then providing hugepagesz at all seems absurd on
x86_64?

That is, hugepagesz = 1G implies hugepagesz = 2M must also be specified?
If you're going to require that, then why not just have hugepages= with
a strict ordering? e.g., hugepages=10, is 10 2M pages; hugepages=10,2 is
10 2M pages and 2 1G pages. Well, I guess you're future-proofing against
adding another hugepage size in-between. If we're going to require this,
I hope the patchset has huge printk()s that functionality is being
disabled because the command-line was not spat out the right way.

And I'm not sure I buy the ABI argument? That implies that you can't
have differing hugepage sizes period, between boots, which we clearly
can on IA64, power, etc. Applications should be examining meminfo as is
for the underlying hugepagesize. Any app that hard-coded the size was
going to break eventually and was completely non-portable.

> I think rather that applications who only know about 2M pages should
> see "0" in this case and not be confused by larger pages. And only
> applications who are multi page size aware should see the new page
> sizes.

Applications could be using libhugetlbfs and not need to know about the
pages in particular (we also export a gethugepagesize() call, which will
need adjustment for multiple fields in /proc/meminfo -- something like
gethugepagesizes(), I guess, where gethugepagesize() returns the default
hugepages size, which should always be the first listed in
/proc/meminfo).

> If you prefer it you could move all the new page sizes to sysfs
> and only ever display the "legacy page size" in meminfo,
> but frankly I personally prefer the quite simple and comparatively
> efficient /proc/meminfo with multiple numbers interface.

Well, some things should be moved to sysfs, I'd say. I'm working on it
as we speak.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 18/18] hugetlb: my fixes 2
  2008-04-23 10:48   ` Andi Kleen
  2008-04-23 15:36     ` Nick Piggin
@ 2008-04-23 18:49     ` Nishanth Aravamudan
  2008-04-23 19:37       ` Andi Kleen
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-23 18:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

On 23.04.2008 [12:48:21 +0200], Andi Kleen wrote:
> npiggin@suse.de wrote:
> 
> Thanks for these fixes. The subject definitely needs improvement, or
> rather all these fixes should be folded into the original patches.
> 
> > Here is my next set of fixes and changes:
> > - Allow configurations without the default HPAGE_SIZE size (mainly useful
> >   for testing but maybe it is the right way to go).
> 
> I don't think it is the correct way. If you want to do it this way you
> would need to special case it in /proc/meminfo to keep things
> compatible.

I'm not sure I believe you here. /proc/meminfo displays both the number
of hugepages and the size. If any app relied on hugepages being a fixed
size, well, they blatantly are ignoring information being provided by
the kernel *and* are non-portable.

> Also in general I would think that always keeping the old huge page
> size around is a good idea. There is some chance at least to allocate
> 2MB pages after boot (especially with the new movable zone and with
> lumpy reclaim), so it doesn't need to be configured at boot time
> strictly. And why take that option away from the user?

Sure, but that's an administrative choice and might be the default.
We're already requiring extra effort to even use 1G pages, right, by
specifying hugepagesz=1G, why does it matter if they also have to
specify hugepagesz=2M. So nothing is being taken away from the user,
unless their administrator only expliclity specified one hugepage size.

Otherwise, we get implicit command-line arguments like:

hugepagesz=1G hugepages=10 hugepages=20

I prefer the flexibility of allowing an administrator to specify exactly
what pool-sizes they want to allow users access to. They also have to
mount hugetlbfs, and specify the size there, but still, I think Nick's
way is the right way forward, especially given the potential for more
than 2 hugepage sizes available.

So I'd say the cmdline should function like:

a) no hugepagesz= specified. hugepages= defaults to the "default"
hugepage size, which is arch-defined (as the historical value, I guess).

b) hugepagesz= specified. every hugepagesz that should be available must
be specified (if the pool is not going to be allocated at boot-time, say
for 64K and 16M pages on power, could the admin to
hugepagesz=64k,16m?)

> Also I would hope that distributions keep their existing /hugetlbfs
> (if they have one) at the compat size for 100% compatibility to
> existing applications.

Sure, but this is again an administrative decision and such, decided by
the distro, not the kernel.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23 15:53       ` Nick Piggin
  2008-04-23 16:02         ` Andi Kleen
@ 2008-04-23 18:52         ` Nishanth Aravamudan
  2008-04-24  2:08           ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-23 18:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, akpm, linux-mm, kniht, abh, wli

On 23.04.2008 [17:53:38 +0200], Nick Piggin wrote:
> On Wed, Apr 23, 2008 at 05:46:52PM +0200, Andi Kleen wrote:
> > On Wed, Apr 23, 2008 at 05:34:04PM +0200, Nick Piggin wrote:
> > > On Wed, Apr 23, 2008 at 10:05:45AM +0200, Andi Kleen wrote:
> > > > 
> > > > > Testing-wise, I've changed the registration mechanism so that if you specify
> > > > > hugepagesz=1G on the command line, then you do not get the 2M pages by default
> > > > > (you have to also specify hugepagesz=2M). Also, when only one hstate is
> > > > > registered, all the proc outputs appear unchanged, so this makes it very easy
> > > > > to test with.
> > > > 
> > > > Are you sure that's a good idea? Just replacing the 2M count in meminfo
> > > > with 1G pages is not fully compatible proc ABI wise I think.
> > > 
> > > Not sure that it is a good idea, but it did allow the test suite to pass
> > > more tests ;)
> > 
> > Then the test suite is wrong. Really I expect programs that want
> > to use 1G pages to be adapted to it.
> 
> No, it can generally determine the size of the hugepages. It would
> be more wrong (but probably more common) for portable code to assume
> 2MB hugepages.

Ack.

> > > What the best option is for backwards compatibility, I don't know. I
> > 
> > The first number has to be always the "legacy" size for compatibility.   
> > I don't think know why you don't know that, it really seems like an
> > obvious fact to me.
> 
> Obvious? When you want your legacy userspace to use 1G pages and don't
> have any 2MB pages in the machine? In that case IMO there is no question
> that my way is the most likely possibility. We have a hugepagesize
> field there, so the assumption would be that it gets used.
> 
> If you want your legacy userspace to have 2MB hugepages, then you would
> have a 2MB hstate and see the 2MB sizes there.

Ack.

> > > think this approach would give things a better chance of actually
> > > working with 1G hugepags and old userspace, but it probably also
> > > increases the chances of funny bugs.
> > 
> > It's not fully compatible. And that is bad.
> 
> It is fully compatible because if you don't actually ask for any new
> option then you don't get it. What you see will be exactly unchanged.
> If you ask for _only_ 1G pages, then this new scheme is very likely to
> work with well written applications wheras if you also print out the 2MB
> legacy values first, then they have little to no chance of working.
> 
> Then if you want legacy apps to use 2MB pages, and new ones to use 1G,
> then you ask for both and get the 2MB column printed in /proc/meminfo
> (actually it can probably get printed 2nd if you ask for 2MB pages
> after asking for 1G pages -- that is something I'll fix).

Yep, the "default hugepagesz" was something I was going to ask about. I
believe hugepagesz= should function kind of like console= where the
order matters if specified multiple times for where /dev/console points.
I agree with you that hugepagesz=XX hugepagesz=YY implies XX is the
default, and YY is the "other", regardless of their values, and that is
how they should be presented in meminfo.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23 16:02         ` Andi Kleen
  2008-04-23 16:02           ` Nick Piggin
@ 2008-04-23 18:54           ` Nishanth Aravamudan
  1 sibling, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-23 18:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, akpm, linux-mm, kniht, abh, wli

On 23.04.2008 [18:02:10 +0200], Andi Kleen wrote:
> > No, it can generally determine the size of the hugepages. It would
> > be more wrong (but probably more common) for portable code to assume
> 
> For compatibility we have to assume code does that.
> 
> > 2MB hugepages.
> 
> Well then it should just run with 2MB pages on a kernel where both
> 1G and 2M are configured. Does it not do that? 
> 
> > If you want your legacy userspace to have 2MB hugepages, then you would
> 
> I think all legacy user space should only use 2MB huge pages.

Even with what you're saying (that 1G implies 2M is also there), let's
say a legacy app just looks in /proc/mounts for hugetlbfs mountpoints
and then creates a file in the first one it finds. If the system
administrator mounted a 1G hugetlbfs first, then the legacy app is going
to get 1G pages, regardless of whether or not 2M are presented to
userspace. So that legacy app just broke -- I don't see any way of
preventing that.

I think Nick's method is sane and reasonable. Do you know of specific
legacy apps that require what you're saying?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 18/18] hugetlb: my fixes 2
  2008-04-23 18:49     ` Nishanth Aravamudan
@ 2008-04-23 19:37       ` Andi Kleen
  2008-04-23 21:11         ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-04-23 19:37 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

> they blatantly are ignoring information being provided by
> the kernel *and* are non-portable.

And? I'm sure both descriptions apply to significant parts of the
deployed userland, including software that deals with hugepages. You
should watch one of the Dave Jones' "why user space sucks" talks at some
point @)

> Sure, but that's an administrative choice and might be the default.
> We're already requiring extra effort to even use 1G pages, right, by
> specifying hugepagesz=1G, why does it matter if they also have to
> specify hugepagesz=2M.

Like I said earlier hugepagesz=2M is basically free, so there is no
reason to not have it even when you happen to have 1GB pages too.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 18/18] hugetlb: my fixes 2
  2008-04-23 19:37       ` Andi Kleen
@ 2008-04-23 21:11         ` Nishanth Aravamudan
  2008-04-23 21:38           ` Nishanth Aravamudan
  2008-04-23 22:06           ` Dave Hansen
  0 siblings, 2 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-23 21:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

On 23.04.2008 [21:37:09 +0200], Andi Kleen wrote:
> 
> > they blatantly are ignoring information being provided by
> > the kernel *and* are non-portable.
> 
> And? I'm sure both descriptions apply to significant parts of the
> deployed userland, including software that deals with hugepages. You
> should watch one of the Dave Jones' "why user space sucks" talks at
> some point @)

My point was simply that I don't know of any applications that are so
hard-coded (although there might be some in SHM_HUGETLB land). If you
know of any that would be great.

> > Sure, but that's an administrative choice and might be the default.
> > We're already requiring extra effort to even use 1G pages, right, by
> > specifying hugepagesz=1G, why does it matter if they also have to
> > specify hugepagesz=2M.
> 
> Like I said earlier hugepagesz=2M is basically free, so there is no
> reason to not have it even when you happen to have 1GB pages too.

I think I was getting confused by the talk about legacy apps and
hugepage pool allocations. And I think I might need to change my
stance...

On the one hand, there is the discussion about /proc/meminfo.

On the other, there is discussion about kernel command-line.

For the latter, I believe that only sizes that wish to be preallocated
should need to be specified on the command-line. That is, all available
hugepage sizes are visible in /proc and /sys once the kernel has booted.
But only the ones that have been specified on the kernel-cmdline *might*
have hugepages allocated during boot (depends on the success of the
allocations, for instance).

Outstanding issues:

 - specifying default hugepagesize other than the one on the kernel
   cmdline when only one is specified on the kernel cmdline. This might
   be a case for just making the default hugepagesize the only one
   available previously (2M on x86_64, 4M/2M on x86, 16M on power, etc).
   That is, regardless of the kernel boot-stanza, the default
   hugepagesize if CONFIG_HUGETLB_PAGE is set is the same on a per-arch
   basis.

 - How to deal with archs with many hugepage sizes available (IA64?) Do
   we show all of them in /proc/meminfo?

Using ppc with 64K, 16M, and 16G hugepages as an example, here is the
result (meminfo shows all three sizes always, with 16M first) for
various kernel command-lines:

hugepages=20

	allocates 20 16M hugepages

hugepages=20 hugepagesz=64k hugepages=40
hugepagesz=64k hugepages=20 hugepages=40

	allocatees 20 16M hugepages and 40 64K hugepages

hugepagesz=16G hugepages=2 hugepages=20 hugepagesz=64k hugepages=40
hugepagesz=16G hugepages=2 hugepagesz=16M hugepages=20 hugepagesz=64k hugepages=40

	allocates 2 16G hugepages, 20 16M hugepages and 40 64K hugepages

hugepagesz=64k hugepages=40

	allocates 40 64k hugepages

In all of the above cases, at run-time, all three hugepage sizes are
visible in the sense that we can try to echo commands into
/proc/sys/vm/nr_hugepages (or the appropriate replacement sysfs
interface). Availability to applications depends on administrators
mounting hugetlbfs with the appropriate size= parameter (I believe).

Does that all seem sane?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 18/18] hugetlb: my fixes 2
  2008-04-23 21:11         ` Nishanth Aravamudan
@ 2008-04-23 21:38           ` Nishanth Aravamudan
  2008-04-23 22:06           ` Dave Hansen
  1 sibling, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-23 21:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

On 23.04.2008 [14:11:36 -0700], Nishanth Aravamudan wrote:
> On 23.04.2008 [21:37:09 +0200], Andi Kleen wrote:

<snip>

>  - How to deal with archs with many hugepage sizes available (IA64?) Do
>    we show all of them in /proc/meminfo?

Hrmm, IA64 here may be a red herring. As I understand it, with short
VHPT mode, there is one hugepage size for all of the hugepage region. In
which case, I think, we'd just make IA64 special in that the first
hugepagesz specified is the one used (and the only visible) or whatever
is the current native default with hugepages= is the one visible
(256M?). That is, IA64 will always only have one hugepagesize available
at run-time on a given boot, so we only need to show the one set of
files in /proc/meminfo. If IA64 moves to long VHPT mode, things would
need adjusting, I guess.

Clearly, we want to document this in kernel-parameters.txt :)

We also should bring in the sparc and sh maintainers, in case they want
to chime in on how things might be presented to those architectures, if
they want to move to multiple hugepage pools?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 18/18] hugetlb: my fixes 2
  2008-04-23 21:11         ` Nishanth Aravamudan
  2008-04-23 21:38           ` Nishanth Aravamudan
@ 2008-04-23 22:06           ` Dave Hansen
  1 sibling, 0 replies; 123+ messages in thread
From: Dave Hansen @ 2008-04-23 22:06 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

On Wed, 2008-04-23 at 14:11 -0700, Nishanth Aravamudan wrote:
> hugepagesz=16G hugepages=2 hugepages=20 hugepagesz=64k hugepages=40
> hugepagesz=16G hugepages=2 hugepagesz=16M hugepages=20 hugepagesz=64k
> hugepages=40
> 
>         allocates 2 16G hugepages, 20 16M hugepages and 40 64K
> hugepages
> 
> hugepagesz=64k hugepages=40
> 
>         allocates 40 64k hugepages

Following up after a chat on irc...

How about letting hugepages take a size argument:

	hugepages=33G

That would allocate 2 16G pages, then 1GB of 16M pages.  If there were
any remainder, then the rest in 64k pages.  Actually instantiating the
pages could be left to when the mounts are created.

I'm just not sure there's a really good reason to be specifying the
hugepagesz= at boot-time when we really don't need to commit to it at
that point.

That said, can ppc64 16G pages ever get used as 16M or 64K pages?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-23 18:52         ` Nishanth Aravamudan
@ 2008-04-24  2:08           ` Nick Piggin
  2008-04-24  6:43             ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-04-24  2:08 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Andi Kleen, akpm, linux-mm, kniht, abh, wli

On Wed, Apr 23, 2008 at 11:52:23AM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [17:53:38 +0200], Nick Piggin wrote:
> > > It's not fully compatible. And that is bad.
> > 
> > It is fully compatible because if you don't actually ask for any new
> > option then you don't get it. What you see will be exactly unchanged.
> > If you ask for _only_ 1G pages, then this new scheme is very likely to
> > work with well written applications wheras if you also print out the 2MB
> > legacy values first, then they have little to no chance of working.
> > 
> > Then if you want legacy apps to use 2MB pages, and new ones to use 1G,
> > then you ask for both and get the 2MB column printed in /proc/meminfo
> > (actually it can probably get printed 2nd if you ask for 2MB pages
> > after asking for 1G pages -- that is something I'll fix).
> 
> Yep, the "default hugepagesz" was something I was going to ask about. I
> believe hugepagesz= should function kind of like console= where the
> order matters if specified multiple times for where /dev/console points.
> I agree with you that hugepagesz=XX hugepagesz=YY implies XX is the
> default, and YY is the "other", regardless of their values, and that is
> how they should be presented in meminfo.

OK, that would be fine. I was going to do it the other way and make
2M always come first. However so long as we document as such the
command line parameters, I don't see why we couldn't have this extra
flexibility (and that means I shouldn't have to write any more code ;))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-24  2:08           ` Nick Piggin
@ 2008-04-24  6:43             ` Nishanth Aravamudan
  2008-04-24  7:06               ` Nick Piggin
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-24  6:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, akpm, linux-mm, kniht, abh, wli

On 24.04.2008 [04:08:28 +0200], Nick Piggin wrote:
> On Wed, Apr 23, 2008 at 11:52:23AM -0700, Nishanth Aravamudan wrote:
> > On 23.04.2008 [17:53:38 +0200], Nick Piggin wrote:
> > > > It's not fully compatible. And that is bad.
> > > 
> > > It is fully compatible because if you don't actually ask for any new
> > > option then you don't get it. What you see will be exactly unchanged.
> > > If you ask for _only_ 1G pages, then this new scheme is very likely to
> > > work with well written applications wheras if you also print out the 2MB
> > > legacy values first, then they have little to no chance of working.
> > > 
> > > Then if you want legacy apps to use 2MB pages, and new ones to use 1G,
> > > then you ask for both and get the 2MB column printed in /proc/meminfo
> > > (actually it can probably get printed 2nd if you ask for 2MB pages
> > > after asking for 1G pages -- that is something I'll fix).
> > 
> > Yep, the "default hugepagesz" was something I was going to ask about. I
> > believe hugepagesz= should function kind of like console= where the
> > order matters if specified multiple times for where /dev/console points.
> > I agree with you that hugepagesz=XX hugepagesz=YY implies XX is the
> > default, and YY is the "other", regardless of their values, and that is
> > how they should be presented in meminfo.
> 
> OK, that would be fine. I was going to do it the other way and make
> 2M always come first. However so long as we document as such the
> command line parameters, I don't see why we couldn't have this extra
> flexibility (and that means I shouldn't have to write any more code ;))

Keep in mind, I did retract this to some extent in my other
reply...After thinking about Andi's points a bit more, I believe the
most flexible (not too-x86_64-centric, either) option is to have all
potential hugepage sizes be "available" at run-time. What hugepages are
allocated at boot-time is all that is specified on the kernel
command-line, in that case (and is only truly necessary for the
ginormous hugepages, and needs to be heavily documented as such).

Realistically, yes, we could have it either way (hugepagesz= determines
the order), but it shouldn't matter to well-written applications, so
keeping things reflecting current reality as much as possible does make
sense -- that is, 2M would always come first meminfo on x86_64.

If you want, I can send you a patch to do that, as I start the sysfs
patches.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-24  6:43             ` Nishanth Aravamudan
@ 2008-04-24  7:06               ` Nick Piggin
  2008-04-24 17:08                 ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-04-24  7:06 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Andi Kleen, akpm, linux-mm, kniht, abh, wli

On Wed, Apr 23, 2008 at 11:43:50PM -0700, Nishanth Aravamudan wrote:
> On 24.04.2008 [04:08:28 +0200], Nick Piggin wrote:
> > On Wed, Apr 23, 2008 at 11:52:23AM -0700, Nishanth Aravamudan wrote:
> > > On 23.04.2008 [17:53:38 +0200], Nick Piggin wrote:
> > > > > It's not fully compatible. And that is bad.
> > > > 
> > > > It is fully compatible because if you don't actually ask for any new
> > > > option then you don't get it. What you see will be exactly unchanged.
> > > > If you ask for _only_ 1G pages, then this new scheme is very likely to
> > > > work with well written applications wheras if you also print out the 2MB
> > > > legacy values first, then they have little to no chance of working.
> > > > 
> > > > Then if you want legacy apps to use 2MB pages, and new ones to use 1G,
> > > > then you ask for both and get the 2MB column printed in /proc/meminfo
> > > > (actually it can probably get printed 2nd if you ask for 2MB pages
> > > > after asking for 1G pages -- that is something I'll fix).
> > > 
> > > Yep, the "default hugepagesz" was something I was going to ask about. I
> > > believe hugepagesz= should function kind of like console= where the
> > > order matters if specified multiple times for where /dev/console points.
> > > I agree with you that hugepagesz=XX hugepagesz=YY implies XX is the
> > > default, and YY is the "other", regardless of their values, and that is
> > > how they should be presented in meminfo.
> > 
> > OK, that would be fine. I was going to do it the other way and make
> > 2M always come first. However so long as we document as such the
> > command line parameters, I don't see why we couldn't have this extra
> > flexibility (and that means I shouldn't have to write any more code ;))
> 
> Keep in mind, I did retract this to some extent in my other
> reply...After thinking about Andi's points a bit more, I believe the
> most flexible (not too-x86_64-centric, either) option is to have all
> potential hugepage sizes be "available" at run-time. What hugepages are
> allocated at boot-time is all that is specified on the kernel
> command-line, in that case (and is only truly necessary for the
> ginormous hugepages, and needs to be heavily documented as such).
> 
> Realistically, yes, we could have it either way (hugepagesz= determines
> the order), but it shouldn't matter to well-written applications, so
> keeping things reflecting current reality as much as possible does make
> sense -- that is, 2M would always come first meminfo on x86_64.
> 
> If you want, I can send you a patch to do that, as I start the sysfs
> patches.

Honestly, I don't really care about the exact behaviour and user APIs.

I agree with the point Andi stresses that backwards compatibility is
#1 priority; and with unchanged kernel command line / config options,
I think we need to have /proc/meminfo give *unchanged* (ie. single
column) output.

Second, future apps obviously should use some more appropriate sysfs
tunables and be aware of multiple hstates.

Finally, I would have thought people would be interested in *trying*
to get legacy apps to work with 1G hugepages (eg. oracle/db2 or HPC
stuff could probably make use of them quite nicely). However this 3rd
consideration is obviously the least important of the 3. I wouldn't
lose any sleep if my option doesn't get in.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-24  7:06               ` Nick Piggin
@ 2008-04-24 17:08                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-24 17:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, akpm, linux-mm, kniht, abh, wli

On 24.04.2008 [09:06:24 +0200], Nick Piggin wrote:
> On Wed, Apr 23, 2008 at 11:43:50PM -0700, Nishanth Aravamudan wrote:
> > On 24.04.2008 [04:08:28 +0200], Nick Piggin wrote:
> > > On Wed, Apr 23, 2008 at 11:52:23AM -0700, Nishanth Aravamudan wrote:
> > > > On 23.04.2008 [17:53:38 +0200], Nick Piggin wrote:
> > > > > > It's not fully compatible. And that is bad.
> > > > > 
> > > > > It is fully compatible because if you don't actually ask for
> > > > > any new option then you don't get it. What you see will be
> > > > > exactly unchanged.  If you ask for _only_ 1G pages, then this
> > > > > new scheme is very likely to work with well written
> > > > > applications wheras if you also print out the 2MB legacy
> > > > > values first, then they have little to no chance of working.
> > > > > 
> > > > > Then if you want legacy apps to use 2MB pages, and new ones to
> > > > > use 1G, then you ask for both and get the 2MB column printed
> > > > > in /proc/meminfo (actually it can probably get printed 2nd if
> > > > > you ask for 2MB pages after asking for 1G pages -- that is
> > > > > something I'll fix).
> > > > 
> > > > Yep, the "default hugepagesz" was something I was going to ask
> > > > about. I believe hugepagesz= should function kind of like
> > > > console= where the order matters if specified multiple times for
> > > > where /dev/console points.  I agree with you that hugepagesz=XX
> > > > hugepagesz=YY implies XX is the
> > > > default, and YY is the "other", regardless of their values, and that is
> > > > how they should be presented in meminfo.
> > > 
> > > OK, that would be fine. I was going to do it the other way and
> > > make 2M always come first. However so long as we document as such
> > > the command line parameters, I don't see why we couldn't have this
> > > extra flexibility (and that means I shouldn't have to write any
> > > more code ;))
> > 
> > Keep in mind, I did retract this to some extent in my other
> > reply...After thinking about Andi's points a bit more, I believe the
> > most flexible (not too-x86_64-centric, either) option is to have all
> > potential hugepage sizes be "available" at run-time. What hugepages
> > are allocated at boot-time is all that is specified on the kernel
> > command-line, in that case (and is only truly necessary for the
> > ginormous hugepages, and needs to be heavily documented as such).
> > 
> > Realistically, yes, we could have it either way (hugepagesz=
> > determines the order), but it shouldn't matter to well-written
> > applications, so keeping things reflecting current reality as much
> > as possible does make sense -- that is, 2M would always come first
> > meminfo on x86_64.
> > 
> > If you want, I can send you a patch to do that, as I start the sysfs
> > patches.
> 
> Honestly, I don't really care about the exact behaviour and user APIs.
> 
> I agree with the point Andi stresses that backwards compatibility is
> #1 priority; and with unchanged kernel command line / config options,
> I think we need to have /proc/meminfo give *unchanged* (ie. single
> column) output.

Ok -- so meminfo will have one format (single column) if the command
line is unchanged, and a different one if, say "hugepagesz=1G" is
specified?

Should we just leave the default hugepage size info in /proc/meminfo
(always single column) and use sysfs for everything else? Including
hugepage meminfo's on a page-size basis? I guess that would violate
sysfs rules, but might be fine for a proof-of-concept?

> Second, future apps obviously should use some more appropriate sysfs
> tunables and be aware of multiple hstates.

Indeed.

> Finally, I would have thought people would be interested in *trying*
> to get legacy apps to work with 1G hugepages (eg. oracle/db2 or HPC
> stuff could probably make use of them quite nicely). However this 3rd
> consideration is obviously the least important of the 3. I wouldn't
> lose any sleep if my option doesn't get in.

Well, there are two interfaces, right?

1) SHM_HUGETLB
  I'm not sure how to extend this best. iirc, SHM_HUGETLB uses an
  internal (invisible) hugetlbfs mount. And I don't think it specifies a
  size or anything to said mount...so unless *only* 1G hugepages are
  available (which we've decided will not be the case?), I believe
  SHM_HUGETLB as currently used will never use them.

2) hugetlbfs
  By mounting hugetlbfs with size= (I believe), we can specify which
  pool should be accessed by files in the mount. This is what
  libhugetlbfs would leverage to use different hugepage sizes. There has
  been some discussion on that list and among some of us working on
  libhugetlbfs on how best to allow applications to specify the size
  they'd prefer. Eric Munson has been working on a binary (hugectl) to
  demonstrate hugepage-backed stacks in-kernel, which might be
  extended to include a --preferred-size flag (it's essentially an
  exec() wrapper, in the same vein as numactl). In any case,
  libhugetlbfs could be used (by only mounting the 1G sized hugetlbfs)
  for legacy apps without modification (well segment remapping may not
  work due to alignments, but should be easy to fix, and will probably
  be fixed in 2.0, which will change our remapping algorithm).

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-23  1:53 ` [patch 02/18] hugetlb: factor out huge_new_page npiggin
@ 2008-04-24 23:49   ` Nishanth Aravamudan
  2008-04-24 23:54   ` Nishanth Aravamudan
  1 sibling, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-24 23:49 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:04 +1000], npiggin@suse.de wrote:
> Needed to avoid code duplication in follow up patches.
> 
> This happens to fix a minor bug. When alloc_bootmem_node returns
> a fallback node on a different node than passed the old code
> would have put it into the free lists of the wrong node.
> Now it would end up in the freelist of the correct node.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  mm/hugetlb.c |   21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -190,6 +190,17 @@ static int adjust_pool_surplus(int delta
>  	return ret;
>  }
> 
> +static void prep_new_huge_page(struct page *page)
> +{
> +	unsigned nid = pfn_to_nid(page_to_pfn(page));

Why not just pass the nid here, which we've already got in the caller? I
assume because the future caller doesn't have it, but then that caller
can do this calculation in the invocation, rather than making all
callers do it?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-23  1:53 ` [patch 02/18] hugetlb: factor out huge_new_page npiggin
  2008-04-24 23:49   ` Nishanth Aravamudan
@ 2008-04-24 23:54   ` Nishanth Aravamudan
  2008-04-24 23:58     ` Nishanth Aravamudan
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-24 23:54 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:04 +1000], npiggin@suse.de wrote:
> Needed to avoid code duplication in follow up patches.
> 
> This happens to fix a minor bug. When alloc_bootmem_node returns
> a fallback node on a different node than passed the old code
> would have put it into the free lists of the wrong node.
> Now it would end up in the freelist of the correct node.

This is rather frustrating. The whole point of having the __GFP_THISNODE
flag is to indicate off-node allocations are *not* supported from the
caller... This was all worked on quite heavily a while back.

I expect this will lead to imbalanced allocations, as hugetlb.c will
assume that the previous node successfully allocated and move the
iterator forward, allocating again on the fallback node. Since hugetlb
code will anyways iterate over the nodes appropriately, can't we just
make alloc_bootmem_node (or the eventual call-path that results therein)
do the right thing for __GFP_THISNODE allocations? Minimally, the
varying semantics need to be documented somewhere...

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-24 23:54   ` Nishanth Aravamudan
@ 2008-04-24 23:58     ` Nishanth Aravamudan
  2008-04-25  7:10       ` Andi Kleen
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-24 23:58 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 24.04.2008 [16:54:31 -0700], Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:04 +1000], npiggin@suse.de wrote:
> > Needed to avoid code duplication in follow up patches.
> > 
> > This happens to fix a minor bug. When alloc_bootmem_node returns
> > a fallback node on a different node than passed the old code
> > would have put it into the free lists of the wrong node.
> > Now it would end up in the freelist of the correct node.
> 
> This is rather frustrating. The whole point of having the __GFP_THISNODE
> flag is to indicate off-node allocations are *not* supported from the
> caller... This was all worked on quite heavily a while back.

Oh I see. This patch refers to a bug that only is introduced by patch
12/18...perhaps *that* patch should add the nid calculation in the
helper, if it is truly needed.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-24 23:58     ` Nishanth Aravamudan
@ 2008-04-25  7:10       ` Andi Kleen
  2008-04-25 16:54         ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-04-25  7:10 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

Nishanth Aravamudan wrote:
> On 24.04.2008 [16:54:31 -0700], Nishanth Aravamudan wrote:
>> On 23.04.2008 [11:53:04 +1000], npiggin@suse.de wrote:
>>> Needed to avoid code duplication in follow up patches.
>>>
>>> This happens to fix a minor bug. When alloc_bootmem_node returns
>>> a fallback node on a different node than passed the old code
>>> would have put it into the free lists of the wrong node.
>>> Now it would end up in the freelist of the correct node.
>> This is rather frustrating. The whole point of having the __GFP_THISNODE
>> flag is to indicate off-node allocations are *not* supported from the
>> caller... This was all worked on quite heavily a while back.

Perhaps it was, but the result in hugetlb.c was not correct.

> Oh I see. This patch refers to a bug that only is introduced by patch
> 12/18...perhaps *that* patch should add the nid calculation in the
> helper, if it is truly needed.

No, the bug is already there even without the bootmem patch.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-25  7:10       ` Andi Kleen
@ 2008-04-25 16:54         ` Nishanth Aravamudan
  2008-04-25 19:13           ` Christoph Lameter
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 16:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli, clameter

On 25.04.2008 [09:10:52 +0200], Andi Kleen wrote:
> Nishanth Aravamudan wrote:
> > On 24.04.2008 [16:54:31 -0700], Nishanth Aravamudan wrote:
> >> On 23.04.2008 [11:53:04 +1000], npiggin@suse.de wrote:
> >>> Needed to avoid code duplication in follow up patches.
> >>>
> >>> This happens to fix a minor bug. When alloc_bootmem_node returns
> >>> a fallback node on a different node than passed the old code
> >>> would have put it into the free lists of the wrong node.
> >>> Now it would end up in the freelist of the correct node.
> >> This is rather frustrating. The whole point of having the __GFP_THISNODE
> >> flag is to indicate off-node allocations are *not* supported from the
> >> caller... This was all worked on quite heavily a while back.
> 
> Perhaps it was, but the result in hugetlb.c was not correct.

Huh? There is a case in current code (current hugepage sizes) that
allows __GFP_THISNODE to go off-node?

> > Oh I see. This patch refers to a bug that only is introduced by patch
> > 12/18...perhaps *that* patch should add the nid calculation in the
> > helper, if it is truly needed.
> 
> No, the bug is already there even without the bootmem patch.

Where does alloc_pages_node go off-node? It is a bug in the core VM if
it does, as we decided __GFP_THISNODE semantics with a nid specified
indicates *no* fallback should occur.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 04/18] hugetlb: modular state
  2008-04-23  1:53 ` [patch 04/18] hugetlb: modular state npiggin
  2008-04-23 15:21   ` Jon Tollefson
@ 2008-04-25 17:13   ` Nishanth Aravamudan
  2008-05-23  5:02     ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 17:13 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:06 +1000], npiggin@suse.de wrote:
> Large, but rather mechanical patch that converts most of the hugetlb.c
> globals into structure members and passes them around.
> 
> Right now there is only a single global hstate structure, but most of
> the infrastructure to extend it is there.

While going through the patches as I apply them to 2.6.25-mm1 (as none
will apply cleanly so far :), I have a few comments. I like this patch
overall.

> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c

<snip>

> +struct hstate global_hstate;

One thing I noticed throughout is that it's sort of inconsistent where a
hstate is passed to a function and where it's locally determined in
functions. It seems like we should obtain the hstate as early as
possible and just pass the pointer down as needed ... except in those
contexts that we don't control the caller, of course. That seems to be
more flexible than the way this patch does it, especially given that the
whole thing is a series that immediately extends this infrastructure to
multiple hugepage sizes. That would seem to, at least, make the
follow-on patches easier to follow.

> 
>  /*
>   * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
>   */
>  static DEFINE_SPINLOCK(hugetlb_lock);

Not sure if this makes sense or not, but would it be useful to make the
lock be per-hstate? It is designed to protect the counters and the
freelists, but those are per-hstate, right? Would need heavy testing,
but might be useful for varying apps both trying to use different size
hugepages simultaneously?

<snip>

> @@ -98,18 +93,19 @@ static struct page *dequeue_huge_page_vm
>  	struct zonelist *zonelist = huge_zonelist(vma, address,
>  					htlb_alloc_mask, &mpol);
>  	struct zone **z;
> +	struct hstate *h = hstate_vma(vma);

Why not make dequeue_huge_page_vma() take an hstate too? All the callers
have the vma, which means they can do this call themselves ... makes
more for a more consistent API between the two dequeue_ variants.

<snip>

>  static void free_huge_page(struct page *page)
>  {
> +	struct hstate *h = &global_hstate;
>  	int nid = page_to_nid(page);
>  	struct address_space *mapping;

Similarly, the only caller of free_huge_page has already figured out the
hstate to use (even if there is only one) -- why not pass it down here?

Oh here it might be because free_huge_page is used as the destructor --
perhaps add a comment?

<snip>

> -static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
> -						unsigned long address)
> +static struct page *alloc_buddy_huge_page(struct hstate *h,
> +					  struct vm_area_struct *vma,
> +					  unsigned long address)
>  {
>  	struct page *page;
>  	unsigned int nid;
> @@ -277,17 +275,17 @@ static struct page *alloc_buddy_huge_pag
>  	 * per-node value is checked there.
>  	 */
>  	spin_lock(&hugetlb_lock);
> -	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
> +	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>  		spin_unlock(&hugetlb_lock);
>  		return NULL;
>  	} else {
> -		nr_huge_pages++;
> -		surplus_huge_pages++;
> +		h->nr_huge_pages++;
> +		h->surplus_huge_pages++;
>  	}
>  	spin_unlock(&hugetlb_lock);
> 
>  	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> -					HUGETLB_PAGE_ORDER);
> +			   huge_page_order(h));

Nit: odd indentation?

<snip>

> @@ -539,19 +546,21 @@ static unsigned int cpuset_mems_nr(unsig
>  #ifdef CONFIG_HIGHMEM
>  static void try_to_free_low(unsigned long count)
>  {

Shouldn't this just take an hstate as a parameter?

> +	struct hstate *h = &global_hstate;
>  	int i;
> 
>  	for (i = 0; i < MAX_NUMNODES; ++i) {
>  		struct page *page, *next;
> -		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
> +		struct list_head *freel = &h->hugepage_freelists[i];
> +		list_for_each_entry_safe(page, next, freel, lru) {

Was this does just to make the line shorter? Just want to make sure I'm
not missing something.

<snip>

>  int hugetlb_report_meminfo(char *buf)
>  {
> +	struct hstate *h = &global_hstate;
>  	return sprintf(buf,
>  			"HugePages_Total: %5lu\n"
>  			"HugePages_Free:  %5lu\n"
>  			"HugePages_Rsvd:  %5lu\n"
>  			"HugePages_Surp:  %5lu\n"
>  			"Hugepagesize:    %5lu kB\n",
> -			nr_huge_pages,
> -			free_huge_pages,
> -			resv_huge_pages,
> -			surplus_huge_pages,
> -			HPAGE_SIZE/1024);
> +			h->nr_huge_pages,
> +			h->free_huge_pages,
> +			h->resv_huge_pages,
> +			h->surplus_huge_pages,
> +			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));

"- 10"? I think this should be easier to get at then this? Oh I guess
it's to get it into kilobytes... Seems kind of odd, but I guess it's
fine.

<snip>

> Index: linux-2.6/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/linux/hugetlb.h
> +++ linux-2.6/include/linux/hugetlb.h
> @@ -40,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
> 
>  /* arch callbacks */
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz);
>  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> @@ -95,7 +95,6 @@ pte_t huge_ptep_get_and_clear(struct mm_
>  #else
>  void hugetlb_prefault_arch_hook(struct mm_struct *mm);
>  #endif
> -

Unrelated whitespace change?

>  #else /* !CONFIG_HUGETLB_PAGE */
> 
>  static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> @@ -169,8 +168,6 @@ struct file *hugetlb_file_setup(const ch
>  int hugetlb_get_quota(struct address_space *mapping, long delta);
>  void hugetlb_put_quota(struct address_space *mapping, long delta);
> 
> -#define BLOCKS_PER_HUGEPAGE	(HPAGE_SIZE / 512)
> -

Rather than deleting this and then putting the similar calculation in
the two callers, perhaps use an inline to calculate it and call that in
the two places you change?

>  static inline int is_file_hugepages(struct file *file)
>  {
>  	if (file->f_op == &hugetlbfs_file_operations)
> @@ -199,4 +196,71 @@ unsigned long hugetlb_get_unmapped_area(
>  					unsigned long flags);
>  #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
> 
> +#ifdef CONFIG_HUGETLB_PAGE

Why another block of HUGETLB_PAGE? Shouldn't this go at the end of the
other one? And the !HUGETLB_PAGE within the corresponding #else?

> +
> +/* Defines one hugetlb page size */
> +struct hstate {
> +	int hugetlb_next_nid;
> +	unsigned int order;

Which is actually a shift, too, right? So why not just call it that? No
function should be direclty accessing these members, so the function
name indicates how the shift is being used?

> +	unsigned long mask;
> +	unsigned long max_huge_pages;
> +	unsigned long nr_huge_pages;
> +	unsigned long free_huge_pages;
> +	unsigned long resv_huge_pages;
> +	unsigned long surplus_huge_pages;
> +	unsigned long nr_overcommit_huge_pages;
> +	struct list_head hugepage_freelists[MAX_NUMNODES];
> +	unsigned int nr_huge_pages_node[MAX_NUMNODES];
> +	unsigned int free_huge_pages_node[MAX_NUMNODES];
> +	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> +};
> +
> +extern struct hstate global_hstate;
> +
> +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> +{
> +	return &global_hstate;
> +}

After having looked at this functions while reviewing, it does seem like
it might be more intuitive to ready vma_hstate ("vma's hstate") rather
than hstate_vma ("hstate's vma"?). But your call.

<snip>

> Index: linux-2.6/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.orig/mm/mempolicy.c
> +++ linux-2.6/mm/mempolicy.c
> @@ -1295,7 +1295,8 @@ struct zonelist *huge_zonelist(struct vm
>  	if (pol->policy == MPOL_INTERLEAVE) {
>  		unsigned nid;
> 
> -		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> +		nid = interleave_nid(pol, vma, addr,
> +					huge_page_shift(hstate_vma(vma)));
>  		if (unlikely(pol != &default_policy &&
>  				pol != current->mempolicy))
>  			__mpol_free(pol);	/* finished with pol */
> @@ -1944,9 +1945,12 @@ static void check_huge_range(struct vm_a
>  {
>  	unsigned long addr;
>  	struct page *page;
> +	struct hstate *h = hstate_vma(vma);
> +	unsigned sz = huge_page_size(h);

This should be unsigned long?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 05/18] hugetlb: multiple hstates
  2008-04-23  1:53 ` [patch 05/18] hugetlb: multiple hstates npiggin
@ 2008-04-25 17:38   ` Nishanth Aravamudan
  2008-04-25 17:48     ` Nishanth Aravamudan
                       ` (2 more replies)
  2008-04-29 17:27   ` Nishanth Aravamudan
  1 sibling, 3 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 17:38 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:07 +1000], npiggin@suse.de wrote:
> Add basic support for more than one hstate in hugetlbfs
> 
> - Convert hstates to an array
> - Add a first default entry covering the standard huge page size
> - Add functions for architectures to register new hstates
> - Add basic iterators over hstates
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/hugetlb.h |   11 ++++
>  mm/hugetlb.c            |  112 +++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 97 insertions(+), 26 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -27,7 +27,17 @@ unsigned long sysctl_overcommit_huge_pag
>  static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
>  unsigned long hugepages_treat_as_movable;
> 
> -struct hstate global_hstate;
> +static int max_hstate = 0;
> +
> +static unsigned long default_hstate_resv = 0;

Unnecessary initializations (and whitespace)?

What's the purpose of default_hstate_resv? Isn't it basically just
replacing max_huge_pages? "resv" has a very special meaning in
hugetlb.c, can another name be chosen?

> +struct hstate hstates[HUGE_MAX_HSTATE];
> +
> +/* for command line parsing */
> +struct hstate *parsed_hstate __initdata = NULL;

Unnecessary initialization (checkpatch caught the first two, not this
one). Should this be static? Isn't __initdata traditionally put closer
to the front of the line?

> +#define for_each_hstate(h) \
> +	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
> 
>  /*
>   * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
> @@ -128,9 +138,19 @@ static void update_and_free_page(struct 
>  	__free_pages(page, huge_page_order(h));
>  }
> 
> +struct hstate *size_to_hstate(unsigned long size)
> +{
> +	struct hstate *h;
> +	for_each_hstate (h) {

Extraneous space?

> +		if (huge_page_size(h) == size)
> +			return h;
> +	}
> +	return NULL;
> +}

Might become annoying if we add many hugepagesizes, but I guess we'll
never have enough to really matter. Just don't want to have to worry
about this loop for performance reasons when only one hugepage size is
in use? Would it make sense to cache the last value used? Probably
overkill for now.

>  static void free_huge_page(struct page *page)
>  {
> -	struct hstate *h = &global_hstate;
> +	struct hstate *h = size_to_hstate(PAGE_SIZE << compound_order(page));

Perhaps this could be made a static inline function?

static inline page_hstate(struct page *page)
{
	return size_to_hstate(PAGE_SIZE << compound_order(page))
}

I guess I haven't checked yet if it's used anywhere else, but it makes
things a little clearer, perhaps?

And this is only needed to be done actually for the destructor case?
Technically, we have the hstate already in the set_max_huge_pages()
path? Might be worth a cleanup down-the-road.

>  	int nid = page_to_nid(page);
>  	struct address_space *mapping;
> 
> @@ -495,38 +515,80 @@ static struct page *alloc_huge_page(stru
>  	return page;
>  }
> 
> -static int __init hugetlb_init(void)
> +static void __init hugetlb_init_hstate(struct hstate *h)

Could this perhaps be named hugetlb_init_one_hstate()? Makes it harder
for me to go cross-eyed as I go between the functions :)

<snip>

> +static void __init report_hugepages(void)
> +{
> +	struct hstate *h;
> +
> +	for_each_hstate(h) {
> +		printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
> +				h->free_huge_pages,
> +				1 << (h->order + PAGE_SHIFT - 20));

This will need to be changed for 64K hugepages (which already exist in
mainline). Perhaps we need a hugepage_units() function :)

<snip>

> +/* Should be called on processing a hugepagesz=... option */
> +void __init huge_add_hstate(unsigned order)
> +{
> +	struct hstate *h;
> +	if (size_to_hstate(PAGE_SIZE << order)) {
> +		printk("hugepagesz= specified twice, ignoring\n");

Needs a KERN_ level.

And did we decide whether specifying hugepagesz= multiple times is ok,
or not?

> +		return;
> +	}
> +	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
> +	BUG_ON(order < HPAGE_SHIFT - PAGE_SHIFT);
> +	h = &hstates[max_hstate++];
> +	h->order = order;
> +	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
> +	hugetlb_init_hstate(h);
> +	parsed_hstate = h;
> +}
> +
>  static int __init hugetlb_setup(char *s)
>  {
> -	if (sscanf(s, "%lu", &max_huge_pages) <= 0)
> -		max_huge_pages = 0;
> +	if (sscanf(s, "%lu", &default_hstate_resv) <= 0)
> +		default_hstate_resv = 0;
>  	return 1;
>  }
>  __setup("hugepages=", hugetlb_setup);
> @@ -544,28 +606,27 @@ static unsigned int cpuset_mems_nr(unsig
> 
>  #ifdef CONFIG_SYSCTL
>  #ifdef CONFIG_HIGHMEM
> -static void try_to_free_low(unsigned long count)
> +static void try_to_free_low(struct hstate *h, unsigned long count)
>  {
> -	struct hstate *h = &global_hstate;
>  	int i;
> 
>  	for (i = 0; i < MAX_NUMNODES; ++i) {
>  		struct page *page, *next;
>  		struct list_head *freel = &h->hugepage_freelists[i];
>  		list_for_each_entry_safe(page, next, freel, lru) {
> -			if (count >= nr_huge_pages)
> +			if (count >= h->nr_huge_pages)
>  				return;
>  			if (PageHighMem(page))
>  				continue;
>  			list_del(&page->lru);
> -			update_and_free_page(page);
> +			update_and_free_page(h, page);
>  			h->free_huge_pages--;
>  			h->free_huge_pages_node[page_to_nid(page)]--;
>  		}
>  	}
>  }
>  #else
> -static inline void try_to_free_low(unsigned long count)
> +static inline void try_to_free_low(struct hstate *h, unsigned long count)
>  {
>  }
>  #endif
> @@ -625,7 +686,7 @@ static unsigned long set_max_huge_pages(
>  	 */
>  	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>  	min_count = max(count, min_count);
> -	try_to_free_low(min_count);
> +	try_to_free_low(h, min_count);
>  	while (min_count < persistent_huge_pages(h)) {
>  		struct page *page = dequeue_huge_page(h);
>  		if (!page)
> @@ -648,6 +709,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  {
>  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
>  	max_huge_pages = set_max_huge_pages(max_huge_pages);
> +	global_hstate.max_huge_pages = max_huge_pages;

So this implies the sysctl still only controls the singe state? Perhaps
it would be better if this patch made set_max_huge_pages() take an
hstate? Also, this seems to be the only place where max_huge_pages is
still used, so can't you just do:

global_hstate.max_huge_pages = set_max_huge_pages(max_huge_pages); ?

<snip>

> @@ -1296,7 +1358,7 @@ out:
>  int hugetlb_reserve_pages(struct inode *inode, long from, long to)
>  {
>  	long ret, chg;
> -	struct hstate *h = &global_hstate;
> +	struct hstate *h = hstate_inode(inode);
> 
>  	chg = region_chg(&inode->i_mapping->private_list, from, to);
>  	if (chg < 0)
> @@ -1315,7 +1377,7 @@ int hugetlb_reserve_pages(struct inode *
> 
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
>  {
> -	struct hstate *h = &global_hstate;
> +	struct hstate *h = hstate_inode(inode);

Couldn't both of these changes have been made in the previous patch?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 05/18] hugetlb: multiple hstates
  2008-04-25 17:38   ` Nishanth Aravamudan
@ 2008-04-25 17:48     ` Nishanth Aravamudan
  2008-04-25 17:55     ` Andi Kleen
  2008-05-23  5:18     ` Nick Piggin
  2 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 17:48 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 25.04.2008 [10:38:27 -0700], Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:07 +1000], npiggin@suse.de wrote:

<snip>

> > @@ -648,6 +709,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
> >  {
> >  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> >  	max_huge_pages = set_max_huge_pages(max_huge_pages);
> > +	global_hstate.max_huge_pages = max_huge_pages;
> 
> So this implies the sysctl still only controls the singe state? Perhaps
> it would be better if this patch made set_max_huge_pages() take an
> hstate? Also, this seems to be the only place where max_huge_pages is
> still used, so can't you just do:
> 
> global_hstate.max_huge_pages = set_max_huge_pages(max_huge_pages); ?

Oops, sorry about the noise, max_huge_pages is the variable actually
modified by the sysctl.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 05/18] hugetlb: multiple hstates
  2008-04-25 17:55     ` Andi Kleen
@ 2008-04-25 17:52       ` Nishanth Aravamudan
  2008-04-25 18:10         ` Andi Kleen
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 17:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli, apw

On 25.04.2008 [19:55:03 +0200], Andi Kleen wrote:
> > Unnecessary initializations (and whitespace)?
> 
> Actually gcc generates exactly the same code for 0 and no
> initialization.

All supported gcc's? Then checkpatch should be fixed?

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 05/18] hugetlb: multiple hstates
  2008-04-25 17:38   ` Nishanth Aravamudan
  2008-04-25 17:48     ` Nishanth Aravamudan
@ 2008-04-25 17:55     ` Andi Kleen
  2008-04-25 17:52       ` Nishanth Aravamudan
  2008-05-23  5:18     ` Nick Piggin
  2 siblings, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-04-25 17:55 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: npiggin, akpm, linux-mm, andi, kniht, abh, wli

> Unnecessary initializations (and whitespace)?

Actually gcc generates exactly the same code for 0 and no initialization.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 07/18] hugetlbfs: per mount hstates
  2008-04-23  1:53 ` [patch 07/18] hugetlbfs: per mount hstates npiggin
@ 2008-04-25 18:09   ` Nishanth Aravamudan
  2008-04-25 20:36     ` Nishanth Aravamudan
  2008-05-23  5:24     ` Nick Piggin
  0 siblings, 2 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 18:09 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:09 +1000], npiggin@suse.de wrote:
> Add support to have individual hstates for each hugetlbfs mount
> 
> - Add a new pagesize= option to the hugetlbfs mount that allows setting
> the page size
> - Set up pointers to a suitable hstate for the set page size option
> to the super block and the inode and the vma.
> - Change the hstate accessors to use this information
> - Add code to the hstate init function to set parsed_hstate for command
> line processing
> - Handle duplicated hstate registrations to the make command line user proof
> 
> [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
>  include/linux/hugetlb.h |   14 +++++++++-----
>  mm/hugetlb.c            |   16 +++-------------
>  mm/memory.c             |   18 ++++++++++++++++--
>  4 files changed, 66 insertions(+), 30 deletions(-)
> 
> Index: linux-2.6/include/linux/hugetlb.h
> ===================================================================

<snip>

> @@ -226,19 +228,21 @@ extern struct hstate hstates[HUGE_MAX_HS
> 
>  #define global_hstate (hstates[0])
> 
> -static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> +static inline struct hstate *hstate_inode(struct inode *i)
>  {
> -	return &global_hstate;
> +	struct hugetlbfs_sb_info *hsb;
> +	hsb = HUGETLBFS_SB(i->i_sb);
> +	return hsb->hstate;
>  }
> 
>  static inline struct hstate *hstate_file(struct file *f)
>  {
> -	return &global_hstate;
> +	return hstate_inode(f->f_dentry->d_inode);
>  }
> 
> -static inline struct hstate *hstate_inode(struct inode *i)
> +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
>  {
> -	return &global_hstate;
> +	return hstate_file(vma->vm_file);

Odd, diff seems to think you've moved these two functions around
(hstate_{vma,inode})...

>  static inline unsigned long huge_page_size(struct hstate *h)
> Index: linux-2.6/fs/hugetlbfs/inode.c
> ===================================================================

<snip>

> @@ -780,17 +784,13 @@ hugetlbfs_parse_options(char *options, s
>  			break;
> 
>  		case Opt_size: {
> - 			unsigned long long size;
>  			/* memparse() will accept a K/M/G without a digit */
>  			if (!isdigit(*args[0].from))
>  				goto bad_val;
>  			size = memparse(args[0].from, &rest);
> -			if (*rest == '%') {
> -				size <<= HPAGE_SHIFT;
> -				size *= max_huge_pages;
> -				do_div(size, 100);
> -			}
> -			pconfig->nr_blocks = (size >> HPAGE_SHIFT);
> +			setsize = SIZE_STD;
> +			if (*rest == '%')
> +				setsize = SIZE_PERCENT;

This seems like a change that could be pulled into its own clean-up
patch and merged up quicker?

> @@ -801,6 +801,19 @@ hugetlbfs_parse_options(char *options, s
>  			pconfig->nr_inodes = memparse(args[0].from, &rest);
>  			break;
> 
> +		case Opt_pagesize: {
> +			unsigned long ps;
> +			ps = memparse(args[0].from, &rest);
> +			pconfig->hstate = size_to_hstate(ps);
> +			if (!pconfig->hstate) {
> +				printk(KERN_ERR
> +				"hugetlbfs: Unsupported page size %lu MB\n",
> +					ps >> 20);

This again will give odd output for pagesizes < 1MB (64k on power).

> @@ -808,6 +821,18 @@ hugetlbfs_parse_options(char *options, s
>  			break;
>  		}
>  	}
> +
> +	/* Do size after hstate is set up */
> +	if (setsize > NO_SIZE) {
> +		struct hstate *h = pconfig->hstate;
> +		if (setsize == SIZE_PERCENT) {
> +			size <<= huge_page_shift(h);
> +			size *= h->max_huge_pages;
> +			do_div(size, 100);
> +		}
> +		pconfig->nr_blocks = (size >> huge_page_shift(h));
> +	}

Oh, I see. We just moved the percent caclulation down here. Sorry about
that, seems sensible to leave it in this patch then.

>  bad_val:
> @@ -832,6 +857,7 @@ hugetlbfs_fill_super(struct super_block 
>  	config.uid = current->fsuid;
>  	config.gid = current->fsgid;
>  	config.mode = 0755;
> +	config.hstate = size_to_hstate(HPAGE_SIZE);

So, we still only have one hugepage size, which is why this is written
this way. Seems odd that an early patch adds multiple hugepage size
support, but we don't actually need it in the series until much later...

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 05/18] hugetlb: multiple hstates
  2008-04-25 17:52       ` Nishanth Aravamudan
@ 2008-04-25 18:10         ` Andi Kleen
  2008-04-28 10:13           ` Andy Whitcroft
  0 siblings, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-04-25 18:10 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli, apw

On Fri, Apr 25, 2008 at 10:52:49AM -0700, Nishanth Aravamudan wrote:
> On 25.04.2008 [19:55:03 +0200], Andi Kleen wrote:
> > > Unnecessary initializations (and whitespace)?
> > 
> > Actually gcc generates exactly the same code for 0 and no
> > initialization.
> 
> All supported gcc's? Then checkpatch should be fixed?

3.3-hammer did it already, 3.2 didn't. 3.2 is nominally still
supposed but I don't think we care particularly about its code
quality.

Yes checkpatch should be fixed.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 08/18] hugetlb: multi hstate sysctls
  2008-04-23  1:53 ` [patch 08/18] hugetlb: multi hstate sysctls npiggin
@ 2008-04-25 18:14   ` Nishanth Aravamudan
  2008-05-23  5:25     ` Nick Piggin
  2008-04-25 23:35   ` Nishanth Aravamudan
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 18:14 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:10 +1000], npiggin@suse.de wrote:
> Expand the hugetlbfs sysctls to handle arrays for all hstates. This
> now allows the removal of global_hstate -- everything is now hstate
> aware.
> 
> - I didn't bother with hugetlb_shm_group and treat_as_movable,
> these are still single global.
> - Also improve error propagation for the sysctl handlers a bit

So, I may be mis-remembering, but the hugepages that are gigantic, that
is > MAX_ORDER, cannot be allocated or freed at run-time? If so, why do
we need to report them in the sysctl? It's a read-only value, right?
Similarly, for the sysfs interface thereto, can I just make them
read-only? I guess it would be an arbitrary difference from the other
files, but reflects reality?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 13/18] hugetlb: support boot allocate different sizes
  2008-04-23  1:53 ` [patch 13/18] hugetlb: support boot allocate different sizes npiggin
  2008-04-23 16:15   ` Andrew Hastings
@ 2008-04-25 18:40   ` Nishanth Aravamudan
  2008-04-25 18:50     ` Andi Kleen
  2008-05-23  5:36     ` Nick Piggin
  1 sibling, 2 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 18:40 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:15 +1000], npiggin@suse.de wrote:
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  mm/hugetlb.c |   24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -582,10 +582,13 @@ static void __init hugetlb_init_hstate(s
>  {
>  	unsigned long i;
> 
> -	for (i = 0; i < MAX_NUMNODES; ++i)
> -		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> +	/* Don't reinitialize lists if they have been already init'ed */
> +	if (!h->hugepage_freelists[0].next) {
> +		for (i = 0; i < MAX_NUMNODES; ++i)
> +			INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> 
> -	h->hugetlb_next_nid = first_node(node_online_map);
> +		h->hugetlb_next_nid = first_node(node_online_map);
> +	}

When would this be the case (the list is already init'd)?

>  	for (i = 0; i < h->max_huge_pages; ++i) {
>  		if (h->order >= MAX_ORDER) {
> @@ -594,7 +597,7 @@ static void __init hugetlb_init_hstate(s
>  		} else if (!alloc_fresh_huge_page(h))
>  			break;
>  	}
> -	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> +	h->max_huge_pages = i;

Why don't we need to set these other values anymore?

>  }
> 
>  static void __init hugetlb_init_hstates(void)
> @@ -602,7 +605,10 @@ static void __init hugetlb_init_hstates(
>  	struct hstate *h;
> 
>  	for_each_hstate(h) {
> -		hugetlb_init_hstate(h);
> +		/* oversize hugepages were init'ed in early boot */
> +		if (h->order < MAX_ORDER)
> +			hugetlb_init_hstate(h);
> +		max_huge_pages[h - hstates] = h->max_huge_pages;

So, you made max_huge_pages an array of the same size as the hstates
array, right?

So why can't we directly use h->max_huge_pagees everywhere, and *only*
touch max_huge_pages in the sysctl path.

Oh right, I have a patch to do exactly this, but haven't posted it yet
(kind of got caught between my patchset and yours and forgotten).

max_huge_pages is a confusing variable (to me)

I think it's use should be restricted to the sysctl as much as possible
(and the sysctl's should be updated to only do work if write is set).
Does that seem sane to you?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 13/18] hugetlb: support boot allocate different sizes
  2008-04-25 18:40   ` Nishanth Aravamudan
@ 2008-04-25 18:50     ` Andi Kleen
  2008-04-25 20:05       ` Nishanth Aravamudan
  2008-05-23  5:36     ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-04-25 18:50 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

Nishanth Aravamudan wrote:

> When would this be the case (the list is already init'd)?

It can happen inside the series before all the final checks are in
with multiple arguments. In theory it could be removed at the end,
but then it doesn't hurt.

> 
>>  	for (i = 0; i < h->max_huge_pages; ++i) {
>>  		if (h->order >= MAX_ORDER) {
>> @@ -594,7 +597,7 @@ static void __init hugetlb_init_hstate(s
>>  		} else if (!alloc_fresh_huge_page(h))
>>  			break;
>>  	}
>> -	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
>> +	h->max_huge_pages = i;
> 
> Why don't we need to set these other values anymore?

Because the low level functions handle them already (as a simple grep
would have told you)

> I think it's use should be restricted to the sysctl as much as possible
> (and the sysctl's should be updated to only do work if write is set).
> Does that seem sane to you?

Fundamental rule of programming: Information should be only kept at a
single place if possible.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 12/18] hugetlbfs: support larger than MAX_ORDER
  2008-04-23  1:53 ` [patch 12/18] hugetlbfs: support larger than MAX_ORDER npiggin
  2008-04-23 16:15   ` Andrew Hastings
@ 2008-04-25 18:55   ` Nishanth Aravamudan
  2008-05-23  5:29     ` Nick Piggin
  2008-04-30 21:01   ` Dave Hansen
  2 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 18:55 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:14 +1000], npiggin@suse.de wrote:
> This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
> not practical to enlarge MAX_ORDER to 1GB. 
> 
> Instead the 1GB pages are only allocated at boot using the bootmem
> allocator using the hugepages=... option.
> 
> These 1G bootmem pages are never freed. In theory it would be possible
> to implement that with some complications, but since it would be a one-way
> street (>= MAX_ORDER pages cannot be allocated later) I decided not to
> currently.
> 
> The >= MAX_ORDER code is not ifdef'ed per architecture. It is not very big
> and the ifdef uglyness seemed not be worth it.
> 
> Known problems: /proc/meminfo and "free" do not display the memory 
> allocated for gb pages in "Total". This is a little confusing for the
> user.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  mm/hugetlb.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 72 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -14,6 +14,7 @@
>  #include <linux/mempolicy.h>
>  #include <linux/cpuset.h>
>  #include <linux/mutex.h>
> +#include <linux/bootmem.h>
> 
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -160,7 +161,7 @@ static void free_huge_page(struct page *
>  	INIT_LIST_HEAD(&page->lru);
> 
>  	spin_lock(&hugetlb_lock);
> -	if (h->surplus_huge_pages_node[nid]) {
> +	if (h->surplus_huge_pages_node[nid] && h->order < MAX_ORDER) {

Shouldn't all h->order accesses actually be using the huge_page_order()
to be consistent?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-25 16:54         ` Nishanth Aravamudan
@ 2008-04-25 19:13           ` Christoph Lameter
  2008-04-25 19:29             ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Christoph Lameter @ 2008-04-25 19:13 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

On Fri, 25 Apr 2008, Nishanth Aravamudan wrote:

> > >>> This happens to fix a minor bug. When alloc_bootmem_node returns
> > >>> a fallback node on a different node than passed the old code
> > >>> would have put it into the free lists of the wrong node.
> > >>> Now it would end up in the freelist of the correct node.
> > >> This is rather frustrating. The whole point of having the __GFP_THISNODE
> > >> flag is to indicate off-node allocations are *not* supported from the
> > >> caller... This was all worked on quite heavily a while back.
> > 
> > Perhaps it was, but the result in hugetlb.c was not correct.
> 
> Huh? There is a case in current code (current hugepage sizes) that
> allows __GFP_THISNODE to go off-node?

Argh. Danger. SLAB will crash and/or corrupt data if that occurs.

> > No, the bug is already there even without the bootmem patch.
> 
> Where does alloc_pages_node go off-node? It is a bug in the core VM if
> it does, as we decided __GFP_THISNODE semantics with a nid specified
> indicates *no* fallback should occur.

But this is only for bootmem right? SLAB is not using bootmem so we could 
make an exception there. The issue is support of __GFP_THISNODE in the 
bootmem allocator?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-25 19:13           ` Christoph Lameter
@ 2008-04-25 19:29             ` Nishanth Aravamudan
  2008-04-30 19:16               ` Christoph Lameter
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 19:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

On 25.04.2008 [12:13:19 -0700], Christoph Lameter wrote:
> On Fri, 25 Apr 2008, Nishanth Aravamudan wrote:
> 
> > > >>> This happens to fix a minor bug. When alloc_bootmem_node returns
> > > >>> a fallback node on a different node than passed the old code
> > > >>> would have put it into the free lists of the wrong node.
> > > >>> Now it would end up in the freelist of the correct node.
> > > >> This is rather frustrating. The whole point of having the __GFP_THISNODE
> > > >> flag is to indicate off-node allocations are *not* supported from the
> > > >> caller... This was all worked on quite heavily a while back.
> > > 
> > > Perhaps it was, but the result in hugetlb.c was not correct.
> > 
> > Huh? There is a case in current code (current hugepage sizes) that
> > allows __GFP_THISNODE to go off-node?
> 
> Argh. Danger. SLAB will crash and/or corrupt data if that occurs.
> 
> > > No, the bug is already there even without the bootmem patch.
> > 
> > Where does alloc_pages_node go off-node? It is a bug in the core VM if
> > it does, as we decided __GFP_THISNODE semantics with a nid specified
> > indicates *no* fallback should occur.
> 
> But this is only for bootmem right? SLAB is not using bootmem so we could 
> make an exception there. The issue is support of __GFP_THISNODE in the 
> bootmem allocator?

I think so -- I'm not entirely sure. Andi, can you elucidate?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 13/18] hugetlb: support boot allocate different sizes
  2008-04-25 18:50     ` Andi Kleen
@ 2008-04-25 20:05       ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 20:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

On 25.04.2008 [20:50:39 +0200], Andi Kleen wrote:
> Nishanth Aravamudan wrote:
> 
> > When would this be the case (the list is already init'd)?
> 
> It can happen inside the series before all the final checks are in
> with multiple arguments. In theory it could be removed at the end,
> but then it doesn't hurt.

Ok, I guess that indicates to me an ordering issue, but perhaps it is
unavoidable.

> >>  	for (i = 0; i < h->max_huge_pages; ++i) {
> >>  		if (h->order >= MAX_ORDER) {
> >> @@ -594,7 +597,7 @@ static void __init hugetlb_init_hstate(s
> >>  		} else if (!alloc_fresh_huge_page(h))
> >>  			break;
> >>  	}
> >> -	h->max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> >> +	h->max_huge_pages = i;
> > 
> > Why don't we need to set these other values anymore?
> 
> Because the low level functions handle them already (as a simple grep
> would have told you)

[12:36:40]nacc@arkanoid:~/linux/views/linux-2.6-work$ rgrep free_huge_pages *
arch/x86/ia32/ia32entry.S:	.quad quiet_ni_syscall 	/* free_huge_pages */
include/linux/hugetlb.h:	unsigned long free_huge_pages;
include/linux/hugetlb.h:	unsigned int free_huge_pages_node[MAX_NUMNODES];
mm/hugetlb.c: * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
mm/hugetlb.c:	h->free_huge_pages++;
mm/hugetlb.c:	h->free_huge_pages_node[nid]++;
mm/hugetlb.c:			h->free_huge_pages--;
mm/hugetlb.c:			h->free_huge_pages_node[nid]--;
mm/hugetlb.c:			h->free_huge_pages--;
mm/hugetlb.c:			h->free_huge_pages_node[nid]--;
mm/hugetlb.c:	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
mm/hugetlb.c:	 * because either resv_huge_pages or free_huge_pages may have changed.
mm/hugetlb.c:			(h->free_huge_pages + allocated);
mm/hugetlb.c:			h->free_huge_pages--;
mm/hugetlb.c:			h->free_huge_pages_node[nid]--;
mm/hugetlb.c:	if (h->free_huge_pages > h->resv_huge_pages)
mm/hugetlb.c:			h->free_huge_pages);
mm/hugetlb.c:			h->free_huge_pages--;
mm/hugetlb.c:			h->free_huge_pages_node[page_to_nid(page)]--;
mm/hugetlb.c:	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
mm/hugetlb.c:	n += dump_field(buf + n, offsetof(struct hstate, free_huge_pages));
mm/hugetlb.c:						free_huge_pages_node[nid]));
mm/hugetlb.c:		if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {

Hrm, I don't see a single assignment to free_huge_pages there.

grep'ing through the patches from Nick's series, I don't see any there
either (which would indicate I misapplied a patch).

Andi, I'm doing a review of the patches because it is needed (I haven't
seen a comprehensive set of responses yet) and because my work depends
on these patches doing the right thing. I would appreciate it if you
could give me slightly more useful responses -- for instance, the aside
comment in your reply was entirely unnecessary, as grep didn't shed any
insight *and* I am looking at the code in question as I work.  Instead
please try to help me understand the patches.

If I didn't hope differently, I'd believe you don't want me to review
the patches at all.

> > I think it's use should be restricted to the sysctl as much as
> > possible (and the sysctl's should be updated to only do work if
> > write is set).  Does that seem sane to you?
> 
> Fundamental rule of programming: Information should be only kept at a
> single place if possible.

Ok ... I'm tired of reading one-sentence responses that don't answer my
questions and come across as insulting. The current patches duplicate
max_huge_pages *already*. My point was reduction. So if your response
was meant to be

	"Yes, that does seem sane."

then that is all you needed to write. If it was

	"No, that does not seem sane."

that would have been equally fine. But what you wrote has neither a
"yes" nor a "no" in it.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 07/18] hugetlbfs: per mount hstates
  2008-04-25 18:09   ` Nishanth Aravamudan
@ 2008-04-25 20:36     ` Nishanth Aravamudan
  2008-04-25 22:39       ` Nishanth Aravamudan
  2008-05-23  5:24     ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 20:36 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 25.04.2008 [11:09:33 -0700], Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:09 +1000], npiggin@suse.de wrote:
> > Add support to have individual hstates for each hugetlbfs mount
> > 
> > - Add a new pagesize= option to the hugetlbfs mount that allows setting
> > the page size
> > - Set up pointers to a suitable hstate for the set page size option
> > to the super block and the inode and the vma.
> > - Change the hstate accessors to use this information
> > - Add code to the hstate init function to set parsed_hstate for command
> > line processing
> > - Handle duplicated hstate registrations to the make command line user proof
> > 
> > [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
> > 
> > Signed-off-by: Andi Kleen <ak@suse.de>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> >  fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
> >  include/linux/hugetlb.h |   14 +++++++++-----
> >  mm/hugetlb.c            |   16 +++-------------
> >  mm/memory.c             |   18 ++++++++++++++++--
> >  4 files changed, 66 insertions(+), 30 deletions(-)
> > 
> > Index: linux-2.6/include/linux/hugetlb.h
> > ===================================================================
> 
> <snip>
> 
> > @@ -226,19 +228,21 @@ extern struct hstate hstates[HUGE_MAX_HS
> > 
> >  #define global_hstate (hstates[0])
> > 
> > -static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > +static inline struct hstate *hstate_inode(struct inode *i)
> >  {
> > -	return &global_hstate;
> > +	struct hugetlbfs_sb_info *hsb;
> > +	hsb = HUGETLBFS_SB(i->i_sb);
> > +	return hsb->hstate;
> >  }
> > 
> >  static inline struct hstate *hstate_file(struct file *f)
> >  {
> > -	return &global_hstate;
> > +	return hstate_inode(f->f_dentry->d_inode);
> >  }
> > 
> > -static inline struct hstate *hstate_inode(struct inode *i)
> > +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> >  {
> > -	return &global_hstate;
> > +	return hstate_file(vma->vm_file);
> 
> Odd, diff seems to think you've moved these two functions around
> (hstate_{vma,inode})...

Err, duh, which of course you have to because of the definitions :)

However, doesn't this now make a core hugetlb functionality (which
really should only depend on CONFIG_HUGETLB_PAGE) depend on HUGETLBFS
being set to have access to HUGETLBFS_SB()? That seems to go in the
opposite direction from where we want to... Perhaps some of these
functions should be in the CONFIG_HUGETLBFS section of hugetlb.h?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 07/18] hugetlbfs: per mount hstates
  2008-04-25 20:36     ` Nishanth Aravamudan
@ 2008-04-25 22:39       ` Nishanth Aravamudan
  2008-04-28 18:20         ` Adam Litke
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 22:39 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli, agl

On 25.04.2008 [13:36:39 -0700], Nishanth Aravamudan wrote:
> On 25.04.2008 [11:09:33 -0700], Nishanth Aravamudan wrote:
> > On 23.04.2008 [11:53:09 +1000], npiggin@suse.de wrote:
> > > Add support to have individual hstates for each hugetlbfs mount
> > > 
> > > - Add a new pagesize= option to the hugetlbfs mount that allows setting
> > > the page size
> > > - Set up pointers to a suitable hstate for the set page size option
> > > to the super block and the inode and the vma.
> > > - Change the hstate accessors to use this information
> > > - Add code to the hstate init function to set parsed_hstate for command
> > > line processing
> > > - Handle duplicated hstate registrations to the make command line user proof
> > > 
> > > [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
> > > 
> > > Signed-off-by: Andi Kleen <ak@suse.de>
> > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > ---
> > >  fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
> > >  include/linux/hugetlb.h |   14 +++++++++-----
> > >  mm/hugetlb.c            |   16 +++-------------
> > >  mm/memory.c             |   18 ++++++++++++++++--
> > >  4 files changed, 66 insertions(+), 30 deletions(-)
> > > 
> > > Index: linux-2.6/include/linux/hugetlb.h
> > > ===================================================================
> > 
> > <snip>
> > 
> > > @@ -226,19 +228,21 @@ extern struct hstate hstates[HUGE_MAX_HS
> > > 
> > >  #define global_hstate (hstates[0])
> > > 
> > > -static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > > +static inline struct hstate *hstate_inode(struct inode *i)
> > >  {
> > > -	return &global_hstate;
> > > +	struct hugetlbfs_sb_info *hsb;
> > > +	hsb = HUGETLBFS_SB(i->i_sb);
> > > +	return hsb->hstate;
> > >  }
> > > 
> > >  static inline struct hstate *hstate_file(struct file *f)
> > >  {
> > > -	return &global_hstate;
> > > +	return hstate_inode(f->f_dentry->d_inode);
> > >  }
> > > 
> > > -static inline struct hstate *hstate_inode(struct inode *i)
> > > +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > >  {
> > > -	return &global_hstate;
> > > +	return hstate_file(vma->vm_file);
> > 
> > Odd, diff seems to think you've moved these two functions around
> > (hstate_{vma,inode})...
> 
> Err, duh, which of course you have to because of the definitions :)
> 
> However, doesn't this now make a core hugetlb functionality (which
> really should only depend on CONFIG_HUGETLB_PAGE) depend on HUGETLBFS
> being set to have access to HUGETLBFS_SB()? That seems to go in the
> opposite direction from where we want to... Perhaps some of these
> functions should be in the CONFIG_HUGETLBFS section of hugetlb.h?

Even if you don't move anything as I had originally suggested, I think
you need to express the CONFIG_ dependencies more clearly (that now
HUGETLB_PAGE depends on HUGETLBFS, afaict).

Urgh, there's actually other similar issue(s) in this file already...

if CONFIG_HUGETLBFS, is_file_hugepages() is defined and calls
is_file_shm_hugepages(), but that is defined in shm.h, which is only
included if CONFIG_HUGETLB_PAGE... Adam, that seems buggy? Is this just
further evidence that our current separation of the two options is
bull-honky?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 08/18] hugetlb: multi hstate sysctls
  2008-04-23  1:53 ` [patch 08/18] hugetlb: multi hstate sysctls npiggin
  2008-04-25 18:14   ` Nishanth Aravamudan
@ 2008-04-25 23:35   ` Nishanth Aravamudan
  2008-05-23  5:28     ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-25 23:35 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:10 +1000], npiggin@suse.de wrote:
> Expand the hugetlbfs sysctls to handle arrays for all hstates. This
> now allows the removal of global_hstate -- everything is now hstate
> aware.
> 
> - I didn't bother with hugetlb_shm_group and treat_as_movable,
> these are still single global.
> - Also improve error propagation for the sysctl handlers a bit

<snip>

> @@ -707,10 +717,25 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  			   struct file *file, void __user *buffer,
>  			   size_t *length, loff_t *ppos)
>  {
> -	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> -	max_huge_pages = set_max_huge_pages(max_huge_pages);
> -	global_hstate.max_huge_pages = max_huge_pages;
> -	return 0;
> +	int err = 0;
> +	struct hstate *h;
> +
> +	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> +	if (err)
> +		return err;
> +
> +	if (write) {
> +		for_each_hstate (h) {
> +			int tmp;
> +
> +			h->max_huge_pages = set_max_huge_pages(h,
> +					max_huge_pages[h - hstates], &tmp);
> +			max_huge_pages[h - hstates] = h->max_huge_pages;
> +			if (tmp && !err)
> +				err = tmp;
> +		}
> +	}

Could this same condition be added to the overcommit handler, please?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 14/18] hugetlb: printk cleanup
  2008-04-23  1:53 ` [patch 14/18] hugetlb: printk cleanup npiggin
@ 2008-04-27  3:32   ` Nishanth Aravamudan
  2008-05-23  5:37     ` Nick Piggin
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-27  3:32 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:16 +1000], npiggin@suse.de wrote:
> - Reword sentence to clarify meaning with multiple options
> - Add support for using GB prefixes for the page size
> - Add extra printk to delayed > MAX_ORDER allocation code
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  mm/hugetlb.c |   21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c
> +++ linux-2.6/mm/hugetlb.c
> @@ -612,15 +612,28 @@ static void __init hugetlb_init_hstates(
>  	}
>  }
> 
> +static __init char *memfmt(char *buf, unsigned long n)

Nit: this function is the only one where __init preceds the return type?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 05/18] hugetlb: multiple hstates
  2008-04-25 18:10         ` Andi Kleen
@ 2008-04-28 10:13           ` Andy Whitcroft
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Whitcroft @ 2008-04-28 10:13 UTC (permalink / raw)
  To: Andi Kleen, Nishanth Aravamudan, akpm; +Cc: npiggin, linux-mm, kniht, abh, wli

On Fri, Apr 25, 2008 at 08:10:56PM +0200, Andi Kleen wrote:
> On Fri, Apr 25, 2008 at 10:52:49AM -0700, Nishanth Aravamudan wrote:
> > On 25.04.2008 [19:55:03 +0200], Andi Kleen wrote:
> > > > Unnecessary initializations (and whitespace)?
> > > 
> > > Actually gcc generates exactly the same code for 0 and no
> > > initialization.
> > 
> > All supported gcc's? Then checkpatch should be fixed?
> 
> 3.3-hammer did it already, 3.2 didn't. 3.2 is nominally still
> supposed but I don't think we care particularly about its code
> quality.
> 
> Yes checkpatch should be fixed.

Cirtainly on this 4.1.2 I randomly picked to test, the size of the data
segment seems unchanged by initialisation to zero.  It ends up in the
BSS as expected.

So I guess the question is do we want to maintain this recommendation
for consistency or has it outlived its usefulness?

Opinions?

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 07/18] hugetlbfs: per mount hstates
  2008-04-25 22:39       ` Nishanth Aravamudan
@ 2008-04-28 18:20         ` Adam Litke
  2008-04-28 18:46           ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Adam Litke @ 2008-04-28 18:20 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: npiggin, akpm, linux-mm, andi, kniht, abh, wli

On Fri, 2008-04-25 at 15:39 -0700, Nishanth Aravamudan wrote: 
> On 25.04.2008 [13:36:39 -0700], Nishanth Aravamudan wrote:
> > On 25.04.2008 [11:09:33 -0700], Nishanth Aravamudan wrote:
> > > On 23.04.2008 [11:53:09 +1000], npiggin@suse.de wrote:
> > > > Add support to have individual hstates for each hugetlbfs mount
> > > > 
> > > > - Add a new pagesize= option to the hugetlbfs mount that allows setting
> > > > the page size
> > > > - Set up pointers to a suitable hstate for the set page size option
> > > > to the super block and the inode and the vma.
> > > > - Change the hstate accessors to use this information
> > > > - Add code to the hstate init function to set parsed_hstate for command
> > > > line processing
> > > > - Handle duplicated hstate registrations to the make command line user proof
> > > > 
> > > > [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
> > > > 
> > > > Signed-off-by: Andi Kleen <ak@suse.de>
> > > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > > ---
> > > >  fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
> > > >  include/linux/hugetlb.h |   14 +++++++++-----
> > > >  mm/hugetlb.c            |   16 +++-------------
> > > >  mm/memory.c             |   18 ++++++++++++++++--
> > > >  4 files changed, 66 insertions(+), 30 deletions(-)
> > > > 
> > > > Index: linux-2.6/include/linux/hugetlb.h
> > > > ===================================================================
> > > 
> > > <snip>
> > > 
> > > > @@ -226,19 +228,21 @@ extern struct hstate hstates[HUGE_MAX_HS
> > > > 
> > > >  #define global_hstate (hstates[0])
> > > > 
> > > > -static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > > > +static inline struct hstate *hstate_inode(struct inode *i)
> > > >  {
> > > > -	return &global_hstate;
> > > > +	struct hugetlbfs_sb_info *hsb;
> > > > +	hsb = HUGETLBFS_SB(i->i_sb);
> > > > +	return hsb->hstate;
> > > >  }
> > > > 
> > > >  static inline struct hstate *hstate_file(struct file *f)
> > > >  {
> > > > -	return &global_hstate;
> > > > +	return hstate_inode(f->f_dentry->d_inode);
> > > >  }
> > > > 
> > > > -static inline struct hstate *hstate_inode(struct inode *i)
> > > > +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > > >  {
> > > > -	return &global_hstate;
> > > > +	return hstate_file(vma->vm_file);
> > > 
> > > Odd, diff seems to think you've moved these two functions around
> > > (hstate_{vma,inode})...
> > 
> > Err, duh, which of course you have to because of the definitions :)
> > 
> > However, doesn't this now make a core hugetlb functionality (which
> > really should only depend on CONFIG_HUGETLB_PAGE) depend on HUGETLBFS
> > being set to have access to HUGETLBFS_SB()? That seems to go in the
> > opposite direction from where we want to... Perhaps some of these
> > functions should be in the CONFIG_HUGETLBFS section of hugetlb.h?
> 
> Even if you don't move anything as I had originally suggested, I think
> you need to express the CONFIG_ dependencies more clearly (that now
> HUGETLB_PAGE depends on HUGETLBFS, afaict).
> 
> Urgh, there's actually other similar issue(s) in this file already...
> 
> if CONFIG_HUGETLBFS, is_file_hugepages() is defined and calls
> is_file_shm_hugepages(), but that is defined in shm.h, which is only
> included if CONFIG_HUGETLB_PAGE... Adam, that seems buggy? Is this just
> further evidence that our current separation of the two options is
> bull-honky?

Yeah.  I'd say there is little reason to separate them anymore.  I am
not an expert on the history here, but I suspect the original reason for
separating CONFIG_HUGETLBFS and CONFIG_HUGETLB_PAGE was a lack of
psychic abilities.  Hugetlbfs is ubiquitous now and there is no other
valid way to use huge pages.  Even SHM_HUGETLB shared memory segments
use hugetlbfs.

One thing you should check is which config options are required for the
hugetlb kernel mappings.  Otherwise, I think we are in the clear to
merge them.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 07/18] hugetlbfs: per mount hstates
  2008-04-28 18:20         ` Adam Litke
@ 2008-04-28 18:46           ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-28 18:46 UTC (permalink / raw)
  To: Adam Litke; +Cc: npiggin, akpm, linux-mm, andi, kniht, abh, wli

On 28.04.2008 [13:20:49 -0500], Adam Litke wrote:
> On Fri, 2008-04-25 at 15:39 -0700, Nishanth Aravamudan wrote: 
> > On 25.04.2008 [13:36:39 -0700], Nishanth Aravamudan wrote:
> > > On 25.04.2008 [11:09:33 -0700], Nishanth Aravamudan wrote:
> > > > On 23.04.2008 [11:53:09 +1000], npiggin@suse.de wrote:
> > > > > Add support to have individual hstates for each hugetlbfs mount
> > > > > 
> > > > > - Add a new pagesize= option to the hugetlbfs mount that allows setting
> > > > > the page size
> > > > > - Set up pointers to a suitable hstate for the set page size option
> > > > > to the super block and the inode and the vma.
> > > > > - Change the hstate accessors to use this information
> > > > > - Add code to the hstate init function to set parsed_hstate for command
> > > > > line processing
> > > > > - Handle duplicated hstate registrations to the make command line user proof
> > > > > 
> > > > > [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
> > > > > 
> > > > > Signed-off-by: Andi Kleen <ak@suse.de>
> > > > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > > > ---
> > > > >  fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
> > > > >  include/linux/hugetlb.h |   14 +++++++++-----
> > > > >  mm/hugetlb.c            |   16 +++-------------
> > > > >  mm/memory.c             |   18 ++++++++++++++++--
> > > > >  4 files changed, 66 insertions(+), 30 deletions(-)
> > > > > 
> > > > > Index: linux-2.6/include/linux/hugetlb.h
> > > > > ===================================================================
> > > > 
> > > > <snip>
> > > > 
> > > > > @@ -226,19 +228,21 @@ extern struct hstate hstates[HUGE_MAX_HS
> > > > > 
> > > > >  #define global_hstate (hstates[0])
> > > > > 
> > > > > -static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > > > > +static inline struct hstate *hstate_inode(struct inode *i)
> > > > >  {
> > > > > -	return &global_hstate;
> > > > > +	struct hugetlbfs_sb_info *hsb;
> > > > > +	hsb = HUGETLBFS_SB(i->i_sb);
> > > > > +	return hsb->hstate;
> > > > >  }
> > > > > 
> > > > >  static inline struct hstate *hstate_file(struct file *f)
> > > > >  {
> > > > > -	return &global_hstate;
> > > > > +	return hstate_inode(f->f_dentry->d_inode);
> > > > >  }
> > > > > 
> > > > > -static inline struct hstate *hstate_inode(struct inode *i)
> > > > > +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > > > >  {
> > > > > -	return &global_hstate;
> > > > > +	return hstate_file(vma->vm_file);
> > > > 
> > > > Odd, diff seems to think you've moved these two functions around
> > > > (hstate_{vma,inode})...
> > > 
> > > Err, duh, which of course you have to because of the definitions :)
> > > 
> > > However, doesn't this now make a core hugetlb functionality (which
> > > really should only depend on CONFIG_HUGETLB_PAGE) depend on HUGETLBFS
> > > being set to have access to HUGETLBFS_SB()? That seems to go in the
> > > opposite direction from where we want to... Perhaps some of these
> > > functions should be in the CONFIG_HUGETLBFS section of hugetlb.h?
> > 
> > Even if you don't move anything as I had originally suggested, I think
> > you need to express the CONFIG_ dependencies more clearly (that now
> > HUGETLB_PAGE depends on HUGETLBFS, afaict).
> > 
> > Urgh, there's actually other similar issue(s) in this file already...
> > 
> > if CONFIG_HUGETLBFS, is_file_hugepages() is defined and calls
> > is_file_shm_hugepages(), but that is defined in shm.h, which is only
> > included if CONFIG_HUGETLB_PAGE... Adam, that seems buggy? Is this just
> > further evidence that our current separation of the two options is
> > bull-honky?
> 
> Yeah.  I'd say there is little reason to separate them anymore.  I am
> not an expert on the history here, but I suspect the original reason for
> separating CONFIG_HUGETLBFS and CONFIG_HUGETLB_PAGE was a lack of
> psychic abilities.  Hugetlbfs is ubiquitous now and there is no other
> valid way to use huge pages.  Even SHM_HUGETLB shared memory segments
> use hugetlbfs.

Yeah, I was thinking it might make sense to merge them now and then
separate them back out later, if we do add any other interfaces to
hugepages.

> One thing you should check is which config options are required for
> the hugetlb kernel mappings.  Otherwise, I think we are in the clear
> to merge them.

Yep, thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 05/18] hugetlb: multiple hstates
  2008-04-23  1:53 ` [patch 05/18] hugetlb: multiple hstates npiggin
  2008-04-25 17:38   ` Nishanth Aravamudan
@ 2008-04-29 17:27   ` Nishanth Aravamudan
  2008-05-23  5:19     ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-29 17:27 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:07 +1000], npiggin@suse.de wrote:
> Add basic support for more than one hstate in hugetlbfs
> 
> - Convert hstates to an array
> - Add a first default entry covering the standard huge page size
> - Add functions for architectures to register new hstates
> - Add basic iterators over hstates
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/hugetlb.h |   11 ++++
>  mm/hugetlb.c            |  112 +++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 97 insertions(+), 26 deletions(-)
> 
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================

<snip>

> +/* Should be called on processing a hugepagesz=... option */
> +void __init huge_add_hstate(unsigned order)

For consistency's sake, can we call this hugetlb_add_hstate()?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-25 19:29             ` Nishanth Aravamudan
@ 2008-04-30 19:16               ` Christoph Lameter
  2008-04-30 20:44                 ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Christoph Lameter @ 2008-04-30 19:16 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

On Fri, 25 Apr 2008, Nishanth Aravamudan wrote:

> I think so -- I'm not entirely sure. Andi, can you elucidate?

Finally had a look at the patch. This is fine because the GFP_THISNODE 
option during the alloc will return a page on the indicated node or none.

page_to_nid must therefore return the node that was specified at alloc 
time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-23  1:53 ` [patch 17/18] x86: add hugepagesz option " npiggin
@ 2008-04-30 19:34   ` Nishanth Aravamudan
  2008-04-30 19:52     ` Andi Kleen
  2008-04-30 20:40     ` Jon Tollefson
  2008-04-30 20:48   ` Nishanth Aravamudan
  1 sibling, 2 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-30 19:34 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:19 +1000], npiggin@suse.de wrote:
> Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.
> 
> This finally allows to select GB pages for hugetlbfs in x86 now
> that all the infrastructure is in place.

So, this patch sort of indicates how archs will need to be modified to
take advantage of the new infrastructure?

> Signed-off-by: Andi Kleen <ak@suse.de>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  Documentation/kernel-parameters.txt |   11 +++++++++--
>  arch/x86/mm/hugetlbpage.c           |   17 +++++++++++++++++
>  include/asm-x86/page.h              |    2 ++
>  3 files changed, 28 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/arch/x86/mm/hugetlbpage.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
> +++ linux-2.6/arch/x86/mm/hugetlbpage.c
> @@ -424,3 +424,20 @@ hugetlb_get_unmapped_area(struct file *f
> 
>  #endif /*HAVE_ARCH_HUGETLB_UNMAPPED_AREA*/
> 
> +#ifdef CONFIG_X86_64
> +static __init int setup_hugepagesz(char *opt)
> +{
> +	unsigned long ps = memparse(opt, &opt);
> +	if (ps == PMD_SIZE) {
> +		huge_add_hstate(PMD_SHIFT - PAGE_SHIFT);
> +	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
> +		huge_add_hstate(PUD_SHIFT - PAGE_SHIFT);
> +	} else {
> +		printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
> +			ps >> 20);
> +		return 0;
> +	}
> +	return 1;
> +}
> +__setup("hugepagesz=", setup_hugepagesz);
> +#endif

Did we decide if what hugepages are available would depend on the kernel
command-line or not?

I would prefer not; that is, the architecture specifies via an init-time
call what hstates it can support (through calls back into generic code
via huge_add_hstate()) and then generic code just supports/iterates over
those pagesizes. It doesn't depend on the administrator specifying
hugepagesz= at all, except if they want to preallocate a certain size
hugepage at boot-time (only strictly necessary for 1G/16G).

Now, this does mean that, for instance, a powerpc kernel may have
HUGE_MAX_HSTATE set to 3 (statically), but not actually be able to
support 3 huge page sizes (if the basepage size is 64k). So either we
need to make HUGE_MAX_HSTATE depend on the CONFIG options, which might
be ok, or we need to make for_each_hstate() also test that the hstate
entry in the array is !NULL?

> Index: linux-2.6/include/asm-x86/page.h
> ===================================================================
> --- linux-2.6.orig/include/asm-x86/page.h
> +++ linux-2.6/include/asm-x86/page.h
> @@ -21,6 +21,8 @@
>  #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
>  #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
> 
> +#define HUGE_MAX_HSTATE 2
> +

power would presumably make this 3, to support 64K,16M,16G (and 2, if
basepage size is 64K).

Another issue for power, though, is that there are local variables in
arch/powerpc/hugetlbpage.c that depend on the hugepage size in use (and
since there is only one, they're global). We really want those variables
to be per-hstate, though, right? The three I see are mmu_huge_psize,
HPAGE_SHIFT and hugepte_shift. For HPAGE_SHIFT, I think we could just
switch them over to huge_page_shift(h) given an hstate, but we would
need to make sure an hstate is available/obtainable at each point? Jon,
do you have any insight here? I want to make sure struct hstate is
future-proofed for other architectures than x86_64...

We probably want to see how converting powerpc looks, then get IA64,
sparc64 and sh on-board?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-30 19:34   ` Nishanth Aravamudan
@ 2008-04-30 19:52     ` Andi Kleen
  2008-04-30 20:02       ` Nishanth Aravamudan
  2008-04-30 20:40     ` Jon Tollefson
  1 sibling, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-04-30 19:52 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: npiggin, akpm, linux-mm, andi, kniht, abh, wli

> I want to make sure struct hstate is
> future-proofed for other architectures than x86_64...

Kernel code doesn't need to be future-proof, because it can be changed at 
any time.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-30 19:52     ` Andi Kleen
@ 2008-04-30 20:02       ` Nishanth Aravamudan
  2008-04-30 20:19         ` Andi Kleen
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-30 20:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

On 30.04.2008 [21:52:37 +0200], Andi Kleen wrote:
> > I want to make sure struct hstate is
> > future-proofed for other architectures than x86_64...
> 
> Kernel code doesn't need to be future-proof, because it can be changed
> at any time.

Then let's just merge whatever we'd like all the time? Why have review
at all?

To quote Nick from a separate discussion on similar future-proofing:

"Let's really try to put some thought into new sysfs locations. Not just
will it work, but is it logical and will it work tomorrow..."

So maybe future-proof is the wrong term, but I want to make sure the
infrastructure we have in place, where it claims to be generic and
usable by architectures (as has been my impression from the discussions
so far -- that it is extensible to other architectures), I want to be
sure that is really the case.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-30 20:02       ` Nishanth Aravamudan
@ 2008-04-30 20:19         ` Andi Kleen
  2008-04-30 20:23           ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-04-30 20:19 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

> Then let's just merge whatever we'd like all the time? Why have review
> at all?

Looking for bugs and problems is good, but to be honest many of your comments 
like more like
"nit picking until it looks exactly what I would have written" 
and that is not the purpose of a review.

> 
> To quote Nick from a separate discussion on similar future-proofing:
> 
> "Let's really try to put some thought into new sysfs locations. Not just
> will it work, but is it logical and will it work tomorrow..."

ABIs are different from code, they have to be more future proof because
changing them has more impact (although that seems to be commonly ignored in 
sysfs)

> infrastructure we have in place, where it claims to be generic and

The hugetlbfs code actually doesn't claim that.

> usable by architectures (as has been my impression from the discussions
> so far -- that it is extensible to other architectures), I want to be
> sure that is really the case.

It is with some future changes, but there is no need to do them for 
the initial merge, but they can be done as additional architectures
are added.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-30 20:19         ` Andi Kleen
@ 2008-04-30 20:23           ` Nishanth Aravamudan
  2008-04-30 20:45             ` Andi Kleen
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-30 20:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

On 30.04.2008 [22:19:32 +0200], Andi Kleen wrote:
> > Then let's just merge whatever we'd like all the time? Why have review
> > at all?
> 
> Looking for bugs and problems is good, but to be honest many of your
> comments like more like "nit picking until it looks exactly what I
> would have written" and that is not the purpose of a review.

I'm sorry if that is how it appears. I promise you that is not the case.
You and Nick have tackled a hard problem and come up with an overall
good solution to it. I believe I said something similar in my reply to
the core (first) patch which abstracts the hstate in the first place. I
admittedly have not heard from Nick on any of my comments, so perhaps he
shares the same view that they are not productive. If so, I'll hold off
on any further review.

> > To quote Nick from a separate discussion on similar future-proofing:
> > 
> > "Let's really try to put some thought into new sysfs locations. Not just
> > will it work, but is it logical and will it work tomorrow..."
> 
> ABIs are different from code, they have to be more future proof
> because changing them has more impact (although that seems to be
> commonly ignored in sysfs)
> 
> > infrastructure we have in place, where it claims to be generic and
> 
> The hugetlbfs code actually doesn't claim that.

The hugetlb.c code is architecture independent and roughly generic (it
doesn't know a whole lot about the underlying architecture itself).
hstates are defined and used in this independent code -- hence my
perspective that we want to make sure it is flexible enough to handle
other architectures than x86_64, or at least easily extensible to them.

> > usable by architectures (as has been my impression from the discussions
> > so far -- that it is extensible to other architectures), I want to be
> > sure that is really the case.
> 
> It is with some future changes, but there is no need to do them for
> the initial merge, but they can be done as additional architectures
> are added.

Well, Nick was talking about adding the powerpc bits to his stack when
he submited for -mm, so these discussions should be happening now,
AFAICT.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-30 19:34   ` Nishanth Aravamudan
  2008-04-30 19:52     ` Andi Kleen
@ 2008-04-30 20:40     ` Jon Tollefson
  1 sibling, 0 replies; 123+ messages in thread
From: Jon Tollefson @ 2008-04-30 20:40 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: npiggin, akpm, linux-mm, andi, abh, wli

Nishanth Aravamudan wrote:
<snip>
> power would presumably make this 3, to support 64K,16M,16G (and 2, if
> basepage size is 64K).
>
> Another issue for power, though, is that there are local variables in
> arch/powerpc/hugetlbpage.c that depend on the hugepage size in use (and
> since there is only one, they're global). We really want those variables
> to be per-hstate, though, right? The three I see are mmu_huge_psize,
> HPAGE_SHIFT and hugepte_shift. For HPAGE_SHIFT, I think we could just
> switch them over to huge_page_shift(h) given an hstate, but we would
> need to make sure an hstate is available/obtainable at each point? Jon,
> do you have any insight here? I want to make sure struct hstate is
>   
So far I have used the page size or other lookup functions to determine
the hstate
and then use the hstate to get the information I need from it.  For
private functions
I have been passing the hstate around so that it doesn't have to be
looked up each
time.

The only other item of note for power is the huge_pgtable_cache for each
huge page size
that is built based on the value of hugepte_shift.

> future-proofed for other architectures than x86_64...
>
> We probably want to see how converting powerpc looks, then get IA64,
> sparc64 and sh on-board?
>
> Thanks,
> Nish
>
> --
>   
Jon


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-30 19:16               ` Christoph Lameter
@ 2008-04-30 20:44                 ` Nishanth Aravamudan
  2008-05-01 19:23                   ` Christoph Lameter
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-30 20:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

On 30.04.2008 [12:16:39 -0700], Christoph Lameter wrote:
> On Fri, 25 Apr 2008, Nishanth Aravamudan wrote:
> 
> > I think so -- I'm not entirely sure. Andi, can you elucidate?
> 
> Finally had a look at the patch. This is fine because the GFP_THISNODE
> option during the alloc will return a page on the indicated node or
> none.

Right...

> page_to_nid must therefore return the node that was specified at alloc
> time.

Sure, my point was that we already have the nid in the caller (because
we specify it along with GFP_THISNODE). So if we pass that nid down into
this new function, we shoulnd't need to do the page_to_nid() call,
right?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-30 20:23           ` Nishanth Aravamudan
@ 2008-04-30 20:45             ` Andi Kleen
  2008-04-30 20:51               ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-04-30 20:45 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

> If so, I'll hold off
> on any further review.

That's not what I asked for.  Some of your comments were very useful
by pointing to real bugs and other problems, just some others were not. Please 
continue reviewing, just make sure that all the comments are focused on 
improving that particular code in the concrete current application.

For example the bulk of the changes needed for PPC I expect will just be an 
additional add on patchkit.

> > The hugetlbfs code actually doesn't claim that.
> 
> The hugetlb.c code is architecture independent and roughly generic (it
> doesn't know a whole lot about the underlying architecture itself).
> hstates are defined and used in this independent code -- hence my
> perspective that we want to make sure it is flexible enough to handle
> other architectures than x86_64, or at least easily extensible to them.

It is extensible to them, but with some further changes (that is what the 
patchkit claimed)

For power I think it would be best if you just started on the incremental
patches needed (in fact there were already such an addon, perhaps that
can be just improved)

> Well, Nick was talking about adding the powerpc bits to his stack when
> he submited for -mm, so these discussions should be happening now,
> AFAICT.

The whole thing is work in progress and will undoubtedly change more
before it is really used.  Nothing is put in stone yet.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-23  1:53 ` [patch 17/18] x86: add hugepagesz option " npiggin
  2008-04-30 19:34   ` Nishanth Aravamudan
@ 2008-04-30 20:48   ` Nishanth Aravamudan
  2008-05-23  5:41     ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-30 20:48 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:19 +1000], npiggin@suse.de wrote:
> Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.
> 
> This finally allows to select GB pages for hugetlbfs in x86 now
> that all the infrastructure is in place.

Another more basic question ... how do we plan on making these hugepages
available to applications. Obviously, an administrator can mount
hugetlbfs with pagesize=1G or whatever and then users (with appropriate
permissions) can mmap() files created therein. But what about
SHM_HUGETLB? It uses a private internal mount of hugetlbfs, which I
don't believe I saw a patch to add a pagesize= parameter for.

So SHM_HUGETLB will (for now) always get the "default" hugepagesize,
right, which should be the same as the legacy size? Given that an
architecture may support several hugepage sizes, I have't been able to
come up with a good way to extend shmget() to specify the preferred
hugepagesize when SHM_HUGETLB is specified. I think for libhugetlbfs
purposes, we will probably add another environment variable to control
that...

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-30 20:45             ` Andi Kleen
@ 2008-04-30 20:51               ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-04-30 20:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: npiggin, akpm, linux-mm, kniht, abh, wli

On 30.04.2008 [22:45:09 +0200], Andi Kleen wrote:
> > If so, I'll hold off
> > on any further review.
> 
> That's not what I asked for.  Some of your comments were very useful
> by pointing to real bugs and other problems, just some others were
> not. Please continue reviewing, just make sure that all the comments
> are focused on improving that particular code in the concrete current
> application.

I will focus on this, thanks for the feedback. Per my just-sent mail,
I'm not sure what the "concrete application" is for 1G pages --
either a custom application or using libhugetlbfs with 1G pages. And the
latter is where most of my comments are coming from. When I've been
making nit-picky comments, I've tried to prefix them with "Nit". Those
have mostly been cosmetic or style-issues that simply show up obviously
in the diffs.

> For example the bulk of the changes needed for PPC I expect will just
> be an additional add on patchkit.

I agree. But it might be nice to minimize the churn and be aware of any
gotchas ahead of time. Hence why I asked yourself and Nick as the
original authors about the separation between arch-independent and
arch-dependent code. x86_64 seems to be relatively easy in this regard,
while power requires more state per-hugepagesize.

> > > The hugetlbfs code actually doesn't claim that.
> > 
> > The hugetlb.c code is architecture independent and roughly generic (it
> > doesn't know a whole lot about the underlying architecture itself).
> > hstates are defined and used in this independent code -- hence my
> > perspective that we want to make sure it is flexible enough to handle
> > other architectures than x86_64, or at least easily extensible to them.
> 
> It is extensible to them, but with some further changes (that is what
> the patchkit claimed)
> 
> For power I think it would be best if you just started on the
> incremental patches needed (in fact there were already such an addon,
> perhaps that can be just improved)

Agreed, I'm starting to look at that with Jon.

> > Well, Nick was talking about adding the powerpc bits to his stack
> > when he submited for -mm, so these discussions should be happening
> > now, AFAICT.
> 
> The whole thing is work in progress and will undoubtedly change more
> before it is really used.  Nothing is put in stone yet.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 12/18] hugetlbfs: support larger than MAX_ORDER
  2008-04-23  1:53 ` [patch 12/18] hugetlbfs: support larger than MAX_ORDER npiggin
  2008-04-23 16:15   ` Andrew Hastings
  2008-04-25 18:55   ` Nishanth Aravamudan
@ 2008-04-30 21:01   ` Dave Hansen
  2008-05-23  5:30     ` Nick Piggin
  2 siblings, 1 reply; 123+ messages in thread
From: Dave Hansen @ 2008-04-30 21:01 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, nacc, abh, wli

On Wed, 2008-04-23 at 11:53 +1000, npiggin@suse.de wrote:
> +static int __init alloc_bm_huge_page(struct hstate *h)

I was just reading one of Jon's patches, and saw this.  Could we expand
the '_bm_' to '_boot_'?  Or, maybe rename to bootmem_alloc_hpage()?
'bm' just doesn't seem to register in my teeny brain.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-04-30 20:44                 ` Nishanth Aravamudan
@ 2008-05-01 19:23                   ` Christoph Lameter
  2008-05-01 20:25                     ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Christoph Lameter @ 2008-05-01 19:23 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

On Wed, 30 Apr 2008, Nishanth Aravamudan wrote:

> Sure, my point was that we already have the nid in the caller (because
> we specify it along with GFP_THISNODE). So if we pass that nid down into
> this new function, we shoulnd't need to do the page_to_nid() call,
> right?

Right. But its safer the to page_to_nid() there. In case we do an alloc 
without __GFP_THISNODE in the future.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-05-01 19:23                   ` Christoph Lameter
@ 2008-05-01 20:25                     ` Nishanth Aravamudan
  2008-05-01 20:34                       ` Christoph Lameter
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-01 20:25 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

On 01.05.2008 [12:23:04 -0700], Christoph Lameter wrote:
> On Wed, 30 Apr 2008, Nishanth Aravamudan wrote:
> 
> > Sure, my point was that we already have the nid in the caller (because
> > we specify it along with GFP_THISNODE). So if we pass that nid down into
> > this new function, we shoulnd't need to do the page_to_nid() call,
> > right?
> 
> Right. But its safer the to page_to_nid() there. In case we do an alloc 
> without __GFP_THISNODE in the future.

In this caller, we will require __GFP_THISNODE for the foreseeable
future. Anything that changes that will be more invasive.

The other callsite can pass the page_to_nid() result as the third
argument.

I'm pretty sure when I first created alloc_huge_page_node(), you argued
for me *not* using page_to_nid() on the returned page because we expect
__GFP_THISNODE to do the right thing.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-05-01 20:25                     ` Nishanth Aravamudan
@ 2008-05-01 20:34                       ` Christoph Lameter
  2008-05-01 21:01                         ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Christoph Lameter @ 2008-05-01 20:34 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

On Thu, 1 May 2008, Nishanth Aravamudan wrote:

> I'm pretty sure when I first created alloc_huge_page_node(), you argued
> for me *not* using page_to_nid() on the returned page because we expect
> __GFP_THISNODE to do the right thing.

I vaguely remember that the issue at that point was that you were trying 
to compensate for __GFP_THISNODE brokenness?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-05-01 20:34                       ` Christoph Lameter
@ 2008-05-01 21:01                         ` Nishanth Aravamudan
  2008-05-23  5:03                           ` Nick Piggin
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-01 21:01 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, npiggin, akpm, linux-mm, kniht, abh, wli

On 01.05.2008 [13:34:23 -0700], Christoph Lameter wrote:
> On Thu, 1 May 2008, Nishanth Aravamudan wrote:
> 
> > I'm pretty sure when I first created alloc_huge_page_node(), you argued
> > for me *not* using page_to_nid() on the returned page because we expect
> > __GFP_THISNODE to do the right thing.
> 
> I vaguely remember that the issue at that point was that you were trying 
> to compensate for __GFP_THISNODE brokenness?

That's a good point -- it was at the time. My point is again here, this
particular callpath *is* using __GPF_THISNODE -- and always will as it's
a node-specific function call. Other callpaths may not, yes, but they
are passing the page in, which means they can call page_to_nid(). Just
seems to calculate a nid we already have.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 06/18] hugetlb: multi hstate proc files
  2008-04-23  1:53 ` [patch 06/18] hugetlb: multi hstate proc files npiggin
@ 2008-05-02 19:53   ` Nishanth Aravamudan
  2008-05-23  5:22     ` Nick Piggin
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-02 19:53 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.04.2008 [11:53:08 +1000], npiggin@suse.de wrote:
> Convert /proc output code over to report multiple hstates
> 
> I chose to just report the numbers in a row, in the hope 
> to minimze breakage of existing software. The "compat" page size
> is always the first number.

Only if add_huge_hstate() is called first for the compat page size,
right? That seems bad if we depend on an ordering.

For instance, for power, I think Jon is calling huge_add_hstate() from
the arch/powerpc/mm/hugetlbpage.c init routine. Which runs before
hugetlb_init, which means that if he adds hugepages like

huge_add_hstate(64k-order);
huge_add_hstate(16m-order);
huge_add_hstate(16g-order);

We'll get 64k as the first field in meminfo.

So perhaps what we should do is:

1) architectures define HPAGE_* as the default (compat) hugepage values
2) architectures have a call into generic code at their init time to
specify what sizes they support
3) the core is the only place that actually does huge_add_hstate() and
it always does it first for the compat order?

I wonder if this might lead to issues in timing between processing
hugepagesz= (in arch code) and hugepages= (in generic code). Not sure. I
guess if we always add all hugepage sizes, we should have all the
hstates we know about ready to configure and as long as hugetlb_init
runs before hugepages= processing, we should be fine? Dunno.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 04/18] hugetlb: modular state
  2008-04-25 17:13   ` Nishanth Aravamudan
@ 2008-05-23  5:02     ` Nick Piggin
  2008-05-23 20:48       ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:02 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, Apr 25, 2008 at 10:13:46AM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:06 +1000], npiggin@suse.de wrote:
> > Large, but rather mechanical patch that converts most of the hugetlb.c
> > globals into structure members and passes them around.
> > 
> > Right now there is only a single global hstate structure, but most of
> > the infrastructure to extend it is there.
> 
> While going through the patches as I apply them to 2.6.25-mm1 (as none
> will apply cleanly so far :), I have a few comments. I like this patch
> overall.

Thanks for all the feedback, and sorry for the delay. I'm just rebasing
things now and getting through all the feedback.

I really do appreciate the comments and have made a lot of changes that
you've suggested...


 
> > Index: linux-2.6/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.orig/mm/hugetlb.c
> > +++ linux-2.6/mm/hugetlb.c
> 
> <snip>
> 
> > +struct hstate global_hstate;
> 
> One thing I noticed throughout is that it's sort of inconsistent where a
> hstate is passed to a function and where it's locally determined in
> functions. It seems like we should obtain the hstate as early as
> possible and just pass the pointer down as needed ... except in those
> contexts that we don't control the caller, of course. That seems to be
> more flexible than the way this patch does it, especially given that the
> whole thing is a series that immediately extends this infrastructure to
> multiple hugepage sizes. That would seem to, at least, make the
> follow-on patches easier to follow.

I guess the intermediate state doesn't look pleasing, but it is
functional and gets us to the destination. I think the places
that get the global hstate should tend to be those where it is
natural in later patches to derive the (non global) hstate.

If there are particular places where the global hstate is obtained
which is subsequently moved down the stack when it is made non
global, let me know. But otherwise I hadn't noticed any glaring
problems... Ahh, I see you've found several, and I agree with all
of them.


> >  /*
> >   * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
> >   */
> >  static DEFINE_SPINLOCK(hugetlb_lock);
> 
> Not sure if this makes sense or not, but would it be useful to make the
> lock be per-hstate? It is designed to protect the counters and the
> freelists, but those are per-hstate, right? Would need heavy testing,
> but might be useful for varying apps both trying to use different size
> hugepages simultaneously?

Hmm, sure we could do that. Although obviously it would be another
patchset, and actually I'd be concerned about making hstate the
unit of scalability in hugetlbfs -- a single hstate should be
suffiicently scalable to handle workloads reasonably.

Good point, but at any rate I guess this patchset isn't the place
to do it.

 
> <snip>
> 
> > @@ -98,18 +93,19 @@ static struct page *dequeue_huge_page_vm
> >  	struct zonelist *zonelist = huge_zonelist(vma, address,
> >  					htlb_alloc_mask, &mpol);
> >  	struct zone **z;
> > +	struct hstate *h = hstate_vma(vma);
> 
> Why not make dequeue_huge_page_vma() take an hstate too? All the callers
> have the vma, which means they can do this call themselves ... makes
> more for a more consistent API between the two dequeue_ variants.

Agree...

 
> <snip>
> 
> >  static void free_huge_page(struct page *page)
> >  {
> > +	struct hstate *h = &global_hstate;
> >  	int nid = page_to_nid(page);
> >  	struct address_space *mapping;
> 
> Similarly, the only caller of free_huge_page has already figured out the
> hstate to use (even if there is only one) -- why not pass it down here?
> 
> Oh here it might be because free_huge_page is used as the destructor --
> perhaps add a comment?

Right. I can add a little comment if you like.

 
> <snip>
> 
> > -static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
> > -						unsigned long address)
> > +static struct page *alloc_buddy_huge_page(struct hstate *h,
> > +					  struct vm_area_struct *vma,
> > +					  unsigned long address)
> >  {
> >  	struct page *page;
> >  	unsigned int nid;
> > @@ -277,17 +275,17 @@ static struct page *alloc_buddy_huge_pag
> >  	 * per-node value is checked there.
> >  	 */
> >  	spin_lock(&hugetlb_lock);
> > -	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
> > +	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> >  		spin_unlock(&hugetlb_lock);
> >  		return NULL;
> >  	} else {
> > -		nr_huge_pages++;
> > -		surplus_huge_pages++;
> > +		h->nr_huge_pages++;
> > +		h->surplus_huge_pages++;
> >  	}
> >  	spin_unlock(&hugetlb_lock);
> > 
> >  	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > -					HUGETLB_PAGE_ORDER);
> > +			   huge_page_order(h));
> 
> Nit: odd indentation?
> 
> <snip>
> 
> > @@ -539,19 +546,21 @@ static unsigned int cpuset_mems_nr(unsig
> >  #ifdef CONFIG_HIGHMEM
> >  static void try_to_free_low(unsigned long count)
> >  {
> 
> Shouldn't this just take an hstate as a parameter?

It does in a subsequent patch... I'll see if that makes sense to
pull it up to here.

 
> > +	struct hstate *h = &global_hstate;
> >  	int i;
> > 
> >  	for (i = 0; i < MAX_NUMNODES; ++i) {
> >  		struct page *page, *next;
> > -		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
> > +		struct list_head *freel = &h->hugepage_freelists[i];
> > +		list_for_each_entry_safe(page, next, freel, lru) {
> 
> Was this does just to make the line shorter? Just want to make sure I'm
> not missing something.

AFAIKS, yes.

 
> <snip>
> 
> >  int hugetlb_report_meminfo(char *buf)
> >  {
> > +	struct hstate *h = &global_hstate;
> >  	return sprintf(buf,
> >  			"HugePages_Total: %5lu\n"
> >  			"HugePages_Free:  %5lu\n"
> >  			"HugePages_Rsvd:  %5lu\n"
> >  			"HugePages_Surp:  %5lu\n"
> >  			"Hugepagesize:    %5lu kB\n",
> > -			nr_huge_pages,
> > -			free_huge_pages,
> > -			resv_huge_pages,
> > -			surplus_huge_pages,
> > -			HPAGE_SIZE/1024);
> > +			h->nr_huge_pages,
> > +			h->free_huge_pages,
> > +			h->resv_huge_pages,
> > +			h->surplus_huge_pages,
> > +			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
> 
> "- 10"? I think this should be easier to get at then this? Oh I guess
> it's to get it into kilobytes... Seems kind of odd, but I guess it's
> fine.

I agree it's not perfect, but I might just leave all these for
a subsequent patchset (or can stick improvements to the end of
this patchset).

 
> <snip>
> 
> > Index: linux-2.6/include/linux/hugetlb.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/hugetlb.h
> > +++ linux-2.6/include/linux/hugetlb.h
> > @@ -40,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
> > 
> >  /* arch callbacks */
> > 
> > -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
> > +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz);
> >  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
> >  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> >  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> > @@ -95,7 +95,6 @@ pte_t huge_ptep_get_and_clear(struct mm_
> >  #else
> >  void hugetlb_prefault_arch_hook(struct mm_struct *mm);
> >  #endif
> > -
> 
> Unrelated whitespace change?

Fixed.

 
> >  #else /* !CONFIG_HUGETLB_PAGE */
> > 
> >  static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> > @@ -169,8 +168,6 @@ struct file *hugetlb_file_setup(const ch
> >  int hugetlb_get_quota(struct address_space *mapping, long delta);
> >  void hugetlb_put_quota(struct address_space *mapping, long delta);
> > 
> > -#define BLOCKS_PER_HUGEPAGE	(HPAGE_SIZE / 512)
> > -
> 
> Rather than deleting this and then putting the similar calculation in
> the two callers, perhaps use an inline to calculate it and call that in
> the two places you change?

Done.

 
> >  static inline int is_file_hugepages(struct file *file)
> >  {
> >  	if (file->f_op == &hugetlbfs_file_operations)
> > @@ -199,4 +196,71 @@ unsigned long hugetlb_get_unmapped_area(
> >  					unsigned long flags);
> >  #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
> > 
> > +#ifdef CONFIG_HUGETLB_PAGE
> 
> Why another block of HUGETLB_PAGE? Shouldn't this go at the end of the
> other one? And the !HUGETLB_PAGE within the corresponding #else?

Hmm, possibly. As has been noted, the CONFIG_ things are a bit
broken, and they should just get merged into one. I'll steer
clear of that area for the moment, as everything is working now,
but consolidating the options and cleaning things up would be
a good idea.


> > +
> > +/* Defines one hugetlb page size */
> > +struct hstate {
> > +	int hugetlb_next_nid;
> > +	unsigned int order;
> 
> Which is actually a shift, too, right? So why not just call it that? No
> function should be direclty accessing these members, so the function
> name indicates how the shift is being used?

I don't feel strongly. If you really do, then I guess it could be
changed.

 
> > +	unsigned long mask;
> > +	unsigned long max_huge_pages;
> > +	unsigned long nr_huge_pages;
> > +	unsigned long free_huge_pages;
> > +	unsigned long resv_huge_pages;
> > +	unsigned long surplus_huge_pages;
> > +	unsigned long nr_overcommit_huge_pages;
> > +	struct list_head hugepage_freelists[MAX_NUMNODES];
> > +	unsigned int nr_huge_pages_node[MAX_NUMNODES];
> > +	unsigned int free_huge_pages_node[MAX_NUMNODES];
> > +	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> > +};
> > +
> > +extern struct hstate global_hstate;
> > +
> > +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > +{
> > +	return &global_hstate;
> > +}
> 
> After having looked at this functions while reviewing, it does seem like
> it might be more intuitive to ready vma_hstate ("vma's hstate") rather
> than hstate_vma ("hstate's vma"?). But your call.

Again I don't feel strongly. Hstate prefix has some upsides.
 

> <snip>
> 
> > Index: linux-2.6/mm/mempolicy.c
> > ===================================================================
> > --- linux-2.6.orig/mm/mempolicy.c
> > +++ linux-2.6/mm/mempolicy.c
> > @@ -1295,7 +1295,8 @@ struct zonelist *huge_zonelist(struct vm
> >  	if (pol->policy == MPOL_INTERLEAVE) {
> >  		unsigned nid;
> > 
> > -		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> > +		nid = interleave_nid(pol, vma, addr,
> > +					huge_page_shift(hstate_vma(vma)));
> >  		if (unlikely(pol != &default_policy &&
> >  				pol != current->mempolicy))
> >  			__mpol_free(pol);	/* finished with pol */
> > @@ -1944,9 +1945,12 @@ static void check_huge_range(struct vm_a
> >  {
> >  	unsigned long addr;
> >  	struct page *page;
> > +	struct hstate *h = hstate_vma(vma);
> > +	unsigned sz = huge_page_size(h);
> 
> This should be unsigned long?

Thanks, missed that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 02/18] hugetlb: factor out huge_new_page
  2008-05-01 21:01                         ` Nishanth Aravamudan
@ 2008-05-23  5:03                           ` Nick Piggin
  0 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:03 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, Andi Kleen, akpm, linux-mm, kniht, abh, wli

On Thu, May 01, 2008 at 02:01:16PM -0700, Nishanth Aravamudan wrote:
> On 01.05.2008 [13:34:23 -0700], Christoph Lameter wrote:
> > On Thu, 1 May 2008, Nishanth Aravamudan wrote:
> > 
> > > I'm pretty sure when I first created alloc_huge_page_node(), you argued
> > > for me *not* using page_to_nid() on the returned page because we expect
> > > __GFP_THISNODE to do the right thing.
> > 
> > I vaguely remember that the issue at that point was that you were trying 
> > to compensate for __GFP_THISNODE brokenness?
> 
> That's a good point -- it was at the time. My point is again here, this
> particular callpath *is* using __GPF_THISNODE -- and always will as it's
> a node-specific function call. Other callpaths may not, yes, but they
> are passing the page in, which means they can call page_to_nid(). Just
> seems to calculate a nid we already have.

Given the discussion, I'll pass in the nid argument and add a comment.

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 05/18] hugetlb: multiple hstates
  2008-04-25 17:38   ` Nishanth Aravamudan
  2008-04-25 17:48     ` Nishanth Aravamudan
  2008-04-25 17:55     ` Andi Kleen
@ 2008-05-23  5:18     ` Nick Piggin
  2 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:18 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, Apr 25, 2008 at 10:38:27AM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:07 +1000], npiggin@suse.de wrote:
> > +#define for_each_hstate(h) \
> > +	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
> > 
> >  /*
> >   * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
> > @@ -128,9 +138,19 @@ static void update_and_free_page(struct 
> >  	__free_pages(page, huge_page_order(h));
> >  }
> > 
> > +struct hstate *size_to_hstate(unsigned long size)
> > +{
> > +	struct hstate *h;
> > +	for_each_hstate (h) {
> 
> Extraneous space?

Tried to make the spacing and style (eg. empty lines, __init after datai
type, before var name, etc that you noted to be more consistent. Let me
know if you spot any more glaring problems). Thanks.


> > +		if (huge_page_size(h) == size)
> > +			return h;
> > +	}
> > +	return NULL;
> > +}
> 
> Might become annoying if we add many hugepagesizes, but I guess we'll
> never have enough to really matter. Just don't want to have to worry
> about this loop for performance reasons when only one hugepage size is
> in use? Would it make sense to cache the last value used? Probably
> overkill for now.

It can probably be added to the compound page struct if it ever
really becomes a problem.

 
> >  static void free_huge_page(struct page *page)
> >  {
> > -	struct hstate *h = &global_hstate;
> > +	struct hstate *h = size_to_hstate(PAGE_SIZE << compound_order(page));
> 
> Perhaps this could be made a static inline function?
> 
> static inline page_hstate(struct page *page)
> {
> 	return size_to_hstate(PAGE_SIZE << compound_order(page))
> }
> 
> I guess I haven't checked yet if it's used anywhere else, but it makes
> things a little clearer, perhaps?

Done this, nice little cleanup I think. Makes it easy to stick the
hstate in struct page if we are even inclined.

 
> And this is only needed to be done actually for the destructor case?
> Technically, we have the hstate already in the set_max_huge_pages()
> path? Might be worth a cleanup down-the-road.

Could be. The dtor path is probably the fastpath, no? In which case
it probably doesn't matter too much to always derive the hstate here.
If we hit problems in the fastpath we'll put hstate into the page
maybe.


> >  	int nid = page_to_nid(page);
> >  	struct address_space *mapping;
> > 
> > @@ -495,38 +515,80 @@ static struct page *alloc_huge_page(stru
> >  	return page;
> >  }
> > 
> > -static int __init hugetlb_init(void)
> > +static void __init hugetlb_init_hstate(struct hstate *h)
> 
> Could this perhaps be named hugetlb_init_one_hstate()? Makes it harder
> for me to go cross-eyed as I go between the functions :)

Done.

 
> <snip>
> 
> > +static void __init report_hugepages(void)
> > +{
> > +	struct hstate *h;
> > +
> > +	for_each_hstate(h) {
> > +		printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
> > +				h->free_huge_pages,
> > +				1 << (h->order + PAGE_SHIFT - 20));
> 
> This will need to be changed for 64K hugepages (which already exist in
> mainline). Perhaps we need a hugepage_units() function :)

Again, you are right, but I'll leave this out of these patches and
add some on the end to work with smaller hugepages.

 
> <snip>
> 
> > +/* Should be called on processing a hugepagesz=... option */
> > +void __init huge_add_hstate(unsigned order)
> > +{
> > +	struct hstate *h;
> > +	if (size_to_hstate(PAGE_SIZE << order)) {
> > +		printk("hugepagesz= specified twice, ignoring\n");
> 
> Needs a KERN_ level.
 
Done.


> And did we decide whether specifying hugepagesz= multiple times is ok,
> or not?
 
Well, the kernel shouldn't crash, but there isn't much we can do other
than just register the given hugepagesz.


> > +		return;
> > +	}
> > +	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
> > +	BUG_ON(order < HPAGE_SHIFT - PAGE_SHIFT);
> > +	h = &hstates[max_hstate++];
> > +	h->order = order;
> > +	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
> > +	hugetlb_init_hstate(h);
> > +	parsed_hstate = h;
> > +}
> > +
> >  static int __init hugetlb_setup(char *s)
> >  {
> > -	if (sscanf(s, "%lu", &max_huge_pages) <= 0)
> > -		max_huge_pages = 0;
> > +	if (sscanf(s, "%lu", &default_hstate_resv) <= 0)
> > +		default_hstate_resv = 0;
> >  	return 1;
> >  }
> >  __setup("hugepages=", hugetlb_setup);
> > @@ -544,28 +606,27 @@ static unsigned int cpuset_mems_nr(unsig
> > 
> >  #ifdef CONFIG_SYSCTL
> >  #ifdef CONFIG_HIGHMEM
> > -static void try_to_free_low(unsigned long count)
> > +static void try_to_free_low(struct hstate *h, unsigned long count)
> >  {
> > -	struct hstate *h = &global_hstate;
> >  	int i;
> > 
> >  	for (i = 0; i < MAX_NUMNODES; ++i) {
> >  		struct page *page, *next;
> >  		struct list_head *freel = &h->hugepage_freelists[i];
> >  		list_for_each_entry_safe(page, next, freel, lru) {
> > -			if (count >= nr_huge_pages)
> > +			if (count >= h->nr_huge_pages)
> >  				return;
> >  			if (PageHighMem(page))
> >  				continue;
> >  			list_del(&page->lru);
> > -			update_and_free_page(page);
> > +			update_and_free_page(h, page);
> >  			h->free_huge_pages--;
> >  			h->free_huge_pages_node[page_to_nid(page)]--;
> >  		}
> >  	}
> >  }
> >  #else
> > -static inline void try_to_free_low(unsigned long count)
> > +static inline void try_to_free_low(struct hstate *h, unsigned long count)
> >  {
> >  }
> >  #endif
> > @@ -625,7 +686,7 @@ static unsigned long set_max_huge_pages(
> >  	 */
> >  	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
> >  	min_count = max(count, min_count);
> > -	try_to_free_low(min_count);
> > +	try_to_free_low(h, min_count);
> >  	while (min_count < persistent_huge_pages(h)) {
> >  		struct page *page = dequeue_huge_page(h);
> >  		if (!page)
> > @@ -648,6 +709,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
> >  {
> >  	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> >  	max_huge_pages = set_max_huge_pages(max_huge_pages);
> > +	global_hstate.max_huge_pages = max_huge_pages;
> 
> So this implies the sysctl still only controls the singe state? Perhaps
> it would be better if this patch made set_max_huge_pages() take an
> hstate? Also, this seems to be the only place where max_huge_pages is
> still used, so can't you just do:
> 
> global_hstate.max_huge_pages = set_max_huge_pages(max_huge_pages); ?

It is a little tricky because we use the contiguous array to do the
sysctl stuff, and copy it back to appropriate hstate. Could use some
cleanup somehow, but perhaps not in this patchset.

 
> <snip>
> 
> > @@ -1296,7 +1358,7 @@ out:
> >  int hugetlb_reserve_pages(struct inode *inode, long from, long to)
> >  {
> >  	long ret, chg;
> > -	struct hstate *h = &global_hstate;
> > +	struct hstate *h = hstate_inode(inode);
> > 
> >  	chg = region_chg(&inode->i_mapping->private_list, from, to);
> >  	if (chg < 0)
> > @@ -1315,7 +1377,7 @@ int hugetlb_reserve_pages(struct inode *
> > 
> >  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
> >  {
> > -	struct hstate *h = &global_hstate;
> > +	struct hstate *h = hstate_inode(inode);
> 
> Couldn't both of these changes have been made in the previous patch?

Yes, thanks I've done that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 05/18] hugetlb: multiple hstates
  2008-04-29 17:27   ` Nishanth Aravamudan
@ 2008-05-23  5:19     ` Nick Piggin
  0 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:19 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Tue, Apr 29, 2008 at 10:27:34AM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:07 +1000], npiggin@suse.de wrote:
> > Add basic support for more than one hstate in hugetlbfs
> > 
> > - Convert hstates to an array
> > - Add a first default entry covering the standard huge page size
> > - Add functions for architectures to register new hstates
> > - Add basic iterators over hstates
> > 
> > Signed-off-by: Andi Kleen <ak@suse.de>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> >  include/linux/hugetlb.h |   11 ++++
> >  mm/hugetlb.c            |  112 +++++++++++++++++++++++++++++++++++++-----------
> >  2 files changed, 97 insertions(+), 26 deletions(-)
> > 
> > Index: linux-2.6/mm/hugetlb.c
> > ===================================================================
> 
> <snip>
> 
> > +/* Should be called on processing a hugepagesz=... option */
> > +void __init huge_add_hstate(unsigned order)
> 
> For consistency's sake, can we call this hugetlb_add_hstate()?

Yes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 06/18] hugetlb: multi hstate proc files
  2008-05-02 19:53   ` Nishanth Aravamudan
@ 2008-05-23  5:22     ` Nick Piggin
  2008-05-23 20:30       ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:22 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, May 02, 2008 at 12:53:11PM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:08 +1000], npiggin@suse.de wrote:
> > Convert /proc output code over to report multiple hstates
> > 
> > I chose to just report the numbers in a row, in the hope 
> > to minimze breakage of existing software. The "compat" page size
> > is always the first number.
> 
> Only if add_huge_hstate() is called first for the compat page size,
> right? That seems bad if we depend on an ordering.
> 
> For instance, for power, I think Jon is calling huge_add_hstate() from
> the arch/powerpc/mm/hugetlbpage.c init routine. Which runs before
> hugetlb_init, which means that if he adds hugepages like
> 
> huge_add_hstate(64k-order);
> huge_add_hstate(16m-order);
> huge_add_hstate(16g-order);
> 
> We'll get 64k as the first field in meminfo.
> 
> So perhaps what we should do is:
> 
> 1) architectures define HPAGE_* as the default (compat) hugepage values
> 2) architectures have a call into generic code at their init time to
> specify what sizes they support
> 3) the core is the only place that actually does huge_add_hstate() and
> it always does it first for the compat order?
> 
> I wonder if this might lead to issues in timing between processing
> hugepagesz= (in arch code) and hugepages= (in generic code). Not sure. I
> guess if we always add all hugepage sizes, we should have all the
> hstates we know about ready to configure and as long as hugetlb_init
> runs before hugepages= processing, we should be fine? Dunno.

You're right I think. The other thing is that we could just have
a small map from the hstate array to reporting order for sysctls.
We could report them in the order specified on the cmdline, with
the default size first if it was not specified on the cmdline.

Hmm, I'll see how that looks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 07/18] hugetlbfs: per mount hstates
  2008-04-25 18:09   ` Nishanth Aravamudan
  2008-04-25 20:36     ` Nishanth Aravamudan
@ 2008-05-23  5:24     ` Nick Piggin
  2008-05-23 20:34       ` Nishanth Aravamudan
  1 sibling, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:24 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, Apr 25, 2008 at 11:09:33AM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:09 +1000], npiggin@suse.de wrote:
> > Add support to have individual hstates for each hugetlbfs mount
> > 
> > - Add a new pagesize= option to the hugetlbfs mount that allows setting
> > the page size
> > - Set up pointers to a suitable hstate for the set page size option
> > to the super block and the inode and the vma.
> > - Change the hstate accessors to use this information
> > - Add code to the hstate init function to set parsed_hstate for command
> > line processing
> > - Handle duplicated hstate registrations to the make command line user proof
> > 
> > [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
> > 
> > Signed-off-by: Andi Kleen <ak@suse.de>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> >  fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
> >  include/linux/hugetlb.h |   14 +++++++++-----
> >  mm/hugetlb.c            |   16 +++-------------
> >  mm/memory.c             |   18 ++++++++++++++++--
> >  4 files changed, 66 insertions(+), 30 deletions(-)
> > 
> > Index: linux-2.6/include/linux/hugetlb.h
> > ===================================================================
> 
> <snip>
> 
> > @@ -226,19 +228,21 @@ extern struct hstate hstates[HUGE_MAX_HS
> > 
> >  #define global_hstate (hstates[0])
> > 
> > -static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > +static inline struct hstate *hstate_inode(struct inode *i)
> >  {
> > -	return &global_hstate;
> > +	struct hugetlbfs_sb_info *hsb;
> > +	hsb = HUGETLBFS_SB(i->i_sb);
> > +	return hsb->hstate;
> >  }
> > 
> >  static inline struct hstate *hstate_file(struct file *f)
> >  {
> > -	return &global_hstate;
> > +	return hstate_inode(f->f_dentry->d_inode);
> >  }
> > 
> > -static inline struct hstate *hstate_inode(struct inode *i)
> > +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> >  {
> > -	return &global_hstate;
> > +	return hstate_file(vma->vm_file);
> 
> Odd, diff seems to think you've moved these two functions around
> (hstate_{vma,inode})...

Yep, one depends on the other...

 
> >  static inline unsigned long huge_page_size(struct hstate *h)
> > Index: linux-2.6/fs/hugetlbfs/inode.c
> > ===================================================================
> 
> <snip>
> 
> > @@ -780,17 +784,13 @@ hugetlbfs_parse_options(char *options, s
> >  			break;
> > 
> >  		case Opt_size: {
> > - 			unsigned long long size;
> >  			/* memparse() will accept a K/M/G without a digit */
> >  			if (!isdigit(*args[0].from))
> >  				goto bad_val;
> >  			size = memparse(args[0].from, &rest);
> > -			if (*rest == '%') {
> > -				size <<= HPAGE_SHIFT;
> > -				size *= max_huge_pages;
> > -				do_div(size, 100);
> > -			}
> > -			pconfig->nr_blocks = (size >> HPAGE_SHIFT);
> > +			setsize = SIZE_STD;
> > +			if (*rest == '%')
> > +				setsize = SIZE_PERCENT;
> 
> This seems like a change that could be pulled into its own clean-up
> patch and merged up quicker?
> 
> > @@ -801,6 +801,19 @@ hugetlbfs_parse_options(char *options, s
> >  			pconfig->nr_inodes = memparse(args[0].from, &rest);
> >  			break;
> > 
> > +		case Opt_pagesize: {
> > +			unsigned long ps;
> > +			ps = memparse(args[0].from, &rest);
> > +			pconfig->hstate = size_to_hstate(ps);
> > +			if (!pconfig->hstate) {
> > +				printk(KERN_ERR
> > +				"hugetlbfs: Unsupported page size %lu MB\n",
> > +					ps >> 20);
> 
> This again will give odd output for pagesizes < 1MB (64k on power).
> 
> > @@ -808,6 +821,18 @@ hugetlbfs_parse_options(char *options, s
> >  			break;
> >  		}
> >  	}
> > +
> > +	/* Do size after hstate is set up */
> > +	if (setsize > NO_SIZE) {
> > +		struct hstate *h = pconfig->hstate;
> > +		if (setsize == SIZE_PERCENT) {
> > +			size <<= huge_page_shift(h);
> > +			size *= h->max_huge_pages;
> > +			do_div(size, 100);
> > +		}
> > +		pconfig->nr_blocks = (size >> huge_page_shift(h));
> > +	}
> 
> Oh, I see. We just moved the percent caclulation down here. Sorry about
> that, seems sensible to leave it in this patch then.
> 
> >  bad_val:
> > @@ -832,6 +857,7 @@ hugetlbfs_fill_super(struct super_block 
> >  	config.uid = current->fsuid;
> >  	config.gid = current->fsgid;
> >  	config.mode = 0755;
> > +	config.hstate = size_to_hstate(HPAGE_SIZE);
> 
> So, we still only have one hugepage size, which is why this is written
> this way. Seems odd that an early patch adds multiple hugepage size
> support, but we don't actually need it in the series until much later...

True, but it is quite a long process and it is nice to have it working
each step of the way in small steps... I think the overall way Andi's done
the patchset is quite nice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 08/18] hugetlb: multi hstate sysctls
  2008-04-25 18:14   ` Nishanth Aravamudan
@ 2008-05-23  5:25     ` Nick Piggin
  2008-05-23 20:27       ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:25 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, Apr 25, 2008 at 11:14:30AM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:10 +1000], npiggin@suse.de wrote:
> > Expand the hugetlbfs sysctls to handle arrays for all hstates. This
> > now allows the removal of global_hstate -- everything is now hstate
> > aware.
> > 
> > - I didn't bother with hugetlb_shm_group and treat_as_movable,
> > these are still single global.
> > - Also improve error propagation for the sysctl handlers a bit
> 
> So, I may be mis-remembering, but the hugepages that are gigantic, that
> is > MAX_ORDER, cannot be allocated or freed at run-time? If so, why do

Right.

> we need to report them in the sysctl? It's a read-only value, right?

I guess for reporting and compatibility.


> Similarly, for the sysfs interface thereto, can I just make them
> read-only? I guess it would be an arbitrary difference from the other
> files, but reflects reality?

For the sysfs interface, I think that would be a fine idea to make
them readonly if they cannot be changed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 08/18] hugetlb: multi hstate sysctls
  2008-04-25 23:35   ` Nishanth Aravamudan
@ 2008-05-23  5:28     ` Nick Piggin
  2008-05-23 10:40       ` Andi Kleen
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:28 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, Apr 25, 2008 at 04:35:36PM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:10 +1000], npiggin@suse.de wrote:
> > Expand the hugetlbfs sysctls to handle arrays for all hstates. This
> > now allows the removal of global_hstate -- everything is now hstate
> > aware.
> > 
> > - I didn't bother with hugetlb_shm_group and treat_as_movable,
> > these are still single global.
> > - Also improve error propagation for the sysctl handlers a bit
> 
> <snip>
> 
> > @@ -707,10 +717,25 @@ int hugetlb_sysctl_handler(struct ctl_ta
> >  			   struct file *file, void __user *buffer,
> >  			   size_t *length, loff_t *ppos)
> >  {
> > -	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> > -	max_huge_pages = set_max_huge_pages(max_huge_pages);
> > -	global_hstate.max_huge_pages = max_huge_pages;
> > -	return 0;
> > +	int err = 0;
> > +	struct hstate *h;
> > +
> > +	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> > +	if (err)
> > +		return err;
> > +
> > +	if (write) {
> > +		for_each_hstate (h) {
> > +			int tmp;
> > +
> > +			h->max_huge_pages = set_max_huge_pages(h,
> > +					max_huge_pages[h - hstates], &tmp);
> > +			max_huge_pages[h - hstates] = h->max_huge_pages;
> > +			if (tmp && !err)
> > +				err = tmp;
> > +		}
> > +	}
> 
> Could this same condition be added to the overcommit handler, please?

Sure thing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 11/18] mm: export prep_compound_page to mm
  2008-04-23 16:12   ` Andrew Hastings
@ 2008-05-23  5:29     ` Nick Piggin
  0 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:29 UTC (permalink / raw)
  To: Andrew Hastings; +Cc: akpm, linux-mm, andi, kniht, nacc, wli

On Wed, Apr 23, 2008 at 11:12:59AM -0500, Andrew Hastings wrote:
> npiggin@suse.de wrote:
> >hugetlb will need to get compound pages from bootmem to handle
> >the case of them being larger than MAX_ORDER. Export
> 
> s/larger/greater than or equal to/

Good catch, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 12/18] hugetlbfs: support larger than MAX_ORDER
  2008-04-25 18:55   ` Nishanth Aravamudan
@ 2008-05-23  5:29     ` Nick Piggin
  0 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:29 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, Apr 25, 2008 at 11:55:43AM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:14 +1000], npiggin@suse.de wrote:
> > This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
> > not practical to enlarge MAX_ORDER to 1GB. 
> > 
> >  #include <asm/page.h>
> >  #include <asm/pgtable.h>
> > @@ -160,7 +161,7 @@ static void free_huge_page(struct page *
> >  	INIT_LIST_HEAD(&page->lru);
> > 
> >  	spin_lock(&hugetlb_lock);
> > -	if (h->surplus_huge_pages_node[nid]) {
> > +	if (h->surplus_huge_pages_node[nid] && h->order < MAX_ORDER) {
> 
> Shouldn't all h->order accesses actually be using the huge_page_order()
> to be consistent?

yes, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 12/18] hugetlbfs: support larger than MAX_ORDER
  2008-04-30 21:01   ` Dave Hansen
@ 2008-05-23  5:30     ` Nick Piggin
  0 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:30 UTC (permalink / raw)
  To: Dave Hansen; +Cc: akpm, linux-mm, andi, kniht, nacc, abh, wli

On Wed, Apr 30, 2008 at 02:01:03PM -0700, Dave Hansen wrote:
> On Wed, 2008-04-23 at 11:53 +1000, npiggin@suse.de wrote:
> > +static int __init alloc_bm_huge_page(struct hstate *h)
> 
> I was just reading one of Jon's patches, and saw this.  Could we expand
> the '_bm_' to '_boot_'?  Or, maybe rename to bootmem_alloc_hpage()?
> 'bm' just doesn't seem to register in my teeny brain.

OK, I agree. They aren't called too often, so I've changed all bm
to bootmem there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 13/18] hugetlb: support boot allocate different sizes
  2008-04-25 18:40   ` Nishanth Aravamudan
  2008-04-25 18:50     ` Andi Kleen
@ 2008-05-23  5:36     ` Nick Piggin
  2008-05-23  6:04       ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:36 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, Apr 25, 2008 at 11:40:41AM -0700, Nishanth Aravamudan wrote:
> 
> So, you made max_huge_pages an array of the same size as the hstates
> array, right?
> 
> So why can't we directly use h->max_huge_pagees everywhere, and *only*
> touch max_huge_pages in the sysctl path.

It's just to bring up the max_huge_pages array initially for the
sysctl read path. I guess the array could be built every time the
sysctl handler runs as another option... that might hide away a
bit of the ugliness into the sysctl code I suppose. I'll see how
it looks.

But remember it is a necessary ugliness due to the sysctl vector
functoins AFAIKS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 14/18] hugetlb: printk cleanup
  2008-04-27  3:32   ` Nishanth Aravamudan
@ 2008-05-23  5:37     ` Nick Piggin
  0 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:37 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Sat, Apr 26, 2008 at 08:32:42PM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:16 +1000], npiggin@suse.de wrote:
> > - Reword sentence to clarify meaning with multiple options
> > - Add support for using GB prefixes for the page size
> > - Add extra printk to delayed > MAX_ORDER allocation code
> > 
> > Signed-off-by: Andi Kleen <ak@suse.de>
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> >  mm/hugetlb.c |   21 +++++++++++++++++----
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> > 
> > Index: linux-2.6/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.orig/mm/hugetlb.c
> > +++ linux-2.6/mm/hugetlb.c
> > @@ -612,15 +612,28 @@ static void __init hugetlb_init_hstates(
> >  	}
> >  }
> > 
> > +static __init char *memfmt(char *buf, unsigned long n)
> 
> Nit: this function is the only one where __init preceds the return type?

Fixed, thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-04-30 20:48   ` Nishanth Aravamudan
@ 2008-05-23  5:41     ` Nick Piggin
  2008-05-23 10:43       ` Andi Kleen
  2008-05-23 20:39       ` Nishanth Aravamudan
  0 siblings, 2 replies; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  5:41 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Wed, Apr 30, 2008 at 01:48:41PM -0700, Nishanth Aravamudan wrote:
> On 23.04.2008 [11:53:19 +1000], npiggin@suse.de wrote:
> > Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.
> > 
> > This finally allows to select GB pages for hugetlbfs in x86 now
> > that all the infrastructure is in place.
> 
> Another more basic question ... how do we plan on making these hugepages
> available to applications. Obviously, an administrator can mount
> hugetlbfs with pagesize=1G or whatever and then users (with appropriate
> permissions) can mmap() files created therein. But what about
> SHM_HUGETLB? It uses a private internal mount of hugetlbfs, which I
> don't believe I saw a patch to add a pagesize= parameter for.
> 
> So SHM_HUGETLB will (for now) always get the "default" hugepagesize,
> right, which should be the same as the legacy size? Given that an
> architecture may support several hugepage sizes, I have't been able to
> come up with a good way to extend shmget() to specify the preferred
> hugepagesize when SHM_HUGETLB is specified. I think for libhugetlbfs
> purposes, we will probably add another environment variable to control
> that...

Good question. One thing I like to do in this patch is to make very
minimal as possible API changes even if it means userspace doesn't
get the full functionality in all corner cases like that.

This way we can get the core work in and stabilized, then can take
more time to discuss the user apis.

For that matter, I'm almost inclined to submit the patchset with
only allow one active hstate specified on the command line, and no
changes to any sysctls... just to get the core code merged sooner ;)
however it is very valueable for testing and proof of concept to
allow multiple active hstates to be configured and run, so I think
we have to have that at least in -mm.

We probably have a month or two before the next merge window, so we
have enough time to think about api issues I hope.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 13/18] hugetlb: support boot allocate different sizes
  2008-05-23  5:36     ` Nick Piggin
@ 2008-05-23  6:04       ` Nick Piggin
  2008-05-23 20:32         ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23  6:04 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, May 23, 2008 at 07:36:41AM +0200, Nick Piggin wrote:
> On Fri, Apr 25, 2008 at 11:40:41AM -0700, Nishanth Aravamudan wrote:
> > 
> > So, you made max_huge_pages an array of the same size as the hstates
> > array, right?
> > 
> > So why can't we directly use h->max_huge_pagees everywhere, and *only*
> > touch max_huge_pages in the sysctl path.
> 
> It's just to bring up the max_huge_pages array initially for the
> sysctl read path. I guess the array could be built every time the
> sysctl handler runs as another option... that might hide away a
> bit of the ugliness into the sysctl code I suppose. I'll see how
> it looks.

Hmm, I think we could get into problems with the issue of kernel parameter
passing vs hstate setup, so things might get a bit fragile. I think
it is robust at this point in time to retain the max_huge_pages array
if the hugetlb vs arch hstate registration setup gets revamped, it
might be something to look at, but I prefer to keep it rather than tinker
at this point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 08/18] hugetlb: multi hstate sysctls
  2008-05-23  5:28     ` Nick Piggin
@ 2008-05-23 10:40       ` Andi Kleen
  0 siblings, 0 replies; 123+ messages in thread
From: Andi Kleen @ 2008-05-23 10:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nishanth Aravamudan, akpm, linux-mm, andi, kniht, abh, wli

> > Could this same condition be added to the overcommit handler, please?
> 
> Sure thing.

I left that out intentionally because it didn't seem useful to me.

-Andi

> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-05-23  5:41     ` Nick Piggin
@ 2008-05-23 10:43       ` Andi Kleen
  2008-05-23 12:34         ` Nick Piggin
  2008-05-23 20:39       ` Nishanth Aravamudan
  1 sibling, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-05-23 10:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nishanth Aravamudan, akpm, linux-mm, andi, kniht, abh, wli

> For that matter, I'm almost inclined to submit the patchset with
> only allow one active hstate specified on the command line, and no
> changes to any sysctls... just to get the core code merged sooner ;)

If you do that you don't really need to bother with the patchset.
I had an earlier patch for GB pages in hugetlbfs that only supported
a single page size and it was much much simpler. All the work just came
from supporting multiple page sizes for binary compatibility.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-05-23 10:43       ` Andi Kleen
@ 2008-05-23 12:34         ` Nick Piggin
  2008-05-23 14:29           ` Andi Kleen
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23 12:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nishanth Aravamudan, akpm, linux-mm, kniht, abh, wli

On Fri, May 23, 2008 at 12:43:27PM +0200, Andi Kleen wrote:
> > For that matter, I'm almost inclined to submit the patchset with
> > only allow one active hstate specified on the command line, and no
> > changes to any sysctls... just to get the core code merged sooner ;)
> 
> If you do that you don't really need to bother with the patchset.
> I had an earlier patch for GB pages in hugetlbfs that only supported
> a single page size and it was much much simpler. All the work just came
> from supporting multiple page sizes for binary compatibility.

Oh, maybe you misunderstand what I meant: I think the multiple hugepages
stuff is nice, and definitely should go in. But I think that if there is
any more disagreement over the userspace APIs, then we should just merge
the patchset anyway just without any changes to the APIs -- at least that
way we'll have most of the code ready for when an agreement can be
reached.

However I say *almost*, because hopefully we can agree on the API.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-05-23 12:34         ` Nick Piggin
@ 2008-05-23 14:29           ` Andi Kleen
  2008-05-23 20:43             ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Andi Kleen @ 2008-05-23 14:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Nishanth Aravamudan, akpm, linux-mm, kniht, abh, wli

> Oh, maybe you misunderstand what I meant: I think the multiple hugepages
> stuff is nice, and definitely should go in. But I think that if there is
> any more disagreement over the userspace APIs, then we should just merge

What disagreement was there? (sorry didn't notice it)

AFAIK the patchkit does not change any user interfaces except for adding
a few numbers to one line of /proc/meminfo and a few other sysctls which seems 
hardly like a big change
(and calling that a "API" would be making a mountain out of a molehill)

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 08/18] hugetlb: multi hstate sysctls
  2008-05-23  5:25     ` Nick Piggin
@ 2008-05-23 20:27       ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-23 20:27 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.05.2008 [07:25:46 +0200], Nick Piggin wrote:
> On Fri, Apr 25, 2008 at 11:14:30AM -0700, Nishanth Aravamudan wrote:
> > On 23.04.2008 [11:53:10 +1000], npiggin@suse.de wrote:
> > > Expand the hugetlbfs sysctls to handle arrays for all hstates. This
> > > now allows the removal of global_hstate -- everything is now hstate
> > > aware.
> > > 
> > > - I didn't bother with hugetlb_shm_group and treat_as_movable,
> > > these are still single global.
> > > - Also improve error propagation for the sysctl handlers a bit
> > 
> > So, I may be mis-remembering, but the hugepages that are gigantic, that
> > is > MAX_ORDER, cannot be allocated or freed at run-time? If so, why do
> 
> Right.
> 
> > we need to report them in the sysctl? It's a read-only value, right?
> 
> I guess for reporting and compatibility.

That's fair. I was more referring to the fact that the relevant
information would be in /proc/meminfo.

> > Similarly, for the sysfs interface thereto, can I just make them
> > read-only? I guess it would be an arbitrary difference from the other
> > files, but reflects reality?
> 
> For the sysfs interface, I think that would be a fine idea to make
> them readonly if they cannot be changed.

Yeah -- will need to think of a good way for the sysfs hstate API to be
told the given hstate is unchangeable. So for now, they may be writable,
but without any effect.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 06/18] hugetlb: multi hstate proc files
  2008-05-23  5:22     ` Nick Piggin
@ 2008-05-23 20:30       ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-23 20:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.05.2008 [07:22:15 +0200], Nick Piggin wrote:
> On Fri, May 02, 2008 at 12:53:11PM -0700, Nishanth Aravamudan wrote:
> > On 23.04.2008 [11:53:08 +1000], npiggin@suse.de wrote:
> > > Convert /proc output code over to report multiple hstates
> > > 
> > > I chose to just report the numbers in a row, in the hope 
> > > to minimze breakage of existing software. The "compat" page size
> > > is always the first number.
> > 
> > Only if add_huge_hstate() is called first for the compat page size,
> > right? That seems bad if we depend on an ordering.
> > 
> > For instance, for power, I think Jon is calling huge_add_hstate() from
> > the arch/powerpc/mm/hugetlbpage.c init routine. Which runs before
> > hugetlb_init, which means that if he adds hugepages like
> > 
> > huge_add_hstate(64k-order);
> > huge_add_hstate(16m-order);
> > huge_add_hstate(16g-order);
> > 
> > We'll get 64k as the first field in meminfo.
> > 
> > So perhaps what we should do is:
> > 
> > 1) architectures define HPAGE_* as the default (compat) hugepage values
> > 2) architectures have a call into generic code at their init time to
> > specify what sizes they support
> > 3) the core is the only place that actually does huge_add_hstate() and
> > it always does it first for the compat order?
> > 
> > I wonder if this might lead to issues in timing between processing
> > hugepagesz= (in arch code) and hugepages= (in generic code). Not sure. I
> > guess if we always add all hugepage sizes, we should have all the
> > hstates we know about ready to configure and as long as hugetlb_init
> > runs before hugepages= processing, we should be fine? Dunno.
> 
> You're right I think. The other thing is that we could just have
> a small map from the hstate array to reporting order for sysctls.
> We could report them in the order specified on the cmdline, with
> the default size first if it was not specified on the cmdline.
> 
> Hmm, I'll see how that looks.

Yeah, either way is fine. I just wanted to make sure any implicit
assumptions were laid out clearly (and should be spelled out in
kernel-parameters.txt and vm/hugetlbpage.txt, probably).

And if we really are worried about backwards compatibility, then we
should be careful about any ordering issues.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 13/18] hugetlb: support boot allocate different sizes
  2008-05-23  6:04       ` Nick Piggin
@ 2008-05-23 20:32         ` Nishanth Aravamudan
  2008-05-23 22:45           ` Nick Piggin
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-23 20:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.05.2008 [08:04:39 +0200], Nick Piggin wrote:
> On Fri, May 23, 2008 at 07:36:41AM +0200, Nick Piggin wrote:
> > On Fri, Apr 25, 2008 at 11:40:41AM -0700, Nishanth Aravamudan wrote:
> > > 
> > > So, you made max_huge_pages an array of the same size as the hstates
> > > array, right?
> > > 
> > > So why can't we directly use h->max_huge_pagees everywhere, and *only*
> > > touch max_huge_pages in the sysctl path.
> > 
> > It's just to bring up the max_huge_pages array initially for the
> > sysctl read path. I guess the array could be built every time the
> > sysctl handler runs as another option... that might hide away a
> > bit of the ugliness into the sysctl code I suppose. I'll see how
> > it looks.
> 
> Hmm, I think we could get into problems with the issue of kernel
> parameter passing vs hstate setup, so things might get a bit fragile.
> I think it is robust at this point in time to retain the
> max_huge_pages array if the hugetlb vs arch hstate registration setup
> gets revamped, it might be something to look at, but I prefer to keep
> it rather than tinker at this point.

Sure and that's fair.

But I'm approaching it from the perspective that the multi-valued
sysctl will go away with the sysfs interface. So perhaps I'll do a
cleanup then.

Also, we will want to update the linux-mm.org wiki diagrams if we change
what states hugepages can be in and the meaning of any of them. I'm not
sure that is the case now, but it might be.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 07/18] hugetlbfs: per mount hstates
  2008-05-23  5:24     ` Nick Piggin
@ 2008-05-23 20:34       ` Nishanth Aravamudan
  2008-05-23 22:49         ` Nick Piggin
  0 siblings, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-23 20:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.05.2008 [07:24:25 +0200], Nick Piggin wrote:
> On Fri, Apr 25, 2008 at 11:09:33AM -0700, Nishanth Aravamudan wrote:
> > On 23.04.2008 [11:53:09 +1000], npiggin@suse.de wrote:
> > > Add support to have individual hstates for each hugetlbfs mount
> > > 
> > > - Add a new pagesize= option to the hugetlbfs mount that allows setting
> > > the page size
> > > - Set up pointers to a suitable hstate for the set page size option
> > > to the super block and the inode and the vma.
> > > - Change the hstate accessors to use this information
> > > - Add code to the hstate init function to set parsed_hstate for command
> > > line processing
> > > - Handle duplicated hstate registrations to the make command line user proof
> > > 
> > > [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
> > > 
> > > Signed-off-by: Andi Kleen <ak@suse.de>
> > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > ---
> > >  fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
> > >  include/linux/hugetlb.h |   14 +++++++++-----
> > >  mm/hugetlb.c            |   16 +++-------------
> > >  mm/memory.c             |   18 ++++++++++++++++--
> > >  4 files changed, 66 insertions(+), 30 deletions(-)
> > > 
> > > Index: linux-2.6/include/linux/hugetlb.h
> > > ===================================================================
> > 
> > <snip>
> > 
> > > @@ -226,19 +228,21 @@ extern struct hstate hstates[HUGE_MAX_HS
> > > 
> > >  #define global_hstate (hstates[0])
> > > 
> > > -static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > > +static inline struct hstate *hstate_inode(struct inode *i)
> > >  {
> > > -	return &global_hstate;
> > > +	struct hugetlbfs_sb_info *hsb;
> > > +	hsb = HUGETLBFS_SB(i->i_sb);
> > > +	return hsb->hstate;
> > >  }
> > > 
> > >  static inline struct hstate *hstate_file(struct file *f)
> > >  {
> > > -	return &global_hstate;
> > > +	return hstate_inode(f->f_dentry->d_inode);
> > >  }
> > > 
> > > -static inline struct hstate *hstate_inode(struct inode *i)
> > > +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > >  {
> > > -	return &global_hstate;
> > > +	return hstate_file(vma->vm_file);
> > 
> > Odd, diff seems to think you've moved these two functions around
> > (hstate_{vma,inode})...
> 
> Yep, one depends on the other...

Yeah, I realized that shortly after. Sorry for that noise.

> > >  static inline unsigned long huge_page_size(struct hstate *h)
> > > Index: linux-2.6/fs/hugetlbfs/inode.c
> > > ===================================================================
> > 
> > <snip>
> > 
> > > @@ -780,17 +784,13 @@ hugetlbfs_parse_options(char *options, s
> > >  			break;
> > > 
> > >  		case Opt_size: {
> > > - 			unsigned long long size;
> > >  			/* memparse() will accept a K/M/G without a digit */
> > >  			if (!isdigit(*args[0].from))
> > >  				goto bad_val;
> > >  			size = memparse(args[0].from, &rest);
> > > -			if (*rest == '%') {
> > > -				size <<= HPAGE_SHIFT;
> > > -				size *= max_huge_pages;
> > > -				do_div(size, 100);
> > > -			}
> > > -			pconfig->nr_blocks = (size >> HPAGE_SHIFT);
> > > +			setsize = SIZE_STD;
> > > +			if (*rest == '%')
> > > +				setsize = SIZE_PERCENT;
> > 
> > This seems like a change that could be pulled into its own clean-up
> > patch and merged up quicker?
> > 
> > > @@ -801,6 +801,19 @@ hugetlbfs_parse_options(char *options, s
> > >  			pconfig->nr_inodes = memparse(args[0].from, &rest);
> > >  			break;
> > > 
> > > +		case Opt_pagesize: {
> > > +			unsigned long ps;
> > > +			ps = memparse(args[0].from, &rest);
> > > +			pconfig->hstate = size_to_hstate(ps);
> > > +			if (!pconfig->hstate) {
> > > +				printk(KERN_ERR
> > > +				"hugetlbfs: Unsupported page size %lu MB\n",
> > > +					ps >> 20);
> > 
> > This again will give odd output for pagesizes < 1MB (64k on power).
> > 
> > > @@ -808,6 +821,18 @@ hugetlbfs_parse_options(char *options, s
> > >  			break;
> > >  		}
> > >  	}
> > > +
> > > +	/* Do size after hstate is set up */
> > > +	if (setsize > NO_SIZE) {
> > > +		struct hstate *h = pconfig->hstate;
> > > +		if (setsize == SIZE_PERCENT) {
> > > +			size <<= huge_page_shift(h);
> > > +			size *= h->max_huge_pages;
> > > +			do_div(size, 100);
> > > +		}
> > > +		pconfig->nr_blocks = (size >> huge_page_shift(h));
> > > +	}
> > 
> > Oh, I see. We just moved the percent caclulation down here. Sorry about
> > that, seems sensible to leave it in this patch then.
> > 
> > >  bad_val:
> > > @@ -832,6 +857,7 @@ hugetlbfs_fill_super(struct super_block 
> > >  	config.uid = current->fsuid;
> > >  	config.gid = current->fsgid;
> > >  	config.mode = 0755;
> > > +	config.hstate = size_to_hstate(HPAGE_SIZE);
> > 
> > So, we still only have one hugepage size, which is why this is written
> > this way. Seems odd that an early patch adds multiple hugepage size
> > support, but we don't actually need it in the series until much later...
> 
> True, but it is quite a long process and it is nice to have it working
> each step of the way in small steps... I think the overall way Andi's
> done the patchset is quite nice.

Yeah, I'm sorry if my review came across as overly critical at the time.
I really am impressed with the amount of change and how it was
presented. But, in all honesty, given that I have not seen many patches
from Andi nor yourself for hugetlbfs code in the past few years, nor do
I expect to see many in the future, I was trying to keep the code as
sensible as possible for those of us that do interact with it regularly
(and its userspace interface, especially !SHM_HUGETLB).

Thanks,
Nish


-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-05-23  5:41     ` Nick Piggin
  2008-05-23 10:43       ` Andi Kleen
@ 2008-05-23 20:39       ` Nishanth Aravamudan
  2008-05-23 22:52         ` Nick Piggin
  1 sibling, 1 reply; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-23 20:39 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.05.2008 [07:41:33 +0200], Nick Piggin wrote:
> On Wed, Apr 30, 2008 at 01:48:41PM -0700, Nishanth Aravamudan wrote:
> > On 23.04.2008 [11:53:19 +1000], npiggin@suse.de wrote:
> > > Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.
> > > 
> > > This finally allows to select GB pages for hugetlbfs in x86 now
> > > that all the infrastructure is in place.
> > 
> > Another more basic question ... how do we plan on making these hugepages
> > available to applications. Obviously, an administrator can mount
> > hugetlbfs with pagesize=1G or whatever and then users (with appropriate
> > permissions) can mmap() files created therein. But what about
> > SHM_HUGETLB? It uses a private internal mount of hugetlbfs, which I
> > don't believe I saw a patch to add a pagesize= parameter for.
> > 
> > So SHM_HUGETLB will (for now) always get the "default" hugepagesize,
> > right, which should be the same as the legacy size? Given that an
> > architecture may support several hugepage sizes, I have't been able to
> > come up with a good way to extend shmget() to specify the preferred
> > hugepagesize when SHM_HUGETLB is specified. I think for libhugetlbfs
> > purposes, we will probably add another environment variable to control
> > that...
> 
> Good question. One thing I like to do in this patch is to make very
> minimal as possible API changes even if it means userspace doesn't get
> the full functionality in all corner cases like that.
> 
> This way we can get the core work in and stabilized, then can take
> more time to discuss the user apis.
> 
> For that matter, I'm almost inclined to submit the patchset with only
> allow one active hstate specified on the command line, and no changes
> to any sysctls... just to get the core code merged sooner ;) however
> it is very valueable for testing and proof of concept to allow
> multiple active hstates to be configured and run, so I think we have
> to have that at least in -mm.
> 
> We probably have a month or two before the next merge window, so we
> have enough time to think about api issues I hope.

I think your plan is sensible, and is certainly how I would approach
adding this support to mainline. That is, all of the core hstate
functionality can probably go upstream rather quickly, as it's
functionally equivalent (and should be easy to verify as such with
libhuge's tests on all supported architectures).

I'm also hoping that once your patches are re-posted and hit -mm, I can
send out my sysfs patch, after updating/testing it, and that could also
go into -mm, which might allow the meminfo and sysctl patches to be
dropped from the series. Depends on your perspective on those, I
suppose, and might also need some coordination with Andrew to make the
series build in the right order (so the sysfs patch can be dropped in,
in place of both of them).

Does that seem reasonable? Also, for -mm coordination, are you going to
pull Jon's patches into your set, then?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-05-23 14:29           ` Andi Kleen
@ 2008-05-23 20:43             ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-23 20:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, akpm, linux-mm, kniht, abh, wli

On 23.05.2008 [16:29:56 +0200], Andi Kleen wrote:
> > Oh, maybe you misunderstand what I meant: I think the multiple
> > hugepages stuff is nice, and definitely should go in. But I think
> > that if there is any more disagreement over the userspace APIs, then
> > we should just merge
> 
> What disagreement was there? (sorry didn't notice it)

Whether or not this information should be presented in /proc at all. And
I would prefer to call it a "discussion" not a disagreement :)

> AFAIK the patchkit does not change any user interfaces except for
> adding a few numbers to one line of /proc/meminfo and a few other
> sysctls which seems hardly like a big change (and calling that a "API"
> would be making a mountain out of a molehill)

I'm somewhat ambivalent about the meminfo changes, although I do not
think they are necessary with a sysfs interface, but I really don't like
the idea of the multi-valued sysctl. Especially if, as we are talking
about, all hugepage sizes will be available in-kernel at all times. That
means any pool manipulations on modern power hardware will require
echo'ing three values, even if only one or two are to be modified (and
the third (16G) can't be changed anyways!) Then the ordering also
becomes an issue. As I pointed out to Nick, while on x86_64, 2M would
come first as the legacy size, it's actually due to a subtle ordering
constraint, which is not guaranteed to be the case on other
architectures (and was not on power with 64k hugepages, in my testing).

The sysfs patch has been written and is really just waiting on a repost
to post it again (it was discussed in a different thread with Greg and
others). I didn't get final confirmation from Greg that I had done
things correctly, but I'm sure he'll yell when I post the version to
merge.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 04/18] hugetlb: modular state
  2008-05-23  5:02     ` Nick Piggin
@ 2008-05-23 20:48       ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-23 20:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 23.05.2008 [07:02:47 +0200], Nick Piggin wrote:
> On Fri, Apr 25, 2008 at 10:13:46AM -0700, Nishanth Aravamudan wrote:
> > On 23.04.2008 [11:53:06 +1000], npiggin@suse.de wrote:
> > > Large, but rather mechanical patch that converts most of the hugetlb.c
> > > globals into structure members and passes them around.
> > > 
> > > Right now there is only a single global hstate structure, but most of
> > > the infrastructure to extend it is there.
> > 
> > While going through the patches as I apply them to 2.6.25-mm1 (as none
> > will apply cleanly so far :), I have a few comments. I like this patch
> > overall.
> 
> Thanks for all the feedback, and sorry for the delay. I'm just
> rebasing things now and getting through all the feedback.
> 
> I really do appreciate the comments and have made a lot of changes
> that you've suggested...

Great, I'm looking forward to the new series and seeing it get some
wider testing in -mm. I'll throw Acks in, when they are posted.

Let me also reiterate that your and Andi's work really does make a world
of difference for the larger hugetlb userbase. The hstate idea and
implementation really do make hugepages a lot more flexible then we were
before, and I really applaud you both for the code.

<snip>

> > >  /*
> > >   * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
> > >   */
> > >  static DEFINE_SPINLOCK(hugetlb_lock);
> > 
> > Not sure if this makes sense or not, but would it be useful to make the
> > lock be per-hstate? It is designed to protect the counters and the
> > freelists, but those are per-hstate, right? Would need heavy testing,
> > but might be useful for varying apps both trying to use different size
> > hugepages simultaneously?
> 
> Hmm, sure we could do that. Although obviously it would be another
> patchset, and actually I'd be concerned about making hstate the
> unit of scalability in hugetlbfs -- a single hstate should be
> suffiicently scalable to handle workloads reasonably.
> 
> Good point, but at any rate I guess this patchset isn't the place
> to do it.

Agreed.

<snip>

> > >  int hugetlb_report_meminfo(char *buf)
> > >  {
> > > +	struct hstate *h = &global_hstate;
> > >  	return sprintf(buf,
> > >  			"HugePages_Total: %5lu\n"
> > >  			"HugePages_Free:  %5lu\n"
> > >  			"HugePages_Rsvd:  %5lu\n"
> > >  			"HugePages_Surp:  %5lu\n"
> > >  			"Hugepagesize:    %5lu kB\n",
> > > -			nr_huge_pages,
> > > -			free_huge_pages,
> > > -			resv_huge_pages,
> > > -			surplus_huge_pages,
> > > -			HPAGE_SIZE/1024);
> > > +			h->nr_huge_pages,
> > > +			h->free_huge_pages,
> > > +			h->resv_huge_pages,
> > > +			h->surplus_huge_pages,
> > > +			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
> > 
> > "- 10"? I think this should be easier to get at then this? Oh I
> > guess it's to get it into kilobytes... Seems kind of odd, but I
> > guess it's fine.
> 
> I agree it's not perfect, but I might just leave all these for
> a subsequent patchset (or can stick improvements to the end of
> this patchset).

I can submit a sequence of cleanup patches myself, as well, they
shouldn't block your posting.

> > >  static inline int is_file_hugepages(struct file *file)
> > >  {
> > >  	if (file->f_op == &hugetlbfs_file_operations)
> > > @@ -199,4 +196,71 @@ unsigned long hugetlb_get_unmapped_area(
> > >  					unsigned long flags);
> > >  #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
> > > 
> > > +#ifdef CONFIG_HUGETLB_PAGE
> > 
> > Why another block of HUGETLB_PAGE? Shouldn't this go at the end of the
> > other one? And the !HUGETLB_PAGE within the corresponding #else?
> 
> Hmm, possibly. As has been noted, the CONFIG_ things are a bit
> broken, and they should just get merged into one. I'll steer
> clear of that area for the moment, as everything is working now,
> but consolidating the options and cleaning things up would be
> a good idea.

Yep, I'll add this as a tail-cleanup. Perhaps part of the overarching
one of just getting rid of CONFIG_HUGETLBFS or CONFIG_HUGETLB_PAGE (have
one config option, not two, since they are mutually dependent).

> > > +
> > > +/* Defines one hugetlb page size */
> > > +struct hstate {
> > > +	int hugetlb_next_nid;
> > > +	unsigned int order;
> > 
> > Which is actually a shift, too, right? So why not just call it that? No
> > function should be direclty accessing these members, so the function
> > name indicates how the shift is being used?
> 
> I don't feel strongly. If you really do, then I guess it could be
> changed.
> 
> 
> > > +	unsigned long mask;
> > > +	unsigned long max_huge_pages;
> > > +	unsigned long nr_huge_pages;
> > > +	unsigned long free_huge_pages;
> > > +	unsigned long resv_huge_pages;
> > > +	unsigned long surplus_huge_pages;
> > > +	unsigned long nr_overcommit_huge_pages;
> > > +	struct list_head hugepage_freelists[MAX_NUMNODES];
> > > +	unsigned int nr_huge_pages_node[MAX_NUMNODES];
> > > +	unsigned int free_huge_pages_node[MAX_NUMNODES];
> > > +	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> > > +};
> > > +
> > > +extern struct hstate global_hstate;
> > > +
> > > +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> > > +{
> > > +	return &global_hstate;
> > > +}
> > 
> > After having looked at this functions while reviewing, it does seem like
> > it might be more intuitive to ready vma_hstate ("vma's hstate") rather
> > than hstate_vma ("hstate's vma"?). But your call.
> 
> Again I don't feel strongly. Hstate prefix has some upsides.

I think you can leave both as is.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 13/18] hugetlb: support boot allocate different sizes
  2008-05-23 20:32         ` Nishanth Aravamudan
@ 2008-05-23 22:45           ` Nick Piggin
  2008-05-23 22:53             ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23 22:45 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, May 23, 2008 at 01:32:28PM -0700, Nishanth Aravamudan wrote:
> On 23.05.2008 [08:04:39 +0200], Nick Piggin wrote:
> > On Fri, May 23, 2008 at 07:36:41AM +0200, Nick Piggin wrote:
> > > On Fri, Apr 25, 2008 at 11:40:41AM -0700, Nishanth Aravamudan wrote:
> > > > 
> > > > So, you made max_huge_pages an array of the same size as the hstates
> > > > array, right?
> > > > 
> > > > So why can't we directly use h->max_huge_pagees everywhere, and *only*
> > > > touch max_huge_pages in the sysctl path.
> > > 
> > > It's just to bring up the max_huge_pages array initially for the
> > > sysctl read path. I guess the array could be built every time the
> > > sysctl handler runs as another option... that might hide away a
> > > bit of the ugliness into the sysctl code I suppose. I'll see how
> > > it looks.
> > 
> > Hmm, I think we could get into problems with the issue of kernel
> > parameter passing vs hstate setup, so things might get a bit fragile.
> > I think it is robust at this point in time to retain the
> > max_huge_pages array if the hugetlb vs arch hstate registration setup
> > gets revamped, it might be something to look at, but I prefer to keep
> > it rather than tinker at this point.
> 
> Sure and that's fair.
> 
> But I'm approaching it from the perspective that the multi-valued
> sysctl will go away with the sysfs interface. So perhaps I'll do a
> cleanup then.

Yes, that could be one good way to keep the proc API unchanged --
move it over to sysfs and just put a "default" hugepagesz in proc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 07/18] hugetlbfs: per mount hstates
  2008-05-23 20:34       ` Nishanth Aravamudan
@ 2008-05-23 22:49         ` Nick Piggin
  2008-05-23 23:24           ` Nishanth Aravamudan
  0 siblings, 1 reply; 123+ messages in thread
From: Nick Piggin @ 2008-05-23 22:49 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, May 23, 2008 at 01:34:44PM -0700, Nishanth Aravamudan wrote:
> On 23.05.2008 [07:24:25 +0200], Nick Piggin wrote:
> > On Fri, Apr 25, 2008 at 11:09:33AM -0700, Nishanth Aravamudan wrote:
> > True, but it is quite a long process and it is nice to have it working
> > each step of the way in small steps... I think the overall way Andi's
> > done the patchset is quite nice.
> 
> Yeah, I'm sorry if my review came across as overly critical at the time.
> I really am impressed with the amount of change and how it was
> presented. But, in all honesty, given that I have not seen many patches
> from Andi nor yourself for hugetlbfs code in the past few years, nor do
> I expect to see many in the future, I was trying to keep the code as
> sensible as possible for those of us that do interact with it regularly
> (and its userspace interface, especially !SHM_HUGETLB).

Yes it's important you're happy with it for that reason. So I have made
a lot of changes you suggested, and other things if you feel strongly
about could be changed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 17/18] x86: add hugepagesz option on 64-bit
  2008-05-23 20:39       ` Nishanth Aravamudan
@ 2008-05-23 22:52         ` Nick Piggin
  0 siblings, 0 replies; 123+ messages in thread
From: Nick Piggin @ 2008-05-23 22:52 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On Fri, May 23, 2008 at 01:39:08PM -0700, Nishanth Aravamudan wrote:
> On 23.05.2008 [07:41:33 +0200], Nick Piggin wrote:
> > On Wed, Apr 30, 2008 at 01:48:41PM -0700, Nishanth Aravamudan wrote:
> > > On 23.04.2008 [11:53:19 +1000], npiggin@suse.de wrote:
> > > > Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.
> > > > 
> > > > This finally allows to select GB pages for hugetlbfs in x86 now
> > > > that all the infrastructure is in place.
> > > 
> > > Another more basic question ... how do we plan on making these hugepages
> > > available to applications. Obviously, an administrator can mount
> > > hugetlbfs with pagesize=1G or whatever and then users (with appropriate
> > > permissions) can mmap() files created therein. But what about
> > > SHM_HUGETLB? It uses a private internal mount of hugetlbfs, which I
> > > don't believe I saw a patch to add a pagesize= parameter for.
> > > 
> > > So SHM_HUGETLB will (for now) always get the "default" hugepagesize,
> > > right, which should be the same as the legacy size? Given that an
> > > architecture may support several hugepage sizes, I have't been able to
> > > come up with a good way to extend shmget() to specify the preferred
> > > hugepagesize when SHM_HUGETLB is specified. I think for libhugetlbfs
> > > purposes, we will probably add another environment variable to control
> > > that...
> > 
> > Good question. One thing I like to do in this patch is to make very
> > minimal as possible API changes even if it means userspace doesn't get
> > the full functionality in all corner cases like that.
> > 
> > This way we can get the core work in and stabilized, then can take
> > more time to discuss the user apis.
> > 
> > For that matter, I'm almost inclined to submit the patchset with only
> > allow one active hstate specified on the command line, and no changes
> > to any sysctls... just to get the core code merged sooner ;) however
> > it is very valueable for testing and proof of concept to allow
> > multiple active hstates to be configured and run, so I think we have
> > to have that at least in -mm.
> > 
> > We probably have a month or two before the next merge window, so we
> > have enough time to think about api issues I hope.
> 
> I think your plan is sensible, and is certainly how I would approach
> adding this support to mainline. That is, all of the core hstate
> functionality can probably go upstream rather quickly, as it's
> functionally equivalent (and should be easy to verify as such with
> libhuge's tests on all supported architectures).
> 
> I'm also hoping that once your patches are re-posted and hit -mm, I can
> send out my sysfs patch, after updating/testing it, and that could also
> go into -mm, which might allow the meminfo and sysctl patches to be
> dropped from the series. Depends on your perspective on those, I
> suppose, and might also need some coordination with Andrew to make the
> series build in the right order (so the sysfs patch can be dropped in,
> in place of both of them).

Yes, I think that is a good plan.

 
> Does that seem reasonable? Also, for -mm coordination, are you going to
> pull Jon's patches into your set, then?

Yes I have Jon's patches in my set now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 13/18] hugetlb: support boot allocate different sizes
  2008-05-23 22:45           ` Nick Piggin
@ 2008-05-23 22:53             ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-23 22:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 24.05.2008 [00:45:49 +0200], Nick Piggin wrote:
> On Fri, May 23, 2008 at 01:32:28PM -0700, Nishanth Aravamudan wrote:
> > On 23.05.2008 [08:04:39 +0200], Nick Piggin wrote:
> > > On Fri, May 23, 2008 at 07:36:41AM +0200, Nick Piggin wrote:
> > > > On Fri, Apr 25, 2008 at 11:40:41AM -0700, Nishanth Aravamudan wrote:
> > > > > 
> > > > > So, you made max_huge_pages an array of the same size as the hstates
> > > > > array, right?
> > > > > 
> > > > > So why can't we directly use h->max_huge_pagees everywhere, and *only*
> > > > > touch max_huge_pages in the sysctl path.
> > > > 
> > > > It's just to bring up the max_huge_pages array initially for the
> > > > sysctl read path. I guess the array could be built every time the
> > > > sysctl handler runs as another option... that might hide away a
> > > > bit of the ugliness into the sysctl code I suppose. I'll see how
> > > > it looks.
> > > 
> > > Hmm, I think we could get into problems with the issue of kernel
> > > parameter passing vs hstate setup, so things might get a bit fragile.
> > > I think it is robust at this point in time to retain the
> > > max_huge_pages array if the hugetlb vs arch hstate registration setup
> > > gets revamped, it might be something to look at, but I prefer to keep
> > > it rather than tinker at this point.
> > 
> > Sure and that's fair.
> > 
> > But I'm approaching it from the perspective that the multi-valued
> > sysctl will go away with the sysfs interface. So perhaps I'll do a
> > cleanup then.
> 
> Yes, that could be one good way to keep the proc API unchanged -- move
> it over to sysfs and just put a "default" hugepagesz in proc.

I would be fine with that approach, and can make my sysfs patch apply at
the end or the middle (as a replacement) for your series to achieve it.
Andi, do you have any input here? Would also make keeping libhugetlbfs
backwards compatible easier, as meminfo's layout wouldn't change at all
and would still be the legacy/default page size.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [patch 07/18] hugetlbfs: per mount hstates
  2008-05-23 22:49         ` Nick Piggin
@ 2008-05-23 23:24           ` Nishanth Aravamudan
  0 siblings, 0 replies; 123+ messages in thread
From: Nishanth Aravamudan @ 2008-05-23 23:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-mm, andi, kniht, abh, wli

On 24.05.2008 [00:49:58 +0200], Nick Piggin wrote:
> On Fri, May 23, 2008 at 01:34:44PM -0700, Nishanth Aravamudan wrote:
> > On 23.05.2008 [07:24:25 +0200], Nick Piggin wrote:
> > > On Fri, Apr 25, 2008 at 11:09:33AM -0700, Nishanth Aravamudan wrote:
> > > True, but it is quite a long process and it is nice to have it working
> > > each step of the way in small steps... I think the overall way Andi's
> > > done the patchset is quite nice.
> > 
> > Yeah, I'm sorry if my review came across as overly critical at the time.
> > I really am impressed with the amount of change and how it was
> > presented. But, in all honesty, given that I have not seen many patches
> > from Andi nor yourself for hugetlbfs code in the past few years, nor do
> > I expect to see many in the future, I was trying to keep the code as
> > sensible as possible for those of us that do interact with it regularly
> > (and its userspace interface, especially !SHM_HUGETLB).
> 
> Yes it's important you're happy with it for that reason. So I have
> made a lot of changes you suggested, and other things if you feel
> strongly about could be changed.

Which I also greatly appreciate :) I think everything else is probably
in a good enough state to be in -mm now, and any clean-ups/add-ons can
happen there.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, other threads:[~2008-05-23 23:21 UTC | newest]

Thread overview: 123+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-23  1:53 [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
2008-04-23  1:53 ` [patch 01/18] hugetlb: fix lockdep spew npiggin
2008-04-23 13:06   ` KOSAKI Motohiro
2008-04-23  1:53 ` [patch 02/18] hugetlb: factor out huge_new_page npiggin
2008-04-24 23:49   ` Nishanth Aravamudan
2008-04-24 23:54   ` Nishanth Aravamudan
2008-04-24 23:58     ` Nishanth Aravamudan
2008-04-25  7:10       ` Andi Kleen
2008-04-25 16:54         ` Nishanth Aravamudan
2008-04-25 19:13           ` Christoph Lameter
2008-04-25 19:29             ` Nishanth Aravamudan
2008-04-30 19:16               ` Christoph Lameter
2008-04-30 20:44                 ` Nishanth Aravamudan
2008-05-01 19:23                   ` Christoph Lameter
2008-05-01 20:25                     ` Nishanth Aravamudan
2008-05-01 20:34                       ` Christoph Lameter
2008-05-01 21:01                         ` Nishanth Aravamudan
2008-05-23  5:03                           ` Nick Piggin
2008-04-23  1:53 ` [patch 03/18] mm: offset align in alloc_bootmem npiggin, Yinghai Lu
2008-04-23  1:53 ` [patch 04/18] hugetlb: modular state npiggin
2008-04-23 15:21   ` Jon Tollefson
2008-04-23 15:38     ` Nick Piggin
2008-04-25 17:13   ` Nishanth Aravamudan
2008-05-23  5:02     ` Nick Piggin
2008-05-23 20:48       ` Nishanth Aravamudan
2008-04-23  1:53 ` [patch 05/18] hugetlb: multiple hstates npiggin
2008-04-25 17:38   ` Nishanth Aravamudan
2008-04-25 17:48     ` Nishanth Aravamudan
2008-04-25 17:55     ` Andi Kleen
2008-04-25 17:52       ` Nishanth Aravamudan
2008-04-25 18:10         ` Andi Kleen
2008-04-28 10:13           ` Andy Whitcroft
2008-05-23  5:18     ` Nick Piggin
2008-04-29 17:27   ` Nishanth Aravamudan
2008-05-23  5:19     ` Nick Piggin
2008-04-23  1:53 ` [patch 06/18] hugetlb: multi hstate proc files npiggin
2008-05-02 19:53   ` Nishanth Aravamudan
2008-05-23  5:22     ` Nick Piggin
2008-05-23 20:30       ` Nishanth Aravamudan
2008-04-23  1:53 ` [patch 07/18] hugetlbfs: per mount hstates npiggin
2008-04-25 18:09   ` Nishanth Aravamudan
2008-04-25 20:36     ` Nishanth Aravamudan
2008-04-25 22:39       ` Nishanth Aravamudan
2008-04-28 18:20         ` Adam Litke
2008-04-28 18:46           ` Nishanth Aravamudan
2008-05-23  5:24     ` Nick Piggin
2008-05-23 20:34       ` Nishanth Aravamudan
2008-05-23 22:49         ` Nick Piggin
2008-05-23 23:24           ` Nishanth Aravamudan
2008-04-23  1:53 ` [patch 08/18] hugetlb: multi hstate sysctls npiggin
2008-04-25 18:14   ` Nishanth Aravamudan
2008-05-23  5:25     ` Nick Piggin
2008-05-23 20:27       ` Nishanth Aravamudan
2008-04-25 23:35   ` Nishanth Aravamudan
2008-05-23  5:28     ` Nick Piggin
2008-05-23 10:40       ` Andi Kleen
2008-04-23  1:53 ` [patch 09/18] hugetlb: abstract numa round robin selection npiggin
2008-04-23  1:53 ` [patch 10/18] mm: introduce non panic alloc_bootmem npiggin
2008-04-23  1:53 ` [patch 11/18] mm: export prep_compound_page to mm npiggin
2008-04-23 16:12   ` Andrew Hastings
2008-05-23  5:29     ` Nick Piggin
2008-04-23  1:53 ` [patch 12/18] hugetlbfs: support larger than MAX_ORDER npiggin
2008-04-23 16:15   ` Andrew Hastings
2008-04-23 16:25     ` Andi Kleen
2008-04-25 18:55   ` Nishanth Aravamudan
2008-05-23  5:29     ` Nick Piggin
2008-04-30 21:01   ` Dave Hansen
2008-05-23  5:30     ` Nick Piggin
2008-04-23  1:53 ` [patch 13/18] hugetlb: support boot allocate different sizes npiggin
2008-04-23 16:15   ` Andrew Hastings
2008-04-25 18:40   ` Nishanth Aravamudan
2008-04-25 18:50     ` Andi Kleen
2008-04-25 20:05       ` Nishanth Aravamudan
2008-05-23  5:36     ` Nick Piggin
2008-05-23  6:04       ` Nick Piggin
2008-05-23 20:32         ` Nishanth Aravamudan
2008-05-23 22:45           ` Nick Piggin
2008-05-23 22:53             ` Nishanth Aravamudan
2008-04-23  1:53 ` [patch 14/18] hugetlb: printk cleanup npiggin
2008-04-27  3:32   ` Nishanth Aravamudan
2008-05-23  5:37     ` Nick Piggin
2008-04-23  1:53 ` [patch 15/18] hugetlb: introduce huge_pud npiggin
2008-04-23  1:53 ` [patch 16/18] x86: support GB hugepages on 64-bit npiggin
2008-04-23  1:53 ` [patch 17/18] x86: add hugepagesz option " npiggin
2008-04-30 19:34   ` Nishanth Aravamudan
2008-04-30 19:52     ` Andi Kleen
2008-04-30 20:02       ` Nishanth Aravamudan
2008-04-30 20:19         ` Andi Kleen
2008-04-30 20:23           ` Nishanth Aravamudan
2008-04-30 20:45             ` Andi Kleen
2008-04-30 20:51               ` Nishanth Aravamudan
2008-04-30 20:40     ` Jon Tollefson
2008-04-30 20:48   ` Nishanth Aravamudan
2008-05-23  5:41     ` Nick Piggin
2008-05-23 10:43       ` Andi Kleen
2008-05-23 12:34         ` Nick Piggin
2008-05-23 14:29           ` Andi Kleen
2008-05-23 20:43             ` Nishanth Aravamudan
2008-05-23 20:39       ` Nishanth Aravamudan
2008-05-23 22:52         ` Nick Piggin
2008-04-23  1:53 ` [patch 18/18] hugetlb: my fixes 2 npiggin
2008-04-23 10:48   ` Andi Kleen
2008-04-23 15:36     ` Nick Piggin
2008-04-23 18:49     ` Nishanth Aravamudan
2008-04-23 19:37       ` Andi Kleen
2008-04-23 21:11         ` Nishanth Aravamudan
2008-04-23 21:38           ` Nishanth Aravamudan
2008-04-23 22:06           ` Dave Hansen
2008-04-23 15:20   ` Jon Tollefson
2008-04-23 15:44     ` Nick Piggin
2008-04-23  8:05 ` [patch 00/18] multi size, and giant hugetlb page support, 1GB hugetlb for x86 Andi Kleen
2008-04-23 15:34   ` Nick Piggin
2008-04-23 15:46     ` Andi Kleen
2008-04-23 15:53       ` Nick Piggin
2008-04-23 16:02         ` Andi Kleen
2008-04-23 16:02           ` Nick Piggin
2008-04-23 18:54           ` Nishanth Aravamudan
2008-04-23 18:52         ` Nishanth Aravamudan
2008-04-24  2:08           ` Nick Piggin
2008-04-24  6:43             ` Nishanth Aravamudan
2008-04-24  7:06               ` Nick Piggin
2008-04-24 17:08                 ` Nishanth Aravamudan
2008-04-23 18:43   ` Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).