[RFC][PATCH 0/5] hugetlb NUMA improvements

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/5] hugetlb NUMA improvements
@ 2007-08-06 16:32 Nishanth Aravamudan
  2007-08-06 16:37 ` [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9 Nishanth Aravamudan
  2007-08-06 16:39 ` [RFC][PATCH 0/5] hugetlb NUMA improvements Nishanth Aravamudan
  0 siblings, 2 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 16:32 UTC (permalink / raw)
  To: clameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

The following stack of 5 patches give hugetlbfs improved NUMA support.

1/5: Fix hugetlb pool allocation with empty nodes V9
	The most important of the patches, fix hugetlb pool allocation
	in the presence of memoryless nodes.

2/5: hugetlb: numafy several functions
3/5: hugetlb: add per-node nr_hugepages sysfs attribute
	Together, add a per-node sysfs attribute for the number of
	hugepages allocated on the node.  This gives system
	administrators more fine-grained control of the global pool's
	distribution.

4/5: hugetlb: fix cpuset-constrained pool resizing
	fix cpuset-constrained resizing in the presence of the previous
	3 patches.

5/5: hugetlb: interleave dequeueing of huge pages
	add interleaving to the dequeue path for hugetlb, so that
	hugepages are removed from all available nodes when the pool
	shrinks. Given the sysfs attribute the current node-at-a-time
	dequeueing is still possible.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9
  2007-08-06 16:32 [RFC][PATCH 0/5] hugetlb NUMA improvements Nishanth Aravamudan
@ 2007-08-06 16:37 ` Nishanth Aravamudan
  2007-08-06 16:38   ` [RFC][PATCH 2/5] hugetlb: numafy several functions Nishanth Aravamudan
  2007-08-06 18:00   ` [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9 Christoph Lameter
  2007-08-06 16:39 ` [RFC][PATCH 0/5] hugetlb NUMA improvements Nishanth Aravamudan
  1 sibling, 2 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 16:37 UTC (permalink / raw)
  To: clameter; +Cc: anton, lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

Fix hugetlb pool allocation with empty nodes V9

Anton found a problem with the hugetlb pool allocation when some nodes
have no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2). Lee
worked on versions that tried to fix it, but none were accepted.
Christoph has created a set of patches which allow for GFP_THISNODE
allocations to fail if the node has no memory and for exporting a
node_memory_map indicating which nodes have memory. Since mempolicy.c
already has a number of functions which support interleaving, create a
mempolicy when we invoke alloc_fresh_huge_page() that specifies
interleaving across all the nodes in node_memory_map, rather than custom
interleaving code in hugetlb.c. This requires adding some dummy
functions, and some declarations, in mempolicy.h to compile with NUMA or
!NUMA. Since interleave_nodes() assumes that il_next has been set
properly (and it usually has by a syscall), make sure the interleaving
starts on a valid node.

On a 4-node ppc64 box with 2 memoryless nodes:

Before:

Trying to clear the hugetlb pool
Done.       0 free
Trying to resize the pool to 100
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:     75
Node 0 HugePages_Free:     25
Done. Initially     100 free

After:

Trying to clear the hugetlb pool
Done.       0 free
Trying to resize the pool to 100
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:     50
Node 0 HugePages_Free:     50
Done. Initially     100 free

Tested on: 2-node IA64, 4-node ppc64 (2 memoryless nodes), 4-node ppc64
(no memoryless nodes), 4-node x86_64, !NUMA x86, 1-node x86 (NUMA-Q), 

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 3930de2..6848072 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -76,6 +76,8 @@ struct mempolicy {
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
  */
 
+extern struct mempolicy *mpol_new(int mode, nodemask_t *nodes);
+
 extern void __mpol_free(struct mempolicy *pol);
 static inline void mpol_free(struct mempolicy *pol)
 {
@@ -161,6 +163,10 @@ static inline void check_highest_zone(enum zone_type k)
 		policy_zone = k;
 }
 
+extern void set_first_interleave_node(nodemask_t mask);
+
+extern unsigned interleave_nodes(struct mempolicy *policy);
+
 int do_migrate_pages(struct mm_struct *mm,
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
 
@@ -176,6 +182,11 @@ static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
 
 #define mpol_set_vma_default(vma) do {} while(0)
 
+static inline struct mempolicy *mpol_new(int mode, nodemask_t *nodes)
+{
+	return NULL;
+}
+
 static inline void mpol_free(struct mempolicy *p)
 {
 }
@@ -253,6 +264,15 @@ static inline int do_migrate_pages(struct mm_struct *mm,
 static inline void check_highest_zone(int k)
 {
 }
+
+static inline void set_first_interleave_node(nodemask_t mask)
+{
+}
+
+static inline unsigned interleave_nodes(struct mempolicy *policy)
+{
+	return 0;
+}
 #endif /* CONFIG_NUMA */
 #endif /* __KERNEL__ */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d7ca59d..4f320b4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -101,26 +101,23 @@ static void free_huge_page(struct page *page)
 	spin_unlock(&hugetlb_lock);
 }
 
-static int alloc_fresh_huge_page(void)
+static int alloc_fresh_huge_page(struct mempolicy *policy)
 {
-	static int prev_nid;
 	struct page *page;
 	int nid;
+	int start_nid = interleave_nodes(policy);
 
-	/*
-	 * Copy static prev_nid to local nid, work on that, then copy it
-	 * back to prev_nid afterwards: otherwise there's a window in which
-	 * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
-	 * But we don't need to use a spin_lock here: it really doesn't
-	 * matter if occasionally a racer chooses the same nid as we do.
-	 */
-	nid = next_node(prev_nid, node_online_map);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
-	prev_nid = nid;
+	nid = start_nid;
+
+	do {
+		page = alloc_pages_node(nid,
+				htlb_alloc_mask|__GFP_COMP|GFP_THISNODE,
+				HUGETLB_PAGE_ORDER);
+		if (page)
+			break;
+		nid = interleave_nodes(policy);
+	} while (nid != start_nid);
 
-	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
@@ -162,18 +159,30 @@ fail:
 static int __init hugetlb_init(void)
 {
 	unsigned long i;
+	struct mempolicy *pol;
 
 	if (HPAGE_SHIFT == 0)
 		return 0;
 
-	for (i = 0; i < MAX_NUMNODES; ++i)
+	for_each_node_state(i, N_HIGH_MEMORY)
 		INIT_LIST_HEAD(&hugepage_freelists[i]);
 
+	pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_HIGH_MEMORY]);
+	if (IS_ERR(pol))
+		goto quit;
+	/*
+	 * since the mempolicy we are using was not specified by a
+	 * process, we need to make sure il_next has a good starting
+	 * value
+	 */
+	set_first_interleave_node(node_states[N_HIGH_MEMORY]);
 	for (i = 0; i < max_huge_pages; ++i) {
-		if (!alloc_fresh_huge_page())
+		if (!alloc_fresh_huge_page(pol))
 			break;
 	}
+	mpol_free(pol);
 	max_huge_pages = free_huge_pages = nr_huge_pages = i;
+quit:
 	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
 	return 0;
 }
@@ -219,7 +228,7 @@ static void try_to_free_low(unsigned long count)
 {
 	int i;
 
-	for (i = 0; i < MAX_NUMNODES; ++i) {
+	for_each_node_state(i, N_HIGH_MEMORY)
 		struct page *page, *next;
 		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
 			if (PageHighMem(page))
@@ -241,10 +250,22 @@ static inline void try_to_free_low(unsigned long count)
 
 static unsigned long set_max_huge_pages(unsigned long count)
 {
+	struct mempolicy *pol;
+
+	pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_HIGH_MEMORY]);
+	if (IS_ERR(pol))
+		return nr_huge_pages;
+	/*
+	 * since the mempolicy we are using was not specified by a
+	 * process, we need to make sure il_next has a good starting
+	 * value
+	 */
+	set_first_interleave_node(node_states[N_HIGH_MEMORY]);
 	while (count > nr_huge_pages) {
-		if (!alloc_fresh_huge_page())
-			return nr_huge_pages;
+		if (!alloc_fresh_huge_page(pol))
+			break;
 	}
+	mpol_free(pol);
 	if (count >= nr_huge_pages)
 		return nr_huge_pages;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 87eb69e..c069891 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -171,7 +171,7 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
 }
 
 /* Create a new policy */
-static struct mempolicy *mpol_new(int mode, nodemask_t *nodes)
+struct mempolicy *mpol_new(int mode, nodemask_t *nodes)
 {
 	struct mempolicy *policy;
 
@@ -1125,8 +1125,13 @@ static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 	return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
 }
 
+void set_first_interleave_node(nodemask_t mask)
+{
+	current->il_next = first_node(mask);
+}
+
 /* Do dynamic interleaving for a process */
-static unsigned interleave_nodes(struct mempolicy *policy)
+unsigned interleave_nodes(struct mempolicy *policy)
 {
 	unsigned nid, next;
 	struct task_struct *me = current;

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC][PATCH 2/5] hugetlb: numafy several functions
  2007-08-06 16:37 ` [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9 Nishanth Aravamudan
@ 2007-08-06 16:38   ` Nishanth Aravamudan
  2007-08-06 16:40     ` [RFC][PATCH 3/5] hugetlb: add per-node nr_hugepages sysfs attribute Nishanth Aravamudan
  2007-08-06 17:59     ` [RFC][PATCH 2/5] hugetlb: numafy several functions Christoph Lameter
  2007-08-06 18:00   ` [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9 Christoph Lameter
  1 sibling, 2 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 16:38 UTC (permalink / raw)
  To: clameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

Add node-parameterized helpers for dequeue_huge_page,
alloc_fresh_huge_page and try_to_free_low. Also have
update_and_free_page() take a nid parameter. This is necessary to add a
per-node sysfs attribute to specify the number of hugepages on that
node.

Tested on: 2-node IA64, 4-node ppc64 (2 memoryless nodes), 4-node ppc64
(no memoryless nodes), 4-node x86_64, !NUMA x86, 1-node x86 (NUMA-Q), 

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1cd3118..31c4359 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -66,11 +66,22 @@ static void enqueue_huge_page(struct page *page)
 	free_huge_pages_node[nid]++;
 }
 
+static struct page *dequeue_huge_page_node(int nid)
+{
+	struct page *page;
+
+	page = list_entry(hugepage_freelists[nid].next,
+					  struct page, lru);
+	list_del(&page->lru);
+	free_huge_pages--;
+	free_huge_pages_node[nid]--;
+	return page;
+}
+
 static struct page *dequeue_huge_page(struct vm_area_struct *vma,
 				unsigned long address)
 {
 	int nid;
-	struct page *page = NULL;
 	struct zonelist *zonelist = huge_zonelist(vma, address,
 						htlb_alloc_mask);
 	struct zone **z;
@@ -82,14 +93,9 @@ static struct page *dequeue_huge_page(struct vm_area_struct *vma,
 			break;
 	}
 
-	if (*z) {
-		page = list_entry(hugepage_freelists[nid].next,
-				  struct page, lru);
-		list_del(&page->lru);
-		free_huge_pages--;
-		free_huge_pages_node[nid]--;
-	}
-	return page;
+	if (*z)
+		return dequeue_huge_page_node(nid);
+	return NULL;
 }
 
 static void free_huge_page(struct page *page)
@@ -103,6 +109,25 @@ static void free_huge_page(struct page *page)
 	spin_unlock(&hugetlb_lock);
 }
 
+static struct page *alloc_fresh_huge_page_node(int nid)
+{
+	struct page *page;
+
+	page = alloc_pages_node(nid,
+			GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
+			HUGETLB_PAGE_ORDER);
+	if (page) {
+		set_compound_page_dtor(page, free_huge_page);
+		spin_lock(&hugetlb_lock);
+		nr_huge_pages++;
+		nr_huge_pages_node[nid]++;
+		spin_unlock(&hugetlb_lock);
+		put_page(page); /* free it into the hugepage allocator */
+	}
+
+	return page;
+}
+
 static int alloc_fresh_huge_page(struct mempolicy *policy)
 {
 	int nid;
@@ -112,22 +137,12 @@ static int alloc_fresh_huge_page(struct mempolicy *policy)
 	nid = start_nid;
 
 	do {
-		page = alloc_pages_node(nid,
-				htlb_alloc_mask|__GFP_COMP|GFP_THISNODE,
-				HUGETLB_PAGE_ORDER);
+		page = alloc_fresh_huge_page_node(nid);
 		if (page)
-			break;
+			return 1;
 		nid = interleave_nodes(policy);
 	} while (nid != start_nid);
-	if (page) {
-		set_compound_page_dtor(page, free_huge_page);
-		spin_lock(&hugetlb_lock);
-		nr_huge_pages++;
-		nr_huge_pages_node[page_to_nid(page)]++;
-		spin_unlock(&hugetlb_lock);
-		put_page(page); /* free it into the hugepage allocator */
-		return 1;
-	}
+
 	return 0;
 }
 
@@ -203,11 +218,11 @@ static unsigned int cpuset_mems_nr(unsigned int *array)
 }
 
 #ifdef CONFIG_SYSCTL
-static void update_and_free_page(struct page *page)
+static void update_and_free_page(int nid, struct page *page)
 {
 	int i;
 	nr_huge_pages--;
-	nr_huge_pages_node[page_to_nid(page)]--;
+	nr_huge_pages_node[nid]--;
 	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
@@ -219,25 +234,37 @@ static void update_and_free_page(struct page *page)
 }
 
 #ifdef CONFIG_HIGHMEM
+static void try_to_free_low_node(int nid, unsigned long count)
+{
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next,
+				&hugepage_freelists[nid], lru) {
+		if (PageHighMem(page))
+			continue;
+		list_del(&page->lru);
+		update_and_free_page(nid, page);
+		free_huge_pages--;
+		free_huge_pages_node[nid]--;
+		if (count >= nr_huge_pages_node[nid])
+			return;
+	}
+}
+
 static void try_to_free_low(unsigned long count)
 {
 	int i;
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
-		struct page *page, *next;
-		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
-			if (PageHighMem(page))
-				continue;
-			list_del(&page->lru);
-			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[page_to_nid(page)]--;
-			if (count >= nr_huge_pages)
-				return;
-		}
+		try_to_free_low_node(i, count);
+		if (count >= nr_huge_pages)
+			break;
 	}
 }
 #else
+static inline void try_to_free_low_node(int nid, unsigned long count)
+{
+}
 static inline void try_to_free_low(unsigned long count)
 {
 }
@@ -265,7 +292,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
 		struct page *page = dequeue_huge_page(NULL, 0);
 		if (!page)
 			break;
-		update_and_free_page(page);
+		update_and_free_page(page_to_nid(page), page);
 	}
 	spin_unlock(&hugetlb_lock);
 	return nr_huge_pages;

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 0/5] hugetlb NUMA improvements
  2007-08-06 16:32 [RFC][PATCH 0/5] hugetlb NUMA improvements Nishanth Aravamudan
  2007-08-06 16:37 ` [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9 Nishanth Aravamudan
@ 2007-08-06 16:39 ` Nishanth Aravamudan
  1 sibling, 0 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 16:39 UTC (permalink / raw)
  To: clameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

On 06.08.2007 [09:32:54 -0700], Nishanth Aravamudan wrote:
> The following stack of 5 patches give hugetlbfs improved NUMA support.
> 
> 1/5: Fix hugetlb pool allocation with empty nodes V9
> 	The most important of the patches, fix hugetlb pool allocation
> 	in the presence of memoryless nodes.
> 
> 2/5: hugetlb: numafy several functions
> 3/5: hugetlb: add per-node nr_hugepages sysfs attribute
> 	Together, add a per-node sysfs attribute for the number of
> 	hugepages allocated on the node.  This gives system
> 	administrators more fine-grained control of the global pool's
> 	distribution.
> 
> 4/5: hugetlb: fix cpuset-constrained pool resizing
> 	fix cpuset-constrained resizing in the presence of the previous
> 	3 patches.
> 
> 5/5: hugetlb: interleave dequeueing of huge pages
> 	add interleaving to the dequeue path for hugetlb, so that
> 	hugepages are removed from all available nodes when the pool
> 	shrinks. Given the sysfs attribute the current node-at-a-time
> 	dequeueing is still possible.

I forgot to mention that this stack depends on Christoph's set of
memoryless nodes patches. In particular the node_states nodemask array
and the fix for GFP_THISNODE allocations.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 3/5] hugetlb: add per-node nr_hugepages sysfs attribute
  2007-08-06 16:38   ` [RFC][PATCH 2/5] hugetlb: numafy several functions Nishanth Aravamudan
@ 2007-08-06 16:40     ` Nishanth Aravamudan
  2007-08-06 16:44       ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Nishanth Aravamudan
  2007-08-06 17:59     ` [RFC][PATCH 2/5] hugetlb: numafy several functions Christoph Lameter
  1 sibling, 1 reply; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 16:40 UTC (permalink / raw)
  To: clameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

Allow specifying the number of hugepages to allocate on a particular
node. Our current global sysctl will try its best to put hugepages
equally on each node, but that may not always be desired. This allows
the admin to control the layout of hugepage allocation at a finer level
(while not breaking the existing interface).  Add callbacks in the sysfs
node registration and unregistration functions into hugetlb to add the
nr_hugepages attribute, which is a no-op if !NUMA or !HUGETLB.

Tested on: 2-node IA64, 4-node ppc64 (2 memoryless nodes), 4-node ppc64
(no memoryless nodes), 4-node x86_64, !NUMA x86, 1-node x86 (NUMA-Q)

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/drivers/base/node.c b/drivers/base/node.c
index cae346e..c9d531f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -151,6 +151,7 @@ int register_node(struct node *node, int num, struct node *parent)
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
 		sysdev_create_file(&node->sysdev, &attr_numastat);
 		sysdev_create_file(&node->sysdev, &attr_distance);
+		hugetlb_register_node(node);
 	}
 	return error;
 }
@@ -168,6 +169,7 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_meminfo);
 	sysdev_remove_file(&node->sysdev, &attr_numastat);
 	sysdev_remove_file(&node->sysdev, &attr_distance);
+	hugetlb_unregister_node(node);
 
 	sysdev_unregister(&node->sysdev);
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 49b7053..2fc188a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,7 +4,9 @@
 #ifdef CONFIG_HUGETLB_PAGE
 
 #include <linux/mempolicy.h>
+#include <linux/node.h>
 #include <linux/shm.h>
+#include <linux/sysdev.h>
 #include <asm/tlbflush.h>
 
 struct ctl_table;
@@ -23,6 +25,13 @@ void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned lon
 int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
 int hugetlb_report_meminfo(char *);
 int hugetlb_report_node_meminfo(int, char *);
+#ifdef CONFIG_NUMA
+int hugetlb_register_node(struct node *);
+void hugetlb_unregister_node(struct node *);
+#else
+#define hugetlb_register_node(node)		0
+#define hugetlb_unregister_node(node)		((void)0)
+#endif
 unsigned long hugetlb_total_pages(void);
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access);
@@ -114,6 +123,8 @@ static inline unsigned long hugetlb_total_pages(void)
 #define unmap_hugepage_range(vma, start, end)	BUG()
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
+#define hugetlb_register_node(node)		0
+#define hugetlb_unregister_node(node)		((void)0)
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL
 #define prepare_hugepage_range(addr,len,pgoff)	(-EINVAL)
 #define pmd_huge(x)	0
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31c4359..3f3df46 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -217,7 +217,6 @@ static unsigned int cpuset_mems_nr(unsigned int *array)
 	return nr;
 }
 
-#ifdef CONFIG_SYSCTL
 static void update_and_free_page(int nid, struct page *page)
 {
 	int i;
@@ -270,6 +269,7 @@ static inline void try_to_free_low(unsigned long count)
 }
 #endif
 
+#ifdef CONFIG_SYSCTL
 static unsigned long set_max_huge_pages(unsigned long count)
 {
 	struct mempolicy *pol;
@@ -343,6 +343,67 @@ int hugetlb_report_node_meminfo(int nid, char *buf)
 		nid, free_huge_pages_node[nid]);
 }
 
+#ifdef CONFIG_NUMA
+static ssize_t hugetlb_read_nr_hugepages_node(struct sys_device *dev,
+							char *buf)
+{
+	return sprintf(buf, "%u\n", nr_huge_pages_node[dev->id]);
+}
+
+static ssize_t hugetlb_write_nr_hugepages_node(struct sys_device *dev,
+					const char *buf, size_t count)
+{
+	int nid = dev->id;
+	unsigned long target;
+	unsigned long free_on_other_nodes;
+	unsigned long nr_huge_pages_req = simple_strtoul(buf, NULL, 10);
+
+	while (nr_huge_pages_req > nr_huge_pages_node[nid]) {
+		if (!alloc_fresh_huge_page_node(nid))
+			return count;
+	}
+	if (nr_huge_pages_req >= nr_huge_pages_node[nid])
+		return count;
+
+	/* need to ensure that our counts are accurate */
+	spin_lock(&hugetlb_lock);
+	free_on_other_nodes = free_huge_pages - free_huge_pages_node[nid];
+	if (free_on_other_nodes >= resv_huge_pages) {
+		/* other nodes can satisfy reserve */
+		target = nr_huge_pages_req;
+	} else {
+		/* this node needs some free to satisfy reserve */
+		target = max((resv_huge_pages - free_on_other_nodes),
+						nr_huge_pages_req);
+	}
+	try_to_free_low_node(nid, target);
+	while (target < nr_huge_pages_node[nid]) {
+		struct page *page = dequeue_huge_page_node(nid);
+		if (!page)
+			break;
+		update_and_free_page(nid, page);
+	}
+	spin_unlock(&hugetlb_lock);
+
+	return count;
+}
+
+static SYSDEV_ATTR(nr_hugepages, S_IRUGO | S_IWUSR,
+			hugetlb_read_nr_hugepages_node,
+			hugetlb_write_nr_hugepages_node);
+
+int hugetlb_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_nr_hugepages);
+}
+
+void hugetlb_unregister_node(struct node *node)
+{
+	sysdev_remove_file(&node->sysdev, &attr_nr_hugepages);
+}
+
+#endif
+
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing
  2007-08-06 16:40     ` [RFC][PATCH 3/5] hugetlb: add per-node nr_hugepages sysfs attribute Nishanth Aravamudan
@ 2007-08-06 16:44       ` Nishanth Aravamudan
  2007-08-06 16:45         ` Nishanth Aravamudan
                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 16:44 UTC (permalink / raw)
  To: clameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl, pj

With the previous 3 patches in this series applied, if a process is in a
constrained cpuset, and tries to grow the hugetlb pool, hugepages may be
allocated on nodes outside of the process' cpuset. More concretely,
growing the pool via

echo some_value > /proc/sys/vm/nr_hugepages

interleaves across all nodes with memory such that hugepage allocations
occur on nodes outside the cpuset. Similarly, this process is able to
change the values in values in
/sys/devices/system/node/nodeX/nr_hugepages, even when X is not in the
cpuset. This directly violates the isolation that cpusets is supposed to
guarantee.

For pool growth: fix the sysctl case by only interleaving across the
nodes in current's cpuset; fix the sysfs attribute case by verifying the
requested node is in current's cpuset. For pool shrinking: both cases
are mostly already covered by the cpuset_zone_allowed_softwall() check
in dequeue_huge_page_node(), but make sure that we only iterate over the
cpusets's nodes in try_to_free_low().

Before:

Trying to resize the pool back to     100 from the top cpuset
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:    100
Node 0 HugePages_Free:      0
Done.     100 free
/cpuset/set1 /cpuset ~
Trying to resize the pool to     200 from a cpuset restricted to node 1
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:    150
Node 0 HugePages_Free:     50
Done.     200 free
Trying to shrink the pool on node 0 down to 0 from a cpuset restricted
to node 1
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:    150
Node 0 HugePages_Free:      0
Done.     150 free

After:

Trying to resize the pool back to     100 from the top cpuset
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:    100
Node 0 HugePages_Free:      0
Done.     100 free
/cpuset/set1 /cpuset ~
Trying to resize the pool to     200 from a cpuset restricted to node 1
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:    200
Node 0 HugePages_Free:      0
Done.     200 free
Trying to grow the pool on node 0 up to 50 from a cpuset restricted to
node 1
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:    200
Node 0 HugePages_Free:      0
Done.     200 free

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 09ad639..af07a0b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -181,6 +181,10 @@ static int __init hugetlb_init(void)
 	for_each_node_state(i, N_HIGH_MEMORY)
 		INIT_LIST_HEAD(&hugepage_freelists[i]);
 
+	/*
+	 * at boot-time, interleave across all available nodes as there
+	 * is not any corresponding cpuset/process
+	 */
 	pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_HIGH_MEMORY]);
 	if (IS_ERR(pol))
 		goto quit;
@@ -258,7 +262,7 @@ static void try_to_free_low(unsigned long count)
 {
 	int i;
 
-	for_each_node_state(i, N_HIGH_MEMORY) {
+	for_each_node_mask(i, cpuset_current_mems_allowed) {
 		try_to_free_low_node(i, count);
 		if (count >= nr_huge_pages)
 			return;
@@ -278,7 +282,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
 {
 	struct mempolicy *pol;
 
-	pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_HIGH_MEMORY]);
+	pol = mpol_new(MPOL_INTERLEAVE, &cpuset_current_mems_allowed);
 	if (IS_ERR(pol))
 		return nr_huge_pages;
 	/*
@@ -286,7 +290,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
 	 * process, we need to make sure il_next has a good starting
 	 * value
 	 */
-	set_first_interleave_node(node_states[N_HIGH_MEMORY]);
+	set_first_interleave_node(cpuset_current_mems_allowed);
 	while (count > nr_huge_pages) {
 		if (!alloc_fresh_huge_page(pol))
 			break;
@@ -368,6 +372,10 @@ static ssize_t hugetlb_write_nr_hugepages_node(struct sys_device *dev,
 	unsigned long free_on_other_nodes;
 	unsigned long nr_huge_pages_req = simple_strtoul(buf, NULL, 10);
 
+	/* prevent per-node allocations from outside the allowed cpuset */
+	if (!node_isset(nid, cpuset_current_mems_allowed))
+		return count;
+
 	while (nr_huge_pages_req > nr_huge_pages_node[nid]) {
 		if (!alloc_fresh_huge_page_node(nid))
 			return count;

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing
  2007-08-06 16:44       ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Nishanth Aravamudan
@ 2007-08-06 16:45         ` Nishanth Aravamudan
  2007-08-06 16:48         ` [RFC][PATCH 5/5] hugetlb: interleave dequeueing of huge pages Nishanth Aravamudan
  2007-08-06 18:04         ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Christoph Lameter
  2 siblings, 0 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 16:45 UTC (permalink / raw)
  To: clameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl, pj

On 06.08.2007 [09:44:10 -0700], Nishanth Aravamudan wrote:
> hugetlb: fix cpuset-constrained pool resizing
> 
> With the previous 3 patches in this series applied, if a process is in a
> constrained cpuset, and tries to grow the hugetlb pool, hugepages may be
> allocated on nodes outside of the process' cpuset. More concretely,
> growing the pool via
> 
> echo some_value > /proc/sys/vm/nr_hugepages
> 
> interleaves across all nodes with memory such that hugepage allocations
> occur on nodes outside the cpuset. Similarly, this process is able to
> change the values in values in
> /sys/devices/system/node/nodeX/nr_hugepages, even when X is not in the
> cpuset. This directly violates the isolation that cpusets is supposed to
> guarantee.
> 
> For pool growth: fix the sysctl case by only interleaving across the
> nodes in current's cpuset; fix the sysfs attribute case by verifying the
> requested node is in current's cpuset. For pool shrinking: both cases
> are mostly already covered by the cpuset_zone_allowed_softwall() check
> in dequeue_huge_page_node(), but make sure that we only iterate over the
> cpusets's nodes in try_to_free_low().
> 
> Before:
> 
> Trying to resize the pool back to     100 from the top cpuset
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    100
> Node 0 HugePages_Free:      0
> Done.     100 free
> /cpuset/set1 /cpuset ~
> Trying to resize the pool to     200 from a cpuset restricted to node 1
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    150
> Node 0 HugePages_Free:     50
> Done.     200 free
> Trying to shrink the pool on node 0 down to 0 from a cpuset restricted
> to node 1
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    150
> Node 0 HugePages_Free:      0
> Done.     150 free
> 
> After:
> 
> Trying to resize the pool back to     100 from the top cpuset
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    100
> Node 0 HugePages_Free:      0
> Done.     100 free
> /cpuset/set1 /cpuset ~
> Trying to resize the pool to     200 from a cpuset restricted to node 1
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    200
> Node 0 HugePages_Free:      0
> Done.     200 free
> Trying to grow the pool on node 0 up to 50 from a cpuset restricted to
> node 1
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    200
> Node 0 HugePages_Free:      0
> Done.     200 free

This patch was also tested on: 2-node IA64, 4-node ppc64 (2 memoryless
nodes), 4-node ppc64 (no memoryless nodes), 4-node x86_64, !NUMA x86,
1-node x86 (NUMA-Q)

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 5/5] hugetlb: interleave dequeueing of huge pages
  2007-08-06 16:44       ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Nishanth Aravamudan
  2007-08-06 16:45         ` Nishanth Aravamudan
@ 2007-08-06 16:48         ` Nishanth Aravamudan
  2007-08-06 18:04         ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Christoph Lameter
  2 siblings, 0 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 16:48 UTC (permalink / raw)
  To: clameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

Currently, when shrinking the hugetlb pool, we free all of the pages on
node 0, then all the pages on node 1, etc. Instead, we interleave over
the valid nodes, as constrained by the enclosing cpuset (or populated
nodes if !CPUSETS). If some particularly node should be cleared first,
the sysfs allocator can be used for finer-grained control. This also
helps with keeping the pool balanced as we change the pool at run-time.

Before:

Trying to clear the hugetlb pool
Done.       0 free
Trying to resize the pool to 100
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:     50
Node 0 HugePages_Free:     50
Done. Initially     100 free
Trying to resize the pool to 200
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:    100
Node 0 HugePages_Free:    100
Done.     200 free
Trying to resize the pool back to     100
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:    100
Node 0 HugePages_Free:      0
Done.     100 free

After:

Trying to clear the hugetlb pool
Done.       0 free
Trying to resize the pool to 100
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:     50
Node 0 HugePages_Free:     50
Done. Initially     100 free
Trying to resize the pool to 200
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:    100
Node 0 HugePages_Free:    100
Done.     200 free
Trying to resize the pool back to     100
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:     50
Node 0 HugePages_Free:     50
Done.     100 free

Tested on: 2-node IA64, 4-node ppc64 (2 memoryless nodes), 4-node ppc64
(no memoryless nodes), 4-node x86_64, !NUMA x86, 1-node x86 (NUMA-Q)

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index af07a0b..f6d1811 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -78,7 +78,27 @@ static struct page *dequeue_huge_page_node(int nid)
 	return page;
 }
 
-static struct page *dequeue_huge_page(struct vm_area_struct *vma,
+static struct page *dequeue_huge_page(struct mempolicy *policy)
+{
+	struct page *page;
+	int nid;
+	int start_nid = interleave_nodes(policy);
+
+	nid = start_nid;
+
+	do {
+		if (!list_empty(&hugepage_freelists[nid])) {
+			page = dequeue_huge_page_node(nid);
+			if (page)
+				return page;
+		}
+		nid = interleave_nodes(policy);
+	} while (nid != start_nid);
+
+	return NULL;
+}
+
+static struct page *dequeue_huge_page_vma(struct vm_area_struct *vma,
 				unsigned long address)
 {
 	int nid;
@@ -155,7 +175,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	else if (free_huge_pages <= resv_huge_pages)
 		goto fail;
 
-	page = dequeue_huge_page(vma, addr);
+	page = dequeue_huge_page_vma(vma, addr);
 	if (!page)
 		goto fail;
 
@@ -295,20 +315,23 @@ static unsigned long set_max_huge_pages(unsigned long count)
 		if (!alloc_fresh_huge_page(pol))
 			break;
 	}
-	mpol_free(pol);
-	if (count >= nr_huge_pages)
+	if (count >= nr_huge_pages) {
+		mpol_free(pol);
 		return nr_huge_pages;
+	}
 
 	spin_lock(&hugetlb_lock);
 	count = max(count, resv_huge_pages);
 	try_to_free_low(count);
+	set_first_interleave_node(cpuset_current_mems_allowed);
 	while (count < nr_huge_pages) {
-		struct page *page = dequeue_huge_page(NULL, 0);
+		struct page *page = dequeue_huge_page(pol);
 		if (!page)
 			break;
 		update_and_free_page(page_to_nid(page), page);
 	}
 	spin_unlock(&hugetlb_lock);
+	mpol_free(pol);
 	return nr_huge_pages;
 }
 
-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/5] hugetlb: numafy several functions
  2007-08-06 16:38   ` [RFC][PATCH 2/5] hugetlb: numafy several functions Nishanth Aravamudan
  2007-08-06 16:40     ` [RFC][PATCH 3/5] hugetlb: add per-node nr_hugepages sysfs attribute Nishanth Aravamudan
@ 2007-08-06 17:59     ` Christoph Lameter
  2007-08-06 18:15       ` Nishanth Aravamudan
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2007-08-06 17:59 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:

> +	page = alloc_pages_node(nid,
> +			GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
> +			HUGETLB_PAGE_ORDER);

GFP_THISNODE disables reclaim. With Mel Gorman's ZONE_MOVABLE you may want 
to enable reclaim here. Use __GFP_THISNODE?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9
  2007-08-06 16:37 ` [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9 Nishanth Aravamudan
  2007-08-06 16:38   ` [RFC][PATCH 2/5] hugetlb: numafy several functions Nishanth Aravamudan
@ 2007-08-06 18:00   ` Christoph Lameter
  2007-08-06 18:19     ` Nishanth Aravamudan
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2007-08-06 18:00 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: anton, lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:

> +	pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_HIGH_MEMORY]);
> +	if (IS_ERR(pol))
> +		goto quit;


You are hardcoding a policy here. Is that really necessary? You could call 
the interleave node functions yourself to generate the node distribution. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing
  2007-08-06 16:44       ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Nishanth Aravamudan
  2007-08-06 16:45         ` Nishanth Aravamudan
  2007-08-06 16:48         ` [RFC][PATCH 5/5] hugetlb: interleave dequeueing of huge pages Nishanth Aravamudan
@ 2007-08-06 18:04         ` Christoph Lameter
  2007-08-06 18:26           ` Nishanth Aravamudan
                             ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: Christoph Lameter @ 2007-08-06 18:04 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl, pj

On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:

> hugetlb: fix cpuset-constrained pool resizing
> 
> With the previous 3 patches in this series applied, if a process is in a
> constrained cpuset, and tries to grow the hugetlb pool, hugepages may be
> allocated on nodes outside of the process' cpuset. More concretely,
> growing the pool via
> 
> echo some_value > /proc/sys/vm/nr_hugepages
> 
> interleaves across all nodes with memory such that hugepage allocations
> occur on nodes outside the cpuset. Similarly, this process is able to
> change the values in values in
> /sys/devices/system/node/nodeX/nr_hugepages, even when X is not in the
> cpuset. This directly violates the isolation that cpusets is supposed to
> guarantee.

No it does not. Cpusets do not affect the administrative rights of users.
 
> For pool growth: fix the sysctl case by only interleaving across the
> nodes in current's cpuset; fix the sysfs attribute case by verifying the
> requested node is in current's cpuset. For pool shrinking: both cases
> are mostly already covered by the cpuset_zone_allowed_softwall() check
> in dequeue_huge_page_node(), but make sure that we only iterate over the
> cpusets's nodes in try_to_free_low().

In that case the number of huge pages is a cpuset attribute. Create 
nr_hugepages under /dev/cpuset/ ...? The sysctl is global and should not 
be cpuset relative.
 
Otherwise the /proc/sys/vm/nr_hugepages and systecl becomes dependend on 
the cpuset context. Which will be a bit strange.



> 
> Before:
> 
> Trying to resize the pool back to     100 from the top cpuset
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    100
> Node 0 HugePages_Free:      0
> Done.     100 free
> /cpuset/set1 /cpuset ~
> Trying to resize the pool to     200 from a cpuset restricted to node 1
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    150
> Node 0 HugePages_Free:     50
> Done.     200 free
> Trying to shrink the pool on node 0 down to 0 from a cpuset restricted
> to node 1
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    150
> Node 0 HugePages_Free:      0
> Done.     150 free
> 
> After:
> 
> Trying to resize the pool back to     100 from the top cpuset
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    100
> Node 0 HugePages_Free:      0
> Done.     100 free
> /cpuset/set1 /cpuset ~
> Trying to resize the pool to     200 from a cpuset restricted to node 1
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    200
> Node 0 HugePages_Free:      0
> Done.     200 free
> Trying to grow the pool on node 0 up to 50 from a cpuset restricted to
> node 1
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:    200
> Node 0 HugePages_Free:      0
> Done.     200 free
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 09ad639..af07a0b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -181,6 +181,10 @@ static int __init hugetlb_init(void)
>  	for_each_node_state(i, N_HIGH_MEMORY)
>  		INIT_LIST_HEAD(&hugepage_freelists[i]);
>  
> +	/*
> +	 * at boot-time, interleave across all available nodes as there
> +	 * is not any corresponding cpuset/process
> +	 */
>  	pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_HIGH_MEMORY]);
>  	if (IS_ERR(pol))
>  		goto quit;
> @@ -258,7 +262,7 @@ static void try_to_free_low(unsigned long count)
>  {
>  	int i;
>  
> -	for_each_node_state(i, N_HIGH_MEMORY) {
> +	for_each_node_mask(i, cpuset_current_mems_allowed) {
>  		try_to_free_low_node(i, count);
>  		if (count >= nr_huge_pages)
>  			return;
> @@ -278,7 +282,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
>  {
>  	struct mempolicy *pol;
>  
> -	pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_HIGH_MEMORY]);
> +	pol = mpol_new(MPOL_INTERLEAVE, &cpuset_current_mems_allowed);
>  	if (IS_ERR(pol))
>  		return nr_huge_pages;
>  	/*
> @@ -286,7 +290,7 @@ static unsigned long set_max_huge_pages(unsigned long count)
>  	 * process, we need to make sure il_next has a good starting
>  	 * value
>  	 */
> -	set_first_interleave_node(node_states[N_HIGH_MEMORY]);
> +	set_first_interleave_node(cpuset_current_mems_allowed);
>  	while (count > nr_huge_pages) {
>  		if (!alloc_fresh_huge_page(pol))
>  			break;
> @@ -368,6 +372,10 @@ static ssize_t hugetlb_write_nr_hugepages_node(struct sys_device *dev,
>  	unsigned long free_on_other_nodes;
>  	unsigned long nr_huge_pages_req = simple_strtoul(buf, NULL, 10);
>  
> +	/* prevent per-node allocations from outside the allowed cpuset */
> +	if (!node_isset(nid, cpuset_current_mems_allowed))
> +		return count;
> +
>  	while (nr_huge_pages_req > nr_huge_pages_node[nid]) {
>  		if (!alloc_fresh_huge_page_node(nid))
>  			return count;
> 
> -- 
> Nishanth Aravamudan <nacc@us.ibm.com>
> IBM Linux Technology Center
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/5] hugetlb: numafy several functions
  2007-08-06 17:59     ` [RFC][PATCH 2/5] hugetlb: numafy several functions Christoph Lameter
@ 2007-08-06 18:15       ` Nishanth Aravamudan
  2007-08-07  0:34         ` Nishanth Aravamudan
  0 siblings, 1 reply; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 18:15 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

On 06.08.2007 [10:59:20 -0700], Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:
> 
> > +	page = alloc_pages_node(nid,
> > +			GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
> > +			HUGETLB_PAGE_ORDER);
> 
> GFP_THISNODE disables reclaim. With Mel Gorman's ZONE_MOVABLE you may
> want to enable reclaim here. Use __GFP_THISNODE?

It is GFP_THISNODE currently. That seems like a separate logical change
which I'll have to consider separately.

Thanks,
Nish


-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9
  2007-08-06 18:00   ` [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9 Christoph Lameter
@ 2007-08-06 18:19     ` Nishanth Aravamudan
  2007-08-06 18:37       ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 18:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: anton, lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

On 06.08.2007 [11:00:53 -0700], Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:
> 
> > +	pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_HIGH_MEMORY]);
> > +	if (IS_ERR(pol))
> > +		goto quit;
> 
> 
> You are hardcoding a policy here. Is that really necessary? You could
> call the interleave node functions yourself to generate the node
> distribution. 

Uh, interleave_nodes() takes a policy. Hence I need a policy to use.
This was your suggestion, Christoph and I'm doing exactly what you
asked.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing
  2007-08-06 18:04         ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Christoph Lameter
@ 2007-08-06 18:26           ` Nishanth Aravamudan
  2007-08-06 18:41             ` Christoph Lameter
  2007-08-06 19:37           ` Lee Schermerhorn
  2007-08-08  1:50           ` Nishanth Aravamudan
  2 siblings, 1 reply; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-06 18:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl, pj

On 06.08.2007 [11:04:48 -0700], Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:
> 
> > hugetlb: fix cpuset-constrained pool resizing
> > 
> > With the previous 3 patches in this series applied, if a process is in a
> > constrained cpuset, and tries to grow the hugetlb pool, hugepages may be
> > allocated on nodes outside of the process' cpuset. More concretely,
> > growing the pool via
> > 
> > echo some_value > /proc/sys/vm/nr_hugepages
> > 
> > interleaves across all nodes with memory such that hugepage allocations
> > occur on nodes outside the cpuset. Similarly, this process is able to
> > change the values in values in
> > /sys/devices/system/node/nodeX/nr_hugepages, even when X is not in the
> > cpuset. This directly violates the isolation that cpusets is supposed to
> > guarantee.
> 
> No it does not. Cpusets do not affect the administrative rights of
> users.

A process is limited to nodes 1 and 2.

You think said process should be able to remove hugepages from nodes 0
and 4?

That sounds like a violation of isolation to me.

I understand what you mean, that root should be able to do whatever it
wants, but at the same time, if a root-owned process is running in a
cpuset, it's constrained for a reason.

More importantly, let's say your process (owned by root or not) is
running in a restricted cpuset on  nodes 2 and 3 of a 4-node system and
wants to use 100 hugepages. Using the global sysctl, presuming an equal
distribution of free memory on all nodes, said process would need to
allocate 200 hugepages on the system (50 on each node), to get 100
hugepages on nodes 2 and 3. With this patch, it only needs to allocate
100 hugepages.

Seems far more sane to me that an intentionally restricted process
(i.e., cpusets) can only affect the bits of the system it's restricted
to.

> > For pool growth: fix the sysctl case by only interleaving across the
> > nodes in current's cpuset; fix the sysfs attribute case by verifying the
> > requested node is in current's cpuset. For pool shrinking: both cases
> > are mostly already covered by the cpuset_zone_allowed_softwall() check
> > in dequeue_huge_page_node(), but make sure that we only iterate over the
> > cpusets's nodes in try_to_free_low().
> 
> In that case the number of huge pages is a cpuset attribute. Create
> nr_hugepages under /dev/cpuset/ ...? The sysctl is global and should
> not be cpuset relative.

No, the number of huge pages is a global still. But the huge pages a
*process* has access to is defined by its enclosing cpuset (or memory
policy, I suppose). I think you're confusing the two. Or I am, I don't
know which.

> Otherwise the /proc/sys/vm/nr_hugepages and systecl becomes dependend
> on the cpuset context. Which will be a bit strange.

Become dependent on the *proccess* context, which is, to me, what would
be expected. If a process is restricted in some way, I would expect it
to be restricted in that way across the board.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9
  2007-08-06 18:19     ` Nishanth Aravamudan
@ 2007-08-06 18:37       ` Christoph Lameter
  2007-08-06 19:52         ` Lee Schermerhorn
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2007-08-06 18:37 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: anton, lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:

> Uh, interleave_nodes() takes a policy. Hence I need a policy to use.
> This was your suggestion, Christoph and I'm doing exactly what you
> asked.

That would make sense if the policy can be overridden. You may be able to 
avoid exporting mpol_new by callig just the functions that generate the 
interleave nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing
  2007-08-06 18:26           ` Nishanth Aravamudan
@ 2007-08-06 18:41             ` Christoph Lameter
  2007-08-07  0:03               ` Nishanth Aravamudan
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2007-08-06 18:41 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl, pj

On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:

> I understand what you mean, that root should be able to do whatever it
> wants, but at the same time, if a root-owned process is running in a
> cpuset, it's constrained for a reason.

Yes but the constraint is for an application running under a regular 
user id not for the root user.
 
> More importantly, let's say your process (owned by root or not) is
> running in a restricted cpuset on  nodes 2 and 3 of a 4-node system and
> wants to use 100 hugepages. Using the global sysctl, presuming an equal
> distribution of free memory on all nodes, said process would need to
> allocate 200 hugepages on the system (50 on each node), to get 100
> hugepages on nodes 2 and 3. With this patch, it only needs to allocate
> 100 hugepages.

The app is not able to use the sysctl. The root user must be able to do 
whatever desired. Does not make sense to impose restrictions on sysctls.

> Become dependent on the *proccess* context, which is, to me, what would
> be expected. If a process is restricted in some way, I would expect it
> to be restricted in that way across the board.

Nope these values are global. Cpuset relative data belongs in /dev/cpuset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing
  2007-08-06 18:04         ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Christoph Lameter
  2007-08-06 18:26           ` Nishanth Aravamudan
@ 2007-08-06 19:37           ` Lee Schermerhorn
  2007-08-08  1:50           ` Nishanth Aravamudan
  2 siblings, 0 replies; 24+ messages in thread
From: Lee Schermerhorn @ 2007-08-06 19:37 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, wli, melgor, akpm, linux-mm, agl, pj,
	Kenneth W. Chen

On Mon, 2007-08-06 at 11:04 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:
> 
> > hugetlb: fix cpuset-constrained pool resizing
> > 
> > With the previous 3 patches in this series applied, if a process is in a
> > constrained cpuset, and tries to grow the hugetlb pool, hugepages may be
> > allocated on nodes outside of the process' cpuset. More concretely,
> > growing the pool via
> > 
> > echo some_value > /proc/sys/vm/nr_hugepages
> > 
> > interleaves across all nodes with memory such that hugepage allocations
> > occur on nodes outside the cpuset. Similarly, this process is able to
> > change the values in values in
> > /sys/devices/system/node/nodeX/nr_hugepages, even when X is not in the
> > cpuset. This directly violates the isolation that cpusets is supposed to
> > guarantee.
> 
> No it does not. Cpusets do not affect the administrative rights of users.

I agree.  nr_hugepages allocates fresh pages for the system wide pool.
I don't think this should not be constrained by cpusets.  I supposed
that if there is a need for this feature, we could document the behavior
and warn admins to only modify nr_hugepages from a program/shell in the
top level cpuset to achieve the current system-wide behavior.

>  
> > For pool growth: fix the sysctl case by only interleaving across the
> > nodes in current's cpuset; fix the sysfs attribute case by verifying the
> > requested node is in current's cpuset. For pool shrinking: both cases
> > are mostly already covered by the cpuset_zone_allowed_softwall() check
> > in dequeue_huge_page_node(), but make sure that we only iterate over the
> > cpusets's nodes in try_to_free_low().
> 
> In that case the number of huge pages is a cpuset attribute. Create 
> nr_hugepages under /dev/cpuset/ ...? The sysctl is global and should not 
> be cpuset relative.
>  
> Otherwise the /proc/sys/vm/nr_hugepages and systecl becomes dependend on 
> the cpuset context. Which will be a bit strange.

I'd like to see it stay a system-wide attribute to preserve current
behavior--with the fixes for memoryless nodes, of course.

I'll queue these up for testing atop Christoph's v5 memoryless nodes
patches.


Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9
  2007-08-06 18:37       ` Christoph Lameter
@ 2007-08-06 19:52         ` Lee Schermerhorn
  2007-08-06 20:15           ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Lee Schermerhorn @ 2007-08-06 19:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, anton, wli, melgor, akpm, linux-mm, agl

On Mon, 2007-08-06 at 11:37 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:
> 
> > Uh, interleave_nodes() takes a policy. Hence I need a policy to use.
> > This was your suggestion, Christoph and I'm doing exactly what you
> > asked.
> 
> That would make sense if the policy can be overridden. You may be able to 
> avoid exporting mpol_new by callig just the functions that generate the 
> interleave nodes.

I don't understand what you're asking either.  The function that Nish is
allocating the initial free huge page pool.  I thought that the intended
behavior of this function was to distribute new allocated huge pages
evenly across the nodes.  It was broken, in that for systems with
memoryless nodes, the allocation would immediately fall back to the next
node in the zonelist, overloading that node with huge page.  

IMO, we should try to preserve the current behavior of nr_hugepages, as
"fixed" by Nish, and use the new per node sysfs attributes to handle or
fixup asymmetric allocation of hugepages, if required.

That being said, I was never a fan of using mempolicy for this.  Not
strongly opposed, just not a fan.  I'd like to see modification to
nr_hugepages, including incremental increase or decrease, try to keep
the number of huge pages balanced across the nodes.  Without breaking
any extra per node additions or deletions via the sysfs attribute.  I
had something in mind like remembering where the last change in
nr_hugepages left off [like the unpatched code with the static node id
variable did].  Thenm scan the mask of nodes with memory in one
direction when increasing nr_hugepages and in the opposite direction
when decreasing.  It'll be a while before I can put together a patch,
tho'.  In any case, I'd want to wait for the current memoryless node and
hugetlb patch streams to settle down.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9
  2007-08-06 19:52         ` Lee Schermerhorn
@ 2007-08-06 20:15           ` Christoph Lameter
  2007-08-07  0:04             ` Nishanth Aravamudan
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2007-08-06 20:15 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nishanth Aravamudan, anton, wli, melgor, akpm, linux-mm, agl

On Mon, 6 Aug 2007, Lee Schermerhorn wrote:

> I don't understand what you're asking either.  The function that Nish is
> allocating the initial free huge page pool.  I thought that the intended
> behavior of this function was to distribute new allocated huge pages
> evenly across the nodes.  It was broken, in that for systems with
> memoryless nodes, the allocation would immediately fall back to the next
> node in the zonelist, overloading that node with huge page.  

I am all for distributing the pages evenly. The problem is that new 
functions are now exported from the memory policy layer. Exporting 
mpol_new() may be avoided by not using a policy. If we are just doing a 
round robin over a nodemask then this may be done in a different way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing
  2007-08-06 18:41             ` Christoph Lameter
@ 2007-08-07  0:03               ` Nishanth Aravamudan
  0 siblings, 0 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-07  0:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl, pj

On 06.08.2007 [11:41:12 -0700], Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:
> 
> > I understand what you mean, that root should be able to do whatever it
> > wants, but at the same time, if a root-owned process is running in a
> > cpuset, it's constrained for a reason.
> 
> Yes but the constraint is for an application running under a regular 
> user id not for the root user.
> 
> > More importantly, let's say your process (owned by root or not) is
> > running in a restricted cpuset on  nodes 2 and 3 of a 4-node system and
> > wants to use 100 hugepages. Using the global sysctl, presuming an equal
> > distribution of free memory on all nodes, said process would need to
> > allocate 200 hugepages on the system (50 on each node), to get 100
> > hugepages on nodes 2 and 3. With this patch, it only needs to allocate
> > 100 hugepages.
> 
> The app is not able to use the sysctl. The root user must be able to do 
> whatever desired. Does not make sense to impose restrictions on sysctls.
> 
> > Become dependent on the *proccess* context, which is, to me, what would
> > be expected. If a process is restricted in some way, I would expect it
> > to be restricted in that way across the board.
> 
> Nope these values are global. Cpuset relative data belongs in /dev/cpuset.

Ok, I'll respin the patches with this in mind and resubmit.

Thanks for the feedback,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9
  2007-08-06 20:15           ` Christoph Lameter
@ 2007-08-07  0:04             ` Nishanth Aravamudan
  0 siblings, 0 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-07  0:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, anton, wli, melgor, akpm, linux-mm, agl

On 06.08.2007 [13:15:36 -0700], Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Lee Schermerhorn wrote:
> 
> > I don't understand what you're asking either.  The function that Nish is
> > allocating the initial free huge page pool.  I thought that the intended
> > behavior of this function was to distribute new allocated huge pages
> > evenly across the nodes.  It was broken, in that for systems with
> > memoryless nodes, the allocation would immediately fall back to the next
> > node in the zonelist, overloading that node with huge page.  
> 
> I am all for distributing the pages evenly. The problem is that new
> functions are now exported from the memory policy layer. Exporting
> mpol_new() may be avoided by not using a policy. If we are just doing
> a round robin over a nodemask then this may be done in a different
> way.

How about this -- I'll respin this patch to keep the 'custom'
interleaving in hugetlb.c, while we discuss how best to do interleaving
independent of a process (which is really the issue at hand here, I
think).

That will affect the other patches, too, so I'll rebase them and
resubmit.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/5] hugetlb: numafy several functions
  2007-08-06 18:15       ` Nishanth Aravamudan
@ 2007-08-07  0:34         ` Nishanth Aravamudan
  0 siblings, 0 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-07  0:34 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl

On 06.08.2007 [11:15:32 -0700], Nishanth Aravamudan wrote:
> On 06.08.2007 [10:59:20 -0700], Christoph Lameter wrote:
> > On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:
> > 
> > > +	page = alloc_pages_node(nid,
> > > +			GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE,
> > > +			HUGETLB_PAGE_ORDER);
> > 
> > GFP_THISNODE disables reclaim. With Mel Gorman's ZONE_MOVABLE you may
> > want to enable reclaim here. Use __GFP_THISNODE?
> 
> It is GFP_THISNODE currently. That seems like a separate logical
> change which I'll have to consider separately.

Bah, sorry, I'm confused. You're right and I'll make this change.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing
  2007-08-06 18:04         ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Christoph Lameter
  2007-08-06 18:26           ` Nishanth Aravamudan
  2007-08-06 19:37           ` Lee Schermerhorn
@ 2007-08-08  1:50           ` Nishanth Aravamudan
  2007-08-08 13:26             ` Lee Schermerhorn
  2 siblings, 1 reply; 24+ messages in thread
From: Nishanth Aravamudan @ 2007-08-08  1:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: lee.schermerhorn, wli, melgor, akpm, linux-mm, agl, pj

On 06.08.2007 [11:04:48 -0700], Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Nishanth Aravamudan wrote:
> 
> > hugetlb: fix cpuset-constrained pool resizing
> > 
> > With the previous 3 patches in this series applied, if a process is in a
> > constrained cpuset, and tries to grow the hugetlb pool, hugepages may be
> > allocated on nodes outside of the process' cpuset. More concretely,
> > growing the pool via
> > 
> > echo some_value > /proc/sys/vm/nr_hugepages
> > 
> > interleaves across all nodes with memory such that hugepage allocations
> > occur on nodes outside the cpuset. Similarly, this process is able to
> > change the values in values in
> > /sys/devices/system/node/nodeX/nr_hugepages, even when X is not in the
> > cpuset. This directly violates the isolation that cpusets is supposed to
> > guarantee.
> 
> No it does not. Cpusets do not affect the administrative rights of users.

For reference here (as I just ran my simple script against
2.6.23-rc1-mm2, 2.6.23-rc1-mm2 + your patches, 2.6.23-rc1-mm2 + your
patches + each of my patches in turn), this is completely untrue with
-mm2 and your patches. I was actually trying to restore this behavior
with this patch. I realize I didn't mention this earlier... On a 4-node
x86_64:

2.6.23-rc1-mm2:

/cpuset ~
Trying to resize the pool to     200 from the top cpuset
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     75
Node 1 HugePages_Free:     25
Node 0 HugePages_Free:     25
Done.     200 free
Trying to resize the pool back to     100 from the top cpuset
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     25
Node 1 HugePages_Free:      0
Node 0 HugePages_Free:      0
Done.     100 free
/cpuset/set1 /cpuset ~
Trying to resize the pool to     200 from a cpuset restricted to node 1
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     25
Node 1 HugePages_Free:    100
Node 0 HugePages_Free:      0
Done.     200 free
Trying to shrink the pool down to 0 from a cpuset restricted to node 1
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     25
Node 1 HugePages_Free:      0
Node 0 HugePages_Free:      0
Done.     100 free

2.6.23-rc1-mm2 + your patches:

/cpuset ~
Trying to resize the pool to     200 from the top cpuset
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     75
Node 1 HugePages_Free:     25
Node 0 HugePages_Free:     25
Done.     200 free
Trying to resize the pool back to     100 from the top cpuset
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     25
Node 1 HugePages_Free:      0
Node 0 HugePages_Free:      0
Done.     100 free
/cpuset/set1 /cpuset ~
Trying to resize the pool to     200 from a cpuset restricted to node 1
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     25
Node 1 HugePages_Free:    100
Node 0 HugePages_Free:      0
Done.     200 free
Trying to shrink the pool down to 0 from a cpuset restricted to node 1
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     25
Node 1 HugePages_Free:      0
Node 0 HugePages_Free:      0
Done.     100 free

After my patch 1/2 (try harder) from this morning:

/cpuset ~
Trying to resize the pool to     200 from the top cpuset
Node 3 HugePages_Free:     25
Node 2 HugePages_Free:     75
Node 1 HugePages_Free:     75
Node 0 HugePages_Free:     25
Done.     200 free
Trying to resize the pool back to     100 from the top cpuset
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:     75
Node 1 HugePages_Free:     25
Node 0 HugePages_Free:      0
Done.     100 free
/cpuset/set1 /cpuset ~
Trying to resize the pool to     200 from a cpuset restricted to node 1
Node 3 HugePages_Free:     25
Node 2 HugePages_Free:    100
Node 1 HugePages_Free:     50
Node 0 HugePages_Free:     25
Done.     200 free
Trying to shrink the pool down to 0 from a cpuset restricted to node 1
Node 3 HugePages_Free:     25
Node 2 HugePages_Free:    100
Node 1 HugePages_Free:      0
Node 0 HugePages_Free:     25
Done.     150 free

After patch 2/2 (memoryless nodes) from this morning (the results are
actually the same as the above, just that the values are shifted around
the nodes a bit):

/cpuset ~
Trying to resize the pool to     200 from the top cpuset
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     75
Node 1 HugePages_Free:     25
Node 0 HugePages_Free:     25
Done.     200 free
Trying to resize the pool back to     100 from the top cpuset
Node 3 HugePages_Free:     75
Node 2 HugePages_Free:     25
Node 1 HugePages_Free:      0
Node 0 HugePages_Free:      0
Done.     100 free
/cpuset/set1 /cpuset ~
Trying to resize the pool to     200 from a cpuset restricted to node 1
Node 3 HugePages_Free:    100
Node 2 HugePages_Free:     50
Node 1 HugePages_Free:     25
Node 0 HugePages_Free:     25
Done.     200 free
Trying to shrink the pool down to 0 from a cpuset restricted to node 1
Node 3 HugePages_Free:    100
Node 2 HugePages_Free:     50
Node 1 HugePages_Free:      0
Node 0 HugePages_Free:     25
Done.     175 free

Finally, after my hugetlb interleave dequeue patch is applied:

/cpuset ~
Trying to resize the pool to     200 from the top cpuset
Node 3 HugePages_Free:     50
Node 2 HugePages_Free:     50
Node 1 HugePages_Free:     50
Node 0 HugePages_Free:     50
Done.     200 free
Trying to resize the pool back to     100 from the top cpuset
Node 3 HugePages_Free:     25
Node 2 HugePages_Free:     25
Node 1 HugePages_Free:     25
Node 0 HugePages_Free:     25
Done.     100 free
/cpuset/set1 /cpuset ~
Trying to resize the pool to     200 from a cpuset restricted to node 1
Node 3 HugePages_Free:     50
Node 2 HugePages_Free:     50
Node 1 HugePages_Free:     50
Node 0 HugePages_Free:     50
Done.     200 free
Trying to shrink the pool down to 0 from a cpuset restricted to node 1
Node 3 HugePages_Free:      0
Node 2 HugePages_Free:      0
Node 1 HugePages_Free:      0
Node 0 HugePages_Free:      0
Done.       0 free

So, it would appear that, in your opinion, this set of patches
constitutes a pseudo-bug-fix? Without the last patch, it seems, cpusets
are able to constrain what nodes a root process can remove hugepages
from.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing
  2007-08-08  1:50           ` Nishanth Aravamudan
@ 2007-08-08 13:26             ` Lee Schermerhorn
  0 siblings, 0 replies; 24+ messages in thread
From: Lee Schermerhorn @ 2007-08-08 13:26 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, wli, melgor, akpm, linux-mm, agl, pj

On Tue, 2007-08-07 at 18:50 -0700, Nishanth Aravamudan wrote:
<snip>
> Finally, after my hugetlb interleave dequeue patch is applied:
> 
> /cpuset ~
> Trying to resize the pool to     200 from the top cpuset
> Node 3 HugePages_Free:     50
> Node 2 HugePages_Free:     50
> Node 1 HugePages_Free:     50
> Node 0 HugePages_Free:     50
> Done.     200 free
> Trying to resize the pool back to     100 from the top cpuset
> Node 3 HugePages_Free:     25
> Node 2 HugePages_Free:     25
> Node 1 HugePages_Free:     25
> Node 0 HugePages_Free:     25
> Done.     100 free
> /cpuset/set1 /cpuset ~
> Trying to resize the pool to     200 from a cpuset restricted to node 1
> Node 3 HugePages_Free:     50
> Node 2 HugePages_Free:     50
> Node 1 HugePages_Free:     50
> Node 0 HugePages_Free:     50
> Done.     200 free
> Trying to shrink the pool down to 0 from a cpuset restricted to node 1
> Node 3 HugePages_Free:      0
> Node 2 HugePages_Free:      0
> Node 1 HugePages_Free:      0
> Node 0 HugePages_Free:      0
> Done.       0 free

That's the behavior I'd like to see!!!

> 
> So, it would appear that, in your opinion, this set of patches
> constitutes a pseudo-bug-fix? Without the last patch, it seems, cpusets
> are able to constrain what nodes a root process can remove hugepages
> from.
> 
> Thanks,
> Nish
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2007-08-08 13:26 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-06 16:32 [RFC][PATCH 0/5] hugetlb NUMA improvements Nishanth Aravamudan
2007-08-06 16:37 ` [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9 Nishanth Aravamudan
2007-08-06 16:38   ` [RFC][PATCH 2/5] hugetlb: numafy several functions Nishanth Aravamudan
2007-08-06 16:40     ` [RFC][PATCH 3/5] hugetlb: add per-node nr_hugepages sysfs attribute Nishanth Aravamudan
2007-08-06 16:44       ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Nishanth Aravamudan
2007-08-06 16:45         ` Nishanth Aravamudan
2007-08-06 16:48         ` [RFC][PATCH 5/5] hugetlb: interleave dequeueing of huge pages Nishanth Aravamudan
2007-08-06 18:04         ` [RFC][PATCH 4/5] hugetlb: fix cpuset-constrained pool resizing Christoph Lameter
2007-08-06 18:26           ` Nishanth Aravamudan
2007-08-06 18:41             ` Christoph Lameter
2007-08-07  0:03               ` Nishanth Aravamudan
2007-08-06 19:37           ` Lee Schermerhorn
2007-08-08  1:50           ` Nishanth Aravamudan
2007-08-08 13:26             ` Lee Schermerhorn
2007-08-06 17:59     ` [RFC][PATCH 2/5] hugetlb: numafy several functions Christoph Lameter
2007-08-06 18:15       ` Nishanth Aravamudan
2007-08-07  0:34         ` Nishanth Aravamudan
2007-08-06 18:00   ` [RFC][PATCH 1/5] Fix hugetlb pool allocation with empty nodes V9 Christoph Lameter
2007-08-06 18:19     ` Nishanth Aravamudan
2007-08-06 18:37       ` Christoph Lameter
2007-08-06 19:52         ` Lee Schermerhorn
2007-08-06 20:15           ` Christoph Lameter
2007-08-07  0:04             ` Nishanth Aravamudan
2007-08-06 16:39 ` [RFC][PATCH 0/5] hugetlb NUMA improvements Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).