linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] mm: fix misleading __GFP_REPEAT related comments
@ 2008-04-11 23:35 Nishanth Aravamudan
  2008-04-11 23:35 ` [PATCH] Smarter retry of costly-order allocations Nishanth Aravamudan
  0 siblings, 1 reply; 14+ messages in thread
From: Nishanth Aravamudan @ 2008-04-11 23:35 UTC (permalink / raw)
  To: akpm; +Cc: mel, clameter, apw, linux-mm

The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY
in the core VM have somewhat differing comments as to their actual
semantics. Annoyingly, the flags definition has inline and header
comments, which might be interpreted as not being equivalent. Just add
references to the header comments in the inline ones so they don't go
out of sync in the future. In their use in __alloc_pages() clarify that
the current implementation treats low-order allocations and __GFP_REPEAT
allocations as distinct cases.

To clarify, the flags' semantics are:

__GFP_NORETRY means try no harder than one run through __alloc_pages

__GFP_REPEAT means __GFP_NOFAIL

__GFP_NOFAIL means repeat forever

order <= PAGE_ALLOC_COSTLY_ORDER means __GFP_NOFAIL

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>

---
Andrew, would it be possible to give this patch and the following two a
spin in the next -mm?

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 898aa9d..b46b861 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,9 +40,9 @@ struct vm_area_struct;
 #define __GFP_FS	((__force gfp_t)0x80u)	/* Can call down to low-level FS? */
 #define __GFP_COLD	((__force gfp_t)0x100u)	/* Cache-cold page required */
 #define __GFP_NOWARN	((__force gfp_t)0x200u)	/* Suppress page allocation failure warning */
-#define __GFP_REPEAT	((__force gfp_t)0x400u)	/* Retry the allocation.  Might fail */
-#define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* Retry for ever.  Cannot fail */
-#define __GFP_NORETRY	((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
+#define __GFP_REPEAT	((__force gfp_t)0x400u)	/* See above */
+#define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* See above */
+#define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 935ae16..1db36da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1691,8 +1691,9 @@ nofail_alloc:
 	 * Don't let big-order allocations loop unless the caller explicitly
 	 * requests that.  Wait for some write requests to complete then retry.
 	 *
-	 * In this implementation, __GFP_REPEAT means __GFP_NOFAIL for order
-	 * <= 3, but that may not be true in other implementations.
+	 * In this implementation, either order <= PAGE_ALLOC_COSTLY_ORDER or
+	 * __GFP_REPEAT mean __GFP_NOFAIL, but that may not be true in other
+	 * implementations.
 	 */
 	do_retry = 0;
 	if (!(gfp_mask & __GFP_NORETRY)) {

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH] Smarter retry of costly-order allocations
  2008-04-11 23:35 [PATCH 1/3] mm: fix misleading __GFP_REPEAT related comments Nishanth Aravamudan
@ 2008-04-11 23:35 ` Nishanth Aravamudan
  2008-04-11 23:36   ` [PATCH 3/3] Explicitly retry hugepage allocations Nishanth Aravamudan
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Nishanth Aravamudan @ 2008-04-11 23:35 UTC (permalink / raw)
  To: akpm; +Cc: mel, clameter, apw, kosaki.motohiro, linux-mm

Because of page order checks in __alloc_pages(), hugepage (and similarly
large order) allocations will not retry unless explicitly marked
__GFP_REPEAT. However, the current retry logic is nearly an infinite
loop (or until reclaim does no progress whatsoever). For these costly
allocations, that seems like overkill and could potentially never
terminate.

Modify try_to_free_pages() to indicate how many pages were reclaimed.
Use that information in __alloc_pages() to eventually fail a large
__GFP_REPEAT allocation when we've reclaimed an order of pages equal to
or greater than the allocation's order. This relies on lumpy reclaim
functioning as advertised. Due to fragmentation, lumpy reclaim may not
be able to free up the order needed in one invocation, so multiple
iterations may be requred. In other words, the more fragmented memory
is, the more retry attempts __GFP_REPEAT will make (particularly for
higher order allocations).

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1db36da..1a0cc4d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1541,7 +1541,8 @@ __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
 	struct task_struct *p = current;
 	int do_retry;
 	int alloc_flags;
-	int did_some_progress;
+	unsigned long did_some_progress;
+	unsigned long pages_reclaimed = 0;
 
 	might_sleep_if(wait);
 
@@ -1691,15 +1692,26 @@ nofail_alloc:
 	 * Don't let big-order allocations loop unless the caller explicitly
 	 * requests that.  Wait for some write requests to complete then retry.
 	 *
-	 * In this implementation, either order <= PAGE_ALLOC_COSTLY_ORDER or
-	 * __GFP_REPEAT mean __GFP_NOFAIL, but that may not be true in other
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
 	 * implementations.
+	 *
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
 	 */
+	pages_reclaimed += did_some_progress;
 	do_retry = 0;
 	if (!(gfp_mask & __GFP_NORETRY)) {
-		if ((order <= PAGE_ALLOC_COSTLY_ORDER) ||
-						(gfp_mask & __GFP_REPEAT))
+		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
 			do_retry = 1;
+		} else {
+			if (gfp_mask & __GFP_REPEAT &&
+				pages_reclaimed < (1 << order))
+					do_retry = 1;
+		}
 		if (gfp_mask & __GFP_NOFAIL)
 			do_retry = 1;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 83f42c9..d106b2c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1319,6 +1319,9 @@ static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
  * hope that some of these pages can be written.  But if the allocating task
  * holds filesystem locks which prevent writeout this might not work, and the
  * allocation attempt will fail.
+ *
+ * returns:	0, if no pages reclaimed
+ * 		else, the number of pages reclaimed
  */
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					struct scan_control *sc)
@@ -1368,7 +1371,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		}
 		total_scanned += sc->nr_scanned;
 		if (nr_reclaimed >= sc->swap_cluster_max) {
-			ret = 1;
+			ret = nr_reclaimed;
 			goto out;
 		}
 
@@ -1391,7 +1394,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	}
 	/* top priority shrink_caches still had more to do? don't OOM, then */
 	if (!sc->all_unreclaimable && scan_global_lru(sc))
-		ret = 1;
+		ret = nr_reclaimed;
 out:
 	/*
 	 * Now that we've scanned all the zones at this priority level, note

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] Explicitly retry hugepage allocations
  2008-04-11 23:35 ` [PATCH] Smarter retry of costly-order allocations Nishanth Aravamudan
@ 2008-04-11 23:36   ` Nishanth Aravamudan
  2008-04-15  8:56     ` Mel Gorman
  2008-04-15  7:07   ` [PATCH] Smarter retry of costly-order allocations Andrew Morton
  2008-04-15  8:51   ` [PATCH] " Mel Gorman
  2 siblings, 1 reply; 14+ messages in thread
From: Nishanth Aravamudan @ 2008-04-11 23:36 UTC (permalink / raw)
  To: akpm; +Cc: mel, clameter, apw, wli, linux-mm

Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate
userspace putting pressure on the VM by repeated echo's into
/proc/sys/vm/nr_hugepages to grow the pool. With the previous patch to
allow for large-order __GFP_REPEAT attempts to loop for a bit (as
opposed to indefinitely), this increases the likelihood of getting
hugepages when the system experiences (or recently experienced) load.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df28c17..e13a7b2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -199,7 +199,8 @@ static struct page *alloc_fresh_huge_page_node(int nid)
 	struct page *page;
 
 	page = alloc_pages_node(nid,
-		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
+		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
+						__GFP_REPEAT|__GFP_NOWARN,
 		HUGETLB_PAGE_ORDER);
 	if (page) {
 		if (arch_prepare_hugepage(page)) {
@@ -294,7 +295,8 @@ static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
 	}
 	spin_unlock(&hugetlb_lock);
 
-	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
+	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
+					__GFP_REPEAT|__GFP_NOWARN,
 					HUGETLB_PAGE_ORDER);
 
 	spin_lock(&hugetlb_lock);

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Smarter retry of costly-order allocations
  2008-04-11 23:35 ` [PATCH] Smarter retry of costly-order allocations Nishanth Aravamudan
  2008-04-11 23:36   ` [PATCH 3/3] Explicitly retry hugepage allocations Nishanth Aravamudan
@ 2008-04-15  7:07   ` Andrew Morton
  2008-04-15 17:26     ` Nishanth Aravamudan
  2008-04-15  8:51   ` [PATCH] " Mel Gorman
  2 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-04-15  7:07 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: mel, clameter, apw, kosaki.motohiro, linux-mm

On Fri, 11 Apr 2008 16:35:53 -0700 Nishanth Aravamudan <nacc@us.ibm.com> wrote:

> Because of page order checks in __alloc_pages(), hugepage (and similarly
> large order) allocations will not retry unless explicitly marked
> __GFP_REPEAT. However, the current retry logic is nearly an infinite
> loop (or until reclaim does no progress whatsoever). For these costly
> allocations, that seems like overkill and could potentially never
> terminate.
> 
> Modify try_to_free_pages() to indicate how many pages were reclaimed.
> Use that information in __alloc_pages() to eventually fail a large
> __GFP_REPEAT allocation when we've reclaimed an order of pages equal to
> or greater than the allocation's order. This relies on lumpy reclaim
> functioning as advertised. Due to fragmentation, lumpy reclaim may not
> be able to free up the order needed in one invocation, so multiple
> iterations may be requred. In other words, the more fragmented memory
> is, the more retry attempts __GFP_REPEAT will make (particularly for
> higher order allocations).
> 

hm, there's rather a lot of speculation and wishful thinking in that
changelog.

If we put this through -mm and into mainline then nobody will test it 
and we won't discover whether it's good or bad until late -rc at best.

So... would like to see some firmer-looking testing results, please.

I _assume_ this patch was inspired by some observed problem?  What was that
problem, and what effect did the patch have?

And what scenarios might be damaged by this patch, and how do we test for
them?

The "repeat until we've reclaimed 1<<order pages" thing is in fact a magic
number, and its value is "1".  How did we arrive at this magic number and
why isn't "2" a better one?  Or "0.5"?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Smarter retry of costly-order allocations
  2008-04-11 23:35 ` [PATCH] Smarter retry of costly-order allocations Nishanth Aravamudan
  2008-04-11 23:36   ` [PATCH 3/3] Explicitly retry hugepage allocations Nishanth Aravamudan
  2008-04-15  7:07   ` [PATCH] Smarter retry of costly-order allocations Andrew Morton
@ 2008-04-15  8:51   ` Mel Gorman
  2008-04-15  9:02     ` Andrew Morton
  2 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2008-04-15  8:51 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, clameter, apw, kosaki.motohiro, linux-mm

On (11/04/08 16:35), Nishanth Aravamudan didst pronounce:
> Because of page order checks in __alloc_pages(), hugepage (and similarly
> large order) allocations will not retry unless explicitly marked
> __GFP_REPEAT. However, the current retry logic is nearly an infinite
> loop (or until reclaim does no progress whatsoever). For these costly
> allocations, that seems like overkill and could potentially never
> terminate.
> 
> Modify try_to_free_pages() to indicate how many pages were reclaimed.
> Use that information in __alloc_pages() to eventually fail a large
> __GFP_REPEAT allocation when we've reclaimed an order of pages equal to
> or greater than the allocation's order. This relies on lumpy reclaim
> functioning as advertised. Due to fragmentation, lumpy reclaim may not
> be able to free up the order needed in one invocation, so multiple
> iterations may be requred. In other words, the more fragmented memory
> is, the more retry attempts __GFP_REPEAT will make (particularly for
> higher order allocations).
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

Changelog is a lot clearer now. Thanks.

Tested-by: Mel Gorman <mel@csn.ul.ie>

> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1db36da..1a0cc4d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1541,7 +1541,8 @@ __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
>  	struct task_struct *p = current;
>  	int do_retry;
>  	int alloc_flags;
> -	int did_some_progress;
> +	unsigned long did_some_progress;
> +	unsigned long pages_reclaimed = 0;
>  
>  	might_sleep_if(wait);
>  
> @@ -1691,15 +1692,26 @@ nofail_alloc:
>  	 * Don't let big-order allocations loop unless the caller explicitly
>  	 * requests that.  Wait for some write requests to complete then retry.
>  	 *
> -	 * In this implementation, either order <= PAGE_ALLOC_COSTLY_ORDER or
> -	 * __GFP_REPEAT mean __GFP_NOFAIL, but that may not be true in other
> +	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> +	 * means __GFP_NOFAIL, but that may not be true in other
>  	 * implementations.
> +	 *
> +	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
> +	 * specified, then we retry until we no longer reclaim any pages
> +	 * (above), or we've reclaimed an order of pages at least as
> +	 * large as the allocation's order. In both cases, if the
> +	 * allocation still fails, we stop retrying.
>  	 */
> +	pages_reclaimed += did_some_progress;
>  	do_retry = 0;
>  	if (!(gfp_mask & __GFP_NORETRY)) {
> -		if ((order <= PAGE_ALLOC_COSTLY_ORDER) ||
> -						(gfp_mask & __GFP_REPEAT))
> +		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
>  			do_retry = 1;
> +		} else {
> +			if (gfp_mask & __GFP_REPEAT &&
> +				pages_reclaimed < (1 << order))
> +					do_retry = 1;
> +		}
>  		if (gfp_mask & __GFP_NOFAIL)
>  			do_retry = 1;
>  	}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 83f42c9..d106b2c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1319,6 +1319,9 @@ static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
>   * hope that some of these pages can be written.  But if the allocating task
>   * holds filesystem locks which prevent writeout this might not work, and the
>   * allocation attempt will fail.
> + *
> + * returns:	0, if no pages reclaimed
> + * 		else, the number of pages reclaimed
>   */
>  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  					struct scan_control *sc)
> @@ -1368,7 +1371,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		}
>  		total_scanned += sc->nr_scanned;
>  		if (nr_reclaimed >= sc->swap_cluster_max) {
> -			ret = 1;
> +			ret = nr_reclaimed;
>  			goto out;
>  		}
>  
> @@ -1391,7 +1394,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  	}
>  	/* top priority shrink_caches still had more to do? don't OOM, then */
>  	if (!sc->all_unreclaimable && scan_global_lru(sc))
> -		ret = 1;
> +		ret = nr_reclaimed;
>  out:
>  	/*
>  	 * Now that we've scanned all the zones at this priority level, note
> 
> -- 
> Nishanth Aravamudan <nacc@us.ibm.com>
> IBM Linux Technology Center
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] Explicitly retry hugepage allocations
  2008-04-11 23:36   ` [PATCH 3/3] Explicitly retry hugepage allocations Nishanth Aravamudan
@ 2008-04-15  8:56     ` Mel Gorman
  2008-04-17  1:40       ` [UPDATED][PATCH " Nishanth Aravamudan
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2008-04-15  8:56 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, clameter, apw, wli, linux-mm

On (11/04/08 16:36), Nishanth Aravamudan didst pronounce:
> Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate
> userspace putting pressure on the VM by repeated echo's into
> /proc/sys/vm/nr_hugepages to grow the pool. With the previous patch to
> allow for large-order __GFP_REPEAT attempts to loop for a bit (as
> opposed to indefinitely), this increases the likelihood of getting
> hugepages when the system experiences (or recently experienced) load.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

I tested the patchset on an x86_32 laptop. With the patches, it was easier to
use the proc interface to grow the hugepage pool. The following is the output
of a script that grows the pool as much as possible running on 2.6.25-rc9

Allocating hugepages test
-------------------------
Disabling OOM Killer for current test process
Starting page count: 0
Attempt 1: 57 pages Progress made with 57 pages
Attempt 2: 73 pages Progress made with 16 pages
Attempt 3: 74 pages Progress made with 1 pages
Attempt 4: 75 pages Progress made with 1 pages
Attempt 5: 77 pages Progress made with 2 pages

77 pages was the most it allocated but it took 5 attempts from userspace
to get it. With your 3 patches applied,

Allocating hugepages test
-------------------------
Disabling OOM Killer for current test process
Starting page count: 0
Attempt 1: 75 pages Progress made with 75 pages
Attempt 2: 76 pages Progress made with 1 pages
Attempt 3: 79 pages Progress made with 3 pages

And 79 pages was the most it got. Your patches were able to allocate the
bulk of possible pages on the first attempt.

Tested-by: Mel Gorman <mel@csn.ul.ie>

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index df28c17..e13a7b2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -199,7 +199,8 @@ static struct page *alloc_fresh_huge_page_node(int nid)
>  	struct page *page;
>  
>  	page = alloc_pages_node(nid,
> -		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> +		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
> +						__GFP_REPEAT|__GFP_NOWARN,
>  		HUGETLB_PAGE_ORDER);
>  	if (page) {
>  		if (arch_prepare_hugepage(page)) {
> @@ -294,7 +295,8 @@ static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
>  	}
>  	spin_unlock(&hugetlb_lock);
>  
> -	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> +	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
> +					__GFP_REPEAT|__GFP_NOWARN,
>  					HUGETLB_PAGE_ORDER);
>  
>  	spin_lock(&hugetlb_lock);
> 
> -- 
> Nishanth Aravamudan <nacc@us.ibm.com>
> IBM Linux Technology Center
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Smarter retry of costly-order allocations
  2008-04-15  8:51   ` [PATCH] " Mel Gorman
@ 2008-04-15  9:02     ` Andrew Morton
  2008-04-15  9:27       ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-04-15  9:02 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nishanth Aravamudan, clameter, apw, kosaki.motohiro, linux-mm

On Tue, 15 Apr 2008 09:51:55 +0100 Mel Gorman <mel@csn.ul.ie> wrote:

> On (11/04/08 16:35), Nishanth Aravamudan didst pronounce:
> > Because of page order checks in __alloc_pages(), hugepage (and similarly
> > large order) allocations will not retry unless explicitly marked
> > __GFP_REPEAT. However, the current retry logic is nearly an infinite
> > loop (or until reclaim does no progress whatsoever). For these costly
> > allocations, that seems like overkill and could potentially never
> > terminate.
> > 
> > Modify try_to_free_pages() to indicate how many pages were reclaimed.
> > Use that information in __alloc_pages() to eventually fail a large
> > __GFP_REPEAT allocation when we've reclaimed an order of pages equal to
> > or greater than the allocation's order. This relies on lumpy reclaim
> > functioning as advertised. Due to fragmentation, lumpy reclaim may not
> > be able to free up the order needed in one invocation, so multiple
> > iterations may be requred. In other words, the more fragmented memory
> > is, the more retry attempts __GFP_REPEAT will make (particularly for
> > higher order allocations).
> > 
> > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> Changelog is a lot clearer now. Thanks.
> 
> Tested-by: Mel Gorman <mel@csn.ul.ie>

Tested in what way though?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Smarter retry of costly-order allocations
  2008-04-15  9:02     ` Andrew Morton
@ 2008-04-15  9:27       ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2008-04-15  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nishanth Aravamudan, clameter, apw, kosaki.motohiro, linux-mm

On (15/04/08 02:02), Andrew Morton didst pronounce:
> On Tue, 15 Apr 2008 09:51:55 +0100 Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On (11/04/08 16:35), Nishanth Aravamudan didst pronounce:
> > > Because of page order checks in __alloc_pages(), hugepage (and similarly
> > > large order) allocations will not retry unless explicitly marked
> > > __GFP_REPEAT. However, the current retry logic is nearly an infinite
> > > loop (or until reclaim does no progress whatsoever). For these costly
> > > allocations, that seems like overkill and could potentially never
> > > terminate.
> > > 
> > > Modify try_to_free_pages() to indicate how many pages were reclaimed.
> > > Use that information in __alloc_pages() to eventually fail a large
> > > __GFP_REPEAT allocation when we've reclaimed an order of pages equal to
> > > or greater than the allocation's order. This relies on lumpy reclaim
> > > functioning as advertised. Due to fragmentation, lumpy reclaim may not
> > > be able to free up the order needed in one invocation, so multiple
> > > iterations may be requred. In other words, the more fragmented memory
> > > is, the more retry attempts __GFP_REPEAT will make (particularly for
> > > higher order allocations).
> > > 
> > > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > 
> > Changelog is a lot clearer now. Thanks.
> > 
> > Tested-by: Mel Gorman <mel@csn.ul.ie>
> 
> Tested in what way though?
> 

It was tested as part of the full patchset as hugepage allocations was the
easiest trigger for __GFP_REPEAT usage. It was based on 2.6.25-rc9. Test
was as follows

1. kernbench as a smoke-test
2. hugetlbcap test
	1. Build 6 trees simultaneously on a 512MB laptop
		(should have caught if pagetable allocations getting broken
		 by the change in __GFP_REPEAT semantics)
	2. Allocate hugepages via proc under load
	3. Kill all compile jobs
	4. Allocate hugepages at rest
3. Run hugepages_get test which is the output I posted as part of patch 3

The main check was to see if pagetable allocations were getting messed
up. I didn't notice a problem on the laptop, but it's 1-way so I've
started tests on larger machines just in case.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Smarter retry of costly-order allocations
  2008-04-15  7:07   ` [PATCH] Smarter retry of costly-order allocations Andrew Morton
@ 2008-04-15 17:26     ` Nishanth Aravamudan
  2008-04-15 19:18       ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Nishanth Aravamudan @ 2008-04-15 17:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mel, clameter, apw, kosaki.motohiro, linux-mm

On 15.04.2008 [00:07:45 -0700], Andrew Morton wrote:
> On Fri, 11 Apr 2008 16:35:53 -0700 Nishanth Aravamudan <nacc@us.ibm.com> wrote:
> 
> > Because of page order checks in __alloc_pages(), hugepage (and similarly
> > large order) allocations will not retry unless explicitly marked
> > __GFP_REPEAT. However, the current retry logic is nearly an infinite
> > loop (or until reclaim does no progress whatsoever). For these costly
> > allocations, that seems like overkill and could potentially never
> > terminate.
> > 
> > Modify try_to_free_pages() to indicate how many pages were reclaimed.
> > Use that information in __alloc_pages() to eventually fail a large
> > __GFP_REPEAT allocation when we've reclaimed an order of pages equal to
> > or greater than the allocation's order. This relies on lumpy reclaim
> > functioning as advertised. Due to fragmentation, lumpy reclaim may not
> > be able to free up the order needed in one invocation, so multiple
> > iterations may be requred. In other words, the more fragmented memory
> > is, the more retry attempts __GFP_REPEAT will make (particularly for
> > higher order allocations).
> > 
> 
> hm, there's rather a lot of speculation and wishful thinking in that
> changelog.

Sorry about that -- I realized after sending (and reading your other
e-mails on LKML/linux-mm about changelogs) that I should have referred
to Mel's previous testing results, at a minimum.

> If we put this through -mm and into mainline then nobody will test it 
> and we won't discover whether it's good or bad until late -rc at best.
> 
> So... would like to see some firmer-looking testing results, please.

Do Mel's e-mails cover this sufficiently?

> I _assume_ this patch was inspired by some observed problem?  What was that
> problem, and what effect did the patch have?

To make it explicit, the problem is in the userspace interface to
growing the static hugepage pool (/proc/sys/vm/nr_hugepages). An
administrator may request 100 hugepages, but due to fragmentation, load,
etc. on the system, only 60 are granted. The administrator could,
however, try to request 100 hugepages again, and be granted 70 on the
second request. Then 72 on the third, then 73 on the fourth, 73 still on
the fifth, and then 74 on the sixth, etc. Numbers are made up here, but
similar patterns are observed in practice. Rather than force user space
to keep trying until some point (which user space is not really in a
point to observe, given patterns like {72, 73, 73, 74}, that is no
growth followed by growth) I think putting the smarts in the kernel to
leverage reclaim is a better approach. And Mel's results indicate the
/proc interface "performs" better (achieves a larger number of hugepages
on first try) than before.

> And what scenarios might be damaged by this patch, and how do we test
> for them?

This is a good question -- Mel's testing does cover some of this by
verifying the reclaim path is not destroyed by the extra checks.
However, unless there is quite serious fragmentation, I think most of
the lower-order allocations (which implicitly are __GFP_NOFAIL) succeed
on one iteration through __alloc_pages anyways. The impact to
lower-order allocations should just be the changed return value, though,
as we don't look at the reclaim success to determine if we should quite
reclaiming in that case.

> The "repeat until we've reclaimed 1<<order pages" thing is in fact a
> magic number, and its value is "1".  How did we arrive at this magic
> number and why isn't "2" a better one?  Or "0.5"?

Another good question, and one that should have been answered in my
changelog, I'm sorry.

We have some order of allocation to satisfy. Relying on lumpy reclaim to
attempt to free up lumps of memory, if we have reclaimed an order
greater than or equal to the order of the requested allocation, we
should be able to satisfy the allocation. If we can't at that point, we
fail the allocation. I believe this is a good balance between trying to
succeed large allocations when possible and looping in the core VM
forever.

"2" may be a better value in one sense, because we should be even more
likely to succeed the allocation if we've freed twice as many pages as
we needed, but we'd try longer at the tail end of the reclaim loop
(having gone through several times not getting a large enough contiguous
region free), even though we probably should have succeeded earlier.

"0.5" won't work, I don't think, because that would imply reclaiming
half as many pages as the original request. Unless there were already
about half the number of pages free (but no more), the allocation would
fail early, even though it might succeed a few more times down the road.
More importantly, "1" subsumes the case where half the pages are free
now, and we need to reclaim the other half -- as we'll succeed the
allocation at some point and stop reclaiming. Really, that's the same
reason that "2" would be better -- or really __GFP_NOFAIL would be. But
given that hugepage orders are very large (and this is all of
PAGE_ALLOC_COSTLY_ORDER to begin with), I don't think we want them to be
NOFAIL.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Smarter retry of costly-order allocations
  2008-04-15 17:26     ` Nishanth Aravamudan
@ 2008-04-15 19:18       ` Andrew Morton
  2008-04-16  0:00         ` Nishanth Aravamudan
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-04-15 19:18 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: mel, clameter, apw, kosaki.motohiro, linux-mm

On Tue, 15 Apr 2008 10:26:14 -0700
Nishanth Aravamudan <nacc@us.ibm.com> wrote:

> > So... would like to see some firmer-looking testing results, please.
> 
> Do Mel's e-mails cover this sufficiently?

I guess so.

Could you please pull together a new set of changelogs sometime?

The big-picture change here is that we now use GFP_REPEAT for hugepages,
which makes the allocations work better.  But I assume that you hit some
problem with that which led you to reduce the amount of effort which
GFP_REPEAT will expend for larger pages, yes?

If so, a description of that problem would be appropriate as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Smarter retry of costly-order allocations
  2008-04-15 19:18       ` Andrew Morton
@ 2008-04-16  0:00         ` Nishanth Aravamudan
  2008-04-16  0:09           ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Nishanth Aravamudan @ 2008-04-16  0:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mel, clameter, apw, kosaki.motohiro, linux-mm

On 15.04.2008 [12:18:34 -0700], Andrew Morton wrote:
> On Tue, 15 Apr 2008 10:26:14 -0700
> Nishanth Aravamudan <nacc@us.ibm.com> wrote:
> 
> > > So... would like to see some firmer-looking testing results, please.
> > 
> > Do Mel's e-mails cover this sufficiently?
> 
> I guess so.
> 
> Could you please pull together a new set of changelogs sometime?

Will do it tomorrow, for sure.

> The big-picture change here is that we now use GFP_REPEAT for hugepages,
> which makes the allocations work better.  But I assume that you hit some
> problem with that which led you to reduce the amount of effort which
> GFP_REPEAT will expend for larger pages, yes?
> 
> If so, a description of that problem would be appropriate as well.

Will add that, as well.

Would you like me to repost the patch with the new changelog and just
ask you therein to drop and replace? Patches 1/3 and 3/3 should be
unchanged.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Smarter retry of costly-order allocations
  2008-04-16  0:00         ` Nishanth Aravamudan
@ 2008-04-16  0:09           ` Andrew Morton
  2008-04-17  1:39             ` [UPDATED][PATCH 2/3] " Nishanth Aravamudan
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-04-16  0:09 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: mel, clameter, apw, kosaki.motohiro, linux-mm

On Tue, 15 Apr 2008 17:00:10 -0700
Nishanth Aravamudan <nacc@us.ibm.com> wrote:

> On 15.04.2008 [12:18:34 -0700], Andrew Morton wrote:
> > On Tue, 15 Apr 2008 10:26:14 -0700
> > Nishanth Aravamudan <nacc@us.ibm.com> wrote:
> > 
> > > > So... would like to see some firmer-looking testing results, please.
> > > 
> > > Do Mel's e-mails cover this sufficiently?
> > 
> > I guess so.
> > 
> > Could you please pull together a new set of changelogs sometime?
> 
> Will do it tomorrow, for sure.
> 
> > The big-picture change here is that we now use GFP_REPEAT for hugepages,
> > which makes the allocations work better.  But I assume that you hit some
> > problem with that which led you to reduce the amount of effort which
> > GFP_REPEAT will expend for larger pages, yes?
> > 
> > If so, a description of that problem would be appropriate as well.
> 
> Will add that, as well.
> 
> Would you like me to repost the patch with the new changelog and just
> ask you therein to drop and replace? Patches 1/3 and 3/3 should be
> unchanged.
> 

Sure, whatever, I'll work it out ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [UPDATED][PATCH 2/3] Smarter retry of costly-order allocations
  2008-04-16  0:09           ` Andrew Morton
@ 2008-04-17  1:39             ` Nishanth Aravamudan
  0 siblings, 0 replies; 14+ messages in thread
From: Nishanth Aravamudan @ 2008-04-17  1:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mel, clameter, apw, kosaki.motohiro, linux-mm

On 15.04.2008 [17:09:02 -0700], Andrew Morton wrote:
> On Tue, 15 Apr 2008 17:00:10 -0700
> Nishanth Aravamudan <nacc@us.ibm.com> wrote:
> 
> > On 15.04.2008 [12:18:34 -0700], Andrew Morton wrote:
> > > On Tue, 15 Apr 2008 10:26:14 -0700
> > > Nishanth Aravamudan <nacc@us.ibm.com> wrote:
> > > 
> > > > > So... would like to see some firmer-looking testing results, please.
> > > > 
> > > > Do Mel's e-mails cover this sufficiently?
> > > 
> > > I guess so.
> > > 
> > > Could you please pull together a new set of changelogs sometime?
> > 
> > Will do it tomorrow, for sure.
> > 
> > > The big-picture change here is that we now use GFP_REPEAT for hugepages,
> > > which makes the allocations work better.  But I assume that you hit some
> > > problem with that which led you to reduce the amount of effort which
> > > GFP_REPEAT will expend for larger pages, yes?
> > > 
> > > If so, a description of that problem would be appropriate as well.
> > 
> > Will add that, as well.
> > 
> > Would you like me to repost the patch with the new changelog and just
> > ask you therein to drop and replace? Patches 1/3 and 3/3 should be
> > unchanged.
> > 
> 
> Sure, whatever, I'll work it out ;)

Because of page order checks in __alloc_pages(), hugepage (and similarly
large order) allocations will not retry unless explicitly marked
__GFP_REPEAT. However, the current retry logic is nearly an infinite
loop (or until reclaim does no progress whatsoever). For these costly
allocations, that seems like overkill and could potentially never
terminate. Mel observed that allowing current __GFP_REPEAT semantics for
hugepage allocations essentially killed the system. I believe this is
because we may continue to reclaim small orders of pages all over, but
never have enough to satisfy the hugepage allocation request. This is
clearly only a problem for large order allocations, of which hugepages
are the most obvious (to me).

Modify try_to_free_pages() to indicate how many pages were reclaimed.
Use that information in __alloc_pages() to eventually fail a large
__GFP_REPEAT allocation when we've reclaimed an order of pages equal to
or greater than the allocation's order. This relies on lumpy reclaim
functioning as advertised. Due to fragmentation, lumpy reclaim may not
be able to free up the order needed in one invocation, so multiple
iterations may be requred. In other words, the more fragmented memory
is, the more retry attempts __GFP_REPEAT will make (particularly for
higher order allocations).

This changes the semantics of __GFP_REPEAT subtly, but *only* for
allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
allocations, we will try up to some point (at least 1<<order reclaimed
pages), rather than forever (which is the case for allocations <=
PAGE_ALLOC_COSTLY_ORDER).

This change improves the /proc/sys/vm/nr_hugepages interface with a
follow-on patch that makes pool allocations use __GFP_REPEAT. Rather
than administrators repeatedly echo'ing a particular value into the
sysctl, and forcing reclaim into action manually, this change allows for
the sysctl to attempt a reasonable effort itself. Similarly, dynamic
pool growth should be more successful under load, as lumpy reclaim can
try to free up pages, rather than failing right away.

Choosing to reclaim only up to the order of the requested allocation
strikes a balance between not failing hugepage allocations and returning
to the caller when it's unlikely to every succeed. Because of lumpy
reclaim, if we have freed the order requested, hopefully it has been in
big chunks and those chunks will allow our allocation to succeed. If
that isn't the case after freeing up the current order, I don't think it
is likely to succeed in the future, although it is possible given a
particular fragmentation pattern.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Mel Gorman <mel@csn.ul.ie>

---
Not sure if this is any better, Andrew. I'll update 3/3 as well, to
include Mel's testing results.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1db36da..1a0cc4d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1541,7 +1541,8 @@ __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
 	struct task_struct *p = current;
 	int do_retry;
 	int alloc_flags;
-	int did_some_progress;
+	unsigned long did_some_progress;
+	unsigned long pages_reclaimed = 0;
 
 	might_sleep_if(wait);
 
@@ -1691,15 +1692,26 @@ nofail_alloc:
 	 * Don't let big-order allocations loop unless the caller explicitly
 	 * requests that.  Wait for some write requests to complete then retry.
 	 *
-	 * In this implementation, either order <= PAGE_ALLOC_COSTLY_ORDER or
-	 * __GFP_REPEAT mean __GFP_NOFAIL, but that may not be true in other
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
 	 * implementations.
+	 *
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
 	 */
+	pages_reclaimed += did_some_progress;
 	do_retry = 0;
 	if (!(gfp_mask & __GFP_NORETRY)) {
-		if ((order <= PAGE_ALLOC_COSTLY_ORDER) ||
-						(gfp_mask & __GFP_REPEAT))
+		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
 			do_retry = 1;
+		} else {
+			if (gfp_mask & __GFP_REPEAT &&
+				pages_reclaimed < (1 << order))
+					do_retry = 1;
+		}
 		if (gfp_mask & __GFP_NOFAIL)
 			do_retry = 1;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 83f42c9..d106b2c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1319,6 +1319,9 @@ static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
  * hope that some of these pages can be written.  But if the allocating task
  * holds filesystem locks which prevent writeout this might not work, and the
  * allocation attempt will fail.
+ *
+ * returns:	0, if no pages reclaimed
+ * 		else, the number of pages reclaimed
  */
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					struct scan_control *sc)
@@ -1368,7 +1371,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		}
 		total_scanned += sc->nr_scanned;
 		if (nr_reclaimed >= sc->swap_cluster_max) {
-			ret = 1;
+			ret = nr_reclaimed;
 			goto out;
 		}
 
@@ -1391,7 +1394,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	}
 	/* top priority shrink_caches still had more to do? don't OOM, then */
 	if (!sc->all_unreclaimable && scan_global_lru(sc))
-		ret = 1;
+		ret = nr_reclaimed;
 out:
 	/*
 	 * Now that we've scanned all the zones at this priority level, note

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [UPDATED][PATCH 3/3] Explicitly retry hugepage allocations
  2008-04-15  8:56     ` Mel Gorman
@ 2008-04-17  1:40       ` Nishanth Aravamudan
  0 siblings, 0 replies; 14+ messages in thread
From: Nishanth Aravamudan @ 2008-04-17  1:40 UTC (permalink / raw)
  To: Mel Gorman; +Cc: akpm, clameter, apw, wli, linux-mm

On 15.04.2008 [09:56:08 +0100], Mel Gorman wrote:
> On (11/04/08 16:36), Nishanth Aravamudan didst pronounce:
> > Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate
> > userspace putting pressure on the VM by repeated echo's into
> > /proc/sys/vm/nr_hugepages to grow the pool. With the previous patch to
> > allow for large-order __GFP_REPEAT attempts to loop for a bit (as
> > opposed to indefinitely), this increases the likelihood of getting
> > hugepages when the system experiences (or recently experienced) load.
> > 
> > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> I tested the patchset on an x86_32 laptop. With the patches, it was easier to
> use the proc interface to grow the hugepage pool. The following is the output
> of a script that grows the pool as much as possible running on 2.6.25-rc9
> 
> Allocating hugepages test
> -------------------------
> Disabling OOM Killer for current test process
> Starting page count: 0
> Attempt 1: 57 pages Progress made with 57 pages
> Attempt 2: 73 pages Progress made with 16 pages
> Attempt 3: 74 pages Progress made with 1 pages
> Attempt 4: 75 pages Progress made with 1 pages
> Attempt 5: 77 pages Progress made with 2 pages
> 
> 77 pages was the most it allocated but it took 5 attempts from userspace
> to get it. With your 3 patches applied,
> 
> Allocating hugepages test
> -------------------------
> Disabling OOM Killer for current test process
> Starting page count: 0
> Attempt 1: 75 pages Progress made with 75 pages
> Attempt 2: 76 pages Progress made with 1 pages
> Attempt 3: 79 pages Progress made with 3 pages
> 
> And 79 pages was the most it got. Your patches were able to allocate the
> bulk of possible pages on the first attempt.

Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate
userspace putting pressure on the VM by repeated echo's into
/proc/sys/vm/nr_hugepages to grow the pool. With the previous patch to
allow for large-order __GFP_REPEAT attempts to loop for a bit (as
opposed to indefinitely), this increases the likelihood of getting
hugepages when the system experiences (or recently experienced) load.

Mel tested the patchset on an x86_32 laptop. With the patches, it was
easier to use the proc interface to grow the hugepage pool. The
following is the output of a script that grows the pool as much as
possible running on 2.6.25-rc9.

Allocating hugepages test
-------------------------
Disabling OOM Killer for current test process
Starting page count: 0
Attempt 1: 57 pages Progress made with 57 pages
Attempt 2: 73 pages Progress made with 16 pages
Attempt 3: 74 pages Progress made with 1 pages
Attempt 4: 75 pages Progress made with 1 pages
Attempt 5: 77 pages Progress made with 2 pages

77 pages was the most it allocated but it took 5 attempts from userspace
to get it. With the 3 patches in this series applied,

Allocating hugepages test
-------------------------
Disabling OOM Killer for current test process
Starting page count: 0
Attempt 1: 75 pages Progress made with 75 pages
Attempt 2: 76 pages Progress made with 1 pages
Attempt 3: 79 pages Progress made with 3 pages

And 79 pages was the most it got. Your patches were able to allocate the
bulk of possible pages on the first attempt.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Mel Gorman <mel@csn.ul.ie>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df28c17..e13a7b2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -199,7 +199,8 @@ static struct page *alloc_fresh_huge_page_node(int nid)
 	struct page *page;
 
 	page = alloc_pages_node(nid,
-		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
+		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
+						__GFP_REPEAT|__GFP_NOWARN,
 		HUGETLB_PAGE_ORDER);
 	if (page) {
 		if (arch_prepare_hugepage(page)) {
@@ -294,7 +295,8 @@ static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
 	}
 	spin_unlock(&hugetlb_lock);
 
-	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
+	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
+					__GFP_REPEAT|__GFP_NOWARN,
 					HUGETLB_PAGE_ORDER);
 
 	spin_lock(&hugetlb_lock);

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-04-17  1:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-11 23:35 [PATCH 1/3] mm: fix misleading __GFP_REPEAT related comments Nishanth Aravamudan
2008-04-11 23:35 ` [PATCH] Smarter retry of costly-order allocations Nishanth Aravamudan
2008-04-11 23:36   ` [PATCH 3/3] Explicitly retry hugepage allocations Nishanth Aravamudan
2008-04-15  8:56     ` Mel Gorman
2008-04-17  1:40       ` [UPDATED][PATCH " Nishanth Aravamudan
2008-04-15  7:07   ` [PATCH] Smarter retry of costly-order allocations Andrew Morton
2008-04-15 17:26     ` Nishanth Aravamudan
2008-04-15 19:18       ` Andrew Morton
2008-04-16  0:00         ` Nishanth Aravamudan
2008-04-16  0:09           ` Andrew Morton
2008-04-17  1:39             ` [UPDATED][PATCH 2/3] " Nishanth Aravamudan
2008-04-15  8:51   ` [PATCH] " Mel Gorman
2008-04-15  9:02     ` Andrew Morton
2008-04-15  9:27       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).