[PATCH 00/25] Cleanup and optimise the page allocator V6

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/25] Cleanup and optimise the page allocator V6
@ 2009-04-20 22:19 Mel Gorman
  2009-04-20 22:19 ` [PATCH 01/25] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
                   ` (25 more replies)
  0 siblings, 26 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

Here is V6 of the cleanup and optimisation of the page allocator and it
should be ready for wider testing. Please consider a possibility for
merging as a Pass 1 at making the page allocator faster. Other passes
will occur later when this one has had a bit of exercise. This patchset
is based on mmotm-2009-04-17 but I haven't widely tested it myself due to
problems I'm encountering with the test grid I use (mostly unrelated to
the kernel). It doesn't apply cleanly to linux-next due to dependencies on
patches in -mm but the conflicts are fairly straight-forward to resolve.
I'm working on getting three local test machines built to test there but
it'll take a while and I wanted to get these patches out.

Hence, the following report is the same from V5 and based on an older
kernel. However, I expect similar results in a newer kernel.

======== Old Report ========

Performance is improved in a variety of cases but note it's not universal due
to lock contention which I'll explain later. Text is reduced by 497 bytes on
the x86-64 config I checked. 18.78% less clock cycles were sampled in the page
allocator paths excluding zeroing which is roughly the same in either kernel,
L1 cache misses are reduced by about 7.36% and L2 cache misses were reduced
by 17.91% cache misses incurred within the allocator itself are reduced.

The lock contention on some machines goes up for the the zone->lru_lock
and zone->lock locks which can regress some workloads even though others on
the same machine still go faster. For netperf, a lock called slock-AF_INET
seemed very important although I didn't look too closely other than noting
contention went up. The zone->lock gets hammered a lot by high order allocs
and frees coming from SLUB which are not covered by the PCP allocator in
this patchset. zone->lru_lock goes up is less clear but as it's page cache
releases but overall contention may be up because CPUs are spending less
time with interrupts disabled and more time trying to do real work but
contending on the locks.

============

Change since V5
  o Rebase to mmotm-2009-04-17

Changes since V4
  o Drop the more controversial patches for now and focus on the "obvious win"
    material.
  o Add reviewed-by notes
  o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
  o Add unlikely() for the clearMlocked check
  o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
    with __free_pages_ok()
  o Convert num_online_nodes() to use a static value so that callers do
    not have to be individually updated
  o Rebase to mmotm-2003-03-13

Changes since V3
  o Drop the more controversial patches for now and focus on the "obvious win"
    material
  o Add reviewed-by notes
  o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
  o Add unlikely() for the clearMlocked check
  o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
    with __free_pages_ok()

Changes since V2
  o Remove brances by treating watermark flags as array indices
  o Remove branch by assuming __GFP_HIGH == ALLOC_HIGH
  o Do not check for compound on every page free
  o Remove branch by always ensuring the migratetype is known on free
  o Simplify buffered_rmqueue further
  o Reintroduce improved version of batched bulk free of pcp pages
  o Use allocation flags as an index to zone watermarks
  o Work out __GFP_COLD only once
  o Reduce the number of times zone stats are updated
  o Do not dump reserve pages back into the allocator. Instead treat them
    as MOVABLE so that MIGRATE_RESERVE gets used on the max-order-overlapped
    boundaries without causing trouble
  o Allow pages up to PAGE_ALLOC_COSTLY_ORDER to use the per-cpu allocator.
    order-1 allocations are frequently enough in particular to justify this
  o Rearrange inlining such that the hot-path is inlined but not in a way
    that increases the text size of the page allocator
  o Make the check for needing additional zonelist filtering due to NUMA
    or cpusets as light as possible
  o Do not destroy compound pages going to the PCP lists
  o Delay the merging of buddies until a high-order allocation needs them
    or anti-fragmentation is being forced to fallback

Changes since V1
  o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
  o Use non-lock bit operations for clearing the mlock flag
  o Factor out alloc_flags calculation so it is only done once (Peter)
  o Make gfp.h a bit prettier and clear-cut (Peter)
  o Instead of deleting a debugging check, replace page_count() in the
    free path with a version that does not check for compound pages (Nick)
  o Drop the alteration for hot/cold page freeing until we know if it
    helps or not

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 01/25] Replace __alloc_pages_internal() with __alloc_pages_nodemask()
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  1:44   ` KOSAKI Motohiro
  2009-04-21  5:55   ` Pekka Enberg
  2009-04-20 22:19 ` [PATCH 02/25] Do not sanity check order in the fast path Mel Gorman
                   ` (24 subsequent siblings)
  25 siblings, 2 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

__alloc_pages_internal is the core page allocator function but
essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
available and exported function "internal" is also a big ugly. This
patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
deletes the old nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/gfp.h |   12 ++----------
 mm/page_alloc.c     |    4 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0bbc15f..556c840 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -169,24 +169,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		       struct zonelist *zonelist, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
-	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
-static inline struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask)
-{
-	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
-}
-
-
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4ea469..dcc4f05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ try_next_zone:
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -1671,7 +1671,7 @@ nopage:
 got_pg:
 	return page;
 }
-EXPORT_SYMBOL(__alloc_pages_internal);
+EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
  * Common helper functions.
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/25] Replace __alloc_pages_internal() with __alloc_pages_nodemask()
  2009-04-20 22:19 ` [PATCH 01/25] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
@ 2009-04-21  1:44   ` KOSAKI Motohiro
  2009-04-21  5:55   ` Pekka Enberg
  1 sibling, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  1:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton


>  include/linux/gfp.h |   12 ++----------
>  mm/page_alloc.c     |    4 ++--
>  2 files changed, 4 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 0bbc15f..556c840 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -169,24 +169,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
>  #endif
>  
>  struct page *
> -__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  		       struct zonelist *zonelist, nodemask_t *nodemask);
>  
>  static inline struct page *
>  __alloc_pages(gfp_t gfp_mask, unsigned int order,
>  		struct zonelist *zonelist)
>  {
> -	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> +	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
>  }
>  
> -static inline struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> -		struct zonelist *zonelist, nodemask_t *nodemask)
> -{
> -	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
> -}
> -
> -
>  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>  						unsigned int order)
>  {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e4ea469..dcc4f05 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1462,7 +1462,7 @@ try_next_zone:
>   * This is the 'heart' of the zoned buddy allocator.
>   */
>  struct page *
> -__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  			struct zonelist *zonelist, nodemask_t *nodemask)
>  {

sorry, late review.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/25] Replace __alloc_pages_internal() with __alloc_pages_nodemask()
  2009-04-20 22:19 ` [PATCH 01/25] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
  2009-04-21  1:44   ` KOSAKI Motohiro
@ 2009-04-21  5:55   ` Pekka Enberg
  1 sibling, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  5:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

Mel Gorman wrote:
> __alloc_pages_internal is the core page allocator function but
> essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
> available and exported function "internal" is also a big ugly. This
> patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
> deletes the old nodemask function.
> 
> Warning - This patch renames an exported symbol. No kernel driver is
> affected by external drivers calling __alloc_pages_internal() should
> change the call to __alloc_pages_nodemask() without any alteration of
> parameters.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 02/25] Do not sanity check order in the fast path
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
  2009-04-20 22:19 ` [PATCH 01/25] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  1:45   ` KOSAKI Motohiro
  2009-04-21  5:55   ` Pekka Enberg
  2009-04-20 22:19 ` [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
                   ` (23 subsequent siblings)
  25 siblings, 2 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/gfp.h |    6 ------
 mm/page_alloc.c     |    2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 556c840..760f6c0 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -182,9 +182,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	/* Unknown node is current node */
 	if (nid < 0)
 		nid = numa_node_id();
@@ -198,9 +195,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dcc4f05..5028f40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1405,6 +1405,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	classzone_idx = zone_idx(preferred_zone);
 
+	VM_BUG_ON(order >= MAX_ORDER);
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/25] Do not sanity check order in the fast path
  2009-04-20 22:19 ` [PATCH 02/25] Do not sanity check order in the fast path Mel Gorman
@ 2009-04-21  1:45   ` KOSAKI Motohiro
  2009-04-21  5:55   ` Pekka Enberg
  1 sibling, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  1:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> @@ -182,9 +182,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
>  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>  						unsigned int order)
>  {
> -	if (unlikely(order >= MAX_ORDER))
> -		return NULL;
> -
>  	/* Unknown node is current node */
>  	if (nid < 0)
>  		nid = numa_node_id();
> @@ -198,9 +195,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
>  static inline struct page *
>  alloc_pages(gfp_t gfp_mask, unsigned int order)
>  {
> -	if (unlikely(order >= MAX_ORDER))
> -		return NULL;
> -
>  	return alloc_pages_current(gfp_mask, order);
>  }
>  extern struct page *alloc_page_vma(gfp_t gfp_mask,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dcc4f05..5028f40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1405,6 +1405,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  
>  	classzone_idx = zone_idx(preferred_zone);
>  
> +	VM_BUG_ON(order >= MAX_ORDER);
> +

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/25] Do not sanity check order in the fast path
  2009-04-20 22:19 ` [PATCH 02/25] Do not sanity check order in the fast path Mel Gorman
  2009-04-21  1:45   ` KOSAKI Motohiro
@ 2009-04-21  5:55   ` Pekka Enberg
  1 sibling, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  5:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

Mel Gorman wrote:
> No user of the allocator API should be passing in an order >= MAX_ORDER
> but we check for it on each and every allocation. Delete this check and
> make it a VM_BUG_ON check further down the call path.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
  2009-04-20 22:19 ` [PATCH 01/25] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
  2009-04-20 22:19 ` [PATCH 02/25] Do not sanity check order in the fast path Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  2:44   ` KOSAKI Motohiro
                     ` (2 more replies)
  2009-04-20 22:19 ` [PATCH 04/25] Check only once if the zonelist is suitable for the allocation Mel Gorman
                   ` (22 subsequent siblings)
  25 siblings, 3 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in fast
paths know for a fact their node is valid. To avoid a comparison and branch,
this patch adds alloc_pages_exact_node() that only checks the nid with
VM_BUG_ON(). Callers that know their node is valid are then converted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 arch/ia64/hp/common/sba_iommu.c   |    2 +-
 arch/ia64/kernel/mca.c            |    3 +--
 arch/ia64/kernel/uncached.c       |    3 ++-
 arch/ia64/sn/pci/pci_dma.c        |    3 ++-
 arch/powerpc/platforms/cell/ras.c |    2 +-
 arch/x86/kvm/vmx.c                |    2 +-
 drivers/misc/sgi-gru/grufile.c    |    2 +-
 drivers/misc/sgi-xp/xpc_uv.c      |    2 +-
 include/linux/gfp.h               |    9 +++++++++
 include/linux/mm.h                |    1 -
 kernel/profile.c                  |    8 ++++----
 mm/filemap.c                      |    2 +-
 mm/hugetlb.c                      |    4 ++--
 mm/mempolicy.c                    |    2 +-
 mm/migrate.c                      |    2 +-
 mm/slab.c                         |    4 ++--
 mm/slob.c                         |    4 ++--
 17 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 56ceb68..fe63b2d 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1131,7 +1131,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp
 #ifdef CONFIG_NUMA
 	{
 		struct page *page;
-		page = alloc_pages_node(ioc->node == MAX_NUMNODES ?
+		page = alloc_pages_exact_node(ioc->node == MAX_NUMNODES ?
 		                        numa_node_id() : ioc->node, flags,
 		                        get_order(size));
 
diff --git a/arch/ia64/kernel/mca.c b/arch/ia64/kernel/mca.c
index 8f33a88..5b17bd4 100644
--- a/arch/ia64/kernel/mca.c
+++ b/arch/ia64/kernel/mca.c
@@ -1829,8 +1829,7 @@ ia64_mca_cpu_init(void *cpu_data)
 			data = mca_bootmem();
 			first_time = 0;
 		} else
-			data = page_address(alloc_pages_node(numa_node_id(),
-					GFP_KERNEL, get_order(sz)));
+			data = __get_free_pages(GFP_KERNEL, get_order(sz));
 		if (!data)
 			panic("Could not allocate MCA memory for cpu %d\n",
 					cpu);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 8eff8c1..6ba72ab 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 
 	/* attempt to allocate a granule's worth of cached memory pages */
 
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid,
+				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				IA64_GRANULE_SHIFT-PAGE_SHIFT);
 	if (!page) {
 		mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d876423..98b6849 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -90,7 +90,8 @@ static void *sn_dma_alloc_coherent(struct device *dev, size_t size,
 	 */
 	node = pcibus_to_node(pdev->bus);
 	if (likely(node >=0)) {
-		struct page *p = alloc_pages_node(node, flags, get_order(size));
+		struct page *p = alloc_pages_exact_node(node,
+						flags, get_order(size));
 
 		if (likely(p))
 			cpuaddr = page_address(p);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5f961c4..16ba671 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -122,7 +122,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
 
 	area->nid = nid;
 	area->order = order;
-	area->pages = alloc_pages_node(area->nid, GFP_KERNEL, area->order);
+	area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL, area->order);
 
 	if (!area->pages)
 		goto out_free_area;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c6997c0..eaa149f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1258,7 +1258,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
 	struct page *pages;
 	struct vmcs *vmcs;
 
-	pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+	pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
 	if (!pages)
 		return NULL;
 	vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 9ba90f3..796ac70 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -306,7 +306,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
 		pnode = uv_node_to_pnode(nid);
 		if (bid < 0 || gru_base[bid])
 			continue;
-		page = alloc_pages_node(nid, GFP_KERNEL, order);
+		page = alloc_pages_exact_node(nid, GFP_KERNEL, order);
 		if (!page)
 			goto fail;
 		gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 9172fcd..c76677a 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -232,7 +232,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
 	mq->mmr_blade = uv_cpu_to_blade_id(cpu);
 
 	nid = cpu_to_node(cpu);
-	page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+	page = alloc_pages_exact_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				pg_order);
 	if (page == NULL) {
 		dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 760f6c0..c7429b8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -5,6 +5,7 @@
 #include <linux/stddef.h>
 #include <linux/linkage.h>
 #include <linux/topology.h>
+#include <linux/mmdebug.h>
 
 struct vm_area_struct;
 
@@ -189,6 +190,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
+static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+						unsigned int order)
+{
+	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c916e4..2d86bc8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -7,7 +7,6 @@
 
 #include <linux/gfp.h>
 #include <linux/list.h>
-#include <linux/mmdebug.h>
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/prio_tree.h>
diff --git a/kernel/profile.c b/kernel/profile.c
index 7724e04..62e08db 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -371,7 +371,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -379,7 +379,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node,
+			page = alloc_pages_exact_node(node,
 					GFP_KERNEL | __GFP_ZERO,
 					0);
 			if (!page)
@@ -570,14 +570,14 @@ static int create_hash_tables(void)
 		int node = cpu_to_node(cpu);
 		struct page *page;
 
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
 			goto out_cleanup;
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node,
+		page = alloc_pages_exact_node(node,
 				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
 				0);
 		if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 5256582..0ad1eb1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -521,7 +521,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_exact_node(n, gfp, 0);
 	}
 	return alloc_pages(gfp, 0);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28c655b..1234486 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -630,7 +630,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 	if (h->order >= MAX_ORDER)
 		return NULL;
 
-	page = alloc_pages_node(nid,
+	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
@@ -649,7 +649,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
  * Use a helper variable to find the next node and then
  * copy it back to hugetlb_next_nid afterwards:
  * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
  * doesn't matter if occasionally a racer chooses the
  * same nid as we do.  Move nid forward in the mask even
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8a5d2b8..c32bc15 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -796,7 +796,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
-	return alloc_pages_node(node, GFP_HIGHUSER_MOVABLE, 0);
+	return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index 068655d..5a24923 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 
 	*result = &pm->status;
 
-	return alloc_pages_node(pm->node,
+	return alloc_pages_exact_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
 }
 
diff --git a/mm/slab.c b/mm/slab.c
index 3da2640..1c680e8 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1699,7 +1699,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
-	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+	page = alloc_pages_exact_node(nodeid, flags, cachep->gfporder);
 	if (!page)
 		return NULL;
 
@@ -3254,7 +3254,7 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		obj = kmem_getpages(cache, local_flags, -1);
+		obj = kmem_getpages(cache, local_flags, numa_node_id());
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (obj) {
diff --git a/mm/slob.c b/mm/slob.c
index 3e7acbc..ffeb218 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -46,7 +46,7 @@
  * NUMA support in SLOB is fairly simplistic, pushing most of the real
  * logic down to the page allocator, and simply doing the node accounting
  * on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_node() with the specified node id is used
+ * provided, alloc_pages_exact_node() with the specified node id is used
  * instead. The common case (or when the node id isn't explicitly provided)
  * will default to the current node, as per numa_node_id().
  *
@@ -243,7 +243,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 
 #ifdef CONFIG_NUMA
 	if (node != -1)
-		page = alloc_pages_node(node, gfp, order);
+		page = alloc_pages_exact_node(node, gfp, order);
 	else
 #endif
 		page = alloc_pages(gfp, order);
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid
  2009-04-20 22:19 ` [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
@ 2009-04-21  2:44   ` KOSAKI Motohiro
  2009-04-21  6:00   ` Pekka Enberg
  2009-04-21  6:33   ` Paul Mundt
  2 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  2:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> Callers of alloc_pages_node() can optionally specify -1 as a node to mean
> "allocate from the current node". However, a number of the callers in fast
> paths know for a fact their node is valid. To avoid a comparison and branch,
> this patch adds alloc_pages_exact_node() that only checks the nid with
> VM_BUG_ON(). Callers that know their node is valid are then converted.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

I think there still are convertable caller. (recently, caller increased a bit)
but it isn't important. anyway,

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid
  2009-04-20 22:19 ` [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
  2009-04-21  2:44   ` KOSAKI Motohiro
@ 2009-04-21  6:00   ` Pekka Enberg
  2009-04-21  6:33   ` Paul Mundt
  2 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  6:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

Mel Gorman wrote:
> Callers of alloc_pages_node() can optionally specify -1 as a node to mean
> "allocate from the current node". However, a number of the callers in fast
> paths know for a fact their node is valid. To avoid a comparison and branch,
> this patch adds alloc_pages_exact_node() that only checks the nid with
> VM_BUG_ON(). Callers that know their node is valid are then converted.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid
  2009-04-20 22:19 ` [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
  2009-04-21  2:44   ` KOSAKI Motohiro
  2009-04-21  6:00   ` Pekka Enberg
@ 2009-04-21  6:33   ` Paul Mundt
  2 siblings, 0 replies; 105+ messages in thread
From: Paul Mundt @ 2009-04-21  6:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Mon, Apr 20, 2009 at 11:19:49PM +0100, Mel Gorman wrote:
> Callers of alloc_pages_node() can optionally specify -1 as a node to mean
> "allocate from the current node". However, a number of the callers in fast
> paths know for a fact their node is valid. To avoid a comparison and branch,
> this patch adds alloc_pages_exact_node() that only checks the nid with
> VM_BUG_ON(). Callers that know their node is valid are then converted.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

For the SLOB NUMA bits:

Acked-by: Paul Mundt <lethal@linux-sh.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 04/25] Check only once if the zonelist is suitable for the allocation
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (2 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  3:03   ` KOSAKI Motohiro
  2009-04-21  7:09   ` Pekka Enberg
  2009-04-20 22:19 ` [PATCH 05/25] Break up the allocator entry point into fast and slow paths Mel Gorman
                   ` (21 subsequent siblings)
  25 siblings, 2 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

It is possible with __GFP_THISNODE that no zones are suitable. This
patch makes sure the check is only made once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5028f40..3bed856 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1486,9 +1486,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
-restart:
-	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
-
+	/* the list of zones suitable for gfp_mask */
+	z = zonelist->_zonerefs;
 	if (unlikely(!z->zone)) {
 		/*
 		 * Happens if we have an empty zonelist as a result of
@@ -1497,6 +1496,7 @@ restart:
 		return NULL;
 	}
 
+restart:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 04/25] Check only once if the zonelist is suitable for the allocation
  2009-04-20 22:19 ` [PATCH 04/25] Check only once if the zonelist is suitable for the allocation Mel Gorman
@ 2009-04-21  3:03   ` KOSAKI Motohiro
  2009-04-21  7:09   ` Pekka Enberg
  1 sibling, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  3:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> -restart:
> -	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
> -
> +	/* the list of zones suitable for gfp_mask */
> +	z = zonelist->_zonerefs;
>  	if (unlikely(!z->zone)) {
>  		/*
>  		 * Happens if we have an empty zonelist as a result of
> @@ -1497,6 +1496,7 @@ restart:
>  		return NULL;
>  	}
>  
> +restart:
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
>  	if (page)

looks good.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 04/25] Check only once if the zonelist is suitable for the allocation
  2009-04-20 22:19 ` [PATCH 04/25] Check only once if the zonelist is suitable for the allocation Mel Gorman
  2009-04-21  3:03   ` KOSAKI Motohiro
@ 2009-04-21  7:09   ` Pekka Enberg
  1 sibling, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  7:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> It is possible with __GFP_THISNODE that no zones are suitable. This
> patch makes sure the check is only made once.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 05/25] Break up the allocator entry point into fast and slow paths
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (3 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 04/25] Check only once if the zonelist is suitable for the allocation Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  6:35   ` KOSAKI Motohiro
  2009-04-20 22:19 ` [PATCH 06/25] Move check for disabled anti-fragmentation out of fastpath Mel Gorman
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

The core of the page allocator is one giant function which allocates memory
on the stack and makes calculations that may not be needed for every
allocation. This patch breaks up the allocator path into fast and slow
paths for clarity. Note the slow paths are still inlined but the entry is
marked unlikely.  If they were not inlined, it actally increases text size
to generate the as there is only one call site.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |  356 ++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 222 insertions(+), 134 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bed856..13b4d11 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1460,47 +1460,172 @@ try_next_zone:
 	return page;
 }
 
-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+static inline int
+should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+				unsigned long pages_reclaimed)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
-	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	struct zoneref *z;
-	struct zone *zone;
-	struct page *page;
-	struct reclaim_state reclaim_state;
-	struct task_struct *p = current;
-	int do_retry;
-	int alloc_flags;
-	unsigned long did_some_progress;
-	unsigned long pages_reclaimed = 0;
+	/* Do not loop if specifically requested */
+	if (gfp_mask & __GFP_NORETRY)
+		return 0;
 
-	lockdep_trace_alloc(gfp_mask);
+	/*
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
+	 * implementations.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
+
+	/*
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
+	 */
+	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+		return 1;
 
-	might_sleep_if(wait);
+	/*
+	 * Don't let big-order allocations loop unless the caller
+	 * explicitly requests that.
+	 */
+	if (gfp_mask & __GFP_NOFAIL)
+		return 1;
 
-	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+	return 0;
+}
 
-	/* the list of zones suitable for gfp_mask */
-	z = zonelist->_zonerefs;
-	if (unlikely(!z->zone)) {
-		/*
-		 * Happens if we have an empty zonelist as a result of
-		 * GFP_THISNODE being used on a memoryless node
-		 */
+static inline struct page *
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	/* Acquire the OOM killer lock for the zones in zonelist */
+	if (!try_set_zone_oom(zonelist, gfp_mask)) {
+		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
-restart:
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	/*
+	 * Go through the zonelist yet one more time, keep very high watermark
+	 * here, this is only to catch a parallel oom killing, we must fail if
+	 * we're still under heavy pressure.
+	 */
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+		order, zonelist, high_zoneidx,
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 	if (page)
-		goto got_pg;
+		goto out;
+
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		goto out;
+
+	/* Exhausted what can be done so it's blamo time */
+	out_of_memory(zonelist, gfp_mask, order);
+
+out:
+	clear_zonelist_oom(zonelist, gfp_mask);
+	return page;
+}
+
+/* The really slow allocator path where we enter direct reclaim */
+static inline struct page *
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+{
+	struct page *page = NULL;
+	struct reclaim_state reclaim_state;
+	struct task_struct *p = current;
+
+	cond_resched();
+
+	/* We now go into synchronous reclaim */
+	cpuset_memory_pressure_bump();
+
+	/*
+	 * The task's cpuset might have expanded its set of allowable nodes
+	 */
+	p->flags |= PF_MEMALLOC;
+	lockdep_set_current_reclaim_state(gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	p->reclaim_state = &reclaim_state;
+
+	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
+
+	p->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+	p->flags &= ~PF_MEMALLOC;
+
+	cond_resched();
+
+	if (order != 0)
+		drain_all_pages();
+
+	if (likely(*did_some_progress))
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+					zonelist, high_zoneidx, alloc_flags);
+	return page;
+}
+
+static inline int
+is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
+{
+	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt())
+		if (!(gfp_mask & __GFP_NOMEMALLOC))
+			return 1;
+	return 0;
+}
+
+/*
+ * This is called in the allocator slow-path if the allocation request is of
+ * sufficient urgency to ignore watermarks and take other desperate measures
+ */
+static inline struct page *
+__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	struct page *page;
+
+	do {
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+
+		if (!page && gfp_mask & __GFP_NOFAIL)
+			congestion_wait(WRITE, HZ/50);
+	} while (!page && (gfp_mask & __GFP_NOFAIL));
+
+	return page;
+}
+
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+						enum zone_type high_zoneidx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
+}
+
+static inline struct page *
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask)
+{
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	struct page *page = NULL;
+	int alloc_flags;
+	unsigned long pages_reclaimed = 0;
+	unsigned long did_some_progress;
+	struct task_struct *p = current;
 
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
@@ -1513,8 +1638,7 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+	wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1534,6 +1658,7 @@ restart:
 	if (wait)
 		alloc_flags |= ALLOC_CPUSET;
 
+restart:
 	/*
 	 * Go through the zonelist again. Let __GFP_HIGH and allocations
 	 * coming from realtime tasks go deeper into reserves.
@@ -1547,119 +1672,47 @@ restart:
 	if (page)
 		goto got_pg;
 
-	/* This allocation should allow future memory freeing. */
-
-rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, nodemask, order,
-				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
-		}
-		goto nopage;
-	}
+	/* Allocate without watermarks if the context allows */
+	if (is_allocation_high_priority(p, gfp_mask))
+		page = __alloc_pages_high_priority(gfp_mask, order,
+			zonelist, high_zoneidx, nodemask);
+	if (page)
+		goto got_pg;
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
-	cond_resched();
-
-	/* We now go into synchronous reclaim */
-	cpuset_memory_pressure_bump();
-
-	p->flags |= PF_MEMALLOC;
-
-	lockdep_set_current_reclaim_state(gfp_mask);
-	reclaim_state.reclaimed_slab = 0;
-	p->reclaim_state = &reclaim_state;
-
-	did_some_progress = try_to_free_pages(zonelist, order,
-						gfp_mask, nodemask);
-
-	p->reclaim_state = NULL;
-	lockdep_clear_current_reclaim_state();
-	p->flags &= ~PF_MEMALLOC;
-
-	cond_resched();
+	/* Try direct reclaim and then allocating */
+	page = __alloc_pages_direct_reclaim(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask,
+					alloc_flags, &did_some_progress);
+	if (page)
+		goto got_pg;
 
-	if (order != 0)
-		drain_all_pages();
+	/*
+	 * If we failed to make any progress reclaiming, then we are
+	 * running out of options and have to consider going OOM
+	 */
+	if (!did_some_progress) {
+		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+			page = __alloc_pages_may_oom(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask);
+			if (page)
+				goto got_pg;
 
-	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
-		if (page)
-			goto got_pg;
-	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (!try_set_zone_oom(zonelist, gfp_mask)) {
-			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
-
-		/*
-		 * Go through the zonelist yet one more time, keep
-		 * very high watermark here, this is only to catch
-		 * a parallel oom killing, we must fail if we're still
-		 * under heavy pressure.
-		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
-			order, zonelist, high_zoneidx,
-			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
-		if (page) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto got_pg;
-		}
-
-		/* The OOM killer will not help higher order allocs so fail */
-		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			clear_zonelist_oom(zonelist, gfp_mask);
-			goto nopage;
-		}
-
-		out_of_memory(zonelist, gfp_mask, order);
-		clear_zonelist_oom(zonelist, gfp_mask);
-		goto restart;
 	}
 
-	/*
-	 * Don't let big-order allocations loop unless the caller explicitly
-	 * requests that.  Wait for some write requests to complete then retry.
-	 *
-	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
-	 *
-	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
-	 * specified, then we retry until we no longer reclaim any pages
-	 * (above), or we've reclaimed an order of pages at least as
-	 * large as the allocation's order. In both cases, if the
-	 * allocation still fails, we stop retrying.
-	 */
+	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
-	do_retry = 0;
-	if (!(gfp_mask & __GFP_NORETRY)) {
-		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
-			do_retry = 1;
-		} else {
-			if (gfp_mask & __GFP_REPEAT &&
-				pages_reclaimed < (1 << order))
-					do_retry = 1;
-		}
-		if (gfp_mask & __GFP_NOFAIL)
-			do_retry = 1;
-	}
-	if (do_retry) {
+	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+		/* Wait for some write requests to complete then retry */
 		congestion_wait(WRITE, HZ/50);
-		goto rebalance;
+		goto restart;
 	}
 
 nopage:
@@ -1672,6 +1725,41 @@ nopage:
 	}
 got_pg:
 	return page;
+
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct page *page;
+
+	lockdep_trace_alloc(gfp_mask);
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	if (should_fail_alloc_page(gfp_mask, order))
+		return NULL;
+
+	/*
+	 * Check the zones suitable for the gfp_mask contain at least one
+	 * valid zone. It's possible to have an empty zonelist as a result
+	 * of GFP_THISNODE and a memoryless node
+	 */
+	if (unlikely(!zonelist->_zonerefs->zone))
+		return NULL;
+
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+	if (unlikely(!page))
+		page = __alloc_pages_slowpath(gfp_mask, order,
+				zonelist, high_zoneidx, nodemask);
+
+	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 05/25] Break up the allocator entry point into fast and slow paths
  2009-04-20 22:19 ` [PATCH 05/25] Break up the allocator entry point into fast and slow paths Mel Gorman
@ 2009-04-21  6:35   ` KOSAKI Motohiro
  2009-04-21  7:13     ` Pekka Enberg
  2009-04-21  9:29     ` Mel Gorman
  0 siblings, 2 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  6:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> The core of the page allocator is one giant function which allocates memory
> on the stack and makes calculations that may not be needed for every
> allocation. This patch breaks up the allocator path into fast and slow
> paths for clarity. Note the slow paths are still inlined but the entry is
> marked unlikely.  If they were not inlined, it actally increases text size
> to generate the as there is only one call site.

hmm..

this patch have few behavior change.
please separate big cleanup patch and behavior patch.

I hope to make this patch non functional change. I'm not sure about these
are your intentional change or not. it cause harder reviewing...


> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> ---
>  mm/page_alloc.c |  356 ++++++++++++++++++++++++++++++++++---------------------
>  1 files changed, 222 insertions(+), 134 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3bed856..13b4d11 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1460,47 +1460,172 @@ try_next_zone:
>  	return page;
>  }
>  
> -/*
> - * This is the 'heart' of the zoned buddy allocator.
> - */
> -struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> -			struct zonelist *zonelist, nodemask_t *nodemask)
> +static inline int
> +should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +				unsigned long pages_reclaimed)
>  {
> -	const gfp_t wait = gfp_mask & __GFP_WAIT;
> -	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> -	struct zoneref *z;
> -	struct zone *zone;
> -	struct page *page;
> -	struct reclaim_state reclaim_state;
> -	struct task_struct *p = current;
> -	int do_retry;
> -	int alloc_flags;
> -	unsigned long did_some_progress;
> -	unsigned long pages_reclaimed = 0;
> +	/* Do not loop if specifically requested */
> +	if (gfp_mask & __GFP_NORETRY)
> +		return 0;
>  
> -	lockdep_trace_alloc(gfp_mask);
> +	/*
> +	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> +	 * means __GFP_NOFAIL, but that may not be true in other
> +	 * implementations.
> +	 */
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return 1;
> +
> +	/*
> +	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
> +	 * specified, then we retry until we no longer reclaim any pages
> +	 * (above), or we've reclaimed an order of pages at least as
> +	 * large as the allocation's order. In both cases, if the
> +	 * allocation still fails, we stop retrying.
> +	 */
> +	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> +		return 1;
>  
> -	might_sleep_if(wait);
> +	/*
> +	 * Don't let big-order allocations loop unless the caller
> +	 * explicitly requests that.
> +	 */
> +	if (gfp_mask & __GFP_NOFAIL)
> +		return 1;
>  
> -	if (should_fail_alloc_page(gfp_mask, order))
> -		return NULL;
> +	return 0;
> +}
>  
> -	/* the list of zones suitable for gfp_mask */
> -	z = zonelist->_zonerefs;
> -	if (unlikely(!z->zone)) {
> -		/*
> -		 * Happens if we have an empty zonelist as a result of
> -		 * GFP_THISNODE being used on a memoryless node
> -		 */
> +static inline struct page *
> +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> +	struct zonelist *zonelist, enum zone_type high_zoneidx,
> +	nodemask_t *nodemask)
> +{
> +	struct page *page;
> +
> +	/* Acquire the OOM killer lock for the zones in zonelist */
> +	if (!try_set_zone_oom(zonelist, gfp_mask)) {
> +		schedule_timeout_uninterruptible(1);
>  		return NULL;
>  	}
>  
> -restart:
> -	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> -			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> +	/*
> +	 * Go through the zonelist yet one more time, keep very high watermark
> +	 * here, this is only to catch a parallel oom killing, we must fail if
> +	 * we're still under heavy pressure.
> +	 */
> +	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
> +		order, zonelist, high_zoneidx,
> +		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
>  	if (page)
> -		goto got_pg;
> +		goto out;
> +
> +	/* The OOM killer will not help higher order allocs */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER)
> +		goto out;
> +
> +	/* Exhausted what can be done so it's blamo time */
> +	out_of_memory(zonelist, gfp_mask, order);
> +
> +out:
> +	clear_zonelist_oom(zonelist, gfp_mask);
> +	return page;
> +}
> +
> +/* The really slow allocator path where we enter direct reclaim */
> +static inline struct page *
> +__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> +	struct zonelist *zonelist, enum zone_type high_zoneidx,
> +	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
> +{
> +	struct page *page = NULL;
> +	struct reclaim_state reclaim_state;
> +	struct task_struct *p = current;
> +
> +	cond_resched();
> +
> +	/* We now go into synchronous reclaim */
> +	cpuset_memory_pressure_bump();
> +
> +	/*
> +	 * The task's cpuset might have expanded its set of allowable nodes
> +	 */
> +	p->flags |= PF_MEMALLOC;
> +	lockdep_set_current_reclaim_state(gfp_mask);
> +	reclaim_state.reclaimed_slab = 0;
> +	p->reclaim_state = &reclaim_state;
> +
> +	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
> +
> +	p->reclaim_state = NULL;
> +	lockdep_clear_current_reclaim_state();
> +	p->flags &= ~PF_MEMALLOC;
> +
> +	cond_resched();
> +
> +	if (order != 0)
> +		drain_all_pages();
> +
> +	if (likely(*did_some_progress))
> +		page = get_page_from_freelist(gfp_mask, nodemask, order,
> +					zonelist, high_zoneidx, alloc_flags);
> +	return page;
> +}
> +
> +static inline int
> +is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
> +{
> +	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> +			&& !in_interrupt())
> +		if (!(gfp_mask & __GFP_NOMEMALLOC))
> +			return 1;
> +	return 0;
> +}
> +
> +/*
> + * This is called in the allocator slow-path if the allocation request is of
> + * sufficient urgency to ignore watermarks and take other desperate measures
> + */
> +static inline struct page *
> +__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> +	struct zonelist *zonelist, enum zone_type high_zoneidx,
> +	nodemask_t *nodemask)
> +{
> +	struct page *page;
> +
> +	do {
> +		page = get_page_from_freelist(gfp_mask, nodemask, order,
> +			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> +
> +		if (!page && gfp_mask & __GFP_NOFAIL)
> +			congestion_wait(WRITE, HZ/50);
> +	} while (!page && (gfp_mask & __GFP_NOFAIL));
> +
> +	return page;
> +}
> +
> +static inline
> +void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> +						enum zone_type high_zoneidx)
> +{
> +	struct zoneref *z;
> +	struct zone *zone;
> +
> +	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> +		wakeup_kswapd(zone, order);
> +}
> +
> +static inline struct page *
> +__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> +	struct zonelist *zonelist, enum zone_type high_zoneidx,
> +	nodemask_t *nodemask)
> +{
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	struct page *page = NULL;
> +	int alloc_flags;
> +	unsigned long pages_reclaimed = 0;
> +	unsigned long did_some_progress;
> +	struct task_struct *p = current;
>  
>  	/*
>  	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
> @@ -1513,8 +1638,7 @@ restart:
>  	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
>  		goto nopage;
>  
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		wakeup_kswapd(zone, order);
> +	wake_all_kswapd(order, zonelist, high_zoneidx);
>  
>  	/*
>  	 * OK, we're below the kswapd watermark and have kicked background
> @@ -1534,6 +1658,7 @@ restart:
>  	if (wait)
>  		alloc_flags |= ALLOC_CPUSET;
>  
> +restart:
>  	/*
>  	 * Go through the zonelist again. Let __GFP_HIGH and allocations
>  	 * coming from realtime tasks go deeper into reserves.
> @@ -1547,119 +1672,47 @@ restart:
>  	if (page)
>  		goto got_pg;
>  
> -	/* This allocation should allow future memory freeing. */
> -
> -rebalance:
> -	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -			&& !in_interrupt()) {
> -		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> -nofail_alloc:
> -			/* go through the zonelist yet again, ignoring mins */
> -			page = get_page_from_freelist(gfp_mask, nodemask, order,
> -				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> -			if (page)
> -				goto got_pg;
> -			if (gfp_mask & __GFP_NOFAIL) {
> -				congestion_wait(WRITE, HZ/50);
> -				goto nofail_alloc;
> -			}
> -		}
> -		goto nopage;
> -	}
> +	/* Allocate without watermarks if the context allows */
> +	if (is_allocation_high_priority(p, gfp_mask))
> +		page = __alloc_pages_high_priority(gfp_mask, order,
> +			zonelist, high_zoneidx, nodemask);
> +	if (page)
> +		goto got_pg;
>  
>  	/* Atomic allocations - we can't balance anything */
>  	if (!wait)
>  		goto nopage;
>  

old code is below.
if PF_MEMALLOC and !in_interrupt() and __GFP_NOMEMALLOC case,
old code jump to nopage, your one call reclaim.

I think, if the task have PF_MEMALLOC, it shouldn't call reclaim.
if not, endless reclaim recursion happend.

--------------------------------------------------------------------
rebalance:
        if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
                        && !in_interrupt()) {
                if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
                        /* go through the zonelist yet again, ignoring mins */
                        page = get_page_from_freelist(gfp_mask, nodemask, order,
                                zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
                        if (page)
                                goto got_pg;
                        if (gfp_mask & __GFP_NOFAIL) {
                                congestion_wait(WRITE, HZ/50);
                                goto nofail_alloc;
                        }
                }
                goto nopage;
        }
--------------------------------------------------------------------




>  
>  	/* Atomic allocations - we can't balance anything */
>  	if (!wait)
>  		goto nopage;
>  
> -	cond_resched();
> -
> -	/* We now go into synchronous reclaim */
> -	cpuset_memory_pressure_bump();
> -
> -	p->flags |= PF_MEMALLOC;
> -
> -	lockdep_set_current_reclaim_state(gfp_mask);
> -	reclaim_state.reclaimed_slab = 0;
> -	p->reclaim_state = &reclaim_state;
> -
> -	did_some_progress = try_to_free_pages(zonelist, order,
> -						gfp_mask, nodemask);
> -
> -	p->reclaim_state = NULL;
> -	lockdep_clear_current_reclaim_state();
> -	p->flags &= ~PF_MEMALLOC;
> -
> -	cond_resched();
> +	/* Try direct reclaim and then allocating */
> +	page = __alloc_pages_direct_reclaim(gfp_mask, order,
> +					zonelist, high_zoneidx,
> +					nodemask,
> +					alloc_flags, &did_some_progress);
> +	if (page)
> +		goto got_pg;
>  
> -	if (order != 0)
> -		drain_all_pages();
> +	/*
> +	 * If we failed to make any progress reclaiming, then we are
> +	 * running out of options and have to consider going OOM
> +	 */
> +	if (!did_some_progress) {
> +		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> +			page = __alloc_pages_may_oom(gfp_mask, order,
> +					zonelist, high_zoneidx,
> +					nodemask);
> +			if (page)
> +				goto got_pg;

the old code here.

------------------------------------------------------------------------
        } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
                if (!try_set_zone_oom(zonelist, gfp_mask)) {
                        schedule_timeout_uninterruptible(1);
                        goto restart;
                }

                /*
                 * Go through the zonelist yet one more time, keep
                 * very high watermark here, this is only to catch
                 * a parallel oom killing, we must fail if we're still
                 * under heavy pressure.
                 */
                page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
                        order, zonelist, high_zoneidx,
                        ALLOC_WMARK_HIGH|ALLOC_CPUSET);
                if (page) {
                        clear_zonelist_oom(zonelist, gfp_mask);
                        goto got_pg;
                }

                /* The OOM killer will not help higher order allocs so fail */
                if (order > PAGE_ALLOC_COSTLY_ORDER) {
                        clear_zonelist_oom(zonelist, gfp_mask);
                        goto nopage;
                }

                out_of_memory(zonelist, gfp_mask, order);
                clear_zonelist_oom(zonelist, gfp_mask);
                goto restart;
        }
------------------------------------------------------------------------

if get_page_from_freelist() return NULL and order > PAGE_ALLOC_COSTLY_ORDER,
old code jump to nopage, your one jump to restart.




> -	if (likely(did_some_progress)) {
> -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> -					zonelist, high_zoneidx, alloc_flags);
> -		if (page)
> -			goto got_pg;
> -	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> -		if (!try_set_zone_oom(zonelist, gfp_mask)) {
> -			schedule_timeout_uninterruptible(1);
>  			goto restart;
>  		}
> -
> -		/*
> -		 * Go through the zonelist yet one more time, keep
> -		 * very high watermark here, this is only to catch
> -		 * a parallel oom killing, we must fail if we're still
> -		 * under heavy pressure.
> -		 */
> -		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
> -			order, zonelist, high_zoneidx,
> -			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
> -		if (page) {
> -			clear_zonelist_oom(zonelist, gfp_mask);
> -			goto got_pg;
> -		}
> -
> -		/* The OOM killer will not help higher order allocs so fail */
> -		if (order > PAGE_ALLOC_COSTLY_ORDER) {
> -			clear_zonelist_oom(zonelist, gfp_mask);
> -			goto nopage;
> -		}
> -
> -		out_of_memory(zonelist, gfp_mask, order);
> -		clear_zonelist_oom(zonelist, gfp_mask);
> -		goto restart;
>  	}
>  
> -	/*
> -	 * Don't let big-order allocations loop unless the caller explicitly
> -	 * requests that.  Wait for some write requests to complete then retry.
> -	 *
> -	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> -	 * means __GFP_NOFAIL, but that may not be true in other
> -	 * implementations.
> -	 *
> -	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
> -	 * specified, then we retry until we no longer reclaim any pages
> -	 * (above), or we've reclaimed an order of pages at least as
> -	 * large as the allocation's order. In both cases, if the
> -	 * allocation still fails, we stop retrying.
> -	 */
> +	/* Check if we should retry the allocation */
>  	pages_reclaimed += did_some_progress;
> -	do_retry = 0;
> -	if (!(gfp_mask & __GFP_NORETRY)) {
> -		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> -			do_retry = 1;
> -		} else {
> -			if (gfp_mask & __GFP_REPEAT &&
> -				pages_reclaimed < (1 << order))
> -					do_retry = 1;
> -		}
> -		if (gfp_mask & __GFP_NOFAIL)
> -			do_retry = 1;
> -	}
> -	if (do_retry) {
> +	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> +		/* Wait for some write requests to complete then retry */
>  		congestion_wait(WRITE, HZ/50);
> -		goto rebalance;
> +		goto restart;

this change rebalance to restart.



>  	}
>  
>  nopage:
> @@ -1672,6 +1725,41 @@ nopage:
>  	}
>  got_pg:
>  	return page;
> +
> +}
> +
> +/*
> + * This is the 'heart' of the zoned buddy allocator.
> + */
> +struct page *
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> +			struct zonelist *zonelist, nodemask_t *nodemask)
> +{
> +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> +	struct page *page;
> +
> +	lockdep_trace_alloc(gfp_mask);
> +
> +	might_sleep_if(gfp_mask & __GFP_WAIT);
> +
> +	if (should_fail_alloc_page(gfp_mask, order))
> +		return NULL;
> +
> +	/*
> +	 * Check the zones suitable for the gfp_mask contain at least one
> +	 * valid zone. It's possible to have an empty zonelist as a result
> +	 * of GFP_THISNODE and a memoryless node
> +	 */
> +	if (unlikely(!zonelist->_zonerefs->zone))
> +		return NULL;
> +
> +	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> +			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> +	if (unlikely(!page))
> +		page = __alloc_pages_slowpath(gfp_mask, order,
> +				zonelist, high_zoneidx, nodemask);
> +
> +	return page;
>  }
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
>  
> -- 
> 1.5.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 05/25] Break up the allocator entry point into fast and slow paths
  2009-04-21  6:35   ` KOSAKI Motohiro
@ 2009-04-21  7:13     ` Pekka Enberg
  2009-04-21  9:30       ` Mel Gorman
  2009-04-21  9:29     ` Mel Gorman
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  7:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

Hi!

On Tue, 2009-04-21 at 15:35 +0900, KOSAKI Motohiro wrote:
> > The core of the page allocator is one giant function which allocates memory
> > on the stack and makes calculations that may not be needed for every
> > allocation. This patch breaks up the allocator path into fast and slow
> > paths for clarity. Note the slow paths are still inlined but the entry is
> > marked unlikely.  If they were not inlined, it actally increases text size
> > to generate the as there is only one call site.
> 
> hmm..
> 
> this patch have few behavior change.
> please separate big cleanup patch and behavior patch.
> 
> I hope to make this patch non functional change. I'm not sure about these
> are your intentional change or not. it cause harder reviewing...

Agreed, splitting this patch into smaller chunks would make it easier to review.

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 05/25] Break up the allocator entry point into fast and slow paths
  2009-04-21  7:13     ` Pekka Enberg
@ 2009-04-21  9:30       ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  9:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: KOSAKI Motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 10:13:04AM +0300, Pekka Enberg wrote:
> Hi!
> 
> On Tue, 2009-04-21 at 15:35 +0900, KOSAKI Motohiro wrote:
> > > The core of the page allocator is one giant function which allocates memory
> > > on the stack and makes calculations that may not be needed for every
> > > allocation. This patch breaks up the allocator path into fast and slow
> > > paths for clarity. Note the slow paths are still inlined but the entry is
> > > marked unlikely.  If they were not inlined, it actally increases text size
> > > to generate the as there is only one call site.
> > 
> > hmm..
> > 
> > this patch have few behavior change.
> > please separate big cleanup patch and behavior patch.
> > 
> > I hope to make this patch non functional change. I'm not sure about these
> > are your intentional change or not. it cause harder reviewing...
> 
> Agreed, splitting this patch into smaller chunks would make it easier to review.
> 

Chunking this doesn't make it easier to review. As it is, it's possible
to go through the old path once and compare it to the new path. I had
this split out at one time, but it meant comparing old and new paths
multiple times instead of once.

However, there were functional changes in here and they needed to be
taken out.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 05/25] Break up the allocator entry point into fast and slow paths
  2009-04-21  6:35   ` KOSAKI Motohiro
  2009-04-21  7:13     ` Pekka Enberg
@ 2009-04-21  9:29     ` Mel Gorman
  2009-04-21 10:44       ` KOSAKI Motohiro
  1 sibling, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  9:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 03:35:11PM +0900, KOSAKI Motohiro wrote:
> > The core of the page allocator is one giant function which allocates memory
> > on the stack and makes calculations that may not be needed for every
> > allocation. This patch breaks up the allocator path into fast and slow
> > paths for clarity. Note the slow paths are still inlined but the entry is
> > marked unlikely.  If they were not inlined, it actally increases text size
> > to generate the as there is only one call site.
> 
> hmm..
> 
> this patch have few behavior change.
> please separate big cleanup patch and behavior patch.
> 

The change in behavior is unintentional.

> I hope to make this patch non functional change. I'm not sure about these
> are your intentional change or not. it cause harder reviewing...
> 

Agreed, this should not make functional changes.

> 
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/page_alloc.c |  356 ++++++++++++++++++++++++++++++++++---------------------
> >  1 files changed, 222 insertions(+), 134 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3bed856..13b4d11 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1460,47 +1460,172 @@ try_next_zone:
> >  	return page;
> >  }
> >  
> > -/*
> > - * This is the 'heart' of the zoned buddy allocator.
> > - */
> > -struct page *
> > -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> > -			struct zonelist *zonelist, nodemask_t *nodemask)
> > +static inline int
> > +should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > +				unsigned long pages_reclaimed)
> >  {
> > -	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > -	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > -	struct zoneref *z;
> > -	struct zone *zone;
> > -	struct page *page;
> > -	struct reclaim_state reclaim_state;
> > -	struct task_struct *p = current;
> > -	int do_retry;
> > -	int alloc_flags;
> > -	unsigned long did_some_progress;
> > -	unsigned long pages_reclaimed = 0;
> > +	/* Do not loop if specifically requested */
> > +	if (gfp_mask & __GFP_NORETRY)
> > +		return 0;
> >  
> > -	lockdep_trace_alloc(gfp_mask);
> > +	/*
> > +	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > +	 * means __GFP_NOFAIL, but that may not be true in other
> > +	 * implementations.
> > +	 */
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return 1;
> > +
> > +	/*
> > +	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
> > +	 * specified, then we retry until we no longer reclaim any pages
> > +	 * (above), or we've reclaimed an order of pages at least as
> > +	 * large as the allocation's order. In both cases, if the
> > +	 * allocation still fails, we stop retrying.
> > +	 */
> > +	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> > +		return 1;
> >  
> > -	might_sleep_if(wait);
> > +	/*
> > +	 * Don't let big-order allocations loop unless the caller
> > +	 * explicitly requests that.
> > +	 */
> > +	if (gfp_mask & __GFP_NOFAIL)
> > +		return 1;
> >  
> > -	if (should_fail_alloc_page(gfp_mask, order))
> > -		return NULL;
> > +	return 0;
> > +}
> >  
> > -	/* the list of zones suitable for gfp_mask */
> > -	z = zonelist->_zonerefs;
> > -	if (unlikely(!z->zone)) {
> > -		/*
> > -		 * Happens if we have an empty zonelist as a result of
> > -		 * GFP_THISNODE being used on a memoryless node
> > -		 */
> > +static inline struct page *
> > +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > +	struct zonelist *zonelist, enum zone_type high_zoneidx,
> > +	nodemask_t *nodemask)
> > +{
> > +	struct page *page;
> > +
> > +	/* Acquire the OOM killer lock for the zones in zonelist */
> > +	if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > +		schedule_timeout_uninterruptible(1);
> >  		return NULL;
> >  	}
> >  
> > -restart:
> > -	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> > -			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> > +	/*
> > +	 * Go through the zonelist yet one more time, keep very high watermark
> > +	 * here, this is only to catch a parallel oom killing, we must fail if
> > +	 * we're still under heavy pressure.
> > +	 */
> > +	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
> > +		order, zonelist, high_zoneidx,
> > +		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
> >  	if (page)
> > -		goto got_pg;
> > +		goto out;
> > +
> > +	/* The OOM killer will not help higher order allocs */
> > +	if (order > PAGE_ALLOC_COSTLY_ORDER)
> > +		goto out;
> > +
> > +	/* Exhausted what can be done so it's blamo time */
> > +	out_of_memory(zonelist, gfp_mask, order);
> > +
> > +out:
> > +	clear_zonelist_oom(zonelist, gfp_mask);
> > +	return page;
> > +}
> > +
> > +/* The really slow allocator path where we enter direct reclaim */
> > +static inline struct page *
> > +__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > +	struct zonelist *zonelist, enum zone_type high_zoneidx,
> > +	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
> > +{
> > +	struct page *page = NULL;
> > +	struct reclaim_state reclaim_state;
> > +	struct task_struct *p = current;
> > +
> > +	cond_resched();
> > +
> > +	/* We now go into synchronous reclaim */
> > +	cpuset_memory_pressure_bump();
> > +
> > +	/*
> > +	 * The task's cpuset might have expanded its set of allowable nodes
> > +	 */
> > +	p->flags |= PF_MEMALLOC;
> > +	lockdep_set_current_reclaim_state(gfp_mask);
> > +	reclaim_state.reclaimed_slab = 0;
> > +	p->reclaim_state = &reclaim_state;
> > +
> > +	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
> > +
> > +	p->reclaim_state = NULL;
> > +	lockdep_clear_current_reclaim_state();
> > +	p->flags &= ~PF_MEMALLOC;
> > +
> > +	cond_resched();
> > +
> > +	if (order != 0)
> > +		drain_all_pages();
> > +
> > +	if (likely(*did_some_progress))
> > +		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > +					zonelist, high_zoneidx, alloc_flags);
> > +	return page;
> > +}
> > +
> > +static inline int
> > +is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
> > +{
> > +	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > +			&& !in_interrupt())
> > +		if (!(gfp_mask & __GFP_NOMEMALLOC))
> > +			return 1;
> > +	return 0;
> > +}
> > +
> > +/*
> > + * This is called in the allocator slow-path if the allocation request is of
> > + * sufficient urgency to ignore watermarks and take other desperate measures
> > + */
> > +static inline struct page *
> > +__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> > +	struct zonelist *zonelist, enum zone_type high_zoneidx,
> > +	nodemask_t *nodemask)
> > +{
> > +	struct page *page;
> > +
> > +	do {
> > +		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > +			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> > +
> > +		if (!page && gfp_mask & __GFP_NOFAIL)
> > +			congestion_wait(WRITE, HZ/50);
> > +	} while (!page && (gfp_mask & __GFP_NOFAIL));
> > +
> > +	return page;
> > +}
> > +
> > +static inline
> > +void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> > +						enum zone_type high_zoneidx)
> > +{
> > +	struct zoneref *z;
> > +	struct zone *zone;
> > +
> > +	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > +		wakeup_kswapd(zone, order);
> > +}
> > +
> > +static inline struct page *
> > +__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > +	struct zonelist *zonelist, enum zone_type high_zoneidx,
> > +	nodemask_t *nodemask)
> > +{
> > +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +	struct page *page = NULL;
> > +	int alloc_flags;
> > +	unsigned long pages_reclaimed = 0;
> > +	unsigned long did_some_progress;
> > +	struct task_struct *p = current;
> >  
> >  	/*
> >  	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
> > @@ -1513,8 +1638,7 @@ restart:
> >  	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> >  		goto nopage;
> >  
> > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > -		wakeup_kswapd(zone, order);
> > +	wake_all_kswapd(order, zonelist, high_zoneidx);
> >  
> >  	/*
> >  	 * OK, we're below the kswapd watermark and have kicked background
> > @@ -1534,6 +1658,7 @@ restart:
> >  	if (wait)
> >  		alloc_flags |= ALLOC_CPUSET;
> >  
> > +restart:
> >  	/*
> >  	 * Go through the zonelist again. Let __GFP_HIGH and allocations
> >  	 * coming from realtime tasks go deeper into reserves.
> > @@ -1547,119 +1672,47 @@ restart:
> >  	if (page)
> >  		goto got_pg;
> >  
> > -	/* This allocation should allow future memory freeing. */
> > -
> > -rebalance:
> > -	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > -			&& !in_interrupt()) {
> > -		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> > -nofail_alloc:
> > -			/* go through the zonelist yet again, ignoring mins */
> > -			page = get_page_from_freelist(gfp_mask, nodemask, order,
> > -				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> > -			if (page)
> > -				goto got_pg;
> > -			if (gfp_mask & __GFP_NOFAIL) {
> > -				congestion_wait(WRITE, HZ/50);
> > -				goto nofail_alloc;
> > -			}
> > -		}
> > -		goto nopage;
> > -	}
> > +	/* Allocate without watermarks if the context allows */
> > +	if (is_allocation_high_priority(p, gfp_mask))
> > +		page = __alloc_pages_high_priority(gfp_mask, order,
> > +			zonelist, high_zoneidx, nodemask);
> > +	if (page)
> > +		goto got_pg;
> >  
> >  	/* Atomic allocations - we can't balance anything */
> >  	if (!wait)
> >  		goto nopage;
> >  
> 
> old code is below.
> if PF_MEMALLOC and !in_interrupt() and __GFP_NOMEMALLOC case,
> old code jump to nopage, your one call reclaim.
> 
> I think, if the task have PF_MEMALLOC, it shouldn't call reclaim.
> if not, endless reclaim recursion happend.
> 
> --------------------------------------------------------------------
> rebalance:
>         if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
>                         && !in_interrupt()) {
>                 if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> nofail_alloc:
>                         /* go through the zonelist yet again, ignoring mins */
>                         page = get_page_from_freelist(gfp_mask, nodemask, order,
>                                 zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
>                         if (page)
>                                 goto got_pg;
>                         if (gfp_mask & __GFP_NOFAIL) {
>                                 congestion_wait(WRITE, HZ/50);
>                                 goto nofail_alloc;
>                         }
>                 }
>                 goto nopage;
>         }
> --------------------------------------------------------------------
> 

I altered the modified version to look like

static inline int
is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
{
        if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
                        && !in_interrupt())
                return 1;
        return 0;
}

Note the check to __GFP_NOMEMALLOC is no longer there.

....

        /* Allocate without watermarks if the context allows */
        if (is_allocation_high_priority(p, gfp_mask)) {
		/* Do not dip into emergency reserves if specified */
                if (!(gfp_mask & __GFP_NOMEMALLOC)) {
                        page = __alloc_pages_high_priority(gfp_mask, order,
                                zonelist, high_zoneidx, nodemask);
                        if (page)
                                goto got_pg;
                }

		/* Ensure no recursion into the allocator */
                goto nopage;
        }


Is that better?

> >  
> >  	/* Atomic allocations - we can't balance anything */
> >  	if (!wait)
> >  		goto nopage;
> >  
> > -	cond_resched();
> > -
> > -	/* We now go into synchronous reclaim */
> > -	cpuset_memory_pressure_bump();
> > -
> > -	p->flags |= PF_MEMALLOC;
> > -
> > -	lockdep_set_current_reclaim_state(gfp_mask);
> > -	reclaim_state.reclaimed_slab = 0;
> > -	p->reclaim_state = &reclaim_state;
> > -
> > -	did_some_progress = try_to_free_pages(zonelist, order,
> > -						gfp_mask, nodemask);
> > -
> > -	p->reclaim_state = NULL;
> > -	lockdep_clear_current_reclaim_state();
> > -	p->flags &= ~PF_MEMALLOC;
> > -
> > -	cond_resched();
> > +	/* Try direct reclaim and then allocating */
> > +	page = __alloc_pages_direct_reclaim(gfp_mask, order,
> > +					zonelist, high_zoneidx,
> > +					nodemask,
> > +					alloc_flags, &did_some_progress);
> > +	if (page)
> > +		goto got_pg;
> >  
> > -	if (order != 0)
> > -		drain_all_pages();
> > +	/*
> > +	 * If we failed to make any progress reclaiming, then we are
> > +	 * running out of options and have to consider going OOM
> > +	 */
> > +	if (!did_some_progress) {
> > +		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > +			page = __alloc_pages_may_oom(gfp_mask, order,
> > +					zonelist, high_zoneidx,
> > +					nodemask);
> > +			if (page)
> > +				goto got_pg;
> 
> the old code here.
> 
> ------------------------------------------------------------------------
>         } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>                 if (!try_set_zone_oom(zonelist, gfp_mask)) {
>                         schedule_timeout_uninterruptible(1);
>                         goto restart;
>                 }
> 
>                 /*
>                  * Go through the zonelist yet one more time, keep
>                  * very high watermark here, this is only to catch
>                  * a parallel oom killing, we must fail if we're still
>                  * under heavy pressure.
>                  */
>                 page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
>                         order, zonelist, high_zoneidx,
>                         ALLOC_WMARK_HIGH|ALLOC_CPUSET);
>                 if (page) {
>                         clear_zonelist_oom(zonelist, gfp_mask);
>                         goto got_pg;
>                 }
> 
>                 /* The OOM killer will not help higher order allocs so fail */
>                 if (order > PAGE_ALLOC_COSTLY_ORDER) {
>                         clear_zonelist_oom(zonelist, gfp_mask);
>                         goto nopage;
>                 }
> 
>                 out_of_memory(zonelist, gfp_mask, order);
>                 clear_zonelist_oom(zonelist, gfp_mask);
>                 goto restart;
>         }
> ------------------------------------------------------------------------
> 
> if get_page_from_freelist() return NULL and order > PAGE_ALLOC_COSTLY_ORDER,
> old code jump to nopage, your one jump to restart.
> 

Good spot. The new section now looks like

                        page = __alloc_pages_may_oom(gfp_mask, order,
                                        zonelist, high_zoneidx,
                                        nodemask);
                        if (page)
                                goto got_pg;

                        /*
                         * The OOM killer does not trigger for high-order allocations
                         * but if no progress is being made, there are no other
                         * options and retrying is unlikely to help
                         */
                        if (order > PAGE_ALLOC_COSTLY_ORDER)
                                goto nopage;

Better?

> 
> 
> 
> > -	if (likely(did_some_progress)) {
> > -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > -					zonelist, high_zoneidx, alloc_flags);
> > -		if (page)
> > -			goto got_pg;
> > -	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > -		if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > -			schedule_timeout_uninterruptible(1);
> >  			goto restart;
> >  		}
> > -
> > -		/*
> > -		 * Go through the zonelist yet one more time, keep
> > -		 * very high watermark here, this is only to catch
> > -		 * a parallel oom killing, we must fail if we're still
> > -		 * under heavy pressure.
> > -		 */
> > -		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
> > -			order, zonelist, high_zoneidx,
> > -			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
> > -		if (page) {
> > -			clear_zonelist_oom(zonelist, gfp_mask);
> > -			goto got_pg;
> > -		}
> > -
> > -		/* The OOM killer will not help higher order allocs so fail */
> > -		if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > -			clear_zonelist_oom(zonelist, gfp_mask);
> > -			goto nopage;
> > -		}
> > -
> > -		out_of_memory(zonelist, gfp_mask, order);
> > -		clear_zonelist_oom(zonelist, gfp_mask);
> > -		goto restart;
> >  	}
> >  
> > -	/*
> > -	 * Don't let big-order allocations loop unless the caller explicitly
> > -	 * requests that.  Wait for some write requests to complete then retry.
> > -	 *
> > -	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > -	 * means __GFP_NOFAIL, but that may not be true in other
> > -	 * implementations.
> > -	 *
> > -	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
> > -	 * specified, then we retry until we no longer reclaim any pages
> > -	 * (above), or we've reclaimed an order of pages at least as
> > -	 * large as the allocation's order. In both cases, if the
> > -	 * allocation still fails, we stop retrying.
> > -	 */
> > +	/* Check if we should retry the allocation */
> >  	pages_reclaimed += did_some_progress;
> > -	do_retry = 0;
> > -	if (!(gfp_mask & __GFP_NORETRY)) {
> > -		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> > -			do_retry = 1;
> > -		} else {
> > -			if (gfp_mask & __GFP_REPEAT &&
> > -				pages_reclaimed < (1 << order))
> > -					do_retry = 1;
> > -		}
> > -		if (gfp_mask & __GFP_NOFAIL)
> > -			do_retry = 1;
> > -	}
> > -	if (do_retry) {
> > +	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> > +		/* Wait for some write requests to complete then retry */
> >  		congestion_wait(WRITE, HZ/50);
> > -		goto rebalance;
> > +		goto restart;
> 
> this change rebalance to restart.
> 

True, it's makes more sense to me sensible to goto restart at that point
after waiting on IO to complete but it's a functional change and doesn't
belong in this patch. I've fixed it up.

Very well spotted.

> 
> >  	}
> >  
> >  nopage:
> > @@ -1672,6 +1725,41 @@ nopage:
> >  	}
> >  got_pg:
> >  	return page;
> > +
> > +}
> > +
> > +/*
> > + * This is the 'heart' of the zoned buddy allocator.
> > + */
> > +struct page *
> > +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> > +			struct zonelist *zonelist, nodemask_t *nodemask)
> > +{
> > +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > +	struct page *page;
> > +
> > +	lockdep_trace_alloc(gfp_mask);
> > +
> > +	might_sleep_if(gfp_mask & __GFP_WAIT);
> > +
> > +	if (should_fail_alloc_page(gfp_mask, order))
> > +		return NULL;
> > +
> > +	/*
> > +	 * Check the zones suitable for the gfp_mask contain at least one
> > +	 * valid zone. It's possible to have an empty zonelist as a result
> > +	 * of GFP_THISNODE and a memoryless node
> > +	 */
> > +	if (unlikely(!zonelist->_zonerefs->zone))
> > +		return NULL;
> > +
> > +	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> > +			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> > +	if (unlikely(!page))
> > +		page = __alloc_pages_slowpath(gfp_mask, order,
> > +				zonelist, high_zoneidx, nodemask);
> > +
> > +	return page;
> >  }
> >  EXPORT_SYMBOL(__alloc_pages_nodemask);
> >  
> > -- 
> > 1.5.6.5
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 05/25] Break up the allocator entry point into fast and slow paths
  2009-04-21  9:29     ` Mel Gorman
@ 2009-04-21 10:44       ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 10:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> > >  
> > > -	/* This allocation should allow future memory freeing. */
> > > -
> > > -rebalance:
> > > -	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > > -			&& !in_interrupt()) {
> > > -		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> > > -nofail_alloc:
> > > -			/* go through the zonelist yet again, ignoring mins */
> > > -			page = get_page_from_freelist(gfp_mask, nodemask, order,
> > > -				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> > > -			if (page)
> > > -				goto got_pg;
> > > -			if (gfp_mask & __GFP_NOFAIL) {
> > > -				congestion_wait(WRITE, HZ/50);
> > > -				goto nofail_alloc;
> > > -			}
> > > -		}
> > > -		goto nopage;
> > > -	}
> > > +	/* Allocate without watermarks if the context allows */
> > > +	if (is_allocation_high_priority(p, gfp_mask))
> > > +		page = __alloc_pages_high_priority(gfp_mask, order,
> > > +			zonelist, high_zoneidx, nodemask);
> > > +	if (page)
> > > +		goto got_pg;
> > >  
> > >  	/* Atomic allocations - we can't balance anything */
> > >  	if (!wait)
> > >  		goto nopage;
> > >  
> > 
> > old code is below.
> > if PF_MEMALLOC and !in_interrupt() and __GFP_NOMEMALLOC case,
> > old code jump to nopage, your one call reclaim.
> > 
> > I think, if the task have PF_MEMALLOC, it shouldn't call reclaim.
> > if not, endless reclaim recursion happend.
> > 
> > --------------------------------------------------------------------
> > rebalance:
> >         if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> >                         && !in_interrupt()) {
> >                 if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> > nofail_alloc:
> >                         /* go through the zonelist yet again, ignoring mins */
> >                         page = get_page_from_freelist(gfp_mask, nodemask, order,
> >                                 zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> >                         if (page)
> >                                 goto got_pg;
> >                         if (gfp_mask & __GFP_NOFAIL) {
> >                                 congestion_wait(WRITE, HZ/50);
> >                                 goto nofail_alloc;
> >                         }
> >                 }
> >                 goto nopage;
> >         }
> > --------------------------------------------------------------------
> > 
> 
> I altered the modified version to look like
> 
> static inline int
> is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
> {
>         if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
>                         && !in_interrupt())
>                 return 1;
>         return 0;
> }
> 
> Note the check to __GFP_NOMEMALLOC is no longer there.
> 
> ....
> 
>         /* Allocate without watermarks if the context allows */
>         if (is_allocation_high_priority(p, gfp_mask)) {
> 		/* Do not dip into emergency reserves if specified */
>                 if (!(gfp_mask & __GFP_NOMEMALLOC)) {
>                         page = __alloc_pages_high_priority(gfp_mask, order,
>                                 zonelist, high_zoneidx, nodemask);
>                         if (page)
>                                 goto got_pg;
>                 }
> 
> 		/* Ensure no recursion into the allocator */
>                 goto nopage;
>         }
> 
> 
> Is that better?

nice.



> > >  	/* Atomic allocations - we can't balance anything */
> > >  	if (!wait)
> > >  		goto nopage;
> > >  
> > > -	cond_resched();
> > > -
> > > -	/* We now go into synchronous reclaim */
> > > -	cpuset_memory_pressure_bump();
> > > -
> > > -	p->flags |= PF_MEMALLOC;
> > > -
> > > -	lockdep_set_current_reclaim_state(gfp_mask);
> > > -	reclaim_state.reclaimed_slab = 0;
> > > -	p->reclaim_state = &reclaim_state;
> > > -
> > > -	did_some_progress = try_to_free_pages(zonelist, order,
> > > -						gfp_mask, nodemask);
> > > -
> > > -	p->reclaim_state = NULL;
> > > -	lockdep_clear_current_reclaim_state();
> > > -	p->flags &= ~PF_MEMALLOC;
> > > -
> > > -	cond_resched();
> > > +	/* Try direct reclaim and then allocating */
> > > +	page = __alloc_pages_direct_reclaim(gfp_mask, order,
> > > +					zonelist, high_zoneidx,
> > > +					nodemask,
> > > +					alloc_flags, &did_some_progress);
> > > +	if (page)
> > > +		goto got_pg;
> > >  
> > > -	if (order != 0)
> > > -		drain_all_pages();
> > > +	/*
> > > +	 * If we failed to make any progress reclaiming, then we are
> > > +	 * running out of options and have to consider going OOM
> > > +	 */
> > > +	if (!did_some_progress) {
> > > +		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > > +			page = __alloc_pages_may_oom(gfp_mask, order,
> > > +					zonelist, high_zoneidx,
> > > +					nodemask);
> > > +			if (page)
> > > +				goto got_pg;
> > 
> > the old code here.
> > 
> > ------------------------------------------------------------------------
> >         } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >                 if (!try_set_zone_oom(zonelist, gfp_mask)) {
> >                         schedule_timeout_uninterruptible(1);
> >                         goto restart;
> >                 }
> > 
> >                 /*
> >                  * Go through the zonelist yet one more time, keep
> >                  * very high watermark here, this is only to catch
> >                  * a parallel oom killing, we must fail if we're still
> >                  * under heavy pressure.
> >                  */
> >                 page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
> >                         order, zonelist, high_zoneidx,
> >                         ALLOC_WMARK_HIGH|ALLOC_CPUSET);
> >                 if (page) {
> >                         clear_zonelist_oom(zonelist, gfp_mask);
> >                         goto got_pg;
> >                 }
> > 
> >                 /* The OOM killer will not help higher order allocs so fail */
> >                 if (order > PAGE_ALLOC_COSTLY_ORDER) {
> >                         clear_zonelist_oom(zonelist, gfp_mask);
> >                         goto nopage;
> >                 }
> > 
> >                 out_of_memory(zonelist, gfp_mask, order);
> >                 clear_zonelist_oom(zonelist, gfp_mask);
> >                 goto restart;
> >         }
> > ------------------------------------------------------------------------
> > 
> > if get_page_from_freelist() return NULL and order > PAGE_ALLOC_COSTLY_ORDER,
> > old code jump to nopage, your one jump to restart.
> > 
> 
> Good spot. The new section now looks like
> 
>                         page = __alloc_pages_may_oom(gfp_mask, order,
>                                         zonelist, high_zoneidx,
>                                         nodemask);
>                         if (page)
>                                 goto got_pg;
> 
>                         /*
>                          * The OOM killer does not trigger for high-order allocations
>                          * but if no progress is being made, there are no other
>                          * options and retrying is unlikely to help
>                          */
>                         if (order > PAGE_ALLOC_COSTLY_ORDER)
>                                 goto nopage;
> 
> Better?

very good.




> > > -	/*
> > > -	 * Don't let big-order allocations loop unless the caller explicitly
> > > -	 * requests that.  Wait for some write requests to complete then retry.
> > > -	 *
> > > -	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > > -	 * means __GFP_NOFAIL, but that may not be true in other
> > > -	 * implementations.
> > > -	 *
> > > -	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
> > > -	 * specified, then we retry until we no longer reclaim any pages
> > > -	 * (above), or we've reclaimed an order of pages at least as
> > > -	 * large as the allocation's order. In both cases, if the
> > > -	 * allocation still fails, we stop retrying.
> > > -	 */
> > > +	/* Check if we should retry the allocation */
> > >  	pages_reclaimed += did_some_progress;
> > > -	do_retry = 0;
> > > -	if (!(gfp_mask & __GFP_NORETRY)) {
> > > -		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> > > -			do_retry = 1;
> > > -		} else {
> > > -			if (gfp_mask & __GFP_REPEAT &&
> > > -				pages_reclaimed < (1 << order))
> > > -					do_retry = 1;
> > > -		}
> > > -		if (gfp_mask & __GFP_NOFAIL)
> > > -			do_retry = 1;
> > > -	}
> > > -	if (do_retry) {
> > > +	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> > > +		/* Wait for some write requests to complete then retry */
> > >  		congestion_wait(WRITE, HZ/50);
> > > -		goto rebalance;
> > > +		goto restart;
> > 
> > this change rebalance to restart.
> > 
> 
> True, it's makes more sense to me sensible to goto restart at that point
> after waiting on IO to complete but it's a functional change and doesn't
> belong in this patch. I've fixed it up.
> 
> Very well spotted.

excellent.

thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 06/25] Move check for disabled anti-fragmentation out of fastpath
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (4 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 05/25] Break up the allocator entry point into fast and slow paths Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  6:37   ` KOSAKI Motohiro
  2009-04-20 22:19 ` [PATCH 07/25] Check in advance if the zonelist needs additional filtering Mel Gorman
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On low-memory systems, anti-fragmentation gets disabled as there is nothing
it can do and it would just incur overhead shuffling pages between lists
constantly. Currently the check is made in the free page fast path for every
page. This patch moves it to a slow path. On machines with low memory,
there will be small amount of additional overhead as pages get shuffled
between lists but it should quickly settle.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/mmzone.h |    3 ---
 mm/page_alloc.c        |    4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..f82bdba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;
 
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	if (unlikely(page_group_by_mobility_disabled))
-		return MIGRATE_UNMOVABLE;
-
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 13b4d11..c8465d0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -172,6 +172,10 @@ int page_group_by_mobility_disabled __read_mostly;
 
 static void set_pageblock_migratetype(struct page *page, int migratetype)
 {
+
+	if (unlikely(page_group_by_mobility_disabled))
+		migratetype = MIGRATE_UNMOVABLE;
+
 	set_pageblock_flags_group(page, (unsigned long)migratetype,
 					PB_migrate, PB_migrate_end);
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 06/25] Move check for disabled anti-fragmentation out of fastpath
  2009-04-20 22:19 ` [PATCH 06/25] Move check for disabled anti-fragmentation out of fastpath Mel Gorman
@ 2009-04-21  6:37   ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  6:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> @@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;
>  
>  static inline int get_pageblock_migratetype(struct page *page)
>  {
> -	if (unlikely(page_group_by_mobility_disabled))
> -		return MIGRATE_UNMOVABLE;
> -
>  	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 13b4d11..c8465d0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -172,6 +172,10 @@ int page_group_by_mobility_disabled __read_mostly;
>  
>  static void set_pageblock_migratetype(struct page *page, int migratetype)
>  {
> +
> +	if (unlikely(page_group_by_mobility_disabled))
> +		migratetype = MIGRATE_UNMOVABLE;
> +
>  	set_pageblock_flags_group(page, (unsigned long)migratetype,
>  					PB_migrate, PB_migrate_end);

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 07/25] Check in advance if the zonelist needs additional filtering
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (5 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 06/25] Move check for disabled anti-fragmentation out of fastpath Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  6:52   ` KOSAKI Motohiro
  2009-04-21  7:21   ` Pekka Enberg
  2009-04-20 22:19 ` [PATCH 08/25] Calculate the preferred zone for allocation only once Mel Gorman
                   ` (18 subsequent siblings)
  25 siblings, 2 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

Zonelist are filtered based on nodemasks for memory policies normally.
It can be additionally filters on cpusets if they exist as well as
noting when zones are full. These simple checks are expensive enough to
be noticed in profiles. This patch checks in advance if zonelist
filtering will ever be needed. If not, then the bulk of the checks are
skipped.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/cpuset.h |    2 ++
 mm/page_alloc.c        |   37 ++++++++++++++++++++++++++-----------
 2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index a5740fc..978e2f1 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -97,6 +97,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 
 #else /* !CONFIG_CPUSETS */
 
+#define number_of_cpusets (0)
+
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c8465d0..3613ba4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1137,7 +1137,11 @@ failed:
 #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
+#ifdef CONFIG_CPUSETS
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#else
+#define ALLOC_CPUSET		0x00
+#endif /* CONFIG_CPUSETS */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -1401,6 +1405,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	int zonelist_filter = 0;
 
 	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
 							&preferred_zone);
@@ -1411,6 +1416,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	VM_BUG_ON(order >= MAX_ORDER);
 
+	/* Determine in advance if the zonelist needs filtering */
+	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
+		zonelist_filter = 1;
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -1418,12 +1427,16 @@ zonelist_scan:
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						high_zoneidx, nodemask) {
-		if (NUMA_BUILD && zlc_active &&
-			!zlc_zone_worth_trying(zonelist, z, allowednodes))
-				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
-				goto try_next_zone;
+
+		/* Ignore the additional zonelist filter checks if possible */
+		if (zonelist_filter) {
+			if (NUMA_BUILD && zlc_active &&
+				!zlc_zone_worth_trying(zonelist, z, allowednodes))
+					continue;
+			if ((alloc_flags & ALLOC_CPUSET) &&
+				!cpuset_zone_allowed_softwall(zone, gfp_mask))
+					goto try_next_zone;
+		}
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
@@ -1445,13 +1458,15 @@ zonelist_scan:
 		if (page)
 			break;
 this_zone_full:
-		if (NUMA_BUILD)
+		if (NUMA_BUILD && zonelist_filter)
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
-		if (NUMA_BUILD && !did_zlc_setup) {
-			/* we do zlc_setup after the first zone is tried */
-			allowednodes = zlc_setup(zonelist, alloc_flags);
-			zlc_active = 1;
+		if (NUMA_BUILD && zonelist_filter) {
+			if (!did_zlc_setup) {
+				/* do zlc_setup after the first zone is tried */
+				allowednodes = zlc_setup(zonelist, alloc_flags);
+				zlc_active = 1;
+			}
 			did_zlc_setup = 1;
 		}
 	}
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 07/25] Check in advance if the zonelist needs additional filtering
  2009-04-20 22:19 ` [PATCH 07/25] Check in advance if the zonelist needs additional filtering Mel Gorman
@ 2009-04-21  6:52   ` KOSAKI Motohiro
  2009-04-21  9:47     ` Mel Gorman
  2009-04-21  7:21   ` Pekka Enberg
  1 sibling, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  6:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> Zonelist are filtered based on nodemasks for memory policies normally.
> It can be additionally filters on cpusets if they exist as well as
> noting when zones are full. These simple checks are expensive enough to
> be noticed in profiles. This patch checks in advance if zonelist
> filtering will ever be needed. If not, then the bulk of the checks are
> skipped.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/cpuset.h |    2 ++
>  mm/page_alloc.c        |   37 ++++++++++++++++++++++++++-----------
>  2 files changed, 28 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index a5740fc..978e2f1 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -97,6 +97,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>  
>  #else /* !CONFIG_CPUSETS */
>  
> +#define number_of_cpusets (0)
> +
>  static inline int cpuset_init(void) { return 0; }
>  static inline void cpuset_init_smp(void) {}
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c8465d0..3613ba4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1137,7 +1137,11 @@ failed:
>  #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
>  #define ALLOC_HARDER		0x10 /* try to alloc harder */
>  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> +#ifdef CONFIG_CPUSETS
>  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> +#else
> +#define ALLOC_CPUSET		0x00
> +#endif /* CONFIG_CPUSETS */
>  
>  #ifdef CONFIG_FAIL_PAGE_ALLOC
>  
> @@ -1401,6 +1405,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
>  	int zlc_active = 0;		/* set if using zonelist_cache */
>  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> +	int zonelist_filter = 0;
>  
>  	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
>  							&preferred_zone);
> @@ -1411,6 +1416,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  
>  	VM_BUG_ON(order >= MAX_ORDER);
>  
> +	/* Determine in advance if the zonelist needs filtering */
> +	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
> +		zonelist_filter = 1;
> +
>  zonelist_scan:
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
> @@ -1418,12 +1427,16 @@ zonelist_scan:
>  	 */
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  						high_zoneidx, nodemask) {
> -		if (NUMA_BUILD && zlc_active &&
> -			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> -				continue;
> -		if ((alloc_flags & ALLOC_CPUSET) &&
> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> -				goto try_next_zone;
> +
> +		/* Ignore the additional zonelist filter checks if possible */
> +		if (zonelist_filter) {
> +			if (NUMA_BUILD && zlc_active &&
> +				!zlc_zone_worth_trying(zonelist, z, allowednodes))
> +					continue;
> +			if ((alloc_flags & ALLOC_CPUSET) &&
> +				!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +					goto try_next_zone;
> +		}

if number_of_cpusets==1, old code call zlc_zone_worth_trying(). but your one never call.
it seems regression.


>  
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>  			unsigned long mark;
> @@ -1445,13 +1458,15 @@ zonelist_scan:
>  		if (page)
>  			break;
>  this_zone_full:
> -		if (NUMA_BUILD)
> +		if (NUMA_BUILD && zonelist_filter)
>  			zlc_mark_zone_full(zonelist, z);
>  try_next_zone:
> -		if (NUMA_BUILD && !did_zlc_setup) {
> -			/* we do zlc_setup after the first zone is tried */
> -			allowednodes = zlc_setup(zonelist, alloc_flags);
> -			zlc_active = 1;
> +		if (NUMA_BUILD && zonelist_filter) {
> +			if (!did_zlc_setup) {
> +				/* do zlc_setup after the first zone is tried */
> +				allowednodes = zlc_setup(zonelist, alloc_flags);
> +				zlc_active = 1;
> +			}
>  			did_zlc_setup = 1;
>  		}
>  	}
> -- 
> 1.5.6.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 07/25] Check in advance if the zonelist needs additional filtering
  2009-04-21  6:52   ` KOSAKI Motohiro
@ 2009-04-21  9:47     ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  9:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 03:52:48PM +0900, KOSAKI Motohiro wrote:
> > Zonelist are filtered based on nodemasks for memory policies normally.
> > It can be additionally filters on cpusets if they exist as well as
> > noting when zones are full. These simple checks are expensive enough to
> > be noticed in profiles. This patch checks in advance if zonelist
> > filtering will ever be needed. If not, then the bulk of the checks are
> > skipped.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/cpuset.h |    2 ++
> >  mm/page_alloc.c        |   37 ++++++++++++++++++++++++++-----------
> >  2 files changed, 28 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index a5740fc..978e2f1 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -97,6 +97,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> >  
> >  #else /* !CONFIG_CPUSETS */
> >  
> > +#define number_of_cpusets (0)
> > +
> >  static inline int cpuset_init(void) { return 0; }
> >  static inline void cpuset_init_smp(void) {}
> >  
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c8465d0..3613ba4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1137,7 +1137,11 @@ failed:
> >  #define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> > +#ifdef CONFIG_CPUSETS
> >  #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
> > +#else
> > +#define ALLOC_CPUSET		0x00
> > +#endif /* CONFIG_CPUSETS */
> >  
> >  #ifdef CONFIG_FAIL_PAGE_ALLOC
> >  
> > @@ -1401,6 +1405,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> >  	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
> >  	int zlc_active = 0;		/* set if using zonelist_cache */
> >  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> > +	int zonelist_filter = 0;
> >  
> >  	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
> >  							&preferred_zone);
> > @@ -1411,6 +1416,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> >  
> >  	VM_BUG_ON(order >= MAX_ORDER);
> >  
> > +	/* Determine in advance if the zonelist needs filtering */
> > +	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
> > +		zonelist_filter = 1;
> > +
> >  zonelist_scan:
> >  	/*
> >  	 * Scan zonelist, looking for a zone with enough free.
> > @@ -1418,12 +1427,16 @@ zonelist_scan:
> >  	 */
> >  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >  						high_zoneidx, nodemask) {
> > -		if (NUMA_BUILD && zlc_active &&
> > -			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> > -				continue;
> > -		if ((alloc_flags & ALLOC_CPUSET) &&
> > -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -				goto try_next_zone;
> > +
> > +		/* Ignore the additional zonelist filter checks if possible */
> > +		if (zonelist_filter) {
> > +			if (NUMA_BUILD && zlc_active &&
> > +				!zlc_zone_worth_trying(zonelist, z, allowednodes))
> > +					continue;
> > +			if ((alloc_flags & ALLOC_CPUSET) &&
> > +				!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > +					goto try_next_zone;
> > +		}
> 
> if number_of_cpusets==1, old code call zlc_zone_worth_trying(). but your one never call.
> it seems regression.
> 

True, but once fixed, the patch becomes a lot less useful. The intention was
to avoid the zlc_setup() function which is pretty heavy and hits on HIGHMEM
machines quite easily but I did it wrong. I should have made the
decision to only call zlc_setup() when there were online NUMA nodes to
care about.

I'll drop this patch and try again.

> 
> >  
> >  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> >  			unsigned long mark;
> > @@ -1445,13 +1458,15 @@ zonelist_scan:
> >  		if (page)
> >  			break;
> >  this_zone_full:
> > -		if (NUMA_BUILD)
> > +		if (NUMA_BUILD && zonelist_filter)
> >  			zlc_mark_zone_full(zonelist, z);
> >  try_next_zone:
> > -		if (NUMA_BUILD && !did_zlc_setup) {
> > -			/* we do zlc_setup after the first zone is tried */
> > -			allowednodes = zlc_setup(zonelist, alloc_flags);
> > -			zlc_active = 1;
> > +		if (NUMA_BUILD && zonelist_filter) {
> > +			if (!did_zlc_setup) {
> > +				/* do zlc_setup after the first zone is tried */
> > +				allowednodes = zlc_setup(zonelist, alloc_flags);
> > +				zlc_active = 1;
> > +			}
> >  			did_zlc_setup = 1;
> >  		}
> >  	}
> > -- 
> > 1.5.6.5
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 07/25] Check in advance if the zonelist needs additional filtering
  2009-04-20 22:19 ` [PATCH 07/25] Check in advance if the zonelist needs additional filtering Mel Gorman
  2009-04-21  6:52   ` KOSAKI Motohiro
@ 2009-04-21  7:21   ` Pekka Enberg
  2009-04-21  9:49     ` Mel Gorman
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  7:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

Hi Mel,

On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> Zonelist are filtered based on nodemasks for memory policies normally.
> It can be additionally filters on cpusets if they exist as well as
> noting when zones are full. These simple checks are expensive enough to
> be noticed in profiles. This patch checks in advance if zonelist
> filtering will ever be needed. If not, then the bulk of the checks are
> skipped.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> @@ -1401,6 +1405,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
>  	int zlc_active = 0;		/* set if using zonelist_cache */
>  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> +	int zonelist_filter = 0;
>  
>  	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
>  							&preferred_zone);
> @@ -1411,6 +1416,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  
>  	VM_BUG_ON(order >= MAX_ORDER);
>  
> +	/* Determine in advance if the zonelist needs filtering */
> +	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
> +		zonelist_filter = 1;
> +
>  zonelist_scan:
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
> @@ -1418,12 +1427,16 @@ zonelist_scan:
>  	 */
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  						high_zoneidx, nodemask) {
> -		if (NUMA_BUILD && zlc_active &&
> -			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> -				continue;
> -		if ((alloc_flags & ALLOC_CPUSET) &&
> -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> -				goto try_next_zone;
> +
> +		/* Ignore the additional zonelist filter checks if possible */
> +		if (zonelist_filter) {
> +			if (NUMA_BUILD && zlc_active &&
> +				!zlc_zone_worth_trying(zonelist, z, allowednodes))
> +					continue;
> +			if ((alloc_flags & ALLOC_CPUSET) &&

The above expression is always true here because of the earlier
zonelists_filter check, no?

> +				!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +					goto try_next_zone;
> +		}
>  
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>  			unsigned long mark;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 07/25] Check in advance if the zonelist needs additional filtering
  2009-04-21  7:21   ` Pekka Enberg
@ 2009-04-21  9:49     ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  9:49 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 10:21:12AM +0300, Pekka Enberg wrote:
> Hi Mel,
> 
> On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> > Zonelist are filtered based on nodemasks for memory policies normally.
> > It can be additionally filters on cpusets if they exist as well as
> > noting when zones are full. These simple checks are expensive enough to
> > be noticed in profiles. This patch checks in advance if zonelist
> > filtering will ever be needed. If not, then the bulk of the checks are
> > skipped.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > @@ -1401,6 +1405,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> >  	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
> >  	int zlc_active = 0;		/* set if using zonelist_cache */
> >  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
> > +	int zonelist_filter = 0;
> >  
> >  	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
> >  							&preferred_zone);
> > @@ -1411,6 +1416,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> >  
> >  	VM_BUG_ON(order >= MAX_ORDER);
> >  
> > +	/* Determine in advance if the zonelist needs filtering */
> > +	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
> > +		zonelist_filter = 1;
> > +
> >  zonelist_scan:
> >  	/*
> >  	 * Scan zonelist, looking for a zone with enough free.
> > @@ -1418,12 +1427,16 @@ zonelist_scan:
> >  	 */
> >  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >  						high_zoneidx, nodemask) {
> > -		if (NUMA_BUILD && zlc_active &&
> > -			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> > -				continue;
> > -		if ((alloc_flags & ALLOC_CPUSET) &&
> > -			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -				goto try_next_zone;
> > +
> > +		/* Ignore the additional zonelist filter checks if possible */
> > +		if (zonelist_filter) {
> > +			if (NUMA_BUILD && zlc_active &&
> > +				!zlc_zone_worth_trying(zonelist, z, allowednodes))
> > +					continue;
> > +			if ((alloc_flags & ALLOC_CPUSET) &&
> 
> The above expression is always true here because of the earlier
> zonelists_filter check, no?
> 

Yeah, silly. I've dropped the patch altogether though because it was
avoiding zonelist filtering for the wrong reasons.

> > +				!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > +					goto try_next_zone;
> > +		}
> >  
> >  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> >  			unsigned long mark;
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 08/25] Calculate the preferred zone for allocation only once
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (6 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 07/25] Check in advance if the zonelist needs additional filtering Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  7:03   ` KOSAKI Motohiro
  2009-04-21  7:37   ` Pekka Enberg
  2009-04-20 22:19 ` [PATCH 09/25] Calculate the migratetype " Mel Gorman
                   ` (17 subsequent siblings)
  25 siblings, 2 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable
zone in the zonelist. This patch calculates preferred_zone once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   53 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3613ba4..b27bcde 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1396,24 +1396,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
-		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+		struct zone *preferred_zone)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
-	struct zone *zone, *preferred_zone;
+	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	int zonelist_filter = 0;
 
-	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
-							&preferred_zone);
-	if (!preferred_zone)
-		return NULL;
-
 	classzone_idx = zone_idx(preferred_zone);
-
 	VM_BUG_ON(order >= MAX_ORDER);
 
 	/* Determine in advance if the zonelist needs filtering */
@@ -1518,7 +1513,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
@@ -1535,7 +1530,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
-		ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+		preferred_zone);
 	if (page)
 		goto out;
 
@@ -1555,7 +1551,8 @@ out:
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1587,7 +1584,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx, alloc_flags);
+					zonelist, high_zoneidx,
+					alloc_flags, preferred_zone);
 	return page;
 }
 
@@ -1608,13 +1606,14 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
 static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+			preferred_zone);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1637,7 +1636,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask)
+	nodemask_t *nodemask, struct zone *preferred_zone)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1687,14 +1686,15 @@ restart:
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags);
+						high_zoneidx, alloc_flags,
+						preferred_zone);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask);
+			zonelist, high_zoneidx, nodemask, preferred_zone);
 	if (page)
 		goto got_pg;
 
@@ -1706,7 +1706,8 @@ restart:
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask,
-					alloc_flags, &did_some_progress);
+					alloc_flags, preferred_zone,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1718,7 +1719,7 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask);
+					nodemask, preferred_zone);
 			if (page)
 				goto got_pg;
 
@@ -1755,6 +1756,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct zone *preferred_zone;
 	struct page *page;
 
 	lockdep_trace_alloc(gfp_mask);
@@ -1772,11 +1774,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
+	/* The preferred zone is used for statistics later */
+	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+							&preferred_zone);
+	if (!preferred_zone)
+		return NULL;
+
+	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+			preferred_zone);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
-				zonelist, high_zoneidx, nodemask);
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone);
 
 	return page;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 08/25] Calculate the preferred zone for allocation only once
  2009-04-20 22:19 ` [PATCH 08/25] Calculate the preferred zone for allocation only once Mel Gorman
@ 2009-04-21  7:03   ` KOSAKI Motohiro
  2009-04-21  8:23     ` Mel Gorman
  2009-04-21  7:37   ` Pekka Enberg
  1 sibling, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  7:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> get_page_from_freelist() can be called multiple times for an allocation.
> Part of this calculates the preferred_zone which is the first usable
> zone in the zonelist. This patch calculates preferred_zone once.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

I'm not sure this patch improve performance largely or not.
but I don't find the bug.


	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 08/25] Calculate the preferred zone for allocation only once
  2009-04-21  7:03   ` KOSAKI Motohiro
@ 2009-04-21  8:23     ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  8:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 04:03:59PM +0900, KOSAKI Motohiro wrote:
> > get_page_from_freelist() can be called multiple times for an allocation.
> > Part of this calculates the preferred_zone which is the first usable
> > zone in the zonelist. This patch calculates preferred_zone once.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> I'm not sure this patch improve performance largely or not.
> but I don't find the bug.
> 

It's pretty small. In most cases, the preferred zone is going to be the
first one but in cases where it's not, this avoids walking the beginning
of the zonelist multiple times. How much time saved depends on the
number of times get_page_from_freelist() is called. It would be twice
for most of the systems early lifetime as the slower paths are not
entered but cost more when the system is lower on memory.

> 	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 

Thanks very much for these reviews.

> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 08/25] Calculate the preferred zone for allocation only once
  2009-04-20 22:19 ` [PATCH 08/25] Calculate the preferred zone for allocation only once Mel Gorman
  2009-04-21  7:03   ` KOSAKI Motohiro
@ 2009-04-21  7:37   ` Pekka Enberg
  2009-04-21  8:27     ` Mel Gorman
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  7:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

Hi Mel,

On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> get_page_from_freelist() can be called multiple times for an
> allocation.
> Part of this calculates the preferred_zone which is the first usable
> zone in the zonelist. This patch calculates preferred_zone once.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

> @@ -1772,11 +1774,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask,
> unsigned int order,
>  	if (unlikely(!zonelist->_zonerefs->zone))
>  		return NULL;
>  
> +	/* The preferred zone is used for statistics later */
> +	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
> +							&preferred_zone);
> +	if (!preferred_zone)
> +		return NULL;

You might want to add an explanation to the changelog why this change is
safe. It looked like a functional change at first glance and it was
pretty difficult to convince myself that __alloc_pages_slowpath() will
always return NULL when there's no preferred zone because of the other
cleanups in this patch series.

> +
> +	/* First allocation attempt */
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> -			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> +			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> +			preferred_zone);
>  	if (unlikely(!page))
>  		page = __alloc_pages_slowpath(gfp_mask, order,
> -				zonelist, high_zoneidx, nodemask);
> +				zonelist, high_zoneidx, nodemask,
> +				preferred_zone);
>  
>  	return page;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 08/25] Calculate the preferred zone for allocation only once
  2009-04-21  7:37   ` Pekka Enberg
@ 2009-04-21  8:27     ` Mel Gorman
  2009-04-21  8:29       ` Pekka Enberg
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  8:27 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 10:37:37AM +0300, Pekka Enberg wrote:
> Hi Mel,
> 
> On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> > get_page_from_freelist() can be called multiple times for an
> > allocation.
> > Part of this calculates the preferred_zone which is the first usable
> > zone in the zonelist. This patch calculates preferred_zone once.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> 

Thanks

> > @@ -1772,11 +1774,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask,
> > unsigned int order,
> >  	if (unlikely(!zonelist->_zonerefs->zone))
> >  		return NULL;
> >  
> > +	/* The preferred zone is used for statistics later */
> > +	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
> > +							&preferred_zone);
> > +	if (!preferred_zone)
> > +		return NULL;
> 
> You might want to add an explanation to the changelog why this change is
> safe. It looked like a functional change at first glance and it was
> pretty difficult to convince myself that __alloc_pages_slowpath() will
> always return NULL when there's no preferred zone because of the other
> cleanups in this patch series.
> 

Is this better?

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable zone in
the zonelist but the zone depends on the GFP flags specified at the beginning
of the allocation call. This patch calculates preferred_zone once. It's safe
to do this because if preferred_zone is NULL at the start of the call, no
amount of direct reclaim or other actions will change the fact the allocation
will fail.

> > +
> > +	/* First allocation attempt */
> >  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> > -			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> > +			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> > +			preferred_zone);
> >  	if (unlikely(!page))
> >  		page = __alloc_pages_slowpath(gfp_mask, order,
> > -				zonelist, high_zoneidx, nodemask);
> > +				zonelist, high_zoneidx, nodemask,
> > +				preferred_zone);
> >  
> >  	return page;
> >  }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 08/25] Calculate the preferred zone for allocation only once
  2009-04-21  8:27     ` Mel Gorman
@ 2009-04-21  8:29       ` Pekka Enberg
  0 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  8:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, 2009-04-21 at 09:27 +0100, Mel Gorman wrote:
> > You might want to add an explanation to the changelog why this change is
> > safe. It looked like a functional change at first glance and it was
> > pretty difficult to convince myself that __alloc_pages_slowpath() will
> > always return NULL when there's no preferred zone because of the other
> > cleanups in this patch series.
> > 
> 
> Is this better?
> 
> get_page_from_freelist() can be called multiple times for an allocation.
> Part of this calculates the preferred_zone which is the first usable zone in
> the zonelist but the zone depends on the GFP flags specified at the beginning
> of the allocation call. This patch calculates preferred_zone once. It's safe
> to do this because if preferred_zone is NULL at the start of the call, no
> amount of direct reclaim or other actions will change the fact the allocation
> will fail.

Perfect!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 09/25] Calculate the migratetype for allocation only once
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (7 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 08/25] Calculate the preferred zone for allocation only once Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  7:37   ` KOSAKI Motohiro
  2009-04-20 22:19 ` [PATCH 10/25] Calculate the alloc_flags " Mel Gorman
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b27bcde..f960cf5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1065,13 +1065,13 @@ void split_page(struct page *page, unsigned int order)
  * or two.
  */
 static struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags)
+			struct zone *zone, int order, gfp_t gfp_flags,
+			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
-	int migratetype = allocflags_to_migratetype(gfp_flags);
 
 again:
 	cpu  = get_cpu();
@@ -1397,7 +1397,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone)
+		struct zone *preferred_zone, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1449,7 +1449,8 @@ zonelist_scan:
 			}
 		}
 
-		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
+		page = buffered_rmqueue(preferred_zone, zone, order,
+						gfp_mask, migratetype);
 		if (page)
 			break;
 this_zone_full:
@@ -1513,7 +1514,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
@@ -1531,7 +1533,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone);
+		preferred_zone, migratetype);
 	if (page)
 		goto out;
 
@@ -1552,7 +1554,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	unsigned long *did_some_progress)
+	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1585,7 +1587,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (likely(*did_some_progress))
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-					alloc_flags, preferred_zone);
+					alloc_flags, preferred_zone,
+					migratetype);
 	return page;
 }
 
@@ -1606,14 +1609,15 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
 static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone);
+			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1636,7 +1640,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone)
+	nodemask_t *nodemask, struct zone *preferred_zone,
+	int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1687,14 +1692,16 @@ restart:
 	 */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags,
-						preferred_zone);
+						preferred_zone,
+						migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
 	if (is_allocation_high_priority(p, gfp_mask))
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone);
+			zonelist, high_zoneidx, nodemask, preferred_zone,
+			migratetype);
 	if (page)
 		goto got_pg;
 
@@ -1707,7 +1714,7 @@ restart:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					&did_some_progress);
+					migratetype, &did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1719,7 +1726,8 @@ restart:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
-					nodemask, preferred_zone);
+					nodemask, preferred_zone,
+					migratetype);
 			if (page)
 				goto got_pg;
 
@@ -1758,6 +1766,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
 	struct page *page;
+	int migratetype = allocflags_to_migratetype(gfp_mask);
 
 	lockdep_trace_alloc(gfp_mask);
 
@@ -1783,11 +1792,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone);
+			preferred_zone, migratetype);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone);
+				preferred_zone, migratetype);
 
 	return page;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 09/25] Calculate the migratetype for allocation only once
  2009-04-20 22:19 ` [PATCH 09/25] Calculate the migratetype " Mel Gorman
@ 2009-04-21  7:37   ` KOSAKI Motohiro
  2009-04-21  8:35     ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  7:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> GFP mask is converted into a migratetype when deciding which pagelist to
> take a page from. However, it is happening multiple times per
> allocation, at least once per zone traversed. Calculate it once.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
>  1 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b27bcde..f960cf5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1065,13 +1065,13 @@ void split_page(struct page *page, unsigned int order)
>   * or two.
>   */
>  static struct page *buffered_rmqueue(struct zone *preferred_zone,
> -			struct zone *zone, int order, gfp_t gfp_flags)
> +			struct zone *zone, int order, gfp_t gfp_flags,
> +			int migratetype)
>  {
>  	unsigned long flags;
>  	struct page *page;
>  	int cold = !!(gfp_flags & __GFP_COLD);
>  	int cpu;
> -	int migratetype = allocflags_to_migratetype(gfp_flags);

hmmm....

allocflags_to_migratetype() is very cheap function and buffered_rmqueue()
and other non-inline static function isn't guranteed inlined.

I don't think this patch improve performance on x86.
and, I have one comment to allocflags_to_migratetype.

-------------------------------------------------------------------
/* Convert GFP flags to their corresponding migrate type */
static inline int allocflags_to_migratetype(gfp_t gfp_flags)
{
        WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);

        if (unlikely(page_group_by_mobility_disabled))
                return MIGRATE_UNMOVABLE;

        /* Group based on mobility */
        return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
                ((gfp_flags & __GFP_RECLAIMABLE) != 0);
}
-------------------------------------------------------------------

s/WARN_ON/VM_BUG_ON/ is better?

GFP_MOVABLE_MASK makes 3. 3 mean MIGRATE_RESERVE. it seems obviously bug.

>  
>  again:
>  	cpu  = get_cpu();
> @@ -1397,7 +1397,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
>  static struct page *
>  get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> -		struct zone *preferred_zone)
> +		struct zone *preferred_zone, int migratetype)
>  {
>  	struct zoneref *z;
>  	struct page *page = NULL;
> @@ -1449,7 +1449,8 @@ zonelist_scan:
>  			}
>  		}
>  
> -		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
> +		page = buffered_rmqueue(preferred_zone, zone, order,
> +						gfp_mask, migratetype);
>  		if (page)
>  			break;
>  this_zone_full:
> @@ -1513,7 +1514,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> -	nodemask_t *nodemask, struct zone *preferred_zone)
> +	nodemask_t *nodemask, struct zone *preferred_zone,
> +	int migratetype)
>  {
>  	struct page *page;
>  
> @@ -1531,7 +1533,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
>  		order, zonelist, high_zoneidx,
>  		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
> -		preferred_zone);
> +		preferred_zone, migratetype);
>  	if (page)
>  		goto out;
>  
> @@ -1552,7 +1554,7 @@ static inline struct page *
>  __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
>  	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> -	unsigned long *did_some_progress)
> +	int migratetype, unsigned long *did_some_progress)
>  {
>  	struct page *page = NULL;
>  	struct reclaim_state reclaim_state;
> @@ -1585,7 +1587,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	if (likely(*did_some_progress))
>  		page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
> -					alloc_flags, preferred_zone);
> +					alloc_flags, preferred_zone,
> +					migratetype);
>  	return page;
>  }
>  
> @@ -1606,14 +1609,15 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
>  static inline struct page *
>  __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> -	nodemask_t *nodemask, struct zone *preferred_zone)
> +	nodemask_t *nodemask, struct zone *preferred_zone,
> +	int migratetype)
>  {
>  	struct page *page;
>  
>  	do {
>  		page = get_page_from_freelist(gfp_mask, nodemask, order,
>  			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> -			preferred_zone);
> +			preferred_zone, migratetype);
>  
>  		if (!page && gfp_mask & __GFP_NOFAIL)
>  			congestion_wait(WRITE, HZ/50);
> @@ -1636,7 +1640,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> -	nodemask_t *nodemask, struct zone *preferred_zone)
> +	nodemask_t *nodemask, struct zone *preferred_zone,
> +	int migratetype)
>  {
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
>  	struct page *page = NULL;
> @@ -1687,14 +1692,16 @@ restart:
>  	 */
>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>  						high_zoneidx, alloc_flags,
> -						preferred_zone);
> +						preferred_zone,
> +						migratetype);
>  	if (page)
>  		goto got_pg;
>  
>  	/* Allocate without watermarks if the context allows */
>  	if (is_allocation_high_priority(p, gfp_mask))
>  		page = __alloc_pages_high_priority(gfp_mask, order,
> -			zonelist, high_zoneidx, nodemask, preferred_zone);
> +			zonelist, high_zoneidx, nodemask, preferred_zone,
> +			migratetype);
>  	if (page)
>  		goto got_pg;
>  
> @@ -1707,7 +1714,7 @@ restart:
>  					zonelist, high_zoneidx,
>  					nodemask,
>  					alloc_flags, preferred_zone,
> -					&did_some_progress);
> +					migratetype, &did_some_progress);
>  	if (page)
>  		goto got_pg;
>  
> @@ -1719,7 +1726,8 @@ restart:
>  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>  			page = __alloc_pages_may_oom(gfp_mask, order,
>  					zonelist, high_zoneidx,
> -					nodemask, preferred_zone);
> +					nodemask, preferred_zone,
> +					migratetype);
>  			if (page)
>  				goto got_pg;
>  
> @@ -1758,6 +1766,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>  	struct zone *preferred_zone;
>  	struct page *page;
> +	int migratetype = allocflags_to_migratetype(gfp_mask);
>  
>  	lockdep_trace_alloc(gfp_mask);
>  
> @@ -1783,11 +1792,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	/* First allocation attempt */
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> -			preferred_zone);
> +			preferred_zone, migratetype);
>  	if (unlikely(!page))
>  		page = __alloc_pages_slowpath(gfp_mask, order,
>  				zonelist, high_zoneidx, nodemask,
> -				preferred_zone);
> +				preferred_zone, migratetype);
>  
>  	return page;
>  }
> -- 
> 1.5.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 09/25] Calculate the migratetype for allocation only once
  2009-04-21  7:37   ` KOSAKI Motohiro
@ 2009-04-21  8:35     ` Mel Gorman
  2009-04-21 10:19       ` KOSAKI Motohiro
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  8:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 04:37:28PM +0900, KOSAKI Motohiro wrote:
> > GFP mask is converted into a migratetype when deciding which pagelist to
> > take a page from. However, it is happening multiple times per
> > allocation, at least once per zone traversed. Calculate it once.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
> >  1 files changed, 26 insertions(+), 17 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b27bcde..f960cf5 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1065,13 +1065,13 @@ void split_page(struct page *page, unsigned int order)
> >   * or two.
> >   */
> >  static struct page *buffered_rmqueue(struct zone *preferred_zone,
> > -			struct zone *zone, int order, gfp_t gfp_flags)
> > +			struct zone *zone, int order, gfp_t gfp_flags,
> > +			int migratetype)
> >  {
> >  	unsigned long flags;
> >  	struct page *page;
> >  	int cold = !!(gfp_flags & __GFP_COLD);
> >  	int cpu;
> > -	int migratetype = allocflags_to_migratetype(gfp_flags);
> 
> hmmm....
> 
> allocflags_to_migratetype() is very cheap function and buffered_rmqueue()
> and other non-inline static function isn't guranteed inlined.
> 

A later patch makes them inlined due to the fact there is only one call
site.

> I don't think this patch improve performance on x86.
> and, I have one comment to allocflags_to_migratetype.
> 
> -------------------------------------------------------------------
> /* Convert GFP flags to their corresponding migrate type */
> static inline int allocflags_to_migratetype(gfp_t gfp_flags)
> {
>         WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> 
>         if (unlikely(page_group_by_mobility_disabled))
>                 return MIGRATE_UNMOVABLE;
> 
>         /* Group based on mobility */
>         return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
>                 ((gfp_flags & __GFP_RECLAIMABLE) != 0);
> }
> -------------------------------------------------------------------
> 
> s/WARN_ON/VM_BUG_ON/ is better?
> 

I wanted to catch out-of-tree drivers but it's been a while so maybe VM_BUG_ON
wouldn't hurt. I can add a patch that does that a pass 2 of improving the
allocator or would you prefer to see it now?

> GFP_MOVABLE_MASK makes 3. 3 mean MIGRATE_RESERVE. it seems obviously bug.
> 

Short answer;
No, GFP flags that result in MIGRATE_RESERVE is a bug. The caller should
never want to be allocating from there.

Longer answer;
The size of the MIGRATE_RESERVE depends on the number of free pages that
must be kept in the zone. Because GFP flags never result in here, the
area is only used when the alternative is to fail the allocation and the
watermarks are still met. The intention is that high-order atomic
allocations that were short lived may be allocated from here. This was
to preserve a behaviour in the allocator before MIGRATE_RESERVE was
introduced. It makes no sense for a caller to allocate directly out of
here and in fact the fallback list for MIGRATE_RESERVE are useless


> >  
> >  again:
> >  	cpu  = get_cpu();
> > @@ -1397,7 +1397,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
> >  static struct page *
> >  get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> >  		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> > -		struct zone *preferred_zone)
> > +		struct zone *preferred_zone, int migratetype)
> >  {
> >  	struct zoneref *z;
> >  	struct page *page = NULL;
> > @@ -1449,7 +1449,8 @@ zonelist_scan:
> >  			}
> >  		}
> >  
> > -		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
> > +		page = buffered_rmqueue(preferred_zone, zone, order,
> > +						gfp_mask, migratetype);
> >  		if (page)
> >  			break;
> >  this_zone_full:
> > @@ -1513,7 +1514,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> >  static inline struct page *
> >  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> > -	nodemask_t *nodemask, struct zone *preferred_zone)
> > +	nodemask_t *nodemask, struct zone *preferred_zone,
> > +	int migratetype)
> >  {
> >  	struct page *page;
> >  
> > @@ -1531,7 +1533,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
> >  		order, zonelist, high_zoneidx,
> >  		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
> > -		preferred_zone);
> > +		preferred_zone, migratetype);
> >  	if (page)
> >  		goto out;
> >  
> > @@ -1552,7 +1554,7 @@ static inline struct page *
> >  __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> >  	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> > -	unsigned long *did_some_progress)
> > +	int migratetype, unsigned long *did_some_progress)
> >  {
> >  	struct page *page = NULL;
> >  	struct reclaim_state reclaim_state;
> > @@ -1585,7 +1587,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  	if (likely(*did_some_progress))
> >  		page = get_page_from_freelist(gfp_mask, nodemask, order,
> >  					zonelist, high_zoneidx,
> > -					alloc_flags, preferred_zone);
> > +					alloc_flags, preferred_zone,
> > +					migratetype);
> >  	return page;
> >  }
> >  
> > @@ -1606,14 +1609,15 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
> >  static inline struct page *
> >  __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> >  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> > -	nodemask_t *nodemask, struct zone *preferred_zone)
> > +	nodemask_t *nodemask, struct zone *preferred_zone,
> > +	int migratetype)
> >  {
> >  	struct page *page;
> >  
> >  	do {
> >  		page = get_page_from_freelist(gfp_mask, nodemask, order,
> >  			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> > -			preferred_zone);
> > +			preferred_zone, migratetype);
> >  
> >  		if (!page && gfp_mask & __GFP_NOFAIL)
> >  			congestion_wait(WRITE, HZ/50);
> > @@ -1636,7 +1640,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> >  static inline struct page *
> >  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> > -	nodemask_t *nodemask, struct zone *preferred_zone)
> > +	nodemask_t *nodemask, struct zone *preferred_zone,
> > +	int migratetype)
> >  {
> >  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> >  	struct page *page = NULL;
> > @@ -1687,14 +1692,16 @@ restart:
> >  	 */
> >  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> >  						high_zoneidx, alloc_flags,
> > -						preferred_zone);
> > +						preferred_zone,
> > +						migratetype);
> >  	if (page)
> >  		goto got_pg;
> >  
> >  	/* Allocate without watermarks if the context allows */
> >  	if (is_allocation_high_priority(p, gfp_mask))
> >  		page = __alloc_pages_high_priority(gfp_mask, order,
> > -			zonelist, high_zoneidx, nodemask, preferred_zone);
> > +			zonelist, high_zoneidx, nodemask, preferred_zone,
> > +			migratetype);
> >  	if (page)
> >  		goto got_pg;
> >  
> > @@ -1707,7 +1714,7 @@ restart:
> >  					zonelist, high_zoneidx,
> >  					nodemask,
> >  					alloc_flags, preferred_zone,
> > -					&did_some_progress);
> > +					migratetype, &did_some_progress);
> >  	if (page)
> >  		goto got_pg;
> >  
> > @@ -1719,7 +1726,8 @@ restart:
> >  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >  			page = __alloc_pages_may_oom(gfp_mask, order,
> >  					zonelist, high_zoneidx,
> > -					nodemask, preferred_zone);
> > +					nodemask, preferred_zone,
> > +					migratetype);
> >  			if (page)
> >  				goto got_pg;
> >  
> > @@ -1758,6 +1766,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> >  	struct zone *preferred_zone;
> >  	struct page *page;
> > +	int migratetype = allocflags_to_migratetype(gfp_mask);
> >  
> >  	lockdep_trace_alloc(gfp_mask);
> >  
> > @@ -1783,11 +1792,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> >  	/* First allocation attempt */
> >  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> >  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> > -			preferred_zone);
> > +			preferred_zone, migratetype);
> >  	if (unlikely(!page))
> >  		page = __alloc_pages_slowpath(gfp_mask, order,
> >  				zonelist, high_zoneidx, nodemask,
> > -				preferred_zone);
> > +				preferred_zone, migratetype);
> >  
> >  	return page;
> >  }
> > -- 
> > 1.5.6.5
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 09/25] Calculate the migratetype for allocation only once
  2009-04-21  8:35     ` Mel Gorman
@ 2009-04-21 10:19       ` KOSAKI Motohiro
  2009-04-21 10:30         ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 10:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> On Tue, Apr 21, 2009 at 04:37:28PM +0900, KOSAKI Motohiro wrote:
> > > GFP mask is converted into a migratetype when deciding which pagelist to
> > > take a page from. However, it is happening multiple times per
> > > allocation, at least once per zone traversed. Calculate it once.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
> > >  1 files changed, 26 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index b27bcde..f960cf5 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1065,13 +1065,13 @@ void split_page(struct page *page, unsigned int order)
> > >   * or two.
> > >   */
> > >  static struct page *buffered_rmqueue(struct zone *preferred_zone,
> > > -			struct zone *zone, int order, gfp_t gfp_flags)
> > > +			struct zone *zone, int order, gfp_t gfp_flags,
> > > +			int migratetype)
> > >  {
> > >  	unsigned long flags;
> > >  	struct page *page;
> > >  	int cold = !!(gfp_flags & __GFP_COLD);
> > >  	int cpu;
> > > -	int migratetype = allocflags_to_migratetype(gfp_flags);
> > 
> > hmmm....
> > 
> > allocflags_to_migratetype() is very cheap function and buffered_rmqueue()
> > and other non-inline static function isn't guranteed inlined.
> > 
> 
> A later patch makes them inlined due to the fact there is only one call
> site.

Oh, I see.
I drop my claim. thanks.



> > -------------------------------------------------------------------
> > /* Convert GFP flags to their corresponding migrate type */
> > static inline int allocflags_to_migratetype(gfp_t gfp_flags)
> > {
> >         WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> > 
> >         if (unlikely(page_group_by_mobility_disabled))
> >                 return MIGRATE_UNMOVABLE;
> > 
> >         /* Group based on mobility */
> >         return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
> >                 ((gfp_flags & __GFP_RECLAIMABLE) != 0);
> > }
> > -------------------------------------------------------------------
> > 
> > s/WARN_ON/VM_BUG_ON/ is better?
> > 
> 
> I wanted to catch out-of-tree drivers but it's been a while so maybe VM_BUG_ON
> wouldn't hurt. I can add a patch that does that a pass 2 of improving the
> allocator or would you prefer to see it now?

no. another patch is better :)


> > GFP_MOVABLE_MASK makes 3. 3 mean MIGRATE_RESERVE. it seems obviously bug.
> > 
> 
> Short answer;
> No, GFP flags that result in MIGRATE_RESERVE is a bug. The caller should
> never want to be allocating from there.
> 
> Longer answer;
> The size of the MIGRATE_RESERVE depends on the number of free pages that
> must be kept in the zone. Because GFP flags never result in here, the
> area is only used when the alternative is to fail the allocation and the
> watermarks are still met. The intention is that high-order atomic
> allocations that were short lived may be allocated from here. This was
> to preserve a behaviour in the allocator before MIGRATE_RESERVE was
> introduced. It makes no sense for a caller to allocate directly out of
> here and in fact the fallback list for MIGRATE_RESERVE are useless

Yeah.
My past mail is too poor. I agree it is caller's bug.
I mean VM_BUG_ON is better because
  - obviously caller bug 
  - VM_BUG_ON is no runtime impact when VM_DEBUG is off.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 09/25] Calculate the migratetype for allocation only once
  2009-04-21 10:19       ` KOSAKI Motohiro
@ 2009-04-21 10:30         ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21 10:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 07:19:00PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Apr 21, 2009 at 04:37:28PM +0900, KOSAKI Motohiro wrote:
> > > > GFP mask is converted into a migratetype when deciding which pagelist to
> > > > take a page from. However, it is happening multiple times per
> > > > allocation, at least once per zone traversed. Calculate it once.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > >  mm/page_alloc.c |   43 ++++++++++++++++++++++++++-----------------
> > > >  1 files changed, 26 insertions(+), 17 deletions(-)
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index b27bcde..f960cf5 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1065,13 +1065,13 @@ void split_page(struct page *page, unsigned int order)
> > > >   * or two.
> > > >   */
> > > >  static struct page *buffered_rmqueue(struct zone *preferred_zone,
> > > > -			struct zone *zone, int order, gfp_t gfp_flags)
> > > > +			struct zone *zone, int order, gfp_t gfp_flags,
> > > > +			int migratetype)
> > > >  {
> > > >  	unsigned long flags;
> > > >  	struct page *page;
> > > >  	int cold = !!(gfp_flags & __GFP_COLD);
> > > >  	int cpu;
> > > > -	int migratetype = allocflags_to_migratetype(gfp_flags);
> > > 
> > > hmmm....
> > > 
> > > allocflags_to_migratetype() is very cheap function and buffered_rmqueue()
> > > and other non-inline static function isn't guranteed inlined.
> > > 
> > 
> > A later patch makes them inlined due to the fact there is only one call
> > site.
> 
> Oh, I see.
> I drop my claim. thanks.
> 
> > > -------------------------------------------------------------------
> > > /* Convert GFP flags to their corresponding migrate type */
> > > static inline int allocflags_to_migratetype(gfp_t gfp_flags)
> > > {
> > >         WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> > > 
> > >         if (unlikely(page_group_by_mobility_disabled))
> > >                 return MIGRATE_UNMOVABLE;
> > > 
> > >         /* Group based on mobility */
> > >         return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
> > >                 ((gfp_flags & __GFP_RECLAIMABLE) != 0);
> > > }
> > > -------------------------------------------------------------------
> > > 
> > > s/WARN_ON/VM_BUG_ON/ is better?
> > > 
> > 
> > I wanted to catch out-of-tree drivers but it's been a while so maybe VM_BUG_ON
> > wouldn't hurt. I can add a patch that does that a pass 2 of improving the
> > allocator or would you prefer to see it now?
> 
> no. another patch is better :)
> 
> 
> > > GFP_MOVABLE_MASK makes 3. 3 mean MIGRATE_RESERVE. it seems obviously bug.
> > > 
> > 
> > Short answer;
> > No, GFP flags that result in MIGRATE_RESERVE is a bug. The caller should
> > never want to be allocating from there.
> > 
> > Longer answer;
> > The size of the MIGRATE_RESERVE depends on the number of free pages that
> > must be kept in the zone. Because GFP flags never result in here, the
> > area is only used when the alternative is to fail the allocation and the
> > watermarks are still met. The intention is that high-order atomic
> > allocations that were short lived may be allocated from here. This was
> > to preserve a behaviour in the allocator before MIGRATE_RESERVE was
> > introduced. It makes no sense for a caller to allocate directly out of
> > here and in fact the fallback list for MIGRATE_RESERVE are useless
> 
> Yeah.
> My past mail is too poor. I agree it is caller's bug.
> I mean VM_BUG_ON is better because
>   - obviously caller bug 
>   - VM_BUG_ON is no runtime impact when VM_DEBUG is off.
> 

Ah, I misread what you were saying. This is certainly a caller bug and I
think it's safe to change it to a VM_BUG_ON() at this stage. I've taken
note to do a patch to that effect in a later patch series.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 10/25] Calculate the alloc_flags for allocation only once
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (8 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 09/25] Calculate the migratetype " Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  9:03   ` KOSAKI Motohiro
  2009-04-20 22:19 ` [PATCH 11/25] Calculate the cold parameter " Mel Gorman
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

Factor out the mapping between GFP and alloc_flags only once. Once factored
out, it only needs to be calculated once but some care must be taken.

[neilb@suse.de says]
As the test:

-       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-                       && !in_interrupt()) {
-               if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+       if (alloc_flags & ALLOC_NO_WATERMARKS) {

we need to ensure we don't recurse when PF_MEMALLOC is set.

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |   88 +++++++++++++++++++++++++++++++-----------------------
 1 files changed, 50 insertions(+), 38 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f960cf5..1506cd5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1592,16 +1592,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline int
-is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
-{
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt())
-		if (!(gfp_mask & __GFP_NOMEMALLOC))
-			return 1;
-	return 0;
-}
-
 /*
  * This is called in the allocator slow-path if the allocation request is of
  * sufficient urgency to ignore watermarks and take other desperate measures
@@ -1637,6 +1627,42 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 		wakeup_kswapd(zone, order);
 }
 
+static inline int
+gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1667,48 +1693,34 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 restart:
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
+	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
-						high_zoneidx, alloc_flags,
-						preferred_zone,
-						migratetype);
+			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
+			preferred_zone, migratetype);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
-	if (is_allocation_high_priority(p, gfp_mask))
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 		page = __alloc_pages_high_priority(gfp_mask, order,
-			zonelist, high_zoneidx, nodemask, preferred_zone,
-			migratetype);
-	if (page)
-		goto got_pg;
+				zonelist, high_zoneidx, nodemask,
+				preferred_zone, migratetype);
+		if (page)
+			goto got_pg;
+	}
 
 	/* Atomic allocations - we can't balance anything */
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/25] Calculate the alloc_flags for allocation only once
  2009-04-20 22:19 ` [PATCH 10/25] Calculate the alloc_flags " Mel Gorman
@ 2009-04-21  9:03   ` KOSAKI Motohiro
  2009-04-21 10:05     ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  9:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> Factor out the mapping between GFP and alloc_flags only once. Once factored
> out, it only needs to be calculated once but some care must be taken.
> 
> [neilb@suse.de says]
> As the test:
> 
> -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> -                       && !in_interrupt()) {
> -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> 
> has been replaced with a slightly weaker one:
> 
> +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> 
> we need to ensure we don't recurse when PF_MEMALLOC is set.

It seems good idea.




> +static inline int
> +gfp_to_alloc_flags(gfp_t gfp_mask)
> +{
> +	struct task_struct *p = current;
> +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +
> +	/*
> +	 * The caller may dip into page reserves a bit more if the caller
> +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> +	 */
> +	if (gfp_mask & __GFP_HIGH)
> +		alloc_flags |= ALLOC_HIGH;
> +
> +	if (!wait) {
> +		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> +		 */
> +		alloc_flags &= ~ALLOC_CPUSET;
> +	} else if (unlikely(rt_task(p)) && !in_interrupt())

wait==1 and in_interrupt==1 is never occur.
I think in_interrupt check can be removed.


>  	/* Atomic allocations - we can't balance anything */
>  	if (!wait)
>  		goto nopage;
>  
> +	/* Avoid recursion of direct reclaim */
> +	if (p->flags & PF_MEMALLOC)
> +		goto nopage;
> +

Again. old code doesn't only check PF_MEMALLOC, but also check TIF_MEMDIE.


>  	/* Try direct reclaim and then allocating */
>  	page = __alloc_pages_direct_reclaim(gfp_mask, order,
>  					zonelist, high_zoneidx,
> -- 
> 1.5.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/25] Calculate the alloc_flags for allocation only once
  2009-04-21  9:03   ` KOSAKI Motohiro
@ 2009-04-21 10:05     ` Mel Gorman
  2009-04-21 10:12       ` KOSAKI Motohiro
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21 10:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 06:03:25PM +0900, KOSAKI Motohiro wrote:
> > Factor out the mapping between GFP and alloc_flags only once. Once factored
> > out, it only needs to be calculated once but some care must be taken.
> > 
> > [neilb@suse.de says]
> > As the test:
> > 
> > -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > -                       && !in_interrupt()) {
> > -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> > 
> > has been replaced with a slightly weaker one:
> > 
> > +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> > 
> > we need to ensure we don't recurse when PF_MEMALLOC is set.
> 
> It seems good idea.
> 
> > +static inline int
> > +gfp_to_alloc_flags(gfp_t gfp_mask)
> > +{
> > +	struct task_struct *p = current;
> > +	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +
> > +	/*
> > +	 * The caller may dip into page reserves a bit more if the caller
> > +	 * cannot run direct reclaim, or if the caller has realtime scheduling
> > +	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> > +	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> > +	 */
> > +	if (gfp_mask & __GFP_HIGH)
> > +		alloc_flags |= ALLOC_HIGH;
> > +
> > +	if (!wait) {
> > +		alloc_flags |= ALLOC_HARDER;
> > +		/*
> > +		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> > +		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> > +		 */
> > +		alloc_flags &= ~ALLOC_CPUSET;
> > +	} else if (unlikely(rt_task(p)) && !in_interrupt())
> 
> wait==1 and in_interrupt==1 is never occur.
> I think in_interrupt check can be removed.
> 

Looks like it. I removed it now.

> >  	/* Atomic allocations - we can't balance anything */
> >  	if (!wait)
> >  		goto nopage;
> >  
> > +	/* Avoid recursion of direct reclaim */
> > +	if (p->flags & PF_MEMALLOC)
> > +		goto nopage;
> > +
> 
> Again. old code doesn't only check PF_MEMALLOC, but also check TIF_MEMDIE.
> 

But a direct reclaim will have PF_MEMALLOC set and doesn't care about
the value of TIF_MEMDIE with respect to recursion.

There is still a check made for TIF_MEMDIE for setting ALLOC_NO_WATERMARKS
in gfp_to_alloc_flags() so that flag is still being taken care of.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/25] Calculate the alloc_flags for allocation only once
  2009-04-21 10:05     ` Mel Gorman
@ 2009-04-21 10:12       ` KOSAKI Motohiro
  2009-04-21 10:37         ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 10:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> > > +	/* Avoid recursion of direct reclaim */
> > > +	if (p->flags & PF_MEMALLOC)
> > > +		goto nopage;
> > > +
> > 
> > Again. old code doesn't only check PF_MEMALLOC, but also check TIF_MEMDIE.
> > 
> 
> But a direct reclaim will have PF_MEMALLOC set and doesn't care about
> the value of TIF_MEMDIE with respect to recursion.
> 
> There is still a check made for TIF_MEMDIE for setting ALLOC_NO_WATERMARKS
> in gfp_to_alloc_flags() so that flag is still being taken care of.

Do you mean this is intentional change?
I only said there is changelog-less behavior change.

old code is here.
PF_MEMALLOC and TIF_MEMDIE makes goto nopage. it avoid reclaim.
-------------------------------------------------------------------------
rebalance:
        if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
                        && !in_interrupt()) {
                if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
                        /* go through the zonelist yet again, ignoring mins */
                        page = get_page_from_freelist(gfp_mask, nodemask, order,
                                zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
                        if (page)
                                goto got_pg;
                        if (gfp_mask & __GFP_NOFAIL) {
                                congestion_wait(WRITE, HZ/50);
                                goto nofail_alloc;
                        }
                }
                goto nopage;
        }
-------------------------------------------------------------------------


but I don't oppose this change if it is your intentional.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/25] Calculate the alloc_flags for allocation only once
  2009-04-21 10:12       ` KOSAKI Motohiro
@ 2009-04-21 10:37         ` Mel Gorman
  2009-04-21 10:40           ` KOSAKI Motohiro
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21 10:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 07:12:57PM +0900, KOSAKI Motohiro wrote:
> > > > +	/* Avoid recursion of direct reclaim */
> > > > +	if (p->flags & PF_MEMALLOC)
> > > > +		goto nopage;
> > > > +
> > > 
> > > Again. old code doesn't only check PF_MEMALLOC, but also check TIF_MEMDIE.
> > > 
> > 
> > But a direct reclaim will have PF_MEMALLOC set and doesn't care about
> > the value of TIF_MEMDIE with respect to recursion.
> > 
> > There is still a check made for TIF_MEMDIE for setting ALLOC_NO_WATERMARKS
> > in gfp_to_alloc_flags() so that flag is still being taken care of.
> 
> Do you mean this is intentional change?
> I only said there is changelog-less behavior change.
> 

Yes, it's intentional.

> old code is here.
> PF_MEMALLOC and TIF_MEMDIE makes goto nopage. it avoid reclaim.

PF_MEMALLOC avoiding reclaim makes sense but TIF_MEMDIE should be
allowed to reclaim. I called it out a bit better in the changelog now.

> -------------------------------------------------------------------------
> rebalance:
>         if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
>                         && !in_interrupt()) {
>                 if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> nofail_alloc:
>                         /* go through the zonelist yet again, ignoring mins */
>                         page = get_page_from_freelist(gfp_mask, nodemask, order,
>                                 zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
>                         if (page)
>                                 goto got_pg;
>                         if (gfp_mask & __GFP_NOFAIL) {
>                                 congestion_wait(WRITE, HZ/50);
>                                 goto nofail_alloc;
>                         }
>                 }
>                 goto nopage;
>         }
> -------------------------------------------------------------------------
> 
> 
> but I don't oppose this change if it is your intentional.
> 

The changelog now reads
=====

Factor out the mapping between GFP and alloc_flags only once. Once factored
out, it only needs to be calculated once but some care must be taken.

[neilb@suse.de says]
As the test:

-       if (((p->flags & PF_MEMALLOC) ||
        unlikely(test_thread_flag(TIF_MEMDIE)))
-                       && !in_interrupt()) {
-               if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+       if (alloc_flags & ALLOC_NO_WATERMARKS) {

Without care, this would allow recursion into the allocator via direct
reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
but TF_MEMDIE callers are now allowed to directly reclaim where they
would have been prevented in the past.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/25] Calculate the alloc_flags for allocation only once
  2009-04-21 10:37         ` Mel Gorman
@ 2009-04-21 10:40           ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 10:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> The changelog now reads
> =====
> 
> Factor out the mapping between GFP and alloc_flags only once. Once factored
> out, it only needs to be calculated once but some care must be taken.
> 
> [neilb@suse.de says]
> As the test:
> 
> -       if (((p->flags & PF_MEMALLOC) ||
>         unlikely(test_thread_flag(TIF_MEMDIE)))
> -                       && !in_interrupt()) {
> -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> 
> has been replaced with a slightly weaker one:
> 
> +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
> 
> Without care, this would allow recursion into the allocator via direct
> reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
> but TF_MEMDIE callers are now allowed to directly reclaim where they
> would have been prevented in the past.

Excellent. :)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 11/25] Calculate the cold parameter for allocation only once
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (9 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 10/25] Calculate the alloc_flags " Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  7:43   ` Pekka Enberg
                     ` (2 more replies)
  2009-04-20 22:19 ` [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH Mel Gorman
                   ` (14 subsequent siblings)
  25 siblings, 3 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

GFP mask is checked for __GFP_COLD has been specified when deciding which
end of the PCP lists to use. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   35 ++++++++++++++++++-----------------
 1 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1506cd5..51e1ded 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1066,11 +1066,10 @@ void split_page(struct page *page, unsigned int order)
  */
 static struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags,
-			int migratetype)
+			int migratetype, int cold)
 {
 	unsigned long flags;
 	struct page *page;
-	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
 
 again:
@@ -1397,7 +1396,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int migratetype, int cold)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1450,7 +1449,7 @@ zonelist_scan:
 		}
 
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+						gfp_mask, migratetype, cold);
 		if (page)
 			break;
 this_zone_full:
@@ -1515,7 +1514,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, int cold)
 {
 	struct page *page;
 
@@ -1533,7 +1532,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, migratetype, cold);
 	if (page)
 		goto out;
 
@@ -1554,7 +1553,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int migratetype, int cold, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
@@ -1588,7 +1587,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
-					migratetype);
+					migratetype, cold);
 	return page;
 }
 
@@ -1600,14 +1599,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, int cold)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, cold);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			congestion_wait(WRITE, HZ/50);
@@ -1667,7 +1666,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, int cold)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -1700,7 +1699,7 @@ restart:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, cold);
 	if (page)
 		goto got_pg;
 
@@ -1708,7 +1707,7 @@ restart:
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype, cold);
 		if (page)
 			goto got_pg;
 	}
@@ -1726,7 +1725,8 @@ restart:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					migratetype, cold,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -1739,7 +1739,7 @@ restart:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					migratetype, cold);
 			if (page)
 				goto got_pg;
 
@@ -1779,6 +1779,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct zone *preferred_zone;
 	struct page *page;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
+	int cold = gfp_mask & __GFP_COLD;
 
 	lockdep_trace_alloc(gfp_mask);
 
@@ -1804,11 +1805,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, cold);
 	if (unlikely(!page))
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype, cold);
 
 	return page;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 11/25] Calculate the cold parameter for allocation only once
  2009-04-20 22:19 ` [PATCH 11/25] Calculate the cold parameter " Mel Gorman
@ 2009-04-21  7:43   ` Pekka Enberg
  2009-04-21  8:41     ` Mel Gorman
  2009-04-21  9:07   ` KOSAKI Motohiro
  2009-04-21 14:58   ` Christoph Lameter
  2 siblings, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  7:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> GFP mask is checked for __GFP_COLD has been specified when deciding which
> end of the PCP lists to use. However, it is happening multiple times per
> allocation, at least once per zone traversed. Calculate it once.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |   35 ++++++++++++++++++-----------------
>  1 files changed, 18 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1506cd5..51e1ded 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1066,11 +1066,10 @@ void split_page(struct page *page, unsigned int order)
>   */
>  static struct page *buffered_rmqueue(struct zone *preferred_zone,
>  			struct zone *zone, int order, gfp_t gfp_flags,
> -			int migratetype)
> +			int migratetype, int cold)
>  {
>  	unsigned long flags;
>  	struct page *page;
> -	int cold = !!(gfp_flags & __GFP_COLD);
>  	int cpu;
>  
>  again:

Is this a measurable win? And does gcc inline all this nicely or does
this change actually increase kernel text size?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 11/25] Calculate the cold parameter for allocation only once
  2009-04-21  7:43   ` Pekka Enberg
@ 2009-04-21  8:41     ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  8:41 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 10:43:25AM +0300, Pekka Enberg wrote:
> On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> > GFP mask is checked for __GFP_COLD has been specified when deciding which
> > end of the PCP lists to use. However, it is happening multiple times per
> > allocation, at least once per zone traversed. Calculate it once.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |   35 ++++++++++++++++++-----------------
> >  1 files changed, 18 insertions(+), 17 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1506cd5..51e1ded 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1066,11 +1066,10 @@ void split_page(struct page *page, unsigned int order)
> >   */
> >  static struct page *buffered_rmqueue(struct zone *preferred_zone,
> >  			struct zone *zone, int order, gfp_t gfp_flags,
> > -			int migratetype)
> > +			int migratetype, int cold)
> >  {
> >  	unsigned long flags;
> >  	struct page *page;
> > -	int cold = !!(gfp_flags & __GFP_COLD);
> >  	int cpu;
> >  
> >  again:
> 
> Is this a measurable win? And does gcc inline all this nicely or does
> this change actually increase kernel text size?
> 

It gets inlined later so it shouldn't affect code size as part of the overall
patchset although it probably adds a small bit here. It showed up on profiles
as a place where cache misses were hitting hard although it's probable that
the misses are being incurred anyway but not as obvious due to sampling.

The win overall is small but it's the same mask that is applied to gfp_mask
over and over again and could be considered a loop invariant in a weird sort
of way.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 11/25] Calculate the cold parameter for allocation only once
  2009-04-20 22:19 ` [PATCH 11/25] Calculate the cold parameter " Mel Gorman
  2009-04-21  7:43   ` Pekka Enberg
@ 2009-04-21  9:07   ` KOSAKI Motohiro
  2009-04-21 10:08     ` Mel Gorman
  2009-04-21 14:59     ` Christoph Lameter
  2009-04-21 14:58   ` Christoph Lameter
  2 siblings, 2 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  9:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> GFP mask is checked for __GFP_COLD has been specified when deciding which
> end of the PCP lists to use. However, it is happening multiple times per
> allocation, at least once per zone traversed. Calculate it once.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |   35 ++++++++++++++++++-----------------
>  1 files changed, 18 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1506cd5..51e1ded 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1066,11 +1066,10 @@ void split_page(struct page *page, unsigned int order)
>   */
>  static struct page *buffered_rmqueue(struct zone *preferred_zone,
>  			struct zone *zone, int order, gfp_t gfp_flags,
> -			int migratetype)
> +			int migratetype, int cold)
>  {
>  	unsigned long flags;
>  	struct page *page;
> -	int cold = !!(gfp_flags & __GFP_COLD);
>  	int cpu;

Honestly, I don't like this ;-)

It seems benefit is too small. It don't win against code ugliness, I think.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 11/25] Calculate the cold parameter for allocation only once
  2009-04-21  9:07   ` KOSAKI Motohiro
@ 2009-04-21 10:08     ` Mel Gorman
  2009-04-21 14:59     ` Christoph Lameter
  1 sibling, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21 10:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 06:07:24PM +0900, KOSAKI Motohiro wrote:
> > GFP mask is checked for __GFP_COLD has been specified when deciding which
> > end of the PCP lists to use. However, it is happening multiple times per
> > allocation, at least once per zone traversed. Calculate it once.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |   35 ++++++++++++++++++-----------------
> >  1 files changed, 18 insertions(+), 17 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1506cd5..51e1ded 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1066,11 +1066,10 @@ void split_page(struct page *page, unsigned int order)
> >   */
> >  static struct page *buffered_rmqueue(struct zone *preferred_zone,
> >  			struct zone *zone, int order, gfp_t gfp_flags,
> > -			int migratetype)
> > +			int migratetype, int cold)
> >  {
> >  	unsigned long flags;
> >  	struct page *page;
> > -	int cold = !!(gfp_flags & __GFP_COLD);
> >  	int cpu;
> 
> Honestly, I don't like this ;-)
> 
> It seems benefit is too small. It don't win against code ugliness, I think.
> 

Ok, I'll drop it for now and then generate figures for it at a later
time. The intention is to have this first set relatively
uncontroversial.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 11/25] Calculate the cold parameter for allocation only once
  2009-04-21  9:07   ` KOSAKI Motohiro
  2009-04-21 10:08     ` Mel Gorman
@ 2009-04-21 14:59     ` Christoph Lameter
  1 sibling, 0 replies; 105+ messages in thread
From: Christoph Lameter @ 2009-04-21 14:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Linux Memory Management List, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, 21 Apr 2009, KOSAKI Motohiro wrote:

> It seems benefit is too small. It don't win against code ugliness, I think.

Some of these functions are inlined by the processor. And it helps the
compiler to optimize if the state is in a local variable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 11/25] Calculate the cold parameter for allocation only once
  2009-04-20 22:19 ` [PATCH 11/25] Calculate the cold parameter " Mel Gorman
  2009-04-21  7:43   ` Pekka Enberg
  2009-04-21  9:07   ` KOSAKI Motohiro
@ 2009-04-21 14:58   ` Christoph Lameter
  2 siblings, 0 replies; 105+ messages in thread
From: Christoph Lameter @ 2009-04-21 14:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton


My earlier ack is missing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (10 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 11/25] Calculate the cold parameter " Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  7:46   ` Pekka Enberg
  2009-04-21  9:08   ` KOSAKI Motohiro
  2009-04-20 22:19 ` [PATCH 13/25] Inline __rmqueue_smallest() Mel Gorman
                   ` (13 subsequent siblings)
  25 siblings, 2 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
flags are equal to each other, we can eliminate a branch.

[akpm@linux-foundation.org: Suggested the hack]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 51e1ded..b13fc29 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1639,8 +1639,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
 	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
+	VM_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
+	alloc_flags |= (gfp_mask & __GFP_HIGH);
 
 	if (!wait) {
 		alloc_flags |= ALLOC_HARDER;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH
  2009-04-20 22:19 ` [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH Mel Gorman
@ 2009-04-21  7:46   ` Pekka Enberg
  2009-04-21  8:45     ` Mel Gorman
  2009-04-21  9:08   ` KOSAKI Motohiro
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  7:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
> flags are equal to each other, we can eliminate a branch.
> 
> [akpm@linux-foundation.org: Suggested the hack]

Yikes!

> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 51e1ded..b13fc29 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1639,8 +1639,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
>  	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
>  	 */
> -	if (gfp_mask & __GFP_HIGH)
> -		alloc_flags |= ALLOC_HIGH;
> +	VM_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
> +	alloc_flags |= (gfp_mask & __GFP_HIGH);

Shouldn't you then also change ALLOC_HIGH to use __GFP_HIGH or at least
add a comment somewhere?

>  
>  	if (!wait) {
>  		alloc_flags |= ALLOC_HARDER;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH
  2009-04-21  7:46   ` Pekka Enberg
@ 2009-04-21  8:45     ` Mel Gorman
  2009-04-21 10:25       ` Pekka Enberg
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  8:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 10:46:22AM +0300, Pekka Enberg wrote:
> On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> > Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
> > flags are equal to each other, we can eliminate a branch.
> > 
> > [akpm@linux-foundation.org: Suggested the hack]
> 
> Yikes!
> 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 51e1ded..b13fc29 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1639,8 +1639,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> >  	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> >  	 */
> > -	if (gfp_mask & __GFP_HIGH)
> > -		alloc_flags |= ALLOC_HIGH;
> > +	VM_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
> > +	alloc_flags |= (gfp_mask & __GFP_HIGH);
> 
> Shouldn't you then also change ALLOC_HIGH to use __GFP_HIGH or at least
> add a comment somewhere?
> 

That might break in weird ways if __GFP_HIGH changes in value then. I
can add a comment though

/*
 * __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch.
 * Check for DEBUG_VM that the assumption is still correct. It cannot be
 * checked at compile-time due to casting
 */

?

> >  
> >  	if (!wait) {
> >  		alloc_flags |= ALLOC_HARDER;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH
  2009-04-21  8:45     ` Mel Gorman
@ 2009-04-21 10:25       ` Pekka Enberg
  0 siblings, 0 replies; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21 10:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

Hi Mel,

On Tue, Apr 21, 2009 at 11:45 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>> > @@ -1639,8 +1639,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>> >      * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
>> >      * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
>> >      */
>> > -   if (gfp_mask & __GFP_HIGH)
>> > -           alloc_flags |= ALLOC_HIGH;
>> > +   VM_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
>> > +   alloc_flags |= (gfp_mask & __GFP_HIGH);
>>
>> Shouldn't you then also change ALLOC_HIGH to use __GFP_HIGH or at least
>> add a comment somewhere?
>
> That might break in weird ways if __GFP_HIGH changes in value then. I
> can add a comment though
>
> /*
>  * __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch.
>  * Check for DEBUG_VM that the assumption is still correct. It cannot be
>  * checked at compile-time due to casting
>  */
>
> ?

I'm perfectly fine with something like that.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

                                      Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH
  2009-04-20 22:19 ` [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH Mel Gorman
  2009-04-21  7:46   ` Pekka Enberg
@ 2009-04-21  9:08   ` KOSAKI Motohiro
  2009-04-21 10:31     ` KOSAKI Motohiro
  1 sibling, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  9:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
> flags are equal to each other, we can eliminate a branch.
> 
> [akpm@linux-foundation.org: Suggested the hack]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 51e1ded..b13fc29 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1639,8 +1639,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
>  	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
>  	 */
> -	if (gfp_mask & __GFP_HIGH)
> -		alloc_flags |= ALLOC_HIGH;
> +	VM_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
> +	alloc_flags |= (gfp_mask & __GFP_HIGH);
>  
>  	if (!wait) {
>  		alloc_flags |= ALLOC_HARDER;

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH
  2009-04-21  9:08   ` KOSAKI Motohiro
@ 2009-04-21 10:31     ` KOSAKI Motohiro
  2009-04-21 10:43       ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 10:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> > @@ -1639,8 +1639,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> >  	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> >  	 */
> > -	if (gfp_mask & __GFP_HIGH)
> > -		alloc_flags |= ALLOC_HIGH;
> > +	VM_BUG_ON(__GFP_HIGH != ALLOC_HIGH);

Oops, I forgot said one comment.
BUILD_BUG_ON() is better?


> > +	alloc_flags |= (gfp_mask & __GFP_HIGH);
> >  
> >  	if (!wait) {
> >  		alloc_flags |= ALLOC_HARDER;
> 
> 	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> 
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH
  2009-04-21 10:31     ` KOSAKI Motohiro
@ 2009-04-21 10:43       ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21 10:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 07:31:23PM +0900, KOSAKI Motohiro wrote:
> > > @@ -1639,8 +1639,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> > >  	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> > >  	 */
> > > -	if (gfp_mask & __GFP_HIGH)
> > > -		alloc_flags |= ALLOC_HIGH;
> > > +	VM_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
> 
> Oops, I forgot said one comment.
> BUILD_BUG_ON() is better?
> 

Much better. Thanks

> 
> > > +	alloc_flags |= (gfp_mask & __GFP_HIGH);
> > >  
> > >  	if (!wait) {
> > >  		alloc_flags |= ALLOC_HARDER;
> > 
> > 	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > 
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 13/25] Inline __rmqueue_smallest()
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (11 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH Mel Gorman
@ 2009-04-20 22:19 ` Mel Gorman
  2009-04-21  7:58   ` Pekka Enberg
  2009-04-21  9:52   ` KOSAKI Motohiro
  2009-04-20 22:20 ` [PATCH 14/25] Inline buffered_rmqueue() Mel Gorman
                   ` (12 subsequent siblings)
  25 siblings, 2 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:19 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

Inline __rmqueue_smallest by altering flow very slightly so that there
is only one call site. This allows the function to be inlined without
additional text bloat.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |   23 ++++++++++++++++++-----
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b13fc29..91a2cdb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
  */
-static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	unsigned int current_order;
@@ -835,24 +836,36 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order,
 		}
 	}
 
-	/* Use MIGRATE_RESERVE rather than fail an allocation */
-	return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
+	return NULL;
 }
 
 /*
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue(struct zone *zone, unsigned int order,
 						int migratetype)
 {
 	struct page *page;
 
+retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
-	if (unlikely(!page))
+	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
 		page = __rmqueue_fallback(zone, order, migratetype);
 
+		/*
+		 * Use MIGRATE_RESERVE rather than fail an allocation. goto
+		 * is used because __rmqueue_smallest is an inline function
+		 * and we want just one call site
+		 */
+		if (!page) {
+			migratetype = MIGRATE_RESERVE;
+			goto retry_reserve;
+		}
+	}
+
 	return page;
 }
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/25] Inline __rmqueue_smallest()
  2009-04-20 22:19 ` [PATCH 13/25] Inline __rmqueue_smallest() Mel Gorman
@ 2009-04-21  7:58   ` Pekka Enberg
  2009-04-21  8:48     ` Mel Gorman
  2009-04-21  9:52   ` KOSAKI Motohiro
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  7:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> Inline __rmqueue_smallest by altering flow very slightly so that there
> is only one call site. This allows the function to be inlined without
> additional text bloat.

Quite frankly, I think these patch changelogs could use some before and
after numbers for "size mm/page_alloc.o" because it's usually the case
that kernel text shrinks when you _remove_ inlines.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/25] Inline __rmqueue_smallest()
  2009-04-21  7:58   ` Pekka Enberg
@ 2009-04-21  8:48     ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  8:48 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 10:58:15AM +0300, Pekka Enberg wrote:
> On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> > Inline __rmqueue_smallest by altering flow very slightly so that there
> > is only one call site. This allows the function to be inlined without
> > additional text bloat.
> 
> Quite frankly, I think these patch changelogs could use some before and
> after numbers for "size mm/page_alloc.o" because it's usually the case
> that kernel text shrinks when you _remove_ inlines.
> 

I can generate that although it'll be a bit misleading because stack
parameters are added earlier in the series that get eliminated later due
to inlines. Shuffling them around won't help a whole lot.

Inline for only one call site though saves text in this series. For a
non-inlined function, the calling convension has to be obeyed and for a
large number of parameters like this functions, that can be sizable.

I'll regenerate the figures though.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/25] Inline __rmqueue_smallest()
  2009-04-20 22:19 ` [PATCH 13/25] Inline __rmqueue_smallest() Mel Gorman
  2009-04-21  7:58   ` Pekka Enberg
@ 2009-04-21  9:52   ` KOSAKI Motohiro
  2009-04-21 10:11     ` Mel Gorman
  1 sibling, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  9:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> Inline __rmqueue_smallest by altering flow very slightly so that there
> is only one call site. This allows the function to be inlined without
> additional text bloat.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> ---
>  mm/page_alloc.c |   23 ++++++++++++++++++-----
>  1 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b13fc29..91a2cdb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
>   * Go through the free lists for the given migratetype and remove
>   * the smallest available page from the freelists
>   */
> -static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> +static inline
> +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>  						int migratetype)

"only one caller" is one of keypoint of this patch, I think.
so, commenting is better? but it isn't blocking reason at all.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/25] Inline __rmqueue_smallest()
  2009-04-21  9:52   ` KOSAKI Motohiro
@ 2009-04-21 10:11     ` Mel Gorman
  2009-04-21 10:22       ` KOSAKI Motohiro
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21 10:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 06:52:28PM +0900, KOSAKI Motohiro wrote:
> > Inline __rmqueue_smallest by altering flow very slightly so that there
> > is only one call site. This allows the function to be inlined without
> > additional text bloat.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/page_alloc.c |   23 ++++++++++++++++++-----
> >  1 files changed, 18 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b13fc29..91a2cdb 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> >   * Go through the free lists for the given migratetype and remove
> >   * the smallest available page from the freelists
> >   */
> > -static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> > +static inline
> > +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> >  						int migratetype)
> 
> "only one caller" is one of keypoint of this patch, I think.
> so, commenting is better? but it isn't blocking reason at all.
> 
> 	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 

Is this better?

Inline __rmqueue_smallest by altering flow very slightly so that there
is only one call site. Because there is only one call-site, this
function can then be inlined without causing text bloat.

I don't see a need to add a comment into the function itself as I don't
think it would help any.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/25] Inline __rmqueue_smallest()
  2009-04-21 10:11     ` Mel Gorman
@ 2009-04-21 10:22       ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 10:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> Is this better?
> 
> Inline __rmqueue_smallest by altering flow very slightly so that there
> is only one call site. Because there is only one call-site, this
> function can then be inlined without causing text bloat.

very nice.


> I don't see a need to add a comment into the function itself as I don't
> think it would help any.

ok. I drop my claim.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 14/25] Inline buffered_rmqueue()
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (12 preceding siblings ...)
  2009-04-20 22:19 ` [PATCH 13/25] Inline __rmqueue_smallest() Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-21  9:56   ` KOSAKI Motohiro
  2009-04-20 22:20 ` [PATCH 15/25] Inline __rmqueue_fallback() Mel Gorman
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

buffered_rmqueue() is in the fast path so inline it. Because it only has
one call site, this actually should reduce text bloat instead of
increase it.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91a2cdb..2dfe3aa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1077,7 +1077,8 @@ void split_page(struct page *page, unsigned int order)
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static struct page *buffered_rmqueue(struct zone *preferred_zone,
+static inline
+struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags,
 			int migratetype, int cold)
 {
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/25] Inline buffered_rmqueue()
  2009-04-20 22:20 ` [PATCH 14/25] Inline buffered_rmqueue() Mel Gorman
@ 2009-04-21  9:56   ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  9:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> buffered_rmqueue() is in the fast path so inline it. Because it only has
> one call site, this actually should reduce text bloat instead of
> increase it.

nit: please append size command output.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91a2cdb..2dfe3aa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1077,7 +1077,8 @@ void split_page(struct page *page, unsigned int order)
>   * we cheat by calling it from here, in the order > 0 path.  Saves a branch
>   * or two.
>   */
> -static struct page *buffered_rmqueue(struct zone *preferred_zone,
> +static inline
> +struct page *buffered_rmqueue(struct zone *preferred_zone,
>  			struct zone *zone, int order, gfp_t gfp_flags,
>  			int migratetype, int cold)
>  {
> -- 
> 1.5.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 15/25] Inline __rmqueue_fallback()
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (13 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 14/25] Inline buffered_rmqueue() Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-21  9:56   ` KOSAKI Motohiro
  2009-04-20 22:20 ` [PATCH 16/25] Save text by reducing call sites of __rmqueue() Mel Gorman
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

__rmqueue_fallback() is in the slow path but has only one call site. It
actually reduces text if it's inlined.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2dfe3aa..83da463 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -775,8 +775,8 @@ static int move_freepages_block(struct zone *zone, struct page *page,
 }
 
 /* Remove an element from the buddy allocator from the fallback list */
-static struct page *__rmqueue_fallback(struct zone *zone, int order,
-						int start_migratetype)
+static inline struct page *
+__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 {
 	struct free_area * area;
 	int current_order;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 15/25] Inline __rmqueue_fallback()
  2009-04-20 22:20 ` [PATCH 15/25] Inline __rmqueue_fallback() Mel Gorman
@ 2009-04-21  9:56   ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21  9:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> __rmqueue_fallback() is in the slow path but has only one call site. It
> actually reduces text if it's inlined.

ditto. I hope to write size command output.


> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2dfe3aa..83da463 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -775,8 +775,8 @@ static int move_freepages_block(struct zone *zone, struct page *page,
>  }
>  
>  /* Remove an element from the buddy allocator from the fallback list */
> -static struct page *__rmqueue_fallback(struct zone *zone, int order,
> -						int start_migratetype)
> +static inline struct page *
> +__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>  {
>  	struct free_area * area;
>  	int current_order;
> -- 
> 1.5.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 16/25] Save text by reducing call sites of __rmqueue()
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (14 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 15/25] Inline __rmqueue_fallback() Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-21 10:47   ` KOSAKI Motohiro
  2009-04-20 22:20 ` [PATCH 17/25] Do not call get_pageblock_migratetype() more than necessary Mel Gorman
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

__rmqueue is inlined in the fast path but it has two call sites, the low
order and high order paths. However, a slight modification to the
high-order path reduces the call sites of __rmqueue. This reduces text
at the slight increase of complexity of the high-order allocation path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 83da463..c57c602 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1121,11 +1121,14 @@ again:
 		list_del(&page->lru);
 		pcp->count--;
 	} else {
-		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order, migratetype);
-		spin_unlock(&zone->lock);
-		if (!page)
+		LIST_HEAD(list);
+		local_irq_save(flags);
+
+		/* Calling __rmqueue would bloat text, hence this */
+		if (!rmqueue_bulk(zone, order, 1, &list, migratetype))
 			goto failed;
+		page = list_entry(list.next, struct page, lru);
+		list_del(&page->lru);
 	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 16/25] Save text by reducing call sites of __rmqueue()
  2009-04-20 22:20 ` [PATCH 16/25] Save text by reducing call sites of __rmqueue() Mel Gorman
@ 2009-04-21 10:47   ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 10:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

>  	} else {
> -		spin_lock_irqsave(&zone->lock, flags);
> -		page = __rmqueue(zone, order, migratetype);
> -		spin_unlock(&zone->lock);
> -		if (!page)
> +		LIST_HEAD(list);
> +		local_irq_save(flags);
> +
> +		/* Calling __rmqueue would bloat text, hence this */
> +		if (!rmqueue_bulk(zone, order, 1, &list, migratetype))
>  			goto failed;
> +		page = list_entry(list.next, struct page, lru);
> +		list_del(&page->lru);
>  	}

looks good
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 17/25] Do not call get_pageblock_migratetype() more than necessary
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (15 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 16/25] Save text by reducing call sites of __rmqueue() Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-21 11:03   ` KOSAKI Motohiro
  2009-04-20 22:20 ` [PATCH 18/25] Do not disable interrupts in free_page_mlock() Mel Gorman
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and
decisions that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |   16 ++++++++++------
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c57c602..a1ca038 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -456,16 +456,18 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
-		struct zone *zone, unsigned int order)
+		struct zone *zone, unsigned int order,
+		int migratetype)
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
-	int migratetype = get_pageblock_migratetype(page);
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
 			return;
 
+	VM_BUG_ON(migratetype == -1);
+
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
@@ -534,17 +536,18 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
-		__free_one_page(page, zone, order);
+		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order)
+static void free_one_page(struct zone *zone, struct page *page, int order,
+				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
-	__free_one_page(page, zone, order);
+	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
 
@@ -569,7 +572,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, order);
+	free_one_page(page_zone(page), page, order,
+					get_pageblock_migratetype(page));
 	local_irq_restore(flags);
 }
 
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 17/25] Do not call get_pageblock_migratetype() more than necessary
  2009-04-20 22:20 ` [PATCH 17/25] Do not call get_pageblock_migratetype() more than necessary Mel Gorman
@ 2009-04-21 11:03   ` KOSAKI Motohiro
  2009-04-21 16:12     ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 11:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> get_pageblock_migratetype() is potentially called twice for every page
> free. Once, when being freed to the pcp lists and once when being freed
> back to buddy. When freeing from the pcp lists, it is known what the
> pageblock type was at the time of free so use it rather than rechecking.
> In low memory situations under memory pressure, this might skew
> anti-fragmentation slightly but the interference is minimal and
> decisions that are fragmenting memory are being made anyway.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> ---
>  mm/page_alloc.c |   16 ++++++++++------
>  1 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c57c602..a1ca038 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -456,16 +456,18 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>   */
>  
>  static inline void __free_one_page(struct page *page,
> -		struct zone *zone, unsigned int order)
> +		struct zone *zone, unsigned int order,
> +		int migratetype)
>  {
>  	unsigned long page_idx;
>  	int order_size = 1 << order;
> -	int migratetype = get_pageblock_migratetype(page);
>  
>  	if (unlikely(PageCompound(page)))
>  		if (unlikely(destroy_compound_page(page, order)))
>  			return;
>  
> +	VM_BUG_ON(migratetype == -1);
> +
>  	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
>  
>  	VM_BUG_ON(page_idx & (order_size - 1));
> @@ -534,17 +536,18 @@ static void free_pages_bulk(struct zone *zone, int count,
>  		page = list_entry(list->prev, struct page, lru);
>  		/* have to delete it as __free_one_page list manipulates */
>  		list_del(&page->lru);
> -		__free_one_page(page, zone, order);
> +		__free_one_page(page, zone, order, page_private(page));
>  	}
>  	spin_unlock(&zone->lock);

looks good.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


btw, I can't review rest patch today. I plan to do that tommorow, sorry.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 17/25] Do not call get_pageblock_migratetype() more than necessary
  2009-04-21 11:03   ` KOSAKI Motohiro
@ 2009-04-21 16:12     ` Mel Gorman
  2009-04-22  2:25       ` KOSAKI Motohiro
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21 16:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Tue, Apr 21, 2009 at 08:03:10PM +0900, KOSAKI Motohiro wrote:
> > get_pageblock_migratetype() is potentially called twice for every page
> > free. Once, when being freed to the pcp lists and once when being freed
> > back to buddy. When freeing from the pcp lists, it is known what the
> > pageblock type was at the time of free so use it rather than rechecking.
> > In low memory situations under memory pressure, this might skew
> > anti-fragmentation slightly but the interference is minimal and
> > decisions that are fragmenting memory are being made anyway.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/page_alloc.c |   16 ++++++++++------
> >  1 files changed, 10 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c57c602..a1ca038 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -456,16 +456,18 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
> >   */
> >  
> >  static inline void __free_one_page(struct page *page,
> > -		struct zone *zone, unsigned int order)
> > +		struct zone *zone, unsigned int order,
> > +		int migratetype)
> >  {
> >  	unsigned long page_idx;
> >  	int order_size = 1 << order;
> > -	int migratetype = get_pageblock_migratetype(page);
> >  
> >  	if (unlikely(PageCompound(page)))
> >  		if (unlikely(destroy_compound_page(page, order)))
> >  			return;
> >  
> > +	VM_BUG_ON(migratetype == -1);
> > +
> >  	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
> >  
> >  	VM_BUG_ON(page_idx & (order_size - 1));
> > @@ -534,17 +536,18 @@ static void free_pages_bulk(struct zone *zone, int count,
> >  		page = list_entry(list->prev, struct page, lru);
> >  		/* have to delete it as __free_one_page list manipulates */
> >  		list_del(&page->lru);
> > -		__free_one_page(page, zone, order);
> > +		__free_one_page(page, zone, order, page_private(page));
> >  	}
> >  	spin_unlock(&zone->lock);
> 
> looks good.
> 	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> 
> btw, I can't review rest patch today. I plan to do that tommorow, sorry.
> 

No problem. Thanks a million for the work you've done so far. It was a
big help and you caught a fair few problems in there.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 17/25] Do not call get_pageblock_migratetype() more than necessary
  2009-04-21 16:12     ` Mel Gorman
@ 2009-04-22  2:25       ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-22  2:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> > btw, I can't review rest patch today. I plan to do that tommorow, sorry.
> 
> No problem. Thanks a million for the work you've done so far. It was a
> big help and you caught a fair few problems in there.

Sure. Nobody think your patch have many problems :)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 18/25] Do not disable interrupts in free_page_mlock()
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (16 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 17/25] Do not call get_pageblock_migratetype() more than necessary Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-21  7:55   ` Pekka Enberg
  2009-04-22  0:13   ` KOSAKI Motohiro
  2009-04-20 22:20 ` [PATCH 19/25] Do not setup zonelist cache when there is only one node Mel Gorman
                   ` (7 subsequent siblings)
  25 siblings, 2 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and this
happens on every page free even though interrupts are disabled very shortly
afterwards a second time.  This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One potential
weirdness with this split is that the counters do not get updated if the
bad_page() check is triggered but a system showing bad pages is getting
screwed already.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/internal.h   |   11 +++--------
 mm/page_alloc.c |    8 +++++++-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 987bb03..58ec1bc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
  */
 static inline void free_page_mlock(struct page *page)
 {
-	if (unlikely(TestClearPageMlocked(page))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
-		__count_vm_event(UNEVICTABLE_MLOCKFREED);
-		local_irq_restore(flags);
-	}
+	__ClearPageMlocked(page);
+	__dec_zone_page_state(page, NR_MLOCK);
+	__count_vm_event(UNEVICTABLE_MLOCKFREED);
 }
 
 #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a1ca038..bf4b8d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
 
 static inline int free_pages_check(struct page *page)
 {
-	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
 		(page_count(page) != 0)  |
@@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	unsigned long flags;
 	int i;
 	int bad = 0;
+	int clearMlocked = PageMlocked(page);
 
 	for (i = 0 ; i < (1 << order) ; ++i)
 		bad += free_pages_check(page + i);
@@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	kernel_map_pages(page, 1 << order, 0);
 
 	local_irq_save(flags);
+	if (unlikely(clearMlocked))
+		free_page_mlock(page);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order,
 					get_pageblock_migratetype(page));
@@ -1018,6 +1020,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	int clearMlocked = PageMlocked(page);
 
 	if (PageAnon(page))
 		page->mapping = NULL;
@@ -1033,7 +1036,10 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	local_irq_save(flags);
+	if (unlikely(clearMlocked))
+		free_page_mlock(page);
 	__count_vm_event(PGFREE);
+
 	if (cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/25] Do not disable interrupts in free_page_mlock()
  2009-04-20 22:20 ` [PATCH 18/25] Do not disable interrupts in free_page_mlock() Mel Gorman
@ 2009-04-21  7:55   ` Pekka Enberg
  2009-04-21  8:50     ` Mel Gorman
  2009-04-22  0:13   ` KOSAKI Motohiro
  1 sibling, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  7:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Mon, 2009-04-20 at 23:20 +0100, Mel Gorman wrote:
> free_page_mlock() tests and clears PG_mlocked using locked versions of the
> bit operations. If set, it disables interrupts to update counters and this
> happens on every page free even though interrupts are disabled very shortly
> afterwards a second time.  This is wasteful.
> 
> This patch splits what free_page_mlock() does. The bit check is still
> made. However, the update of counters is delayed until the interrupts are
> disabled and the non-lock version for clearing the bit is used. One potential
> weirdness with this split is that the counters do not get updated if the
> bad_page() check is triggered but a system showing bad pages is getting
> screwed already.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

> ---
>  mm/internal.h   |   11 +++--------
>  mm/page_alloc.c |    8 +++++++-
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 987bb03..58ec1bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>   */
>  static inline void free_page_mlock(struct page *page)
>  {
> -	if (unlikely(TestClearPageMlocked(page))) {
> -		unsigned long flags;
> -
> -		local_irq_save(flags);
> -		__dec_zone_page_state(page, NR_MLOCK);
> -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> -		local_irq_restore(flags);
> -	}

Maybe add a VM_BUG_ON(!PageMlocked(page))?

> +	__ClearPageMlocked(page);
> +	__dec_zone_page_state(page, NR_MLOCK);
> +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
>  }
>  
>  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/25] Do not disable interrupts in free_page_mlock()
  2009-04-21  7:55   ` Pekka Enberg
@ 2009-04-21  8:50     ` Mel Gorman
  2009-04-21 15:05       ` Christoph Lameter
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  8:50 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 10:55:07AM +0300, Pekka Enberg wrote:
> On Mon, 2009-04-20 at 23:20 +0100, Mel Gorman wrote:
> > free_page_mlock() tests and clears PG_mlocked using locked versions of the
> > bit operations. If set, it disables interrupts to update counters and this
> > happens on every page free even though interrupts are disabled very shortly
> > afterwards a second time.  This is wasteful.
> > 
> > This patch splits what free_page_mlock() does. The bit check is still
> > made. However, the update of counters is delayed until the interrupts are
> > disabled and the non-lock version for clearing the bit is used. One potential
> > weirdness with this split is that the counters do not get updated if the
> > bad_page() check is triggered but a system showing bad pages is getting
> > screwed already.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> 
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> 
> > ---
> >  mm/internal.h   |   11 +++--------
> >  mm/page_alloc.c |    8 +++++++-
> >  2 files changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 987bb03..58ec1bc 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> > -	if (unlikely(TestClearPageMlocked(page))) {
> > -		unsigned long flags;
> > -
> > -		local_irq_save(flags);
> > -		__dec_zone_page_state(page, NR_MLOCK);
> > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > -		local_irq_restore(flags);
> > -	}
> 
> Maybe add a VM_BUG_ON(!PageMlocked(page))?
> 

We always check in the caller and I don't see callers to this function
expanding. I can add it if you insist but I don't think it'll catch
anything in this case.

> > +	__ClearPageMlocked(page);
> > +	__dec_zone_page_state(page, NR_MLOCK);
> > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> >  
> >  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/25] Do not disable interrupts in free_page_mlock()
  2009-04-21  8:50     ` Mel Gorman
@ 2009-04-21 15:05       ` Christoph Lameter
  0 siblings, 0 replies; 105+ messages in thread
From: Christoph Lameter @ 2009-04-21 15:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, 21 Apr 2009, Mel Gorman wrote:

> > Maybe add a VM_BUG_ON(!PageMlocked(page))?
> >
>
> We always check in the caller and I don't see callers to this function
> expanding. I can add it if you insist but I don't think it'll catch
> anything in this case.

Dont add it. Pekka sometimes has this checkeritis that manifest itself
in VM_BUG_ON and BUG_ONs all over the place.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/25] Do not disable interrupts in free_page_mlock()
  2009-04-20 22:20 ` [PATCH 18/25] Do not disable interrupts in free_page_mlock() Mel Gorman
  2009-04-21  7:55   ` Pekka Enberg
@ 2009-04-22  0:13   ` KOSAKI Motohiro
  2009-04-22 14:43     ` Lee Schermerhorn
  1 sibling, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-22  0:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton, Lee Schermerhorn

(cc to Lee)

> free_page_mlock() tests and clears PG_mlocked using locked versions of the
> bit operations. If set, it disables interrupts to update counters and this
> happens on every page free even though interrupts are disabled very shortly
> afterwards a second time.  This is wasteful.
> 
> This patch splits what free_page_mlock() does. The bit check is still
> made. However, the update of counters is delayed until the interrupts are
> disabled and the non-lock version for clearing the bit is used. One potential
> weirdness with this split is that the counters do not get updated if the
> bad_page() check is triggered but a system showing bad pages is getting
> screwed already.

Looks good. thanks good improvement.
I hope Lee also ack this.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


  -kosaki


> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> ---
>  mm/internal.h   |   11 +++--------
>  mm/page_alloc.c |    8 +++++++-
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 987bb03..58ec1bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>   */
>  static inline void free_page_mlock(struct page *page)
>  {
> -	if (unlikely(TestClearPageMlocked(page))) {
> -		unsigned long flags;
> -
> -		local_irq_save(flags);
> -		__dec_zone_page_state(page, NR_MLOCK);
> -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> -		local_irq_restore(flags);
> -	}
> +	__ClearPageMlocked(page);
> +	__dec_zone_page_state(page, NR_MLOCK);
> +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
>  }
>  
>  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a1ca038..bf4b8d9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
>  
>  static inline int free_pages_check(struct page *page)
>  {
> -	free_page_mlock(page);
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
>  		(page_count(page) != 0)  |
> @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>  	unsigned long flags;
>  	int i;
>  	int bad = 0;
> +	int clearMlocked = PageMlocked(page);
>  
>  	for (i = 0 ; i < (1 << order) ; ++i)
>  		bad += free_pages_check(page + i);
> @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>  	kernel_map_pages(page, 1 << order, 0);
>  
>  	local_irq_save(flags);
> +	if (unlikely(clearMlocked))
> +		free_page_mlock(page);
>  	__count_vm_events(PGFREE, 1 << order);
>  	free_one_page(page_zone(page), page, order,
>  					get_pageblock_migratetype(page));
> @@ -1018,6 +1020,7 @@ static void free_hot_cold_page(struct page *page, int cold)
>  	struct zone *zone = page_zone(page);
>  	struct per_cpu_pages *pcp;
>  	unsigned long flags;
> +	int clearMlocked = PageMlocked(page);
>  
>  	if (PageAnon(page))
>  		page->mapping = NULL;
> @@ -1033,7 +1036,10 @@ static void free_hot_cold_page(struct page *page, int cold)
>  
>  	pcp = &zone_pcp(zone, get_cpu())->pcp;
>  	local_irq_save(flags);
> +	if (unlikely(clearMlocked))
> +		free_page_mlock(page);
>  	__count_vm_event(PGFREE);
> +
>  	if (cold)
>  		list_add_tail(&page->lru, &pcp->list);
>  	else
> -- 
> 1.5.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/25] Do not disable interrupts in free_page_mlock()
  2009-04-22  0:13   ` KOSAKI Motohiro
@ 2009-04-22 14:43     ` Lee Schermerhorn
  0 siblings, 0 replies; 105+ messages in thread
From: Lee Schermerhorn @ 2009-04-22 14:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 2009-04-22 at 09:13 +0900, KOSAKI Motohiro wrote:
> (cc to Lee)
> 
> > free_page_mlock() tests and clears PG_mlocked using locked versions of the
> > bit operations. If set, it disables interrupts to update counters and this
> > happens on every page free even though interrupts are disabled very shortly
> > afterwards a second time.  This is wasteful.
> > 
> > This patch splits what free_page_mlock() does. The bit check is still
> > made. However, the update of counters is delayed until the interrupts are
> > disabled and the non-lock version for clearing the bit is used. One potential
> > weirdness with this split is that the counters do not get updated if the
> > bad_page() check is triggered but a system showing bad pages is getting
> > screwed already.
> 
> Looks good. thanks good improvement.
> I hope Lee also ack this.
> 
> 	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> 
>   -kosaki
> 

Yes, this looks OK.  If we're racing on a page that is being freed or,
as Mel notes, some other condition triggers bad_page(), we got bigger
problems than inaccurate statistics...

	Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

> 
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/internal.h   |   11 +++--------
> >  mm/page_alloc.c |    8 +++++++-
> >  2 files changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 987bb03..58ec1bc 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> >   */
> >  static inline void free_page_mlock(struct page *page)
> >  {
> > -	if (unlikely(TestClearPageMlocked(page))) {
> > -		unsigned long flags;
> > -
> > -		local_irq_save(flags);
> > -		__dec_zone_page_state(page, NR_MLOCK);
> > -		__count_vm_event(UNEVICTABLE_MLOCKFREED);
> > -		local_irq_restore(flags);
> > -	}
> > +	__ClearPageMlocked(page);
> > +	__dec_zone_page_state(page, NR_MLOCK);
> > +	__count_vm_event(UNEVICTABLE_MLOCKFREED);
> >  }
> >  
> >  #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a1ca038..bf4b8d9 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
> >  
> >  static inline int free_pages_check(struct page *page)
> >  {
> > -	free_page_mlock(page);
> >  	if (unlikely(page_mapcount(page) |
> >  		(page->mapping != NULL)  |
> >  		(page_count(page) != 0)  |
> > @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	unsigned long flags;
> >  	int i;
> >  	int bad = 0;
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	for (i = 0 ; i < (1 << order) ; ++i)
> >  		bad += free_pages_check(page + i);
> > @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	kernel_map_pages(page, 1 << order, 0);
> >  
> >  	local_irq_save(flags);
> > +	if (unlikely(clearMlocked))
> > +		free_page_mlock(page);
> >  	__count_vm_events(PGFREE, 1 << order);
> >  	free_one_page(page_zone(page), page, order,
> >  					get_pageblock_migratetype(page));
> > @@ -1018,6 +1020,7 @@ static void free_hot_cold_page(struct page *page, int cold)
> >  	struct zone *zone = page_zone(page);
> >  	struct per_cpu_pages *pcp;
> >  	unsigned long flags;
> > +	int clearMlocked = PageMlocked(page);
> >  
> >  	if (PageAnon(page))
> >  		page->mapping = NULL;
> > @@ -1033,7 +1036,10 @@ static void free_hot_cold_page(struct page *page, int cold)
> >  
> >  	pcp = &zone_pcp(zone, get_cpu())->pcp;
> >  	local_irq_save(flags);
> > +	if (unlikely(clearMlocked))
> > +		free_page_mlock(page);
> >  	__count_vm_event(PGFREE);
> > +
> >  	if (cold)
> >  		list_add_tail(&page->lru, &pcp->list);
> >  	else
> > -- 
> > 1.5.6.5
> > 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 19/25] Do not setup zonelist cache when there is only one node
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (17 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 18/25] Do not disable interrupts in free_page_mlock() Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-20 22:20 ` [PATCH 20/25] Do not check for compound pages during the page allocator sanity checks Mel Gorman
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

There is a zonelist cache which is used to track zones that are not in
the allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost
for no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf4b8d9..ec01d8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1440,6 +1440,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	/* Determine in advance if the zonelist needs filtering */
 	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
 		zonelist_filter = 1;
+	if (num_online_nodes() > 1)
+		zonelist_filter = 1;
 
 zonelist_scan:
 	/*
@@ -1484,8 +1486,12 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
 		if (NUMA_BUILD && zonelist_filter) {
-			if (!did_zlc_setup) {
-				/* do zlc_setup after the first zone is tried */
+			if (!did_zlc_setup && num_online_nodes() > 1) {
+				/*
+				 * do zlc_setup after the first zone is tried
+				 * but only if there are multiple nodes to make
+				 * it worthwhile
+				 */
 				allowednodes = zlc_setup(zonelist, alloc_flags);
 				zlc_active = 1;
 			}
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 20/25] Do not check for compound pages during the page allocator sanity checks
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (18 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 19/25] Do not setup zonelist cache when there is only one node Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-22  0:20   ` KOSAKI Motohiro
  2009-04-20 22:20 ` [PATCH 21/25] Use allocation flags as an index to the zone watermark Mel Gorman
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for
compound pages and checks the count of the head page if true. However,
in these paths, we do not care if the page is compound or not as the
count of each tail page should also be zero.

This patch makes two changes to the use of page_count() in the free path. It
converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also avoids
checking for compound pages.

[mel@csn.ul.ie: Wrote changelog]
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ec01d8f..376d848 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -425,7 +425,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 		return 0;
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
-		BUG_ON(page_count(buddy) != 0);
+		VM_BUG_ON(page_count(buddy) != 0);
 		return 1;
 	}
 	return 0;
@@ -501,7 +501,7 @@ static inline int free_pages_check(struct page *page)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0) |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
 		bad_page(page);
 		return 1;
@@ -646,7 +646,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
+		(atomic_read(&page->_count) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
 		bad_page(page);
 		return 1;
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/25] Do not check for compound pages during the page allocator sanity checks
  2009-04-20 22:20 ` [PATCH 20/25] Do not check for compound pages during the page allocator sanity checks Mel Gorman
@ 2009-04-22  0:20   ` KOSAKI Motohiro
  2009-04-22 10:09     ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-22  0:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> A number of sanity checks are made on each page allocation and free
> including that the page count is zero. page_count() checks for
> compound pages and checks the count of the head page if true. However,
> in these paths, we do not care if the page is compound or not as the
> count of each tail page should also be zero.
> 
> This patch makes two changes to the use of page_count() in the free path. It
> converts one check of page_count() to a VM_BUG_ON() as the count should
> have been unconditionally checked earlier in the free path. It also avoids
> checking for compound pages.
> 
> [mel@csn.ul.ie: Wrote changelog]
> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> ---
>  mm/page_alloc.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ec01d8f..376d848 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -425,7 +425,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>  		return 0;
>  
>  	if (PageBuddy(buddy) && page_order(buddy) == order) {
> -		BUG_ON(page_count(buddy) != 0);
> +		VM_BUG_ON(page_count(buddy) != 0);
>  		return 1;
>  	}
>  	return 0;
>

Looks good.


> @@ -501,7 +501,7 @@ static inline int free_pages_check(struct page *page)
>  {
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
> -		(page_count(page) != 0)  |
> +		(atomic_read(&page->_count) != 0) |
>  		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
>  		bad_page(page);
>  		return 1;
> @@ -646,7 +646,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
>  {
>  	if (unlikely(page_mapcount(page) |
>  		(page->mapping != NULL)  |
> -		(page_count(page) != 0)  |
> +		(atomic_read(&page->_count) != 0)  |
>  		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
>  		bad_page(page);
>  		return 1;


inserting VM_BUG_ON(PageTail(page)) is better?





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/25] Do not check for compound pages during the page allocator sanity checks
  2009-04-22  0:20   ` KOSAKI Motohiro
@ 2009-04-22 10:09     ` Mel Gorman
  2009-04-22 10:41       ` KOSAKI Motohiro
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-22 10:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 09:20:40AM +0900, KOSAKI Motohiro wrote:
> > A number of sanity checks are made on each page allocation and free
> > including that the page count is zero. page_count() checks for
> > compound pages and checks the count of the head page if true. However,
> > in these paths, we do not care if the page is compound or not as the
> > count of each tail page should also be zero.
> > 
> > This patch makes two changes to the use of page_count() in the free path. It
> > converts one check of page_count() to a VM_BUG_ON() as the count should
> > have been unconditionally checked earlier in the free path. It also avoids
> > checking for compound pages.
> > 
> > [mel@csn.ul.ie: Wrote changelog]
> > Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  mm/page_alloc.c |    6 +++---
> >  1 files changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index ec01d8f..376d848 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -425,7 +425,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
> >  		return 0;
> >  
> >  	if (PageBuddy(buddy) && page_order(buddy) == order) {
> > -		BUG_ON(page_count(buddy) != 0);
> > +		VM_BUG_ON(page_count(buddy) != 0);
> >  		return 1;
> >  	}
> >  	return 0;
> >
> 
> Looks good.
> 
> 
> > @@ -501,7 +501,7 @@ static inline int free_pages_check(struct page *page)
> >  {
> >  	if (unlikely(page_mapcount(page) |
> >  		(page->mapping != NULL)  |
> > -		(page_count(page) != 0)  |
> > +		(atomic_read(&page->_count) != 0) |
> >  		(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
> >  		bad_page(page);
> >  		return 1;
> > @@ -646,7 +646,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> >  {
> >  	if (unlikely(page_mapcount(page) |
> >  		(page->mapping != NULL)  |
> > -		(page_count(page) != 0)  |
> > +		(atomic_read(&page->_count) != 0)  |
> >  		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
> >  		bad_page(page);
> >  		return 1;
> 
> 
> inserting VM_BUG_ON(PageTail(page)) is better?
> 

We already go one further with

#define PAGE_FLAGS_CHECK_AT_PREP        ((1 << NR_PAGEFLAGS) - 1)

...

if (.... | (page->flags & PAGE_FLAGS_CHECK_AT_PREP))
	bad_page(page);

PG_tail is in PAGE_FLAGS_CHECK_AT_PREP so we're already checking for it
and calling bad_page() if set.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/25] Do not check for compound pages during the page allocator sanity checks
  2009-04-22 10:09     ` Mel Gorman
@ 2009-04-22 10:41       ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-22 10:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

>> inserting VM_BUG_ON(PageTail(page)) is better?
>>
>
> We already go one further with
>
> #define PAGE_FLAGS_CHECK_AT_PREP        ((1 << NR_PAGEFLAGS) - 1)
>
> ...
>
> if (.... | (page->flags & PAGE_FLAGS_CHECK_AT_PREP))
>        bad_page(page);
>
> PG_tail is in PAGE_FLAGS_CHECK_AT_PREP so we're already checking for it
> and calling bad_page() if set.

ok, good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 21/25] Use allocation flags as an index to the zone watermark
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (19 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 20/25] Do not check for compound pages during the page allocator sanity checks Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-22  0:26   ` KOSAKI Motohiro
  2009-04-20 22:20 ` [PATCH 22/25] Update NR_FREE_PAGES only as necessary Mel Gorman
                   ` (4 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine which
watermark to use. This patch uses the flags as an array index and places
the three watermarks in a union with an array so it can be offset. This
means the flags can be used as an array index and reduces the branches
taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/mmzone.h |    8 +++++++-
 mm/page_alloc.c        |   18 ++++++++----------
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f82bdba..c1fa208 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -275,7 +275,13 @@ struct zone_reclaim_stat {
 
 struct zone {
 	/* Fields commonly accessed by the page allocator */
-	unsigned long		pages_min, pages_low, pages_high;
+	union {
+		struct {
+			unsigned long	pages_min, pages_low, pages_high;
+		};
+		unsigned long pages_mark[3];
+	};
+
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 376d848..e61867e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1157,10 +1157,13 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
+/* The WMARK bits are used as an index zone->pages_mark */
+#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x08 /* don't check watermarks at all */
+#define ALLOC_WMARK_MASK	0x07 /* Mask to get the watermark bits */
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #ifdef CONFIG_CPUSETS
@@ -1463,12 +1466,7 @@ zonelist_scan:
 
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
-			if (alloc_flags & ALLOC_WMARK_MIN)
-				mark = zone->pages_min;
-			else if (alloc_flags & ALLOC_WMARK_LOW)
-				mark = zone->pages_low;
-			else
-				mark = zone->pages_high;
+			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
 			if (!zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 21/25] Use allocation flags as an index to the zone watermark
  2009-04-20 22:20 ` [PATCH 21/25] Use allocation flags as an index to the zone watermark Mel Gorman
@ 2009-04-22  0:26   ` KOSAKI Motohiro
  2009-04-22  0:41     ` David Rientjes
  2009-04-22 10:21     ` Mel Gorman
  0 siblings, 2 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-22  0:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
> pages_min, pages_low or pages_high is used as the zone watermark when
> allocating the pages. Two branches in the allocator hotpath determine which
> watermark to use. This patch uses the flags as an array index and places
> the three watermarks in a union with an array so it can be offset. This
> means the flags can be used as an array index and reduces the branches
> taken.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> ---
>  include/linux/mmzone.h |    8 +++++++-
>  mm/page_alloc.c        |   18 ++++++++----------
>  2 files changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f82bdba..c1fa208 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -275,7 +275,13 @@ struct zone_reclaim_stat {
>  
>  struct zone {
>  	/* Fields commonly accessed by the page allocator */
> -	unsigned long		pages_min, pages_low, pages_high;
> +	union {
> +		struct {
> +			unsigned long	pages_min, pages_low, pages_high;
> +		};
> +		unsigned long pages_mark[3];
> +	};
> +

hmmm... I don't like union hack. 
Why can't we change all caller to use page_mark?




>  	/*
>  	 * We don't know if the memory that we're going to allocate will be freeable
>  	 * or/and it will be released eventually, so to avoid totally wasting several
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 376d848..e61867e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1157,10 +1157,13 @@ failed:
>  	return NULL;
>  }
>  
> -#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
> -#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
> -#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
> -#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> +/* The WMARK bits are used as an index zone->pages_mark */
> +#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
> +#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
> +#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
> +#define ALLOC_NO_WATERMARKS	0x08 /* don't check watermarks at all */
> +#define ALLOC_WMARK_MASK	0x07 /* Mask to get the watermark bits */

the mask only use two bit. but mask definition is three bit (0x07), why?


> +
>  #define ALLOC_HARDER		0x10 /* try to alloc harder */
>  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
>  #ifdef CONFIG_CPUSETS
> @@ -1463,12 +1466,7 @@ zonelist_scan:
>  
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>  			unsigned long mark;
> -			if (alloc_flags & ALLOC_WMARK_MIN)
> -				mark = zone->pages_min;
> -			else if (alloc_flags & ALLOC_WMARK_LOW)
> -				mark = zone->pages_low;
> -			else
> -				mark = zone->pages_high;
> +			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
>  			if (!zone_watermark_ok(zone, order, mark,
>  				    classzone_idx, alloc_flags)) {
>  				if (!zone_reclaim_mode ||
> -- 
> 1.5.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 21/25] Use allocation flags as an index to the zone watermark
  2009-04-22  0:26   ` KOSAKI Motohiro
@ 2009-04-22  0:41     ` David Rientjes
  2009-04-22 10:21     ` Mel Gorman
  1 sibling, 0 replies; 105+ messages in thread
From: David Rientjes @ 2009-04-22  0:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

On Wed, 22 Apr 2009, KOSAKI Motohiro wrote:

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 376d848..e61867e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1157,10 +1157,13 @@ failed:
> >  	return NULL;
> >  }
> >  
> > -#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
> > -#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
> > -#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
> > -#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> > +/* The WMARK bits are used as an index zone->pages_mark */
> > +#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
> > +#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
> > +#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
> > +#define ALLOC_NO_WATERMARKS	0x08 /* don't check watermarks at all */
> > +#define ALLOC_WMARK_MASK	0x07 /* Mask to get the watermark bits */
> 
> the mask only use two bit. but mask definition is three bit (0x07), why?
> 

I think it would probably be better to simply use

	#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS - 1)

here and define ALLOC_NO_WATERMARKS to be 0x04.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 21/25] Use allocation flags as an index to the zone watermark
  2009-04-22  0:26   ` KOSAKI Motohiro
  2009-04-22  0:41     ` David Rientjes
@ 2009-04-22 10:21     ` Mel Gorman
  2009-04-22 10:23       ` Mel Gorman
  1 sibling, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-22 10:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 09:26:10AM +0900, KOSAKI Motohiro wrote:
> > ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
> > pages_min, pages_low or pages_high is used as the zone watermark when
> > allocating the pages. Two branches in the allocator hotpath determine which
> > watermark to use. This patch uses the flags as an array index and places
> > the three watermarks in a union with an array so it can be offset. This
> > means the flags can be used as an array index and reduces the branches
> > taken.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > ---
> >  include/linux/mmzone.h |    8 +++++++-
> >  mm/page_alloc.c        |   18 ++++++++----------
> >  2 files changed, 15 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index f82bdba..c1fa208 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -275,7 +275,13 @@ struct zone_reclaim_stat {
> >  
> >  struct zone {
> >  	/* Fields commonly accessed by the page allocator */
> > -	unsigned long		pages_min, pages_low, pages_high;
> > +	union {
> > +		struct {
> > +			unsigned long	pages_min, pages_low, pages_high;
> > +		};
> > +		unsigned long pages_mark[3];
> > +	};
> > +
> 
> hmmm... I don't like union hack. 
> Why can't we change all caller to use page_mark?
> 

Because pages_min, pages_low and pages_high are such well understood concepts
and their current use is easy to userstand. It could all be changed to getters
and setters with a patch to update all call sites or having symbolic names
for the index and forcing the use of the array but when I started doing that,
the code looked worse to my eye, not better.  The union is a relatively
small hack but well contained within the one place that cares.

> >  	/*
> >  	 * We don't know if the memory that we're going to allocate will be freeable
> >  	 * or/and it will be released eventually, so to avoid totally wasting several
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 376d848..e61867e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1157,10 +1157,13 @@ failed:
> >  	return NULL;
> >  }
> >  
> > -#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
> > -#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
> > -#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
> > -#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> > +/* The WMARK bits are used as an index zone->pages_mark */
> > +#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
> > +#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
> > +#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
> > +#define ALLOC_NO_WATERMARKS	0x08 /* don't check watermarks at all */
> > +#define ALLOC_WMARK_MASK	0x07 /* Mask to get the watermark bits */
> 
> the mask only use two bit. but mask definition is three bit (0x07), why?
> 

I was thinking ALLOC_NO_WATERMARKS-1 and that all the lower bits must be
cleared and left the value of 0x08 to occupy same number of bits even though
that wasn't necessary. As suggested I'll change this to

ALLOC_NO_WATERMARKS	0x04
WMARK_MASK		(1-ALLOC_NO_WATERMARKS)

> 
> > +
> >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> >  #ifdef CONFIG_CPUSETS
> > @@ -1463,12 +1466,7 @@ zonelist_scan:
> >  
> >  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> >  			unsigned long mark;
> > -			if (alloc_flags & ALLOC_WMARK_MIN)
> > -				mark = zone->pages_min;
> > -			else if (alloc_flags & ALLOC_WMARK_LOW)
> > -				mark = zone->pages_low;
> > -			else
> > -				mark = zone->pages_high;
> > +			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
> >  			if (!zone_watermark_ok(zone, order, mark,
> >  				    classzone_idx, alloc_flags)) {
> >  				if (!zone_reclaim_mode ||
> > -- 
> > 1.5.6.5
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 21/25] Use allocation flags as an index to the zone watermark
  2009-04-22 10:21     ` Mel Gorman
@ 2009-04-22 10:23       ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-22 10:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Memory Management List, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

On Wed, Apr 22, 2009 at 11:21:17AM +0100, Mel Gorman wrote:
> On Wed, Apr 22, 2009 at 09:26:10AM +0900, KOSAKI Motohiro wrote:
> > > ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
> > > pages_min, pages_low or pages_high is used as the zone watermark when
> > > allocating the pages. Two branches in the allocator hotpath determine which
> > > watermark to use. This patch uses the flags as an array index and places
> > > the three watermarks in a union with an array so it can be offset. This
> > > means the flags can be used as an array index and reduces the branches
> > > taken.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > > ---
> > >  include/linux/mmzone.h |    8 +++++++-
> > >  mm/page_alloc.c        |   18 ++++++++----------
> > >  2 files changed, 15 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index f82bdba..c1fa208 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -275,7 +275,13 @@ struct zone_reclaim_stat {
> > >  
> > >  struct zone {
> > >  	/* Fields commonly accessed by the page allocator */
> > > -	unsigned long		pages_min, pages_low, pages_high;
> > > +	union {
> > > +		struct {
> > > +			unsigned long	pages_min, pages_low, pages_high;
> > > +		};
> > > +		unsigned long pages_mark[3];
> > > +	};
> > > +
> > 
> > hmmm... I don't like union hack. 
> > Why can't we change all caller to use page_mark?
> > 
> 
> Because pages_min, pages_low and pages_high are such well understood concepts
> and their current use is easy to userstand. It could all be changed to getters
> and setters with a patch to update all call sites or having symbolic names
> for the index and forcing the use of the array but when I started doing that,
> the code looked worse to my eye, not better.  The union is a relatively
> small hack but well contained within the one place that cares.
> 
> > >  	/*
> > >  	 * We don't know if the memory that we're going to allocate will be freeable
> > >  	 * or/and it will be released eventually, so to avoid totally wasting several
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 376d848..e61867e 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1157,10 +1157,13 @@ failed:
> > >  	return NULL;
> > >  }
> > >  
> > > -#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
> > > -#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
> > > -#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
> > > -#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
> > > +/* The WMARK bits are used as an index zone->pages_mark */
> > > +#define ALLOC_WMARK_MIN		0x00 /* use pages_min watermark */
> > > +#define ALLOC_WMARK_LOW		0x01 /* use pages_low watermark */
> > > +#define ALLOC_WMARK_HIGH	0x02 /* use pages_high watermark */
> > > +#define ALLOC_NO_WATERMARKS	0x08 /* don't check watermarks at all */
> > > +#define ALLOC_WMARK_MASK	0x07 /* Mask to get the watermark bits */
> > 
> > the mask only use two bit. but mask definition is three bit (0x07), why?
> > 
> 
> I was thinking ALLOC_NO_WATERMARKS-1 and that all the lower bits must be
> cleared and left the value of 0x08 to occupy same number of bits even though
> that wasn't necessary. As suggested I'll change this to
> 
> ALLOC_NO_WATERMARKS	0x04
> WMARK_MASK		(1-ALLOC_NO_WATERMARKS)
> 

ALLOC_NO_WATERMARKS-1 obviously is what I meant

> > 
> > > +
> > >  #define ALLOC_HARDER		0x10 /* try to alloc harder */
> > >  #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
> > >  #ifdef CONFIG_CPUSETS
> > > @@ -1463,12 +1466,7 @@ zonelist_scan:
> > >  
> > >  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> > >  			unsigned long mark;
> > > -			if (alloc_flags & ALLOC_WMARK_MIN)
> > > -				mark = zone->pages_min;
> > > -			else if (alloc_flags & ALLOC_WMARK_LOW)
> > > -				mark = zone->pages_low;
> > > -			else
> > > -				mark = zone->pages_high;
> > > +			mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
> > >  			if (!zone_watermark_ok(zone, order, mark,
> > >  				    classzone_idx, alloc_flags)) {
> > >  				if (!zone_reclaim_mode ||
> > > -- 
> > > 1.5.6.5
> > > 
> > 
> > 
> > 
> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 22/25] Update NR_FREE_PAGES only as necessary
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (20 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 21/25] Use allocation flags as an index to the zone watermark Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-22  0:35   ` KOSAKI Motohiro
  2009-04-20 22:20 ` [PATCH 23/25] Get the pageblock migratetype without disabling interrupts Mel Gorman
                   ` (3 subsequent siblings)
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

When pages are being freed to the buddy allocator, the zone
NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
freeing, it's updated once per page. This retouches cache lines more
than necessary. Update the counters one per per-cpu bulk free.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e61867e..6bcaf08 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
 		int migratetype)
 {
 	unsigned long page_idx;
-	int order_size = 1 << order;
 
 	if (unlikely(PageCompound(page)))
 		if (unlikely(destroy_compound_page(page, order)))
@@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 
-	VM_BUG_ON(page_idx & (order_size - 1));
+	VM_BUG_ON(page_idx & ((1 << order) - 1));
 	VM_BUG_ON(bad_range(zone, page));
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
 		struct page *buddy;
@@ -528,6 +526,8 @@ static void free_pages_bulk(struct zone *zone, int count,
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
+
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	while (count--) {
 		struct page *page;
 
@@ -546,6 +546,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	spin_lock(&zone->lock);
 	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
 	zone->pages_scanned = 0;
+
+	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	__free_one_page(page, zone, order, migratetype);
 	spin_unlock(&zone->lock);
 }
@@ -690,7 +692,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		list_del(&page->lru);
 		rmv_page_order(page);
 		area->nr_free--;
-		__mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order));
 		expand(zone, page, order, current_order, area, migratetype);
 		return page;
 	}
@@ -830,8 +831,6 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
-			__mod_zone_page_state(zone, NR_FREE_PAGES,
-							-(1UL << order));
 
 			if (current_order == pageblock_order)
 				set_pageblock_migratetype(page,
@@ -905,6 +904,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		set_page_private(page, migratetype);
 		list = &page->lru;
 	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
 	spin_unlock(&zone->lock);
 	return i;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 22/25] Update NR_FREE_PAGES only as necessary
  2009-04-20 22:20 ` [PATCH 22/25] Update NR_FREE_PAGES only as necessary Mel Gorman
@ 2009-04-22  0:35   ` KOSAKI Motohiro
  0 siblings, 0 replies; 105+ messages in thread
From: KOSAKI Motohiro @ 2009-04-22  0:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Linux Memory Management List, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Pekka Enberg, Andrew Morton

> When pages are being freed to the buddy allocator, the zone
> NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
> freeing, it's updated once per page. This retouches cache lines more
> than necessary. Update the counters one per per-cpu bulk free.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 23/25] Get the pageblock migratetype without disabling interrupts
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (21 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 22/25] Update NR_FREE_PAGES only as necessary Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-20 22:20 ` [PATCH 24/25] Re-sort GFP flags and fix whitespace alignment for easier reading Mel Gorman
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

Local interrupts are disabled when freeing pages to the PCP list. Part
of that free checks what the migratetype of the pageblock the page is in
but it checks this with interrupts disabled. This patch checks the
pagetype with interrupts enabled. The impact is that it is possible a
page is freed to the wrong list when a pageblock changes type but as
that block is now already considered mixed from an anti-fragmentation
perspective, it's not of vital importance.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6bcaf08..acb0fac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1035,6 +1035,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	kernel_map_pages(page, 1, 0);
 
 	pcp = &zone_pcp(zone, get_cpu())->pcp;
+	set_page_private(page, get_pageblock_migratetype(page));
 	local_irq_save(flags);
 	if (unlikely(clearMlocked))
 		free_page_mlock(page);
@@ -1044,7 +1045,6 @@ static void free_hot_cold_page(struct page *page, int cold)
 		list_add_tail(&page->lru, &pcp->list);
 	else
 		list_add(&page->lru, &pcp->list);
-	set_page_private(page, get_pageblock_migratetype(page));
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH 24/25] Re-sort GFP flags and fix whitespace alignment for easier reading.
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (22 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 23/25] Get the pageblock migratetype without disabling interrupts Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-21  8:04   ` Pekka Enberg
  2009-04-20 22:20 ` [PATCH 25/25] Use a pre-calculated value instead of num_online_nodes() in fast paths Mel Gorman
  2009-04-21  8:13 ` [PATCH 00/25] Cleanup and optimise the page allocator V6 Pekka Enberg
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

Resort the GFP flags after __GFP_MOVABLE got redefined so how the bits
are used are a bit cleared.

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/gfp.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c7429b8..cfc1dd3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -47,11 +47,11 @@ struct vm_area_struct;
 #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
 #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
-#define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
-#define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_NOMEMALLOC  ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
+#define __GFP_HARDWALL    ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_THISNODE	  ((__force gfp_t)0x40000u) /* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE	((__force gfp_t)0x100000u)  /* Page is movable */
+#define __GFP_MOVABLE	  ((__force gfp_t)0x100000u)/* Page is movable */
 
 #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 24/25] Re-sort GFP flags and fix whitespace alignment for easier reading.
  2009-04-20 22:20 ` [PATCH 24/25] Re-sort GFP flags and fix whitespace alignment for easier reading Mel Gorman
@ 2009-04-21  8:04   ` Pekka Enberg
  2009-04-21  8:52     ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  8:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Mon, 2009-04-20 at 23:20 +0100, Mel Gorman wrote:
> Resort the GFP flags after __GFP_MOVABLE got redefined so how the bits
> are used are a bit cleared.

I'm confused. AFAICT, this patch just fixes up some whitespace issues
but doesn't actually "sort" anything?

> 
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>

The "From" tag should be the first line of the patch.

> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/gfp.h |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index c7429b8..cfc1dd3 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -47,11 +47,11 @@ struct vm_area_struct;
>  #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
>  #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
>  #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
> -#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> -#define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
> -#define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
> +#define __GFP_NOMEMALLOC  ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> +#define __GFP_HARDWALL    ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
> +#define __GFP_THISNODE	  ((__force gfp_t)0x40000u) /* No fallback, no policies */
>  #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> -#define __GFP_MOVABLE	((__force gfp_t)0x100000u)  /* Page is movable */
> +#define __GFP_MOVABLE	  ((__force gfp_t)0x100000u)/* Page is movable */
>  
>  #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 24/25] Re-sort GFP flags and fix whitespace alignment for easier reading.
  2009-04-21  8:04   ` Pekka Enberg
@ 2009-04-21  8:52     ` Mel Gorman
  2009-04-21 15:08       ` Christoph Lameter
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  8:52 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 11:04:03AM +0300, Pekka Enberg wrote:
> On Mon, 2009-04-20 at 23:20 +0100, Mel Gorman wrote:
> > Resort the GFP flags after __GFP_MOVABLE got redefined so how the bits
> > are used are a bit cleared.
> 
> I'm confused. AFAICT, this patch just fixes up some whitespace issues
> but doesn't actually "sort" anything?
> 

Hmm, doh. This resorted when another patch existed that no longer exists
due to difficulties. This patch only fixes whitespace now but I didn't fix
the changelog.  I can either move it to the next set altogether where it
does resort things or drop it on the grounds whitespace patches just muck
with changelogs. I'm leaning towards the latter.

> > 
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> The "From" tag should be the first line of the patch.
> 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/gfp.h |    8 ++++----
> >  1 files changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index c7429b8..cfc1dd3 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -47,11 +47,11 @@ struct vm_area_struct;
> >  #define __GFP_NORETRY	((__force gfp_t)0x1000u)/* See above */
> >  #define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
> >  #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
> > -#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> > -#define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
> > -#define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
> > +#define __GFP_NOMEMALLOC  ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> > +#define __GFP_HARDWALL    ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
> > +#define __GFP_THISNODE	  ((__force gfp_t)0x40000u) /* No fallback, no policies */
> >  #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> > -#define __GFP_MOVABLE	((__force gfp_t)0x100000u)  /* Page is movable */
> > +#define __GFP_MOVABLE	  ((__force gfp_t)0x100000u)/* Page is movable */
> >  
> >  #define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
> >  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 24/25] Re-sort GFP flags and fix whitespace alignment for easier reading.
  2009-04-21  8:52     ` Mel Gorman
@ 2009-04-21 15:08       ` Christoph Lameter
  2009-04-21 15:24         ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: Christoph Lameter @ 2009-04-21 15:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, 21 Apr 2009, Mel Gorman wrote:

> Hmm, doh. This resorted when another patch existed that no longer exists
> due to difficulties. This patch only fixes whitespace now but I didn't fix
> the changelog.  I can either move it to the next set altogether where it
> does resort things or drop it on the grounds whitespace patches just muck
> with changelogs. I'm leaning towards the latter.

Where were we with that other patch? I vaguely recalling reworking the
other patch (gfp_zone I believe) to be calculated at compile time. Did I
drop this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 24/25] Re-sort GFP flags and fix whitespace alignment for easier reading.
  2009-04-21 15:08       ` Christoph Lameter
@ 2009-04-21 15:24         ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-21 15:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 11:08:46AM -0400, Christoph Lameter wrote:
> On Tue, 21 Apr 2009, Mel Gorman wrote:
> 
> > Hmm, doh. This resorted when another patch existed that no longer exists
> > due to difficulties. This patch only fixes whitespace now but I didn't fix
> > the changelog.  I can either move it to the next set altogether where it
> > does resort things or drop it on the grounds whitespace patches just muck
> > with changelogs. I'm leaning towards the latter.
> 
> Where were we with that other patch? I vaguely recalling reworking the
> other patch (gfp_zone I believe) to be calculated at compile time. Did I
> drop this?
> 

No, you didn't. I have a version still that's promising but doesn't currently
work nor has been tested to show it really helps. It's on the long finger
till pass 2 where it'll be high on the list to sort out.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 25/25] Use a pre-calculated value instead of num_online_nodes() in fast paths
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (23 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 24/25] Re-sort GFP flags and fix whitespace alignment for easier reading Mel Gorman
@ 2009-04-20 22:20 ` Mel Gorman
  2009-04-21  8:08   ` Pekka Enberg
  2009-04-21  8:13 ` [PATCH 00/25] Cleanup and optimise the page allocator V6 Pekka Enberg
  25 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-20 22:20 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Pekka Enberg, Andrew Morton

From: Christoph Lameter <cl@linux-foundation.org>

num_online_nodes() is called in a number of places but most often by the
page allocator when deciding whether the zonelist needs to be filtered based
on cpusets or the zonelist cache. This is actually a heavy function and
touches a number of cache lines.

This patch stores the number of online nodes at boot time and updates the
value when nodes get onlined and offlined. The value is then used in a
number of important paths in place of num_online_nodes().

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/nodemask.h |   15 ++++++++++++++-
 mm/hugetlb.c             |    4 ++--
 mm/page_alloc.c          |   12 +++++++-----
 mm/slab.c                |    2 +-
 mm/slub.c                |    2 +-
 net/sunrpc/svc.c         |    2 +-
 6 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 848025c..474e73e 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
 #define next_online_node(nid)	next_node((nid), node_states[N_ONLINE])
 
 extern int nr_node_ids;
+extern int nr_online_nodes;
+
+static inline void node_set_online(int nid)
+{
+	node_set_state(nid, N_ONLINE);
+	nr_online_nodes = num_node_state(N_ONLINE);
+}
+
+static inline void node_set_offline(int nid)
+{
+	node_clear_state(nid, N_ONLINE);
+	nr_online_nodes = num_node_state(N_ONLINE);
+}
 #else
 
 static inline int node_state(int node, enum node_states state)
@@ -434,7 +447,7 @@ static inline int num_node_state(enum node_states state)
 #define first_online_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1
-
+#define nr_online_nodes		1
 #endif
 
 #define node_online_map 	node_states[N_ONLINE]
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1234486..c9a404c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -875,7 +875,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	 * can no longer free unreserved surplus pages. This occurs when
 	 * the nodes with surplus pages have no free pages.
 	 */
-	unsigned long remaining_iterations = num_online_nodes();
+	unsigned long remaining_iterations = nr_online_nodes;
 
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
@@ -904,7 +904,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 			h->surplus_huge_pages--;
 			h->surplus_huge_pages_node[nid]--;
 			nr_pages--;
-			remaining_iterations = num_online_nodes();
+			remaining_iterations = nr_online_nodes;
 		}
 	}
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index acb0fac..c1571e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -165,7 +165,9 @@ static unsigned long __meminitdata dma_reserve;
 
 #if MAX_NUMNODES > 1
 int nr_node_ids __read_mostly = MAX_NUMNODES;
+int nr_online_nodes __read_mostly = 1;
 EXPORT_SYMBOL(nr_node_ids);
+EXPORT_SYMBOL(nr_online_nodes);
 #endif
 
 int page_group_by_mobility_disabled __read_mostly;
@@ -1443,7 +1445,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	/* Determine in advance if the zonelist needs filtering */
 	if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
 		zonelist_filter = 1;
-	if (num_online_nodes() > 1)
+	if (nr_online_nodes > 1)
 		zonelist_filter = 1;
 
 zonelist_scan:
@@ -1484,7 +1486,7 @@ this_zone_full:
 			zlc_mark_zone_full(zonelist, z);
 try_next_zone:
 		if (NUMA_BUILD && zonelist_filter) {
-			if (!did_zlc_setup && num_online_nodes() > 1) {
+			if (!did_zlc_setup && nr_online_nodes > 1) {
 				/*
 				 * do zlc_setup after the first zone is tried
 				 * but only if there are multiple nodes to make
@@ -2277,7 +2279,7 @@ int numa_zonelist_order_handler(ctl_table *table, int write,
 }
 
 
-#define MAX_NODE_LOAD (num_online_nodes())
+#define MAX_NODE_LOAD (nr_online_nodes)
 static int node_load[MAX_NUMNODES];
 
 /**
@@ -2486,7 +2488,7 @@ static void build_zonelists(pg_data_t *pgdat)
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
-	load = num_online_nodes();
+	load = nr_online_nodes;
 	prev_node = local_node;
 	nodes_clear(used_mask);
 
@@ -2637,7 +2639,7 @@ void build_all_zonelists(void)
 
 	printk("Built %i zonelists in %s order, mobility grouping %s.  "
 		"Total pages: %ld\n",
-			num_online_nodes(),
+			nr_online_nodes,
 			zonelist_order_name[current_zonelist_order],
 			page_group_by_mobility_disabled ? "off" : "on",
 			vm_total_pages);
diff --git a/mm/slab.c b/mm/slab.c
index 1c680e8..41d1343 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3579,7 +3579,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
 	 * variable to skip the call, which is mostly likely to be present in
 	 * the cache.
 	 */
-	if (numa_platform && cache_free_alien(cachep, objp))
+	if (numa_platform > 1 && cache_free_alien(cachep, objp))
 		return;
 
 	if (likely(ac->avail < ac->limit)) {
diff --git a/mm/slub.c b/mm/slub.c
index 93f5fb0..8e181e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3714,7 +3714,7 @@ static int list_locations(struct kmem_cache *s, char *buf,
 						 to_cpumask(l->cpus));
 		}
 
-		if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
+		if (nr_online_nodes > 1 && !nodes_empty(l->nodes) &&
 				len < PAGE_SIZE - 60) {
 			len += sprintf(buf + len, " nodes=");
 			len += nodelist_scnprintf(buf + len, PAGE_SIZE - len - 50,
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 8847add..5ed8931 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -124,7 +124,7 @@ svc_pool_map_choose_mode(void)
 {
 	unsigned int node;
 
-	if (num_online_nodes() > 1) {
+	if (nr_online_nodes > 1) {
 		/*
 		 * Actually have multiple NUMA nodes,
 		 * so split pools on NUMA node boundaries
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH 25/25] Use a pre-calculated value instead of num_online_nodes() in fast paths
  2009-04-20 22:20 ` [PATCH 25/25] Use a pre-calculated value instead of num_online_nodes() in fast paths Mel Gorman
@ 2009-04-21  8:08   ` Pekka Enberg
  2009-04-21  9:01     ` Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  8:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Mon, 2009-04-20 at 23:20 +0100, Mel Gorman wrote:
> diff --git a/mm/slab.c b/mm/slab.c
> index 1c680e8..41d1343 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3579,7 +3579,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
>  	 * variable to skip the call, which is mostly likely to be present in
>  	 * the cache.
>  	 */
> -	if (numa_platform && cache_free_alien(cachep, objp))
> +	if (numa_platform > 1 && cache_free_alien(cachep, objp))
>  		return;

This doesn't look right. I assume you meant "nr_online_nodes > 1" here?
If so, please go ahead and remove "numa_platform" completely.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 25/25] Use a pre-calculated value instead of num_online_nodes() in fast paths
  2009-04-21  8:08   ` Pekka Enberg
@ 2009-04-21  9:01     ` Mel Gorman
  2009-04-21 15:09       ` Christoph Lameter
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-04-21  9:01 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 11:08:20AM +0300, Pekka Enberg wrote:
> On Mon, 2009-04-20 at 23:20 +0100, Mel Gorman wrote:
> > diff --git a/mm/slab.c b/mm/slab.c
> > index 1c680e8..41d1343 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -3579,7 +3579,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
> >  	 * variable to skip the call, which is mostly likely to be present in
> >  	 * the cache.
> >  	 */
> > -	if (numa_platform && cache_free_alien(cachep, objp))
> > +	if (numa_platform > 1 && cache_free_alien(cachep, objp))
> >  		return;
> 
> This doesn't look right. I assume you meant "nr_online_nodes > 1" here?
> If so, please go ahead and remove "numa_platform" completely.
> 

It would need to be nr_possible_nodes which would be a separate patch to add
the definition and then drop numa_platform. This change is wrong as part of
this patch. I'll drop it. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 25/25] Use a pre-calculated value instead of num_online_nodes() in fast paths
  2009-04-21  9:01     ` Mel Gorman
@ 2009-04-21 15:09       ` Christoph Lameter
  0 siblings, 0 replies; 105+ messages in thread
From: Christoph Lameter @ 2009-04-21 15:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Linux Memory Management List, KOSAKI Motohiro,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, 21 Apr 2009, Mel Gorman wrote:

> On Tue, Apr 21, 2009 at 11:08:20AM +0300, Pekka Enberg wrote:
> > On Mon, 2009-04-20 at 23:20 +0100, Mel Gorman wrote:
> > > diff --git a/mm/slab.c b/mm/slab.c
> > > index 1c680e8..41d1343 100644
> > > --- a/mm/slab.c
> > > +++ b/mm/slab.c
> > > @@ -3579,7 +3579,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
> > >  	 * variable to skip the call, which is mostly likely to be present in
> > >  	 * the cache.
> > >  	 */
> > > -	if (numa_platform && cache_free_alien(cachep, objp))
> > > +	if (numa_platform > 1 && cache_free_alien(cachep, objp))
> > >  		return;
> >
> > This doesn't look right. I assume you meant "nr_online_nodes > 1" here?
> > If so, please go ahead and remove "numa_platform" completely.
> >
>
> It would need to be nr_possible_nodes which would be a separate patch to add
> the definition and then drop numa_platform. This change is wrong as part of
> this patch. I'll drop it. Thanks

nr_online_nodes would be okay as Pekka suggested.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 00/25] Cleanup and optimise the page allocator V6
  2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
                   ` (24 preceding siblings ...)
  2009-04-20 22:20 ` [PATCH 25/25] Use a pre-calculated value instead of num_online_nodes() in fast paths Mel Gorman
@ 2009-04-21  8:13 ` Pekka Enberg
  2009-04-22 14:13   ` Mel Gorman
  25 siblings, 1 reply; 105+ messages in thread
From: Pekka Enberg @ 2009-04-21  8:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> Here is V6 of the cleanup and optimisation of the page allocator and it
> should be ready for wider testing. Please consider a possibility for
> merging as a Pass 1 at making the page allocator faster.

The patch series is quite big. Can we fast-track some of the less
controversial patches to make it more manageable? AFAICT, 1-4 are ready
to go in to -mm as-is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 00/25] Cleanup and optimise the page allocator V6
  2009-04-21  8:13 ` [PATCH 00/25] Cleanup and optimise the page allocator V6 Pekka Enberg
@ 2009-04-22 14:13   ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-04-22 14:13 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linux Memory Management List, KOSAKI Motohiro, Christoph Lameter,
	Nick Piggin, Linux Kernel Mailing List, Lin Ming, Zhang Yanmin,
	Peter Zijlstra, Andrew Morton

On Tue, Apr 21, 2009 at 11:13:54AM +0300, Pekka Enberg wrote:
> On Mon, 2009-04-20 at 23:19 +0100, Mel Gorman wrote:
> > Here is V6 of the cleanup and optimisation of the page allocator and it
> > should be ready for wider testing. Please consider a possibility for
> > merging as a Pass 1 at making the page allocator faster.
> 
> The patch series is quite big. Can we fast-track some of the less
> controversial patches to make it more manageable? AFAICT, 1-4 are ready
> to go in to -mm as-is.
> 

I made one more attempt with V7 to get a full set that doesn't raise eyebrows
and passes a full review. If it's still running into hassle, we'll break it
up more. Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 00/25] Cleanup and optimise the page allocator V5
@ 2009-03-20 10:02 Mel Gorman
  2009-03-20 10:02 ` [PATCH 02/25] Do not sanity check order in the fast path Mel Gorman
  0 siblings, 1 reply; 105+ messages in thread
From: Mel Gorman @ 2009-03-20 10:02 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Andrew Morton

Here is V5 of the cleanup and optimisation of the page allocator and it should
be ready for wider testing. Please consider a possibility for merging as a
Pass 1 at making the page allocator faster. Other passes will occur later
when this one has had a bit of exercise. The patchset completed a series
of tests based on the latest MMOTM.

Performance is improved in a variety of cases but note it's not universal due
to lock contention which I'll explain later. Text is reduced by 497 bytes on
the x86-64 config I checked. 18.78% less clock cycles were sampled in the page
allocator paths excluding zeroing which is roughly the same in either kernel,
L1 cache misses are reduced by about 7.36% and L2 cache misses were reduced
by 17.91% cache misses incurred within the allocator itself are reduced.

The lock contention on some machines goes up for the the zone->lru_lock
and zone->lock locks which can regress some workloads even though others on
the same machine still go faster. For netperf, a lock called slock-AF_INET
seemed very important although I didn't look too closely other than noting
contention went up. The zone->lock gets hammered a lot by high order allocs
and frees coming from SLUB which are not covered by the PCP allocator in
this patchset. zone->lru_lock goes up is less clear but as it's page cache
releases but overall contention may be up because CPUs are spending less
time with interrupts disabled and more time trying to do real work but
contending on the locks.

Changes since V4
  o Drop the more controversial patches for now and focus on the "obvious win"
    material.
  o Add reviewed-by notes
  o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
  o Add unlikely() for the clearMlocked check
  o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
    with __free_pages_ok()
  o Convert num_online_nodes() to use a static value so that callers do
    not have to be individually updated
  o Rebase to mmotm-2003-03-13

Changes since V3
  o Drop the more controversial patches for now and focus on the "obvious win"
    material
  o Add reviewed-by notes
  o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
  o Add unlikely() for the clearMlocked check
  o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
    with __free_pages_ok()

Changes since V2
  o Remove brances by treating watermark flags as array indices
  o Remove branch by assuming __GFP_HIGH == ALLOC_HIGH
  o Do not check for compound on every page free
  o Remove branch by always ensuring the migratetype is known on free
  o Simplify buffered_rmqueue further
  o Reintroduce improved version of batched bulk free of pcp pages
  o Use allocation flags as an index to zone watermarks
  o Work out __GFP_COLD only once
  o Reduce the number of times zone stats are updated
  o Do not dump reserve pages back into the allocator. Instead treat them
    as MOVABLE so that MIGRATE_RESERVE gets used on the max-order-overlapped
    boundaries without causing trouble
  o Allow pages up to PAGE_ALLOC_COSTLY_ORDER to use the per-cpu allocator.
    order-1 allocations are frequently enough in particular to justify this
  o Rearrange inlining such that the hot-path is inlined but not in a way
    that increases the text size of the page allocator
  o Make the check for needing additional zonelist filtering due to NUMA
    or cpusets as light as possible
  o Do not destroy compound pages going to the PCP lists
  o Delay the merging of buddies until a high-order allocation needs them
    or anti-fragmentation is being forced to fallback

Changes since V1
  o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
  o Use non-lock bit operations for clearing the mlock flag
  o Factor out alloc_flags calculation so it is only done once (Peter)
  o Make gfp.h a bit prettier and clear-cut (Peter)
  o Instead of deleting a debugging check, replace page_count() in the
    free path with a version that does not check for compound pages (Nick)
  o Drop the alteration for hot/cold page freeing until we know if it
    helps or not

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 02/25] Do not sanity check order in the fast path
  2009-03-20 10:02 [PATCH 00/25] Cleanup and optimise the page allocator V5 Mel Gorman
@ 2009-03-20 10:02 ` Mel Gorman
  0 siblings, 0 replies; 105+ messages in thread
From: Mel Gorman @ 2009-03-20 10:02 UTC (permalink / raw)
  To: Mel Gorman, Linux Memory Management List
  Cc: KOSAKI Motohiro, Christoph Lameter, Nick Piggin,
	Linux Kernel Mailing List, Lin Ming, Zhang Yanmin, Peter Zijlstra,
	Andrew Morton

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/gfp.h |    6 ------
 mm/page_alloc.c     |    2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dcf0ab8..8736047 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -181,9 +181,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	/* Unknown node is current node */
 	if (nid < 0)
 		nid = numa_node_id();
@@ -197,9 +194,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	if (unlikely(order >= MAX_ORDER))
-		return NULL;
-
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0671b3f..dd87dad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1407,6 +1407,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 
 	classzone_idx = zone_idx(preferred_zone);
 
+	VM_BUG_ON(order >= MAX_ORDER);
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2009-04-22 14:42 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-20 22:19 [PATCH 00/25] Cleanup and optimise the page allocator V6 Mel Gorman
2009-04-20 22:19 ` [PATCH 01/25] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
2009-04-21  1:44   ` KOSAKI Motohiro
2009-04-21  5:55   ` Pekka Enberg
2009-04-20 22:19 ` [PATCH 02/25] Do not sanity check order in the fast path Mel Gorman
2009-04-21  1:45   ` KOSAKI Motohiro
2009-04-21  5:55   ` Pekka Enberg
2009-04-20 22:19 ` [PATCH 03/25] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
2009-04-21  2:44   ` KOSAKI Motohiro
2009-04-21  6:00   ` Pekka Enberg
2009-04-21  6:33   ` Paul Mundt
2009-04-20 22:19 ` [PATCH 04/25] Check only once if the zonelist is suitable for the allocation Mel Gorman
2009-04-21  3:03   ` KOSAKI Motohiro
2009-04-21  7:09   ` Pekka Enberg
2009-04-20 22:19 ` [PATCH 05/25] Break up the allocator entry point into fast and slow paths Mel Gorman
2009-04-21  6:35   ` KOSAKI Motohiro
2009-04-21  7:13     ` Pekka Enberg
2009-04-21  9:30       ` Mel Gorman
2009-04-21  9:29     ` Mel Gorman
2009-04-21 10:44       ` KOSAKI Motohiro
2009-04-20 22:19 ` [PATCH 06/25] Move check for disabled anti-fragmentation out of fastpath Mel Gorman
2009-04-21  6:37   ` KOSAKI Motohiro
2009-04-20 22:19 ` [PATCH 07/25] Check in advance if the zonelist needs additional filtering Mel Gorman
2009-04-21  6:52   ` KOSAKI Motohiro
2009-04-21  9:47     ` Mel Gorman
2009-04-21  7:21   ` Pekka Enberg
2009-04-21  9:49     ` Mel Gorman
2009-04-20 22:19 ` [PATCH 08/25] Calculate the preferred zone for allocation only once Mel Gorman
2009-04-21  7:03   ` KOSAKI Motohiro
2009-04-21  8:23     ` Mel Gorman
2009-04-21  7:37   ` Pekka Enberg
2009-04-21  8:27     ` Mel Gorman
2009-04-21  8:29       ` Pekka Enberg
2009-04-20 22:19 ` [PATCH 09/25] Calculate the migratetype " Mel Gorman
2009-04-21  7:37   ` KOSAKI Motohiro
2009-04-21  8:35     ` Mel Gorman
2009-04-21 10:19       ` KOSAKI Motohiro
2009-04-21 10:30         ` Mel Gorman
2009-04-20 22:19 ` [PATCH 10/25] Calculate the alloc_flags " Mel Gorman
2009-04-21  9:03   ` KOSAKI Motohiro
2009-04-21 10:05     ` Mel Gorman
2009-04-21 10:12       ` KOSAKI Motohiro
2009-04-21 10:37         ` Mel Gorman
2009-04-21 10:40           ` KOSAKI Motohiro
2009-04-20 22:19 ` [PATCH 11/25] Calculate the cold parameter " Mel Gorman
2009-04-21  7:43   ` Pekka Enberg
2009-04-21  8:41     ` Mel Gorman
2009-04-21  9:07   ` KOSAKI Motohiro
2009-04-21 10:08     ` Mel Gorman
2009-04-21 14:59     ` Christoph Lameter
2009-04-21 14:58   ` Christoph Lameter
2009-04-20 22:19 ` [PATCH 12/25] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH Mel Gorman
2009-04-21  7:46   ` Pekka Enberg
2009-04-21  8:45     ` Mel Gorman
2009-04-21 10:25       ` Pekka Enberg
2009-04-21  9:08   ` KOSAKI Motohiro
2009-04-21 10:31     ` KOSAKI Motohiro
2009-04-21 10:43       ` Mel Gorman
2009-04-20 22:19 ` [PATCH 13/25] Inline __rmqueue_smallest() Mel Gorman
2009-04-21  7:58   ` Pekka Enberg
2009-04-21  8:48     ` Mel Gorman
2009-04-21  9:52   ` KOSAKI Motohiro
2009-04-21 10:11     ` Mel Gorman
2009-04-21 10:22       ` KOSAKI Motohiro
2009-04-20 22:20 ` [PATCH 14/25] Inline buffered_rmqueue() Mel Gorman
2009-04-21  9:56   ` KOSAKI Motohiro
2009-04-20 22:20 ` [PATCH 15/25] Inline __rmqueue_fallback() Mel Gorman
2009-04-21  9:56   ` KOSAKI Motohiro
2009-04-20 22:20 ` [PATCH 16/25] Save text by reducing call sites of __rmqueue() Mel Gorman
2009-04-21 10:47   ` KOSAKI Motohiro
2009-04-20 22:20 ` [PATCH 17/25] Do not call get_pageblock_migratetype() more than necessary Mel Gorman
2009-04-21 11:03   ` KOSAKI Motohiro
2009-04-21 16:12     ` Mel Gorman
2009-04-22  2:25       ` KOSAKI Motohiro
2009-04-20 22:20 ` [PATCH 18/25] Do not disable interrupts in free_page_mlock() Mel Gorman
2009-04-21  7:55   ` Pekka Enberg
2009-04-21  8:50     ` Mel Gorman
2009-04-21 15:05       ` Christoph Lameter
2009-04-22  0:13   ` KOSAKI Motohiro
2009-04-22 14:43     ` Lee Schermerhorn
2009-04-20 22:20 ` [PATCH 19/25] Do not setup zonelist cache when there is only one node Mel Gorman
2009-04-20 22:20 ` [PATCH 20/25] Do not check for compound pages during the page allocator sanity checks Mel Gorman
2009-04-22  0:20   ` KOSAKI Motohiro
2009-04-22 10:09     ` Mel Gorman
2009-04-22 10:41       ` KOSAKI Motohiro
2009-04-20 22:20 ` [PATCH 21/25] Use allocation flags as an index to the zone watermark Mel Gorman
2009-04-22  0:26   ` KOSAKI Motohiro
2009-04-22  0:41     ` David Rientjes
2009-04-22 10:21     ` Mel Gorman
2009-04-22 10:23       ` Mel Gorman
2009-04-20 22:20 ` [PATCH 22/25] Update NR_FREE_PAGES only as necessary Mel Gorman
2009-04-22  0:35   ` KOSAKI Motohiro
2009-04-20 22:20 ` [PATCH 23/25] Get the pageblock migratetype without disabling interrupts Mel Gorman
2009-04-20 22:20 ` [PATCH 24/25] Re-sort GFP flags and fix whitespace alignment for easier reading Mel Gorman
2009-04-21  8:04   ` Pekka Enberg
2009-04-21  8:52     ` Mel Gorman
2009-04-21 15:08       ` Christoph Lameter
2009-04-21 15:24         ` Mel Gorman
2009-04-20 22:20 ` [PATCH 25/25] Use a pre-calculated value instead of num_online_nodes() in fast paths Mel Gorman
2009-04-21  8:08   ` Pekka Enberg
2009-04-21  9:01     ` Mel Gorman
2009-04-21 15:09       ` Christoph Lameter
2009-04-21  8:13 ` [PATCH 00/25] Cleanup and optimise the page allocator V6 Pekka Enberg
2009-04-22 14:13   ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2009-03-20 10:02 [PATCH 00/25] Cleanup and optimise the page allocator V5 Mel Gorman
2009-03-20 10:02 ` [PATCH 02/25] Do not sanity check order in the fast path Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).