[PATCH -mm 0/2] kswapd vs compaction improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH -mm 0/2] kswapd vs compaction improvements
@ 2012-01-10  2:31 ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2012-01-10  2:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Mel Gorman, akpm, KOSAKI Motohiro, Johannes Weiner,
	hughd, aarcange

Running a kernel with compaction enabled with some memory pressure has
caused my system to run into all kinds of trouble.

One of the more obvious problems is that while kswapd does not try to 
free contiguous pages when CONFIG_COMPACTION is enabled, it does
continue reclaiming until enough contiguous pages have become available.
This can lead to enormous swap storms, where lots of memory is freed
and a fair amount of the working set can end up in swap.

A second problem is that memory compaction currently does nothing for
network allocations in the receive path, for eg. jumbo frames, because
those are done in interrupt context.  In the past we have tried to
have kswapd invoke memory compaction, but it used too much CPU time.
The second patch in this series has kswapd invoke compaction very
carefully, taking in account the desired page order, as well as
zone->compaction_deferred.

I have tested these patches on my system, and things seem to behave
well. Any tests and reviews would be appreciated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH -mm 0/2] kswapd vs compaction improvements
@ 2012-01-10  2:31 ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2012-01-10  2:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Mel Gorman, akpm, KOSAKI Motohiro, Johannes Weiner,
	hughd, aarcange

Running a kernel with compaction enabled with some memory pressure has
caused my system to run into all kinds of trouble.

One of the more obvious problems is that while kswapd does not try to 
free contiguous pages when CONFIG_COMPACTION is enabled, it does
continue reclaiming until enough contiguous pages have become available.
This can lead to enormous swap storms, where lots of memory is freed
and a fair amount of the working set can end up in swap.

A second problem is that memory compaction currently does nothing for
network allocations in the receive path, for eg. jumbo frames, because
those are done in interrupt context.  In the past we have tried to
have kswapd invoke memory compaction, but it used too much CPU time.
The second patch in this series has kswapd invoke compaction very
carefully, taking in account the desired page order, as well as
zone->compaction_deferred.

I have tested these patches on my system, and things seem to behave
well. Any tests and reviews would be appreciated.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH -mm 1/2] mm: kswapd test order 0 watermarks when compaction is enabled
  2012-01-10  2:31 ` Rik van Riel
@ 2012-01-10  2:33   ` Rik van Riel
  -1 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2012-01-10  2:33 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Mel Gorman, akpm, KOSAKI Motohiro, Johannes Weiner,
	hughd, aarcange

When built with CONFIG_COMPACTION, kswapd does not try to free
contiguous pages.  Because it is not trying, it should also not
test whether it succeeded, because that can result in continuous
page reclaim, until a large fraction of memory is free and large
fractions of the working set have been evicted.

Also remove a line of code that increments balanced right before
exiting the function.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   22 +++++++++++++++++-----
 1 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f54a05b..c3eec6b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2608,7 +2608,7 @@ loop_again:
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab;
+			int nr_slab, testorder;
 			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
@@ -2637,11 +2637,25 @@ loop_again:
 			 * gap is either the low watermark or 1%
 			 * of the zone, whichever is smaller.
 			 */
+			testorder = order;
 			balance_gap = min(low_wmark_pages(zone),
 				(zone->present_pages +
 					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
 				KSWAPD_ZONE_BALANCE_GAP_RATIO);
-			if (!zone_watermark_ok_safe(zone, order,
+			/*
+			 * Kswapd reclaims only single pages when
+			 * COMPACTION_BUILD. Trying too hard to get
+			 * contiguous free pages can result in excessive
+			 * amounts of free memory, and useful things
+			 * getting kicked out of memory.
+			 * Limit the amount of reclaim to something sane,
+			 * plus space for compaction to do its thing.
+			 */
+			if (COMPACTION_BUILD) {
+				testorder = 0;
+				balance_gap += 2<<order;
+			}
+			if (!zone_watermark_ok_safe(zone, testorder,
 					high_wmark_pages(zone) + balance_gap,
 					end_zone, 0)) {
 				shrink_zone(priority, zone, &sc);
@@ -2670,7 +2684,7 @@ loop_again:
 				continue;
 			}
 
-			if (!zone_watermark_ok_safe(zone, order,
+			if (!zone_watermark_ok_safe(zone, testorder,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
 				/*
@@ -2776,8 +2790,6 @@ out:
 
 			/* If balanced, clear the congested flag */
 			zone_clear_flag(zone, ZONE_CONGESTED);
-			if (i <= *classzone_idx)
-				balanced += zone->present_pages;
 		}
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH -mm 1/2] mm: kswapd test order 0 watermarks when compaction is enabled
@ 2012-01-10  2:33   ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2012-01-10  2:33 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Mel Gorman, akpm, KOSAKI Motohiro, Johannes Weiner,
	hughd, aarcange

When built with CONFIG_COMPACTION, kswapd does not try to free
contiguous pages.  Because it is not trying, it should also not
test whether it succeeded, because that can result in continuous
page reclaim, until a large fraction of memory is free and large
fractions of the working set have been evicted.

Also remove a line of code that increments balanced right before
exiting the function.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   22 +++++++++++++++++-----
 1 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f54a05b..c3eec6b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2608,7 +2608,7 @@ loop_again:
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab;
+			int nr_slab, testorder;
 			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
@@ -2637,11 +2637,25 @@ loop_again:
 			 * gap is either the low watermark or 1%
 			 * of the zone, whichever is smaller.
 			 */
+			testorder = order;
 			balance_gap = min(low_wmark_pages(zone),
 				(zone->present_pages +
 					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
 				KSWAPD_ZONE_BALANCE_GAP_RATIO);
-			if (!zone_watermark_ok_safe(zone, order,
+			/*
+			 * Kswapd reclaims only single pages when
+			 * COMPACTION_BUILD. Trying too hard to get
+			 * contiguous free pages can result in excessive
+			 * amounts of free memory, and useful things
+			 * getting kicked out of memory.
+			 * Limit the amount of reclaim to something sane,
+			 * plus space for compaction to do its thing.
+			 */
+			if (COMPACTION_BUILD) {
+				testorder = 0;
+				balance_gap += 2<<order;
+			}
+			if (!zone_watermark_ok_safe(zone, testorder,
 					high_wmark_pages(zone) + balance_gap,
 					end_zone, 0)) {
 				shrink_zone(priority, zone, &sc);
@@ -2670,7 +2684,7 @@ loop_again:
 				continue;
 			}
 
-			if (!zone_watermark_ok_safe(zone, order,
+			if (!zone_watermark_ok_safe(zone, testorder,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
 				/*
@@ -2776,8 +2790,6 @@ out:
 
 			/* If balanced, clear the congested flag */
 			zone_clear_flag(zone, ZONE_CONGESTED);
-			if (i <= *classzone_idx)
-				balanced += zone->present_pages;
 		}
 	}
 


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 1/2] mm: kswapd test order 0 watermarks when compaction is enabled
  2012-01-10  2:33   ` Rik van Riel
@ 2012-01-12 15:46     ` Mel Gorman
  -1 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2012-01-12 15:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, akpm, KOSAKI Motohiro, Johannes Weiner,
	hughd, aarcange

On Mon, Jan 09, 2012 at 09:33:13PM -0500, Rik van Riel wrote:
> When built with CONFIG_COMPACTION, kswapd does not try to free
> contiguous pages.  Because it is not trying, it should also not
> test whether it succeeded, because that can result in continuous
> page reclaim, until a large fraction of memory is free and large
> fractions of the working set have been evicted.
> 

hmm, I'm missing something about your explanation.

1. wakeup_kswapd passes requested order to kswapd_max_order. Bear in
   mind that this does *not* happen for THP.
2. kswapd reads this and passes it to balance_pgdat
3. balance_pgdat puts that in scan_control
4. shrink_zone gets that scan_control and so on

kswapd does try to free contiguous pages.

What is the source of the contiguous allocations of concern? The
mm_vmscan_wakeup_kswapd tracepoint should be able to get you a stack
trace to identify the source of high-order allocations.

To confirm I was not on crazy pills I fired up a systemtap script
that did a burst of order-8 allocations (ok, some crazy pills) and
observed this with tracepoints

          <idle>-0     [002] 236009.284803: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=8
         kswapd0-53    [002] 236009.285028: mm_vmscan_kswapd_wake: nid=0 order=8
         kswapd0-53    [002] 236009.285034: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=9 nr_scanned=0 nr_taken=0 contig_taken=0 contig_dirty=0 contig_failed=0
         kswapd0-53    [002] 236009.285035: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=22 nr_scanned=0 nr_taken=0 contig_taken=0 contig_dirty=0 contig_failed=0
         kswapd0-53    [002] 236009.285038: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=1 nr_scanned=1 nr_taken=1 contig_taken=0 contig_dirty=0 contig_failed=1
         kswapd0-53    [002] 236009.285049: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=32 nr_scanned=32 nr_taken=32 contig_taken=12 contig_dirty=0 contig_failed=20
         kswapd0-53    [002] 236009.285080: mm_vmscan_lru_isolate: isolate_mode=2 order=8 nr_requested=32 nr_scanned=38 nr_taken=38 contig_taken=24 contig_dirty=0 contig_failed=14
         kswapd0-53    [002] 236009.285090: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=23 nr_scanned=24 nr_taken=24 contig_taken=4 contig_dirty=0 contig_failed=20

This is with CONFIG_COMPACTION.

You're still in the right area though. kswapd does contiguous-aware
reclaim it does not do any compaction and so potentially it is doing
excessive reclaim while depending on another process to do the
compaction for it. That is a problem.

> Also remove a line of code that increments balanced right before
> exiting the function.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |   22 +++++++++++++++++-----
>  1 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f54a05b..c3eec6b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2608,7 +2608,7 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab;
> +			int nr_slab, testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
> @@ -2637,11 +2637,25 @@ loop_again:
>  			 * gap is either the low watermark or 1%
>  			 * of the zone, whichever is smaller.
>  			 */
> +			testorder = order;
>  			balance_gap = min(low_wmark_pages(zone),
>  				(zone->present_pages +
>  					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
>  				KSWAPD_ZONE_BALANCE_GAP_RATIO);
> -			if (!zone_watermark_ok_safe(zone, order,
> +			/*
> +			 * Kswapd reclaims only single pages when
> +			 * COMPACTION_BUILD. Trying too hard to get
> +			 * contiguous free pages can result in excessive
> +			 * amounts of free memory, and useful things
> +			 * getting kicked out of memory.
> +			 * Limit the amount of reclaim to something sane,
> +			 * plus space for compaction to do its thing.
> +			 */
> +			if (COMPACTION_BUILD) {
> +				testorder = 0;
> +				balance_gap += 2<<order;
> +			}
> +			if (!zone_watermark_ok_safe(zone, testorder,
>  					high_wmark_pages(zone) + balance_gap,
>  					end_zone, 0)) {

kswapd does reclaim high-order pages so this comment is misleading.
However I see the type of problem you are talking about.

Direct reclaim in shrink_zones() does a check for compaction_suitable()
when deciding whether to abort reclaim or not. How about doing the same
for kswapd and if compaction can go ahead, goto out?

>  				shrink_zone(priority, zone, &sc);
> @@ -2670,7 +2684,7 @@ loop_again:
>  				continue;
>  			}
>  
> -			if (!zone_watermark_ok_safe(zone, order,
> +			if (!zone_watermark_ok_safe(zone, testorder,
>  					high_wmark_pages(zone), end_zone, 0)) {
>  				all_zones_ok = 0;
>  				/*
> @@ -2776,8 +2790,6 @@ out:
>  
>  			/* If balanced, clear the congested flag */
>  			zone_clear_flag(zone, ZONE_CONGESTED);
> -			if (i <= *classzone_idx)
> -				balanced += zone->present_pages;
>  		}
>  	}
>  
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 1/2] mm: kswapd test order 0 watermarks when compaction is enabled
@ 2012-01-12 15:46     ` Mel Gorman
  0 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2012-01-12 15:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, akpm, KOSAKI Motohiro, Johannes Weiner,
	hughd, aarcange

On Mon, Jan 09, 2012 at 09:33:13PM -0500, Rik van Riel wrote:
> When built with CONFIG_COMPACTION, kswapd does not try to free
> contiguous pages.  Because it is not trying, it should also not
> test whether it succeeded, because that can result in continuous
> page reclaim, until a large fraction of memory is free and large
> fractions of the working set have been evicted.
> 

hmm, I'm missing something about your explanation.

1. wakeup_kswapd passes requested order to kswapd_max_order. Bear in
   mind that this does *not* happen for THP.
2. kswapd reads this and passes it to balance_pgdat
3. balance_pgdat puts that in scan_control
4. shrink_zone gets that scan_control and so on

kswapd does try to free contiguous pages.

What is the source of the contiguous allocations of concern? The
mm_vmscan_wakeup_kswapd tracepoint should be able to get you a stack
trace to identify the source of high-order allocations.

To confirm I was not on crazy pills I fired up a systemtap script
that did a burst of order-8 allocations (ok, some crazy pills) and
observed this with tracepoints

          <idle>-0     [002] 236009.284803: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=8
         kswapd0-53    [002] 236009.285028: mm_vmscan_kswapd_wake: nid=0 order=8
         kswapd0-53    [002] 236009.285034: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=9 nr_scanned=0 nr_taken=0 contig_taken=0 contig_dirty=0 contig_failed=0
         kswapd0-53    [002] 236009.285035: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=22 nr_scanned=0 nr_taken=0 contig_taken=0 contig_dirty=0 contig_failed=0
         kswapd0-53    [002] 236009.285038: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=1 nr_scanned=1 nr_taken=1 contig_taken=0 contig_dirty=0 contig_failed=1
         kswapd0-53    [002] 236009.285049: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=32 nr_scanned=32 nr_taken=32 contig_taken=12 contig_dirty=0 contig_failed=20
         kswapd0-53    [002] 236009.285080: mm_vmscan_lru_isolate: isolate_mode=2 order=8 nr_requested=32 nr_scanned=38 nr_taken=38 contig_taken=24 contig_dirty=0 contig_failed=14
         kswapd0-53    [002] 236009.285090: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=23 nr_scanned=24 nr_taken=24 contig_taken=4 contig_dirty=0 contig_failed=20

This is with CONFIG_COMPACTION.

You're still in the right area though. kswapd does contiguous-aware
reclaim it does not do any compaction and so potentially it is doing
excessive reclaim while depending on another process to do the
compaction for it. That is a problem.

> Also remove a line of code that increments balanced right before
> exiting the function.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |   22 +++++++++++++++++-----
>  1 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f54a05b..c3eec6b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2608,7 +2608,7 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab;
> +			int nr_slab, testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
> @@ -2637,11 +2637,25 @@ loop_again:
>  			 * gap is either the low watermark or 1%
>  			 * of the zone, whichever is smaller.
>  			 */
> +			testorder = order;
>  			balance_gap = min(low_wmark_pages(zone),
>  				(zone->present_pages +
>  					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
>  				KSWAPD_ZONE_BALANCE_GAP_RATIO);
> -			if (!zone_watermark_ok_safe(zone, order,
> +			/*
> +			 * Kswapd reclaims only single pages when
> +			 * COMPACTION_BUILD. Trying too hard to get
> +			 * contiguous free pages can result in excessive
> +			 * amounts of free memory, and useful things
> +			 * getting kicked out of memory.
> +			 * Limit the amount of reclaim to something sane,
> +			 * plus space for compaction to do its thing.
> +			 */
> +			if (COMPACTION_BUILD) {
> +				testorder = 0;
> +				balance_gap += 2<<order;
> +			}
> +			if (!zone_watermark_ok_safe(zone, testorder,
>  					high_wmark_pages(zone) + balance_gap,
>  					end_zone, 0)) {

kswapd does reclaim high-order pages so this comment is misleading.
However I see the type of problem you are talking about.

Direct reclaim in shrink_zones() does a check for compaction_suitable()
when deciding whether to abort reclaim or not. How about doing the same
for kswapd and if compaction can go ahead, goto out?

>  				shrink_zone(priority, zone, &sc);
> @@ -2670,7 +2684,7 @@ loop_again:
>  				continue;
>  			}
>  
> -			if (!zone_watermark_ok_safe(zone, order,
> +			if (!zone_watermark_ok_safe(zone, testorder,
>  					high_wmark_pages(zone), end_zone, 0)) {
>  				all_zones_ok = 0;
>  				/*
> @@ -2776,8 +2790,6 @@ out:
>  
>  			/* If balanced, clear the congested flag */
>  			zone_clear_flag(zone, ZONE_CONGESTED);
> -			if (i <= *classzone_idx)
> -				balanced += zone->present_pages;
>  		}
>  	}
>  
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
  2012-01-10  2:31 ` Rik van Riel
@ 2012-01-10  2:33   ` Rik van Riel
  -1 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2012-01-10  2:33 UTC (permalink / raw)
  To: linux-mm, aarcange
  Cc: linux-kernel, Mel Gorman, akpm, KOSAKI Motohiro, Johannes Weiner,
	hughd

With CONFIG_COMPACTION enabled, kswapd does not try to free
contiguous free pages, even when it is woken for a higher order
request.

This could be bad for eg. jumbo frame network allocations, which
are done from interrupt context and cannot compact memory themselves.
Higher than before allocation failure rates in the network receive
path have been observed in kernels with compaction enabled.

Teach kswapd to defragment the memory zones in a node, but only
if required and compaction is not deferred in a zone.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |    6 ++++++
 mm/compaction.c            |   22 ++++++++++++++--------
 mm/vmscan.c                |    9 +++++++++
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index bb2bbdb..df31dab 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -23,6 +23,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync);
+extern int compact_pgdat(pg_data_t *pgdat, int order, bool force);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
 /* Do not skip compaction more than 64 times */
@@ -62,6 +63,11 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_CONTINUE;
 }
 
+static inline int compact_pgdat(pg_data_t *pgdat, int order, bool force)
+{
+	return COMPACT_CONTINUE;
+}
+
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 1253d7a..1962a0e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -651,16 +651,11 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 
 
 /* Compact all zones within a node */
-static int compact_node(int nid)
+int compact_pgdat(pg_data_t *pgdat, int order, bool force)
 {
 	int zoneid;
-	pg_data_t *pgdat;
 	struct zone *zone;
 
-	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
-		return -EINVAL;
-	pgdat = NODE_DATA(nid);
-
 	/* Flush pending updates to the LRU lists */
 	lru_add_drain_all();
 
@@ -668,7 +663,7 @@ static int compact_node(int nid)
 		struct compact_control cc = {
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
-			.order = -1,
+			.order = order,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
@@ -679,7 +674,8 @@ static int compact_node(int nid)
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
 
-		compact_zone(zone, &cc);
+		if (force || !compaction_deferred(zone))
+			compact_zone(zone, &cc);
 
 		VM_BUG_ON(!list_empty(&cc.freepages));
 		VM_BUG_ON(!list_empty(&cc.migratepages));
@@ -688,6 +684,16 @@ static int compact_node(int nid)
 	return 0;
 }
 
+static int compact_node(int nid)
+{
+	pg_data_t *pgdat;
+	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	return compact_pgdat(pgdat, -1, true);
+}
+
 /* Compact all nodes in the system */
 static int compact_nodes(void)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fb469a..65bf21db 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2522,6 +2522,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int priority;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
+	int zones_need_compaction = 1;
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_soft_reclaimed;
@@ -2788,9 +2789,17 @@ out:
 				goto loop_again;
 			}
 
+			/* Check if the memory needs to be defragmented. */
+			if (zone_watermark_ok(zone, order,
+				    low_wmark_pages(zone), *classzone_idx, 0))
+				zones_need_compaction = 0;
+
 			/* If balanced, clear the congested flag */
 			zone_clear_flag(zone, ZONE_CONGESTED);
 		}
+
+		if (zones_need_compaction)
+			compact_pgdat(pgdat, order, false);
 	}
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
@ 2012-01-10  2:33   ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2012-01-10  2:33 UTC (permalink / raw)
  To: linux-mm, aarcange
  Cc: linux-kernel, Mel Gorman, akpm, KOSAKI Motohiro, Johannes Weiner,
	hughd

With CONFIG_COMPACTION enabled, kswapd does not try to free
contiguous free pages, even when it is woken for a higher order
request.

This could be bad for eg. jumbo frame network allocations, which
are done from interrupt context and cannot compact memory themselves.
Higher than before allocation failure rates in the network receive
path have been observed in kernels with compaction enabled.

Teach kswapd to defragment the memory zones in a node, but only
if required and compaction is not deferred in a zone.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |    6 ++++++
 mm/compaction.c            |   22 ++++++++++++++--------
 mm/vmscan.c                |    9 +++++++++
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index bb2bbdb..df31dab 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -23,6 +23,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync);
+extern int compact_pgdat(pg_data_t *pgdat, int order, bool force);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
 /* Do not skip compaction more than 64 times */
@@ -62,6 +63,11 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_CONTINUE;
 }
 
+static inline int compact_pgdat(pg_data_t *pgdat, int order, bool force)
+{
+	return COMPACT_CONTINUE;
+}
+
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 1253d7a..1962a0e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -651,16 +651,11 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 
 
 /* Compact all zones within a node */
-static int compact_node(int nid)
+int compact_pgdat(pg_data_t *pgdat, int order, bool force)
 {
 	int zoneid;
-	pg_data_t *pgdat;
 	struct zone *zone;
 
-	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
-		return -EINVAL;
-	pgdat = NODE_DATA(nid);
-
 	/* Flush pending updates to the LRU lists */
 	lru_add_drain_all();
 
@@ -668,7 +663,7 @@ static int compact_node(int nid)
 		struct compact_control cc = {
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
-			.order = -1,
+			.order = order,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
@@ -679,7 +674,8 @@ static int compact_node(int nid)
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
 
-		compact_zone(zone, &cc);
+		if (force || !compaction_deferred(zone))
+			compact_zone(zone, &cc);
 
 		VM_BUG_ON(!list_empty(&cc.freepages));
 		VM_BUG_ON(!list_empty(&cc.migratepages));
@@ -688,6 +684,16 @@ static int compact_node(int nid)
 	return 0;
 }
 
+static int compact_node(int nid)
+{
+	pg_data_t *pgdat;
+	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	return compact_pgdat(pgdat, -1, true);
+}
+
 /* Compact all nodes in the system */
 static int compact_nodes(void)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fb469a..65bf21db 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2522,6 +2522,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int priority;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
+	int zones_need_compaction = 1;
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_soft_reclaimed;
@@ -2788,9 +2789,17 @@ out:
 				goto loop_again;
 			}
 
+			/* Check if the memory needs to be defragmented. */
+			if (zone_watermark_ok(zone, order,
+				    low_wmark_pages(zone), *classzone_idx, 0))
+				zones_need_compaction = 0;
+
 			/* If balanced, clear the congested flag */
 			zone_clear_flag(zone, ZONE_CONGESTED);
 		}
+
+		if (zones_need_compaction)
+			compact_pgdat(pgdat, order, false);
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
  2012-01-10  2:33   ` Rik van Riel
@ 2012-01-11  7:25     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 16+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11  7:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, aarcange, linux-kernel, Mel Gorman, akpm,
	Johannes Weiner, hughd

> With CONFIG_COMPACTION enabled, kswapd does not try to free
> contiguous free pages, even when it is woken for a higher order
> request.
>
> This could be bad for eg. jumbo frame network allocations, which
> are done from interrupt context and cannot compact memory themselves.
> Higher than before allocation failure rates in the network receive
> path have been observed in kernels with compaction enabled.
>
> Teach kswapd to defragment the memory zones in a node, but only
> if required and compaction is not deferred in a zone.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>

I agree with we need asynchronous defragmentations feature. But, do we
really need to use kswapd for compaction? While kswapd take a
compaction work, it can't work to make
free memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
@ 2012-01-11  7:25     ` KOSAKI Motohiro
  0 siblings, 0 replies; 16+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11  7:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, aarcange, linux-kernel, Mel Gorman, akpm,
	Johannes Weiner, hughd

> With CONFIG_COMPACTION enabled, kswapd does not try to free
> contiguous free pages, even when it is woken for a higher order
> request.
>
> This could be bad for eg. jumbo frame network allocations, which
> are done from interrupt context and cannot compact memory themselves.
> Higher than before allocation failure rates in the network receive
> path have been observed in kernels with compaction enabled.
>
> Teach kswapd to defragment the memory zones in a node, but only
> if required and compaction is not deferred in a zone.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>

I agree with we need asynchronous defragmentations feature. But, do we
really need to use kswapd for compaction? While kswapd take a
compaction work, it can't work to make
free memory.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
  2012-01-11  7:25     ` KOSAKI Motohiro
@ 2012-01-11 13:34       ` Rik van Riel
  -1 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2012-01-11 13:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, aarcange, linux-kernel, Mel Gorman, akpm,
	Johannes Weiner, hughd

On 01/11/2012 02:25 AM, KOSAKI Motohiro wrote:
>> With CONFIG_COMPACTION enabled, kswapd does not try to free
>> contiguous free pages, even when it is woken for a higher order
>> request.
>>
>> This could be bad for eg. jumbo frame network allocations, which
>> are done from interrupt context and cannot compact memory themselves.
>> Higher than before allocation failure rates in the network receive
>> path have been observed in kernels with compaction enabled.
>>
>> Teach kswapd to defragment the memory zones in a node, but only
>> if required and compaction is not deferred in a zone.
>>
>> Signed-off-by: Rik van Riel<riel@redhat.com>
>
> I agree with we need asynchronous defragmentations feature. But, do we
> really need to use kswapd for compaction? While kswapd take a
> compaction work, it can't work to make free memory.

I believe we do need some background compaction, especially
to help allocations from network interrupts.

If you believe the compaction is better done from some
other thread, I guess we could do that, but truthfully, if
kswapd spends a lot of time doing compaction, I made a
mistake somewhere :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
@ 2012-01-11 13:34       ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2012-01-11 13:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, aarcange, linux-kernel, Mel Gorman, akpm,
	Johannes Weiner, hughd

On 01/11/2012 02:25 AM, KOSAKI Motohiro wrote:
>> With CONFIG_COMPACTION enabled, kswapd does not try to free
>> contiguous free pages, even when it is woken for a higher order
>> request.
>>
>> This could be bad for eg. jumbo frame network allocations, which
>> are done from interrupt context and cannot compact memory themselves.
>> Higher than before allocation failure rates in the network receive
>> path have been observed in kernels with compaction enabled.
>>
>> Teach kswapd to defragment the memory zones in a node, but only
>> if required and compaction is not deferred in a zone.
>>
>> Signed-off-by: Rik van Riel<riel@redhat.com>
>
> I agree with we need asynchronous defragmentations feature. But, do we
> really need to use kswapd for compaction? While kswapd take a
> compaction work, it can't work to make free memory.

I believe we do need some background compaction, especially
to help allocations from network interrupts.

If you believe the compaction is better done from some
other thread, I guess we could do that, but truthfully, if
kswapd spends a lot of time doing compaction, I made a
mistake somewhere :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
  2012-01-11 13:34       ` Rik van Riel
@ 2012-01-11 19:57         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 16+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11 19:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, aarcange, linux-kernel, Mel Gorman, akpm,
	Johannes Weiner, hughd

> I believe we do need some background compaction, especially
> to help allocations from network interrupts.

I completely agree.


> If you believe the compaction is better done from some
> other thread, I guess we could do that, but truthfully, if
> kswapd spends a lot of time doing compaction, I made a
> mistake somewhere :)

I don't have much experience of compaction on real production systems.
but I have a few bad experience of background lumpy reclaim. If much
network allocation is happen when kswapd get stucked large order lumpy
reclaim, kswapd can't work for making order-0 job.it was bad. I'm only
worry about similar issue will occur.

But, ok, we can fix it when we actually observed such thing. So,
please go ahead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
@ 2012-01-11 19:57         ` KOSAKI Motohiro
  0 siblings, 0 replies; 16+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11 19:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, aarcange, linux-kernel, Mel Gorman, akpm,
	Johannes Weiner, hughd

> I believe we do need some background compaction, especially
> to help allocations from network interrupts.

I completely agree.

> If you believe the compaction is better done from some
> other thread, I guess we could do that, but truthfully, if
> kswapd spends a lot of time doing compaction, I made a
> mistake somewhere :)

I don't have much experience of compaction on real production systems.
but I have a few bad experience of background lumpy reclaim. If much
network allocation is happen when kswapd get stucked large order lumpy
reclaim, kswapd can't work for making order-0 job.it was bad. I'm only
worry about similar issue will occur.

But, ok, we can fix it when we actually observed such thing. So,
please go ahead.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
  2012-01-10  2:33   ` Rik van Riel
@ 2012-01-12 16:12     ` Mel Gorman
  -1 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2012-01-12 16:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, aarcange, linux-kernel, akpm, KOSAKI Motohiro,
	Johannes Weiner, hughd

On Mon, Jan 09, 2012 at 09:33:57PM -0500, Rik van Riel wrote:
> With CONFIG_COMPACTION enabled, kswapd does not try to free
> contiguous free pages, even when it is woken for a higher order
> request.
> 
> This could be bad for eg. jumbo frame network allocations, which
> are done from interrupt context and cannot compact memory themselves.
> Higher than before allocation failure rates in the network receive
> path have been observed in kernels with compaction enabled.
> 
> Teach kswapd to defragment the memory zones in a node, but only
> if required and compaction is not deferred in a zone.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

heh, this is not the first time we had kswapd doing compaction. It was a
bit of a disaster the last time around but a lot has changed since.

Care is needed here though. There are patches in-bound that implement
sync-light. Part of that series includes a fix that makes compact_node
use sync compaction. This is fine for a process writing to /proc, not to
fine for kswapd. Watch for that.

> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index bb2bbdb..df31dab 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -23,6 +23,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask,
>  			bool sync);
> +extern int compact_pgdat(pg_data_t *pgdat, int order, bool force);
>  extern unsigned long compaction_suitable(struct zone *zone, int order);
>  
>  /* Do not skip compaction more than 64 times */
> @@ -62,6 +63,11 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  	return COMPACT_CONTINUE;
>  }
>  
> +static inline int compact_pgdat(pg_data_t *pgdat, int order, bool force)
> +{
> +	return COMPACT_CONTINUE;
> +}
> +
>  static inline unsigned long compaction_suitable(struct zone *zone, int order)
>  {
>  	return COMPACT_SKIPPED;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 1253d7a..1962a0e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -651,16 +651,11 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  
>  
>  /* Compact all zones within a node */
> -static int compact_node(int nid)
> +int compact_pgdat(pg_data_t *pgdat, int order, bool force)

force is a bad parameter to have as a VM-internal API to kswapd. By
and large I'm trying to move away from these sort of bools towards
compact_mode of sync, sync-light or async.

I would much prefer if what was exposed to kswapd was a compact_pgdat()
that created a suitable compact_control (async compaction I would
expect) and pass that to a static __compact_pgdat() that is shared by
both kswapd and compact_nodes.

That would stop spreading too much knowledge of compaction to kswapd.

>  {
>  	int zoneid;
> -	pg_data_t *pgdat;
>  	struct zone *zone;
>  
> -	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
> -		return -EINVAL;
> -	pgdat = NODE_DATA(nid);
> -
>  	/* Flush pending updates to the LRU lists */
>  	lru_add_drain_all();
>  
> @@ -668,7 +663,7 @@ static int compact_node(int nid)
>  		struct compact_control cc = {
>  			.nr_freepages = 0,
>  			.nr_migratepages = 0,
> -			.order = -1,
> +			.order = order,
>  		};
>  
>  		zone = &pgdat->node_zones[zoneid];

Just to be clear, there is a fix that applies to here that initialises
sync and that would have caused problems for this patch.

> @@ -679,7 +674,8 @@ static int compact_node(int nid)
>  		INIT_LIST_HEAD(&cc.freepages);
>  		INIT_LIST_HEAD(&cc.migratepages);
>  
> -		compact_zone(zone, &cc);
> +		if (force || !compaction_deferred(zone))
> +			compact_zone(zone, &cc);
>  
>  		VM_BUG_ON(!list_empty(&cc.freepages));
>  		VM_BUG_ON(!list_empty(&cc.migratepages));
> @@ -688,6 +684,16 @@ static int compact_node(int nid)
>  	return 0;
>  }
>  
> +static int compact_node(int nid)
> +{
> +	pg_data_t *pgdat;
> +	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
> +		return -EINVAL;
> +	pgdat = NODE_DATA(nid);
> +
> +	return compact_pgdat(pgdat, -1, true);
> +}
> +
>  /* Compact all nodes in the system */
>  static int compact_nodes(void)
>  {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0fb469a..65bf21db 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2522,6 +2522,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  	int priority;
>  	int i;
>  	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
> +	int zones_need_compaction = 1;
>  	unsigned long total_scanned;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	unsigned long nr_soft_reclaimed;
> @@ -2788,9 +2789,17 @@ out:
>  				goto loop_again;
>  			}
>  
> +			/* Check if the memory needs to be defragmented. */
> +			if (zone_watermark_ok(zone, order,
> +				    low_wmark_pages(zone), *classzone_idx, 0))
> +				zones_need_compaction = 0;
> +
>  			/* If balanced, clear the congested flag */
>  			zone_clear_flag(zone, ZONE_CONGESTED);
>  		}
> +
> +		if (zones_need_compaction)
> +			compact_pgdat(pgdat, order, false);

hmm, not fully convinced this is the right way of deciding if compaction
should be used. If patch 1 used compaction_suitable() to decide when to
stop reclaiming instead of a watermark check, we could then call
compact_pgdat() in the exit path to balance_pgdat(). That would minimise
the risk that kswapd gets stuck in compaction when it should be
reclaiming pages. What do you think?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm 2/2] mm: kswapd carefully invoke compaction
@ 2012-01-12 16:12     ` Mel Gorman
  0 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2012-01-12 16:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, aarcange, linux-kernel, akpm, KOSAKI Motohiro,
	Johannes Weiner, hughd

On Mon, Jan 09, 2012 at 09:33:57PM -0500, Rik van Riel wrote:
> With CONFIG_COMPACTION enabled, kswapd does not try to free
> contiguous free pages, even when it is woken for a higher order
> request.
> 
> This could be bad for eg. jumbo frame network allocations, which
> are done from interrupt context and cannot compact memory themselves.
> Higher than before allocation failure rates in the network receive
> path have been observed in kernels with compaction enabled.
> 
> Teach kswapd to defragment the memory zones in a node, but only
> if required and compaction is not deferred in a zone.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

heh, this is not the first time we had kswapd doing compaction. It was a
bit of a disaster the last time around but a lot has changed since.

Care is needed here though. There are patches in-bound that implement
sync-light. Part of that series includes a fix that makes compact_node
use sync compaction. This is fine for a process writing to /proc, not to
fine for kswapd. Watch for that.

> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index bb2bbdb..df31dab 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -23,6 +23,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask,
>  			bool sync);
> +extern int compact_pgdat(pg_data_t *pgdat, int order, bool force);
>  extern unsigned long compaction_suitable(struct zone *zone, int order);
>  
>  /* Do not skip compaction more than 64 times */
> @@ -62,6 +63,11 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  	return COMPACT_CONTINUE;
>  }
>  
> +static inline int compact_pgdat(pg_data_t *pgdat, int order, bool force)
> +{
> +	return COMPACT_CONTINUE;
> +}
> +
>  static inline unsigned long compaction_suitable(struct zone *zone, int order)
>  {
>  	return COMPACT_SKIPPED;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 1253d7a..1962a0e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -651,16 +651,11 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  
>  
>  /* Compact all zones within a node */
> -static int compact_node(int nid)
> +int compact_pgdat(pg_data_t *pgdat, int order, bool force)

force is a bad parameter to have as a VM-internal API to kswapd. By
and large I'm trying to move away from these sort of bools towards
compact_mode of sync, sync-light or async.

I would much prefer if what was exposed to kswapd was a compact_pgdat()
that created a suitable compact_control (async compaction I would
expect) and pass that to a static __compact_pgdat() that is shared by
both kswapd and compact_nodes.

That would stop spreading too much knowledge of compaction to kswapd.

>  {
>  	int zoneid;
> -	pg_data_t *pgdat;
>  	struct zone *zone;
>  
> -	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
> -		return -EINVAL;
> -	pgdat = NODE_DATA(nid);
> -
>  	/* Flush pending updates to the LRU lists */
>  	lru_add_drain_all();
>  
> @@ -668,7 +663,7 @@ static int compact_node(int nid)
>  		struct compact_control cc = {
>  			.nr_freepages = 0,
>  			.nr_migratepages = 0,
> -			.order = -1,
> +			.order = order,
>  		};
>  
>  		zone = &pgdat->node_zones[zoneid];

Just to be clear, there is a fix that applies to here that initialises
sync and that would have caused problems for this patch.

> @@ -679,7 +674,8 @@ static int compact_node(int nid)
>  		INIT_LIST_HEAD(&cc.freepages);
>  		INIT_LIST_HEAD(&cc.migratepages);
>  
> -		compact_zone(zone, &cc);
> +		if (force || !compaction_deferred(zone))
> +			compact_zone(zone, &cc);
>  
>  		VM_BUG_ON(!list_empty(&cc.freepages));
>  		VM_BUG_ON(!list_empty(&cc.migratepages));
> @@ -688,6 +684,16 @@ static int compact_node(int nid)
>  	return 0;
>  }
>  
> +static int compact_node(int nid)
> +{
> +	pg_data_t *pgdat;
> +	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
> +		return -EINVAL;
> +	pgdat = NODE_DATA(nid);
> +
> +	return compact_pgdat(pgdat, -1, true);
> +}
> +
>  /* Compact all nodes in the system */
>  static int compact_nodes(void)
>  {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0fb469a..65bf21db 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2522,6 +2522,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  	int priority;
>  	int i;
>  	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
> +	int zones_need_compaction = 1;
>  	unsigned long total_scanned;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	unsigned long nr_soft_reclaimed;
> @@ -2788,9 +2789,17 @@ out:
>  				goto loop_again;
>  			}
>  
> +			/* Check if the memory needs to be defragmented. */
> +			if (zone_watermark_ok(zone, order,
> +				    low_wmark_pages(zone), *classzone_idx, 0))
> +				zones_need_compaction = 0;
> +
>  			/* If balanced, clear the congested flag */
>  			zone_clear_flag(zone, ZONE_CONGESTED);
>  		}
> +
> +		if (zones_need_compaction)
> +			compact_pgdat(pgdat, order, false);

hmm, not fully convinced this is the right way of deciding if compaction
should be used. If patch 1 used compaction_suitable() to decide when to
stop reclaiming instead of a watermark check, we could then call
compact_pgdat() in the exit path to balance_pgdat(). That would minimise
the risk that kswapd gets stuck in compaction when it should be
reclaiming pages. What do you think?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-01-12 16:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-10  2:31 [PATCH -mm 0/2] kswapd vs compaction improvements Rik van Riel
2012-01-10  2:31 ` Rik van Riel
2012-01-10  2:33 ` [PATCH -mm 1/2] mm: kswapd test order 0 watermarks when compaction is enabled Rik van Riel
2012-01-10  2:33   ` Rik van Riel
2012-01-12 15:46   ` Mel Gorman
2012-01-12 15:46     ` Mel Gorman
2012-01-10  2:33 ` [PATCH -mm 2/2] mm: kswapd carefully invoke compaction Rik van Riel
2012-01-10  2:33   ` Rik van Riel
2012-01-11  7:25   ` KOSAKI Motohiro
2012-01-11  7:25     ` KOSAKI Motohiro
2012-01-11 13:34     ` Rik van Riel
2012-01-11 13:34       ` Rik van Riel
2012-01-11 19:57       ` KOSAKI Motohiro
2012-01-11 19:57         ` KOSAKI Motohiro
2012-01-12 16:12   ` Mel Gorman
2012-01-12 16:12     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.