Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
       [not found] <g0ia38-jj6.ln1@ppp121-45-136-118.lns11.adl6.internode.on.net>
@ 2011-02-22  7:37 ` Clemens Ladisch
  2011-02-22  7:46   ` Arthur Marsh
  2011-02-22 13:40   ` Andrea Arcangeli
  0 siblings, 2 replies; 31+ messages in thread
From: Clemens Ladisch @ 2011-02-22  7:37 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Andrea Arcangeli, alsa-user, linux-kernel

Arthur Marsh wrote:
> I'm experiencing MIDI playback slow-downs when I'm observing kswapd0 
> active (a few percent of cpu in top output) in recent kernels.
> 
> I git-bisected the problem down to:
> 
> commit 5a03b051ed87e72b959f32a86054e1142ac4cf55
> Author: Andrea Arcangeli <aarcange@redhat.com>
> Date:   Thu Jan 13 15:47:11 2011 -0800
> 
>      thp: use compaction in kswapd for GFP_ATOMIC order > 0
> 
>      This takes advantage of memory compaction to properly generate pages of
>      order > 0 if regular page reclaim fails and priority level becomes more
>      severe and we don't reach the proper watermarks.
> 
>      Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>      Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>      Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> I ran git-bisect over the weekend, building and installing ALSA 1.0.24 
> with each kernel. After identifying the above commit, I rebuilt the 2.6 
> head with that commit reverted and verified that the problem was no 
> longer present.

Apparently, huge page compaction disables interrupts for much too long.

> MIDI playback was via aplaymidi -p 17:0 to a Soundblaster Audigy 2 ZS 
> (SB0350) wavetable.

The ALSA sequencer uses either the system timer or an HR timer at 1 kHz
to deliver MIDI commands (notes); the wavetable driver requires its own
interrupts in regular 5.3 ms intervals.


Regards,
Clemens

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22  7:37 ` [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0 Clemens Ladisch
@ 2011-02-22  7:46   ` Arthur Marsh
  2011-02-22 13:40   ` Andrea Arcangeli
  1 sibling, 0 replies; 31+ messages in thread
From: Arthur Marsh @ 2011-02-22  7:46 UTC (permalink / raw)
  To: Clemens Ladisch; +Cc: Andrea Arcangeli, alsa-user, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1646 bytes --]



Clemens Ladisch wrote, on 22/02/11 18:07:
> Arthur Marsh wrote:
>> I'm experiencing MIDI playback slow-downs when I'm observing kswapd0
>> active (a few percent of cpu in top output) in recent kernels.
>>
>> I git-bisected the problem down to:
>>
>> commit 5a03b051ed87e72b959f32a86054e1142ac4cf55
>> Author: Andrea Arcangeli<aarcange@redhat.com>
>> Date:   Thu Jan 13 15:47:11 2011 -0800
>>
>>       thp: use compaction in kswapd for GFP_ATOMIC order>  0
>>
>>       This takes advantage of memory compaction to properly generate pages of
>>       order>  0 if regular page reclaim fails and priority level becomes more
>>       severe and we don't reach the proper watermarks.
>>
>>       Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
>>       Signed-off-by: Andrew Morton<akpm@linux-foundation.org>
>>       Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org>
>>
>> I ran git-bisect over the weekend, building and installing ALSA 1.0.24
>> with each kernel. After identifying the above commit, I rebuilt the 2.6
>> head with that commit reverted and verified that the problem was no
>> longer present.
>
> Apparently, huge page compaction disables interrupts for much too long.
>
>> MIDI playback was via aplaymidi -p 17:0 to a Soundblaster Audigy 2 ZS
>> (SB0350) wavetable.
>
> The ALSA sequencer uses either the system timer or an HR timer at 1 kHz
> to deliver MIDI commands (notes); the wavetable driver requires its own
> interrupts in regular 5.3 ms intervals.
>
>
> Regards,
> Clemens
>

Hi, Andrea Arcangeli's "z1" patch (attached) solved the problem for me, 
even with significant swap activity.

Regards,

Arthur.

[-- Attachment #2: z1 --]
[-- Type: text/plain, Size: 1412 bytes --]

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 17497d0..5718bca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2385,7 +2385,6 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
-			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
 
@@ -2408,7 +2407,7 @@ loop_again:
 			 * zone has way too many pages free already.
 			 */
 			if (!zone_watermark_ok_safe(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
@@ -2416,24 +2415,23 @@ loop_again:
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 
-			compaction = 0;
+#if 0
 			if (order &&
 			    zone_watermark_ok(zone, 0,
 					       high_wmark_pages(zone),
 					      end_zone, 0) &&
 			    !zone_watermark_ok(zone, order,
 					       high_wmark_pages(zone),
-					       end_zone, 0)) {
+					       end_zone, 0))
 				compact_zone_order(zone,
 						   order,
 						   sc.gfp_mask, false,
 						   COMPACT_MODE_KSWAPD);
-				compaction = 1;
-			}
+#endif
 
 			if (zone->all_unreclaimable)
 				continue;
-			if (!compaction && nr_slab == 0 &&
+			if (nr_slab == 0 &&
 			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22  7:37 ` [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0 Clemens Ladisch
  2011-02-22  7:46   ` Arthur Marsh
@ 2011-02-22 13:40   ` Andrea Arcangeli
  2011-02-22 16:15     ` Andrea Arcangeli
  1 sibling, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 13:40 UTC (permalink / raw)
  To: Clemens Ladisch; +Cc: Arthur Marsh, alsa-user, linux-kernel, Mel Gorman

On Tue, Feb 22, 2011 at 08:37:23AM +0100, Clemens Ladisch wrote:
> Arthur Marsh wrote:
> > I'm experiencing MIDI playback slow-downs when I'm observing kswapd0 
> > active (a few percent of cpu in top output) in recent kernels.
> > 
> > I git-bisected the problem down to:
> > 
> > commit 5a03b051ed87e72b959f32a86054e1142ac4cf55
> > Author: Andrea Arcangeli <aarcange@redhat.com>
> > Date:   Thu Jan 13 15:47:11 2011 -0800
> > 
> >      thp: use compaction in kswapd for GFP_ATOMIC order > 0
> > 
> >      This takes advantage of memory compaction to properly generate pages of
> >      order > 0 if regular page reclaim fails and priority level becomes more
> >      severe and we don't reach the proper watermarks.
> > 
> >      Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> >      Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >      Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > 
> > I ran git-bisect over the weekend, building and installing ALSA 1.0.24 
> > with each kernel. After identifying the above commit, I rebuilt the 2.6 
> > head with that commit reverted and verified that the problem was no 
> > longer present.
> 
> Apparently, huge page compaction disables interrupts for much too long.
> 
> > MIDI playback was via aplaymidi -p 17:0 to a Soundblaster Audigy 2 ZS 
> > (SB0350) wavetable.
> 
> The ALSA sequencer uses either the system timer or an HR timer at 1 kHz
> to deliver MIDI commands (notes); the wavetable driver requires its own
> interrupts in regular 5.3 ms intervals.

I asked Arthur to test this fix. We already know the attached z1 patch
fixed the problem 100%. But that was a debugging patch not meant for
merging, if the below works, I think we're done. This is the same
approach I'm also going to test in another benchmark that also showed
increased latencies that isn't related to multimedia or midi but pure nfs
load with jumbo frames. The problem is the same!

The below is untested, please let me know if it helps, because it may
not be enough.

=========
Subject: compaction: fix high latencies created by kswapd

From: Andrea Arcangeli <aarcange@redhat.com>

We need to cond_resched in the compaction loop to avoid hurting
latencies and stop being too aggressive in kswapd, and let it go in
all unreclaimable if needed (not order aware logic but it's not worth
being too aggressive with expensive compaction).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---


diff --git a/mm/compaction.c b/mm/compaction.c
index 8be430b..fef06c4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -274,6 +274,9 @@ static unsigned long isolate_migratepages(struct zone *zone,
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
+
+		cond_resched();
+
 		if (!pfn_valid_within(low_pfn))
 			continue;
 		nr_scanned++;
@@ -413,15 +416,6 @@ static int compact_finished(struct zone *zone,
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
-	/*
-	 * Generating only one page of the right order is not enough
-	 * for kswapd, we must continue until we're above the high
-	 * watermark as a pool for high order GFP_ATOMIC allocations
-	 * too.
-	 */
-	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
-		return COMPACT_CONTINUE;
-
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 17497d0..564771c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2385,7 +2385,6 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
-			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
 
@@ -2408,7 +2407,7 @@ loop_again:
 			 * zone has way too many pages free already.
 			 */
 			if (!zone_watermark_ok_safe(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
@@ -2416,24 +2415,21 @@ loop_again:
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 
-			compaction = 0;
 			if (order &&
 			    zone_watermark_ok(zone, 0,
 					       high_wmark_pages(zone),
 					      end_zone, 0) &&
 			    !zone_watermark_ok(zone, order,
 					       high_wmark_pages(zone),
-					       end_zone, 0)) {
+					       end_zone, 0))
 				compact_zone_order(zone,
 						   order,
 						   sc.gfp_mask, false,
 						   COMPACT_MODE_KSWAPD);
-				compaction = 1;
-			}
 
 			if (zone->all_unreclaimable)
 				continue;
-			if (!compaction && nr_slab == 0 &&
+			if (nr_slab == 0 &&
 			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22 13:40   ` Andrea Arcangeli
@ 2011-02-22 16:15     ` Andrea Arcangeli
  2011-02-22 16:59       ` Mel Gorman
  2011-02-22 17:47       ` Arthur Marsh
  0 siblings, 2 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 16:15 UTC (permalink / raw)
  To: Clemens Ladisch; +Cc: Arthur Marsh, alsa-user, linux-kernel, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 408 bytes --]

On Tue, Feb 22, 2011 at 02:40:47PM +0100, Andrea Arcangeli wrote:
>  	spin_lock_irq(&zone->lru_lock);
>  	for (; low_pfn < end_pfn; low_pfn++) {
>  		struct page *page;
> +
> +		cond_resched();
> +

my bad, see the above spin_lock_irq oops...

I attached two replacement patches to apply in order (both of them
should be applied at the same time on top of git upstream, and they
shouldn't lockup this time).

[-- Attachment #2: kswapd-high_wmark --]
[-- Type: text/plain, Size: 1853 bytes --]

Subject: vmscan: kswapd must not free more than high_wmark pages

From: Andrea Arcangeli <aarcange@redhat.com>

When the min_free_kbytes is set with `hugeadm
--set-recommended-min_free_kbytes" or with THP enabled (which runs the
equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate
anti-frag at full effectiveness automatically at boot) the high wmark
of some zone is as high as ~88M. 88M free on a 4G system isn't
horrible, but 88M*8 = 704M free on a 4G system is definitely
unbearable. This only tends to be visible on 4G systems with tiny
over-4g zone where kswapd insists to reach the high wmark on the
over-4g zone but doing so it shrunk up to 704M from the normal zone by
mistake.

For the trivial case where kswapd isn't waken until all zones hit the low wmark
and there is no concurrency between allocator and kswapd freeing, rotating more
the tiny above4g lru than "high-low" despite we only allocated "high-low" cache
into it doesn't sound obviously right either. Bigger gap to me looks like will
do more harm than good and if we need a real guarantee of balancing we should
rotate the allocations across the zones (bigger lru in a zone will require it
to be hit more frequently because it'll rotate slower than the other zones, the
bias should not even dependent on the zone size but on the lru size).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2408,7 +2408,7 @@ loop_again:
 			 * zone has way too many pages free already.
 			 */
 			if (!zone_watermark_ok_safe(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,

[-- Attachment #3: compaction-kswapd --]
[-- Type: text/plain, Size: 2230 bytes --]

---
 mm/compaction.c |   19 ++++++++++---------
 mm/vmscan.c     |    8 ++------
 2 files changed, 12 insertions(+), 15 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -271,9 +271,19 @@ static unsigned long isolate_migratepage
 	}
 
 	/* Time to isolate some pages for migration */
+	cond_resched();
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
+
+		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
+			if (fatal_signal_pending(current))
+				break;
+			spin_unlock_irq(&zone->lru_lock);
+			cond_resched();
+			spin_lock_irq(&zone->lru_lock);
+		}
+
 		if (!pfn_valid_within(low_pfn))
 			continue;
 		nr_scanned++;
@@ -413,15 +423,6 @@ static int compact_finished(struct zone 
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
-	/*
-	 * Generating only one page of the right order is not enough
-	 * for kswapd, we must continue until we're above the high
-	 * watermark as a pool for high order GFP_ATOMIC allocations
-	 * too.
-	 */
-	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
-		return COMPACT_CONTINUE;
-
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2385,7 +2385,6 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
-			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
 
@@ -2416,24 +2415,21 @@ loop_again:
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 
-			compaction = 0;
 			if (order &&
 			    zone_watermark_ok(zone, 0,
 					       high_wmark_pages(zone),
 					      end_zone, 0) &&
 			    !zone_watermark_ok(zone, order,
 					       high_wmark_pages(zone),
-					       end_zone, 0)) {
+					       end_zone, 0))
 				compact_zone_order(zone,
 						   order,
 						   sc.gfp_mask, false,
 						   COMPACT_MODE_KSWAPD);
-				compaction = 1;
-			}
 
 			if (zone->all_unreclaimable)
 				continue;
-			if (!compaction && nr_slab == 0 &&
+			if (nr_slab == 0 &&
 			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22 16:15     ` Andrea Arcangeli
@ 2011-02-22 16:59       ` Mel Gorman
  2011-02-22 17:08         ` Andrea Arcangeli
  2011-02-22 17:47       ` Arthur Marsh
  1 sibling, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2011-02-22 16:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Clemens Ladisch, Arthur Marsh, alsa-user, linux-kernel

On Tue, Feb 22, 2011 at 05:15:13PM +0100, Andrea Arcangeli wrote:
> ---
>  mm/compaction.c |   19 ++++++++++---------
>  mm/vmscan.c     |    8 ++------
>  2 files changed, 12 insertions(+), 15 deletions(-)
> 
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -271,9 +271,19 @@ static unsigned long isolate_migratepage
>  	}
>  
>  	/* Time to isolate some pages for migration */
> +	cond_resched();
>  	spin_lock_irq(&zone->lru_lock);
>  	for (; low_pfn < end_pfn; low_pfn++) {
>  		struct page *page;
> +
> +		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
> +			if (fatal_signal_pending(current))
> +				break;
> +			spin_unlock_irq(&zone->lru_lock);
> +			cond_resched();
> +			spin_lock_irq(&zone->lru_lock);
> +		}
> +

There is a small chance that if the lock is contended, the current CPU
will simply reacquire the lock. Any idea how likely that is? The
need_resched() check itself seems reasonable and should reduce the
length of time interrupts are disabled.

>  		if (!pfn_valid_within(low_pfn))
>  			continue;
>  		nr_scanned++;
> @@ -413,15 +423,6 @@ static int compact_finished(struct zone 
>  	if (cc->order == -1)
>  		return COMPACT_CONTINUE;
>  
> -	/*
> -	 * Generating only one page of the right order is not enough
> -	 * for kswapd, we must continue until we're above the high
> -	 * watermark as a pool for high order GFP_ATOMIC allocations
> -	 * too.
> -	 */
> -	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
> -		return COMPACT_CONTINUE;
> -

Why is this change necessary? kswapd may go to sleep sooner as a result
of this change but it doesn't affect the length of time interrupts are
disabled. Some other latency problem you've found?

>  	/* Direct compactor: Is a suitable page free? */
>  	for (order = cc->order; order < MAX_ORDER; order++) {
>  		/* Job done if page is free of the right migratetype */
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2385,7 +2385,6 @@ loop_again:
>  		 * cause too much scanning of the lower zones.
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
> -			int compaction;
>  			struct zone *zone = pgdat->node_zones + i;
>  			int nr_slab;
>  
> @@ -2416,24 +2415,21 @@ loop_again:
>  			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  			total_scanned += sc.nr_scanned;
>  
> -			compaction = 0;
>  			if (order &&
>  			    zone_watermark_ok(zone, 0,
>  					       high_wmark_pages(zone),
>  					      end_zone, 0) &&
>  			    !zone_watermark_ok(zone, order,
>  					       high_wmark_pages(zone),
> -					       end_zone, 0)) {
> +					       end_zone, 0))
>  				compact_zone_order(zone,
>  						   order,
>  						   sc.gfp_mask, false,
>  						   COMPACT_MODE_KSWAPD);
> -				compaction = 1;
> -			}
>  
>  			if (zone->all_unreclaimable)
>  				continue;
> -			if (!compaction && nr_slab == 0 &&
> +			if (nr_slab == 0 &&
>  			    !zone_reclaimable(zone))
>  				zone->all_unreclaimable = 1;

I'm not seeing how this change is related to interrupts either. The intention
of the current code is that after compaction, a zone should not be considered
all_unreclaimnable. The reason is that there was enough free memory
before compaction started but compaction takes some time during which
kswapd is not reclaiming pages at all. The view of the zone before and
after compaction is not directly related to all_unreclaimable so
all_reclaimable should only be set after shrinking a zone and there is
insufficient free memory to meet watermarks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22 16:59       ` Mel Gorman
@ 2011-02-22 17:08         ` Andrea Arcangeli
  2011-02-22 17:37           ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 17:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Clemens Ladisch, Arthur Marsh, alsa-user, linux-kernel

On Tue, Feb 22, 2011 at 04:59:45PM +0000, Mel Gorman wrote:
> There is a small chance that if the lock is contended, the current CPU
> will simply reacquire the lock. Any idea how likely that is? The
> need_resched() check itself seems reasonable and should reduce the
> length of time interrupts are disabled.

If the loop is short the contention probability should be small. I
mostly added it because that's the way cond_resched_lock does it. I
thought it was better anyway.

> Why is this change necessary? kswapd may go to sleep sooner as a result
> of this change but it doesn't affect the length of time interrupts are
> disabled. Some other latency problem you've found?

It's not. But I don't want to run more than 1 loop. Otherwise I'm
afraid that kswapd will generate a too big high load.

> I'm not seeing how this change is related to interrupts either. The intention
> of the current code is that after compaction, a zone should not be considered
> all_unreclaimnable. The reason is that there was enough free memory
> before compaction started but compaction takes some time during which
> kswapd is not reclaiming pages at all. The view of the zone before and
> after compaction is not directly related to all_unreclaimable so
> all_reclaimable should only be set after shrinking a zone and there is
> insufficient free memory to meet watermarks.

There is not just the interrupt issue. There's also a problem that
kswapd is generating a too high load. And I'm afraid what can happen
is that kswapd should go in all reclaimable state and it doesn't
because there was also an high order allocation in the mix. So I
prefer to obey to the order=0 all unreclaimable logic with higher
priority. The freeing-max one page above is also to run max 1 scan
over all pfn before putting kswapd in all unreclaimable state. The
probability that a GFP_ATOMIC allocation improves performance thanks
to being "jumbo" more than one entire scan of the pfn in the system
sounds quite small. If all goes well kswapd will generate more than
one atomic page. Also it's good to keep the COMPACTION_KSWAPD mode to
differentiate the low/high wmark (with kswapd checking the high one if
not even a page of the right order is available).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22 17:08         ` Andrea Arcangeli
@ 2011-02-22 17:37           ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2011-02-22 17:37 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Clemens Ladisch, Arthur Marsh, alsa-user, linux-kernel

On Tue, Feb 22, 2011 at 06:08:50PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 22, 2011 at 04:59:45PM +0000, Mel Gorman wrote:
> > There is a small chance that if the lock is contended, the current CPU
> > will simply reacquire the lock. Any idea how likely that is? The
> > need_resched() check itself seems reasonable and should reduce the
> > length of time interrupts are disabled.
> 
> If the loop is short the contention probability should be small. I
> mostly added it because that's the way cond_resched_lock does it. I
> thought it was better anyway.
> 

Ok.

> > Why is this change necessary? kswapd may go to sleep sooner as a result
> > of this change but it doesn't affect the length of time interrupts are
> > disabled. Some other latency problem you've found?
> 
> It's not. But I don't want to run more than 1 loop. Otherwise I'm
> afraid that kswapd will generate a too big high load.
> 

It's a possibility. The intention was to keep compacting for high-order
GFP_ATOMIC allocations but granted, this is not a strong justification.
It occurred to me as well that while kswapd is doing this, no pages are
being reclaimed. This could result in direct reclaimers being more
frequent. I don't have data on how much this helps GFP_ATOMIC
allocations but it's easier to imagine how it could increase latencies
due to increased direct reclaim.

> > I'm not seeing how this change is related to interrupts either. The intention
> > of the current code is that after compaction, a zone should not be considered
> > all_unreclaimnable. The reason is that there was enough free memory
> > before compaction started but compaction takes some time during which
> > kswapd is not reclaiming pages at all. The view of the zone before and
> > after compaction is not directly related to all_unreclaimable so
> > all_reclaimable should only be set after shrinking a zone and there is
> > insufficient free memory to meet watermarks.
> 
> There is not just the interrupt issue. There's also a problem that
> kswapd is generating a too high load. And I'm afraid what can happen
> is that kswapd should go in all reclaimable state and it doesn't
> because there was also an high order allocation in the mix.

Why should it go into an all_unreclaimable state after compaction when it
hasn't been reclaiming pages though? A side-effect of all_unreclaimable is that
the zone is considered balanced and so kswapd will consume less CPU by going to
sleep because "all zones are balanced" but it feels like accidental behaviour.

> So I
> prefer to obey to the order=0 all unreclaimable logic with higher
> priority. The freeing-max one page above is also to run max 1 scan
> over all pfn before putting kswapd in all unreclaimable state. The
> probability that a GFP_ATOMIC allocation improves performance thanks
> to being "jumbo" more than one entire scan of the pfn in the system
> sounds quite small. If all goes well kswapd will generate more than
> one atomic page. Also it's good to keep the COMPACTION_KSWAPD mode to
> differentiate the low/high wmark (with kswapd checking the high one if
> not even a page of the right order is available).
> 

Making kswapd more aggressive in compaction was intended to help
high-order GFP_ATOMIC allocations. If them being sucecssful is no longer
a big issue and failures are infrequent and tolerated, then it's ok to
allow kswapd to sleep earlier. Unfortunately, I don't have any testcases
that exercise these type of allocations but it'd be nice if those tests
can be rerun.

So of the three changes in the patch (which hopefully will be three
patches eventually);

Change 1 reduces the time interrupts are disabled. Hard to argue with
	that - the new behaviour is reasonable.

Change 2 makes kswapd give up compaction earlier and go back to
	reclaiming pages. Potentially kswapd will go to sleep sooner and
	consume less CPU. At worst, high-order GFP_ATOMIC allocations may
	fail more frequently. It'd be nice to test the relevant workloads
	again to make sure they are not impaired. If they are not, then
	kswapd going back to sleep sooner is desirable and the change
	makes sense.

Change 3 potentially puts kswapd to sleep sooner but it's marking a zone
	all_unreclaimable when it's not necessarily in that state.
	Potentially, kswapd for order-0 will later skip over that zone and
	reclaim no pages from it until a page is freed in that zone resetting
	the flag. Doesn't seem right :(

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22 16:15     ` Andrea Arcangeli
  2011-02-22 16:59       ` Mel Gorman
@ 2011-02-22 17:47       ` Arthur Marsh
  2011-02-22 19:43         ` Andrea Arcangeli
  2011-02-23 16:24         ` Andrea Arcangeli
  1 sibling, 2 replies; 31+ messages in thread
From: Arthur Marsh @ 2011-02-22 17:47 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman



Andrea Arcangeli wrote, on 23/02/11 02:45:
> On Tue, Feb 22, 2011 at 02:40:47PM +0100, Andrea Arcangeli wrote:
>>   	spin_lock_irq(&zone->lru_lock);
>>   	for (; low_pfn<  end_pfn; low_pfn++) {
>>   		struct page *page;
>> +
>> +		cond_resched();
>> +
>
> my bad, see the above spin_lock_irq oops...
>
> I attached two replacement patches to apply in order (both of them
> should be applied at the same time on top of git upstream, and they
> shouldn't lockup this time).

OK, these patches applied together against upstream didn't cause a crash 
but I did observe:

significant slowdowns of MIDI playback (moreso than in previous cases, 
and with less than 20 Meg of swap file in use);

kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).

If I should try only one of the patches or something else entirely, 
please let me know.

Regards,

Arthur.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22 17:47       ` Arthur Marsh
@ 2011-02-22 19:43         ` Andrea Arcangeli
  2011-02-23  9:15           ` Mel Gorman
  2011-02-23 16:24         ` Andrea Arcangeli
  1 sibling, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 19:43 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
> OK, these patches applied together against upstream didn't cause a crash 
> but I did observe:
> 
> significant slowdowns of MIDI playback (moreso than in previous cases, 
> and with less than 20 Meg of swap file in use);
> 
> kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
> 
> If I should try only one of the patches or something else entirely, 
> please let me know.

For Mel: with z1, kswapd used only 0.1-3.9 percent of CPU while he
loaded other applications.

We may need a way to put kswapd in all uncompactable mode to solve
this, logic 3 just trying not to disable the all unreclaimable logic
seems not enough. I.e. if compact_zone_order doesn't return
COMPACT_COMPLETE, stop the compaction loop in kswapd. Then we can put
back in the COMPACT_MODE_KSWAPD return COMPACT_CONTINUE in
compact_finished as the caller will throttle it (and it won't run more
than one scan before putting kswapd to sleep in all uncompactable mode).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22 19:43         ` Andrea Arcangeli
@ 2011-02-23  9:15           ` Mel Gorman
  2011-02-23 11:41             ` Arthur Marsh
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2011-02-23  9:15 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Arthur Marsh, Clemens Ladisch, alsa-user, linux-kernel

On Tue, Feb 22, 2011 at 08:43:25PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
> > OK, these patches applied together against upstream didn't cause a crash 
> > but I did observe:
> > 
> > significant slowdowns of MIDI playback (moreso than in previous cases, 
> > and with less than 20 Meg of swap file in use);
> > 

Can this be replicated with on-board audio hardware or is there something
specific about the card? e.g. those typically driven by snd-intel8x0 or
snd_hda_intel

Unfortunately the original mail is a bit light on details on how this was
reproduced and I didn't find a thread with more details. It looks like it's
simply playing a midi file while the system is under load but less clear
on what the symptoms are (audio skipping maybe?). I'll start with using
irqs-off tracer to see can I replicate a similar style of issue without
depending on sound.

> > kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
> > 
> > If I should try only one of the patches or something else entirely, 
> > please let me know.
> 
> For Mel: with z1, kswapd used only 0.1-3.9 percent of CPU while he
> loaded other applications.
> 

What's the usage otherwise?

As that patch has been NAKd by Rik on the grounds it eliminates the "balance
gap" entirely, can it be checked if the patch below keeps the CPU usage low
as well please?

> We may need a way to put kswapd in all uncompactable mode to solve
> this, logic 3 just trying not to disable the all unreclaimable logic
> seems not enough. I.e. if compact_zone_order doesn't return
> COMPACT_COMPLETE, stop the compaction loop in kswapd. Then we can put
> back in the COMPACT_MODE_KSWAPD return COMPACT_CONTINUE in
> compact_finished as the caller will throttle it (and it won't run more
> than one scan before putting kswapd to sleep in all uncompactable mode).
> 

Should be possible. I'm not going to develop such a patch now though and
instead have a stab at replicating some of the exhibited problems (high
kswapd CPU usage, long interrupt disabled times etc).

Can the following patch be tested as a potential replacement for z1
please?

==== CUT HERE ====
vmscan: kswapd should not free an excessive number of pages when balancing small zones

When reclaiming for order-0 pages, kswapd requires that all zones be
balanced. Each cycle through balance_pgdat() does background ageing on all
zones if necessary and applies equal pressure on the inactive zone unless
a lot of pages are free already.

A "lot of free pages" is defined as a "balance gap" above the high watermark
which is currently 7*high_watermark. Historically this was reasonable as
min_free_kbytes was small. However, on systems using huge pages, it is
recommended that min_free_kbytes is higher and it is tuned with hugeadm
--set-recommended-min_free_kbytes. With the introduction of transparent
huge page support, this recommended value is also applied. On X86-64 with
4G of memory, min_free_kbytes becomes 67584 so one would expect around 68M
of memory to be free. The Normal zone is approximately 35000 pages so under
even normal memory pressure such as copying a large file, it gets exhausted
quickly. As it is getting exhausted, kswapd applies pressure equally to all
zones, including the DMA32 zone. DMA32 is approximately 700,000 pages with
a high watermark of around 23,000 pages. In this situation, kswapd will
reclaim around (23000*8 where 8 is the high watermark + balance gap of 7 *
high watermark) pages or 718M of pages before the zone is ignored. What
the user sees is that free memory far higher than it should be.

To avoid an excessive number of pages being reclaimed from the larger zones,
explicitely defines the "balance gap" to be either 1% of the zone or the
low watermark for the zone, whichever is smaller.  While kswapd will check
all zones to apply pressure, it'll ignore zones that meets the (high_wmark +
balance_gap) watermark.

To test this, 80G were copied from a paritition and the amount of memory
being used was recorded. A comparison of a patch and unpatched kernel
can be seen at
http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
and shows that kswapd is not reclaiming as much memory with the patch
applied.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/swap.h |    9 +++++++++
 mm/vmscan.c          |   16 +++++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d55932..a57c6e7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -155,6 +155,15 @@ enum {
 #define SWAP_CLUSTER_MAX 32
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
+/*
+ * Ratio between the present memory in the zone and the "gap" that
+ * we're allowing kswapd to shrink in addition to the per-zone high
+ * wmark, even for zones that already have the high wmark satisfied,
+ * in order to provide better per-zone lru behavior. We are ok to
+ * spend not more than 1% of the memory for this zone balancing "gap".
+ */
+#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
+
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 17497d0..0c83530 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2388,6 +2388,7 @@ loop_again:
 			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
+			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
 				continue;
@@ -2404,11 +2405,20 @@ loop_again:
 			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
 
 			/*
-			 * We put equal pressure on every zone, unless one
-			 * zone has way too many pages free already.
+			 * We put equal pressure on every zone, unless
+			 * one zone has way too many pages free
+			 * already. The "too many pages" is defined
+			 * as the high wmark plus a "gap" where the
+			 * gap is either the low watermark or 1%
+			 * of the zone, whichever is smaller.
 			 */
+			balance_gap = min(low_wmark_pages(zone),
+				(zone->present_pages +
+					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+				KSWAPD_ZONE_BALANCE_GAP_RATIO);
 			if (!zone_watermark_ok_safe(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					high_wmark_pages(zone) + balance_gap,
+					end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23  9:15           ` Mel Gorman
@ 2011-02-23 11:41             ` Arthur Marsh
  2011-02-23 13:50               ` Clemens Ladisch
  2011-02-23 17:01               ` Mel Gorman
  0 siblings, 2 replies; 31+ messages in thread
From: Arthur Marsh @ 2011-02-23 11:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrea Arcangeli, Clemens Ladisch, alsa-user, linux-kernel



Mel Gorman wrote, on 23/02/11 19:45:
> On Tue, Feb 22, 2011 at 08:43:25PM +0100, Andrea Arcangeli wrote:
>> On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
>>> OK, these patches applied together against upstream didn't cause a crash
>>> but I did observe:
>>>
>>> significant slowdowns of MIDI playback (moreso than in previous cases,
>>> and with less than 20 Meg of swap file in use);
>>>
>
> Can this be replicated with on-board audio hardware or is there something
> specific about the card? e.g. those typically driven by snd-intel8x0 or
> snd_hda_intel

This card (Soundblaster Audigy 2 ZS - SB0350) has an on-board wavetable 
synthesiser. To perform a test with sound hardware included on a 
motherboard, one would need to connect the pc to an external synthesiser 
to play the MIDI signal from the sound hardware on the motherboard.

To quote an earlier posting from Clemens Ladisch:

"The ALSA sequencer uses either the system timer or an HR timer at 1 kHz
to deliver MIDI commands (notes); the wavetable driver requires its own
interrupts in regular 5.3 ms intervals."

The wavetable interrupts are apparently being lost. I am not getting 
lost MIDI events (ie notes going missing or being stuck on), but 
slowdown in play-back.

>
> Unfortunately the original mail is a bit light on details on how this was
> reproduced and I didn't find a thread with more details. It looks like it's
> simply playing a midi file while the system is under load but less clear
> on what the symptoms are (audio skipping maybe?). I'll start with using
> irqs-off tracer to see can I replicate a similar style of issue without
> depending on sound.

Clemens, would this work to identify the problem without relying on a 
device such as a sound card with a wavetable synthesiser or external 
synthesiser receiving MIDI signals from the pc?

>
>>> kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
>>>
>>> If I should try only one of the patches or something else entirely,
>>> please let me know.
>>
>> For Mel: with z1, kswapd used only 0.1-3.9 percent of CPU while he
>> loaded other applications.
>>
>
> What's the usage otherwise?
>
> As that patch has been NAKd by Rik on the grounds it eliminates the "balance
> gap" entirely, can it be checked if the patch below keeps the CPU usage low
> as well please?

Unfortunately, when I loaded KDE 3.5.X, konversation, aptitude -u, and 
icedove with Mel's patch in the last few minutes, kswapd0 ran up to 17.5 
percent CPU usage and the playback using aplaymidi slowed down. Load was 
around 6 and swap usage had barely started when I noticed the slowdown 
in MIDI playback.

By contrast, with the "z1" patch I could load the pc as above and also 
starting iceweasel with a few dozen tabs open, and play the MIDI file by 
using the QT application Rosegarden instead of aplaymidi, and not have 
any slowdown, even though the swap usage was up around 200 MB and load 
was over 10.0.

>
>> We may need a way to put kswapd in all uncompactable mode to solve
>> this, logic 3 just trying not to disable the all unreclaimable logic
>> seems not enough. I.e. if compact_zone_order doesn't return
>> COMPACT_COMPLETE, stop the compaction loop in kswapd. Then we can put
>> back in the COMPACT_MODE_KSWAPD return COMPACT_CONTINUE in
>> compact_finished as the caller will throttle it (and it won't run more
>> than one scan before putting kswapd to sleep in all uncompactable mode).
>>
>
> Should be possible. I'm not going to develop such a patch now though and
> instead have a stab at replicating some of the exhibited problems (high
> kswapd CPU usage, long interrupt disabled times etc).
>
> Can the following patch be tested as a potential replacement for z1
> please?
>
> ==== CUT HERE ====
> vmscan: kswapd should not free an excessive number of pages when balancing small zones
>
> When reclaiming for order-0 pages, kswapd requires that all zones be
> balanced. Each cycle through balance_pgdat() does background ageing on all
> zones if necessary and applies equal pressure on the inactive zone unless
> a lot of pages are free already.
>
> A "lot of free pages" is defined as a "balance gap" above the high watermark
> which is currently 7*high_watermark. Historically this was reasonable as
> min_free_kbytes was small. However, on systems using huge pages, it is
> recommended that min_free_kbytes is higher and it is tuned with hugeadm
> --set-recommended-min_free_kbytes. With the introduction of transparent
> huge page support, this recommended value is also applied. On X86-64 with
> 4G of memory, min_free_kbytes becomes 67584 so one would expect around 68M
> of memory to be free. The Normal zone is approximately 35000 pages so under
> even normal memory pressure such as copying a large file, it gets exhausted
> quickly. As it is getting exhausted, kswapd applies pressure equally to all
> zones, including the DMA32 zone. DMA32 is approximately 700,000 pages with
> a high watermark of around 23,000 pages. In this situation, kswapd will
> reclaim around (23000*8 where 8 is the high watermark + balance gap of 7 *
> high watermark) pages or 718M of pages before the zone is ignored. What
> the user sees is that free memory far higher than it should be.
>
> To avoid an excessive number of pages being reclaimed from the larger zones,
> explicitely defines the "balance gap" to be either 1% of the zone or the
> low watermark for the zone, whichever is smaller.  While kswapd will check
> all zones to apply pressure, it'll ignore zones that meets the (high_wmark +
> balance_gap) watermark.
>
> To test this, 80G were copied from a paritition and the amount of memory
> being used was recorded. A comparison of a patch and unpatched kernel
> can be seen at
> http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
> and shows that kswapd is not reclaiming as much memory with the patch
> applied.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
> ---
>   include/linux/swap.h |    9 +++++++++
>   mm/vmscan.c          |   16 +++++++++++++---
>   2 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4d55932..a57c6e7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -155,6 +155,15 @@ enum {
>   #define SWAP_CLUSTER_MAX 32
>   #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>
> +/*
> + * Ratio between the present memory in the zone and the "gap" that
> + * we're allowing kswapd to shrink in addition to the per-zone high
> + * wmark, even for zones that already have the high wmark satisfied,
> + * in order to provide better per-zone lru behavior. We are ok to
> + * spend not more than 1% of the memory for this zone balancing "gap".
> + */
> +#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
> +
>   #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
>   #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
>   #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 17497d0..0c83530 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2388,6 +2388,7 @@ loop_again:
>   			int compaction;
>   			struct zone *zone = pgdat->node_zones + i;
>   			int nr_slab;
> +			unsigned long balance_gap;
>
>   			if (!populated_zone(zone))
>   				continue;
> @@ -2404,11 +2405,20 @@ loop_again:
>   			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
>
>   			/*
> -			 * We put equal pressure on every zone, unless one
> -			 * zone has way too many pages free already.
> +			 * We put equal pressure on every zone, unless
> +			 * one zone has way too many pages free
> +			 * already. The "too many pages" is defined
> +			 * as the high wmark plus a "gap" where the
> +			 * gap is either the low watermark or 1%
> +			 * of the zone, whichever is smaller.
>   			 */
> +			balance_gap = min(low_wmark_pages(zone),
> +				(zone->present_pages +
> +					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
> +				KSWAPD_ZONE_BALANCE_GAP_RATIO);
>   			if (!zone_watermark_ok_safe(zone, order,
> -					8*high_wmark_pages(zone), end_zone, 0))
> +					high_wmark_pages(zone) + balance_gap,
> +					end_zone, 0))
>   				shrink_zone(priority, zone,&sc);
>   			reclaim_state->reclaimed_slab = 0;
>   			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
>

Regards,

Arthur.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 11:41             ` Arthur Marsh
@ 2011-02-23 13:50               ` Clemens Ladisch
  2011-02-23 17:01               ` Mel Gorman
  1 sibling, 0 replies; 31+ messages in thread
From: Clemens Ladisch @ 2011-02-23 13:50 UTC (permalink / raw)
  To: Arthur Marsh, Mel Gorman; +Cc: Andrea Arcangeli, alsa-user, linux-kernel

Arthur Marsh wrote:
> Mel Gorman wrote, on 23/02/11 19:45:
> > Unfortunately the original mail is a bit light on details on how this was
> > reproduced and I didn't find a thread with more details. It looks like it's
> > simply playing a midi file while the system is under load but less clear
> > on what the symptoms are (audio skipping maybe?).

The synthesizer hardware continues to generate sound even if the
computer does not update its settings, it's just that the timings are
off.  Even a few milliseconds of jitter are audible.

> > I'll start with using irqs-off tracer to see can I replicate
> > a similar style of issue without depending on sound.
> 
> Clemens, would this work to identify the problem without relying on a 
> device such as a sound card with a wavetable synthesiser or external 
> synthesiser receiving MIDI signals from the pc?

Yes (assuming that it is indeed delayed interrupts that cause the
slowdowns, but I see no other mechanism how memory compaction could
affect the sequencer timer).  Any other job with realtime constraints
should be similarly affected, but the tracer measures the problem
directly at its source.


Regards,
Clemens

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 11:41             ` Arthur Marsh
  2011-02-23 13:50               ` Clemens Ladisch
@ 2011-02-23 17:01               ` Mel Gorman
  2011-02-23 17:40                 ` Andrea Arcangeli
  1 sibling, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2011-02-23 17:01 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Andrea Arcangeli, Clemens Ladisch, alsa-user, linux-kernel

On Wed, Feb 23, 2011 at 10:11:33PM +1030, Arthur Marsh wrote:
>
>
> Mel Gorman wrote, on 23/02/11 19:45:
>> On Tue, Feb 22, 2011 at 08:43:25PM +0100, Andrea Arcangeli wrote:
>>> On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
>>>> OK, these patches applied together against upstream didn't cause a crash
>>>> but I did observe:
>>>>
>>>> significant slowdowns of MIDI playback (moreso than in previous cases,
>>>> and with less than 20 Meg of swap file in use);
>>>>
>>
>> Can this be replicated with on-board audio hardware or is there something
>> specific about the card? e.g. those typically driven by snd-intel8x0 or
>> snd_hda_intel
>
> This card (Soundblaster Audigy 2 ZS - SB0350) has an on-board wavetable  
> synthesiser. To perform a test with sound hardware included on a  
> motherboard, one would need to connect the pc to an external synthesiser  
> to play the MIDI signal from the sound hardware on the motherboard.
>

Nuts. Nonetheless, I wrote some tools to track worst IRQ-disabled latencies
over time and identify the top offenders. Sure enough in my own very basic
tests, compaction was showing up as disabling IRQs for a massive length of
time. It wasn't a common occurance in my tests but they are very basic and
it's not necessarily the *only* source of IRQs being disabled too long but
it was the first one I found.

I regret that yet more patches are being blasted around but I hope the included
figures will convince you to run yet-another-test. I don't know how you are
measuring how long IRQs are being disabled and when it's a problem but it's
what I'm more interested in right now than the kswapd CPU usage which I'm
still in the process of gathering data to analyse.

==== CUT HERE ====
mm: compaction: Minimise the time IRQs are disabled while isolating free pages

compaction_alloc() isolates free pages to be used as migration targets.
While its scanning, IRQs are disabled on the mistaken assumption the scanning
was short. Analysis showed that IRQs were in fact being disabled for a
substantial length of time. A simple test was run using large anonymous
mappings with transparent hugepage support enabled to trigger frequent
compactions. A monitor sampled what the worst IRQs-off latencies were and
a post-processing tool found the following;

Event compaction_alloc..compaction_alloc                 3014 us count    1
Event compaction_alloc..compaction_alloc                 2954 us count    1
Event compaction_alloc..compaction_alloc                 1803 us count    1
Event compaction_alloc..compaction_alloc                 1303 us count    1
Event compaction_alloc..compaction_alloc                 1291 us count    1
Event compaction_alloc..compaction_alloc                  911 us count    1
Event compaction_alloc..compaction_alloc                  753 us count    1
Event compaction_alloc..compaction_alloc                  747 us count    1
Event compaction_alloc..compaction_alloc                  610 us count    1
Event split_huge_page..add_to_swap                        583 us count    1
Event split_huge_page..add_to_swap                        531 us count    1
Event split_huge_page..add_to_swap                        262 us count    1
Event split_huge_page..add_to_swap                        258 us count    1
Event split_huge_page..add_to_swap                        256 us count    1
Event split_huge_page..add_to_swap                        252 us count    1
Event split_huge_page..add_to_swap                        248 us count    1
Event split_huge_page..add_to_swap                        247 us count    2
Event compaction_alloc..compaction_alloc                  134 us count    1
Event shrink_inactive_list..shrink_zone                    85 us count    1
Event ftrace_module_notify..ftrace_module_notify           80 us count    1
Event shrink_inactive_list..shrink_zone                    76 us count    1
Event shrink_inactive_list..shrink_zone                    47 us count    1
Event save_args..restore_args                              40 us count    1
Event save_args..restore_args                              39 us count    1
Event save_args..call_softirq                              38 us count    1
Event save_args..call_softirq                              36 us count    2
Event save_args..call_softirq                              35 us count    2
Event save_args..call_softirq                              34 us count    2
Event drain_array..cache_reap                              31 us count    1
Event cfq_kick_queue..__blk_run_queue                      29 us count    1
Event save_args..restore_args                              28 us count    1
Event save_args..call_softirq                              28 us count    3
Event save_args..call_softirq                              27 us count    2
Event save_args..restore_args                              27 us count    1
Event mempool_free_slab..mempool_free_slab                 26 us count    1
Event drain_array..cache_reap                              26 us count    2
Event save_args..call_softirq                              26 us count   10
Event mempool_free_slab..mempool_free_slab                 25 us count    1
Event save_args..call_softirq                              25 us count    4
Event save_args..call_softirq                              23 us count    2
Event ____pagevec_lru_add..__lru_cache_add                  2 us count    1

i.e. compaction is disabled IRQs for a prolonged period of time - 3ms in
one instance. The full report generated by the tool can be found at
http://www.csn.ul.ie/~mel/postings/irqs-disabled-2.6.38-rc6.txt .
This patch reduces the time IRQs are disabled by simply disabling IRQs
at the last possible minute. An updated IRQs-off summary report then
looks like;

Event shrink_inactive_list..shrink_zone                  2075 us count    1
Event shrink_inactive_list..shrink_zone                   686 us count    1
Event split_huge_page..add_to_swap                        535 us count    1
Event split_huge_page..add_to_swap                        269 us count    1
Event split_huge_page..add_to_swap                        265 us count    1
Event split_huge_page..add_to_swap                        264 us count    1
Event shrink_inactive_list..shrink_zone                   261 us count    1
Event split_huge_page..add_to_swap                        255 us count    1
Event split_huge_page..add_to_swap                        253 us count    1
Event split_huge_page..add_to_swap                        252 us count    1
Event split_huge_page..add_to_swap                        250 us count    1
Event split_huge_page..add_to_swap                        239 us count    1
Event split_huge_page..add_to_swap                        237 us count    1
Event shrink_inactive_list..shrink_zone                   205 us count    1
Event compact_zone..compact_zone_order                    162 us count    1
Event compact_zone..compact_zone_order                    136 us count    1
Event compact_zone..compact_zone_order                    134 us count    1
Event compact_zone..compact_zone_order                    114 us count    1
Event shrink_inactive_list..shrink_zone                   114 us count    1
Event compact_zone..compact_zone_order                    111 us count    1
Event shrink_inactive_list..shrink_zone                    95 us count    1
Event shrink_inactive_list..shrink_zone                    91 us count    2
Event shrink_inactive_list..shrink_zone                    90 us count    1
Event ftrace_module_notify..ftrace_module_notify           89 us count    1
Event shrink_inactive_list..shrink_zone                    89 us count    1
Event shrink_inactive_list..shrink_zone                    88 us count    2
Event shrink_inactive_list..shrink_zone                    86 us count    1
Event shrink_inactive_list..shrink_zone                    84 us count    2
Event shrink_inactive_list..shrink_zone                    70 us count    1
Event save_args..restore_args                              46 us count    1
Event mempool_free_slab..mempool_free_slab                 43 us count    1
Event mempool_free_slab..mempool_free_slab                 38 us count    1
Event save_args..restore_args                              37 us count    1
Event mempool_free_slab..mempool_free_slab                 37 us count    1
Event save_args..call_softirq                              36 us count    2
Event save_args..call_softirq                              35 us count    1
Event save_args..restore_args                              35 us count    1
Event mempool_free_slab..mempool_free_slab                 34 us count    1
Event save_args..call_softirq                              34 us count    1
Event save_args..restore_args                              33 us count    1
Event save_args..call_softirq                              33 us count    2
Event save_args..call_softirq                              32 us count    1
Event save_args..restore_args                              30 us count    1
Event save_args..call_softirq                              30 us count    1
Event drain_array..cache_reap                              30 us count    1
Event save_args..restore_args                              29 us count    1
Event scsi_request_fn..process_one_work                    29 us count    1
Event save_args..call_softirq                              29 us count    1
Event save_args..call_softirq                              28 us count    2
Event save_args..call_softirq                              27 us count    5
Event drain_array..cache_reap                              26 us count    1
Event save_args..call_softirq                              26 us count    8
Event save_args..call_softirq                              25 us count    6
Event save_args..call_softirq                              24 us count    2
Event __wake_up..__wake_up                                  2 us count    1

A full report is again available at
http://www.csn.ul.ie/~mel/postings/irqs-disabled-2.6.38-rc6-compactirq-v1r1.txt
. As should be obvious, IRQ disabled latencies due to compaction are
almost elimimnated for this particular test. As a bonus, the test also
completed far faster

MMTests Statistics: duration
					   vanilla compactirq-v1r1
User/Sys Time Running Test (seconds)         28.63     18.68
Total Elapsed Time (seconds)                162.81     85.83

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/compaction.c |   17 ++++++++++++-----
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8be430b8..f47de94 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -155,7 +155,6 @@ static void isolate_freepages(struct zone *zone,
 	 * pages on cc->migratepages. We stop searching if the migrate
 	 * and free page scanners meet or enough free pages are isolated.
 	 */
-	spin_lock_irqsave(&zone->lock, flags);
 	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
 					pfn -= pageblock_nr_pages) {
 		unsigned long isolated;
@@ -178,9 +177,18 @@ static void isolate_freepages(struct zone *zone,
 		if (!suitable_migration_target(page))
 			continue;
 
-		/* Found a block suitable for isolating free pages from */
-		isolated = isolate_freepages_block(zone, pfn, freelist);
-		nr_freepages += isolated;
+		/*
+		 * Found a block suitable for isolating free pages from. Now
+		 * we disabled interrupts, double check things are ok and
+		 * isolate the pages. This is to minimise the time IRQs
+		 * are disabled
+		 */
+		spin_lock_irqsave(&zone->lock, flags);
+		if (suitable_migration_target(page)) {
+			isolated = isolate_freepages_block(zone, pfn, freelist);
+			nr_freepages += isolated;
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
 
 		/*
 		 * Record the highest PFN we isolated pages from. When next
@@ -190,7 +198,6 @@ static void isolate_freepages(struct zone *zone,
 		if (isolated)
 			high_pfn = max(high_pfn, pfn);
 	}
-	spin_unlock_irqrestore(&zone->lock, flags);
 
 	/* split_free_page does not map the pages */
 	list_for_each_entry(page, freelist, lru) {

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 17:01               ` Mel Gorman
@ 2011-02-23 17:40                 ` Andrea Arcangeli
  0 siblings, 0 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 17:40 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Arthur Marsh, Clemens Ladisch, alsa-user, linux-kernel

On Wed, Feb 23, 2011 at 05:01:17PM +0000, Mel Gorman wrote:
> I regret that yet more patches are being blasted around but I hope the included
> figures will convince you to run yet-another-test. I don't know how you are
> measuring how long IRQs are being disabled and when it's a problem but it's
> what I'm more interested in right now than the kswapd CPU usage which I'm
> still in the process of gathering data to analyse.

Arthur, I suggest to apply this patch before any of my patches (just
like the kswapd-high-wmark) before testing my patches. It's orthogonal
so it can be applied at the same time.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-22 17:47       ` Arthur Marsh
  2011-02-22 19:43         ` Andrea Arcangeli
@ 2011-02-23 16:24         ` Andrea Arcangeli
  2011-02-23 16:36           ` Andrea Arcangeli
                             ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 16:24 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 8134 bytes --]

On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
> OK, these patches applied together against upstream didn't cause a crash 
> but I did observe:
> 
> significant slowdowns of MIDI playback (moreso than in previous cases, 
> and with less than 20 Meg of swap file in use);
> 
> kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
> 
> If I should try only one of the patches or something else entirely, 
> please let me know.

Yes, with irq off, schedule won't run and need_resched won't get set.

So let's try this.

In case this doesn't fix I definitely give it up with compaction in
kswapd as GFP_ATOMIC will likely not get an huge benefit out of
compaction anyway because 1) the allocations from GFP_ATOMIC are
likely short lived, 2) the cost of the compaction scan loop +
migration may be higher than the benefit we get from jumbo frames

So if this doesn't fix it, I'll already post a definitive fix that
removes compaction from kswapd (but leaves it enabled for direct
reclaim for all order sizes including kernel stack). I already
verified that this solves not just the midi latency issue but the
other server benchmark that I'm dealing with. Then we can think at
compaction+kswapd later for 2.6.39 but I think we need to close this
kswapd issue in time for 2.6.38. So if the below isn't enough the next
patch should be applied. I'll get those two patches tested in the
server load too, and I'll let you know how the results are when it
finishes.

Please apply also the attached "kswapd-high-wmark" before the below
one.

====
Subject: compaction: fix high latencies created by kswapd

From: Andrea Arcangeli <aarcange@redhat.com>

We need to proper spin_unlock_irq/cond_resched in the compaction loop to avoid
hurting latencies. We must also stop being too aggressive insisting with
compaction within kswapd if compaction don't make progress in every zone.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/compaction.h |    6 +++++
 mm/compaction.c            |   53 +++++++++++++++++++++++++++++++++++++--------
 mm/vmscan.c                |   39 +++++++++++++++++++--------------
 3 files changed, 73 insertions(+), 25 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -271,9 +271,27 @@ static unsigned long isolate_migratepage
 	}
 
 	/* Time to isolate some pages for migration */
+	cond_resched();
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
+		int unlocked = 0;
+
+		/* give a chance to irqs before checking need_resched() */
+		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+			spin_unlock_irq(&zone->lru_lock);
+			unlocked = 1;
+		}
+		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
+			if (fatal_signal_pending(current))
+				break;
+			if (!unlocked)
+				spin_unlock_irq(&zone->lru_lock);
+			cond_resched();
+			spin_lock_irq(&zone->lru_lock);
+		} else if (unlocked)
+			spin_lock_irq(&zone->lru_lock);
+
 		if (!pfn_valid_within(low_pfn))
 			continue;
 		nr_scanned++;
@@ -436,6 +454,28 @@ static int compact_finished(struct zone 
 	return COMPACT_CONTINUE;
 }
 
+static unsigned long compaction_watermark(struct zone *zone, int order)
+{
+	/*
+	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
+	 * This is because during migration, copies of pages need to be
+	 * allocated and for a short time, the footprint is higher
+	 */
+	return low_wmark_pages(zone) + (2UL << order);
+}
+
+static int __compaction_need_reclaim(struct zone *zone, int order,
+				     unsigned long watermark)
+{
+	return !zone_watermark_ok(zone, 0, watermark, 0, 0);
+}
+
+int compaction_need_reclaim(struct zone *zone, int order)
+{
+	return __compaction_need_reclaim(zone, order,
+					 compaction_watermark(zone, order));
+}
+
 /*
  * compaction_suitable: Is this suitable to run compaction on this zone now?
  * Returns
@@ -449,21 +489,16 @@ unsigned long compaction_suitable(struct
 	unsigned long watermark;
 
 	/*
-	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
-	 * This is because during migration, copies of pages need to be
-	 * allocated and for a short time, the footprint is higher
-	 */
-	watermark = low_wmark_pages(zone) + (2UL << order);
-	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
-		return COMPACT_SKIPPED;
-
-	/*
 	 * order == -1 is expected when compacting via
 	 * /proc/sys/vm/compact_memory
 	 */
 	if (order == -1)
 		return COMPACT_CONTINUE;
 
+	watermark = compaction_watermark(zone, order);
+	if (__compaction_need_reclaim(zone, order, watermark))
+		return COMPACT_SKIPPED;
+
 	/*
 	 * fragmentation index determines if allocation failures are due to
 	 * low memory or external fragmentation
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2385,9 +2385,9 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
-			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
+			int local_order;
 
 			if (!populated_zone(zone))
 				continue;
@@ -2407,20 +2407,21 @@ loop_again:
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok_safe(zone, order,
-					high_wmark_pages(zone), end_zone, 0))
+			if (!zone_watermark_ok_safe(zone, 0,
+					high_wmark_pages(zone), end_zone, 0) ||
+			    compaction_need_reclaim(zone, order)) {
 				shrink_zone(priority, zone, &sc);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-			total_scanned += sc.nr_scanned;
+				reclaim_state->reclaimed_slab = 0;
+				nr_slab = shrink_slab(sc.nr_scanned,
+						      GFP_KERNEL,
+						      lru_pages);
+				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+				total_scanned += sc.nr_scanned;
+			}
 
-			compaction = 0;
+			local_order = order;
 			if (order &&
-			    zone_watermark_ok(zone, 0,
-					       high_wmark_pages(zone),
-					      end_zone, 0) &&
+			    !compaction_need_reclaim(zone, order) &&
 			    !zone_watermark_ok(zone, order,
 					       high_wmark_pages(zone),
 					       end_zone, 0)) {
@@ -2428,12 +2429,18 @@ loop_again:
 						   order,
 						   sc.gfp_mask, false,
 						   COMPACT_MODE_KSWAPD);
-				compaction = 1;
+				/*
+				 * Let kswapd stop immediately even if
+				 * compaction didn't succeed to
+				 * generate "high_wmark_pages" of the
+				 * max order wanted in every zone.
+				 */
+				local_order = 0;
 			}
 
 			if (zone->all_unreclaimable)
 				continue;
-			if (!compaction && nr_slab == 0 &&
+			if (nr_slab == 0 &&
 			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*
@@ -2445,7 +2452,7 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
-			if (!zone_watermark_ok_safe(zone, order,
+			if (!zone_watermark_ok_safe(zone, local_order,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
 				/*
@@ -2453,7 +2460,7 @@ loop_again:
 				 * means that we have a GFP_ATOMIC allocation
 				 * failure risk. Hurry up!
 				 */
-				if (!zone_watermark_ok_safe(zone, order,
+				if (!zone_watermark_ok_safe(zone, local_order,
 					    min_wmark_pages(zone), end_zone, 0))
 					has_under_min_watermark_zone = 1;
 			} else {
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -26,6 +26,7 @@ extern int fragmentation_index(struct zo
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync);
+extern int compaction_need_reclaim(struct zone *zone, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 extern unsigned long compact_zone_order(struct zone *zone, int order,
 					gfp_t gfp_mask, bool sync,
@@ -68,6 +69,11 @@ static inline unsigned long try_to_compa
 	return COMPACT_CONTINUE;
 }
 
+static inline int compaction_need_reclaim(struct zone *zone, int order)
+{
+	return 0;
+}
+
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;

[-- Attachment #2: kswapd-high_wmark --]
[-- Type: text/plain, Size: 1853 bytes --]

Subject: vmscan: kswapd must not free more than high_wmark pages

From: Andrea Arcangeli <aarcange@redhat.com>

When the min_free_kbytes is set with `hugeadm
--set-recommended-min_free_kbytes" or with THP enabled (which runs the
equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate
anti-frag at full effectiveness automatically at boot) the high wmark
of some zone is as high as ~88M. 88M free on a 4G system isn't
horrible, but 88M*8 = 704M free on a 4G system is definitely
unbearable. This only tends to be visible on 4G systems with tiny
over-4g zone where kswapd insists to reach the high wmark on the
over-4g zone but doing so it shrunk up to 704M from the normal zone by
mistake.

For the trivial case where kswapd isn't waken until all zones hit the low wmark
and there is no concurrency between allocator and kswapd freeing, rotating more
the tiny above4g lru than "high-low" despite we only allocated "high-low" cache
into it doesn't sound obviously right either. Bigger gap to me looks like will
do more harm than good and if we need a real guarantee of balancing we should
rotate the allocations across the zones (bigger lru in a zone will require it
to be hit more frequently because it'll rotate slower than the other zones, the
bias should not even dependent on the zone size but on the lru size).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2408,7 +2408,7 @@ loop_again:
 			 * zone has way too many pages free already.
 			 */
 			if (!zone_watermark_ok_safe(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 16:24         ` Andrea Arcangeli
@ 2011-02-23 16:36           ` Andrea Arcangeli
  2011-02-23 16:40             ` Andrea Arcangeli
  2011-02-23 16:55           ` Andrea Arcangeli
  2011-02-23 17:10           ` Mel Gorman
  2 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 16:36 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 5574 bytes --]

On Wed, Feb 23, 2011 at 05:24:32PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
> > OK, these patches applied together against upstream didn't cause a crash 
> > but I did observe:
> > 
> > significant slowdowns of MIDI playback (moreso than in previous cases, 
> > and with less than 20 Meg of swap file in use);
> > 
> > kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
> > 
> > If I should try only one of the patches or something else entirely, 
> > please let me know.
> 
> Yes, with irq off, schedule won't run and need_resched won't get set.
> 
> So let's try this.
> 
> In case this doesn't fix I definitely give it up with compaction in
> kswapd as GFP_ATOMIC will likely not get an huge benefit out of
> compaction anyway because 1) the allocations from GFP_ATOMIC are
> likely short lived, 2) the cost of the compaction scan loop +
> migration may be higher than the benefit we get from jumbo frames
> 
> So if this doesn't fix it, I'll already post a definitive fix that
> removes compaction from kswapd (but leaves it enabled for direct
> reclaim for all order sizes including kernel stack). I already
> verified that this solves not just the midi latency issue but the
> other server benchmark that I'm dealing with. Then we can think at
> compaction+kswapd later for 2.6.39 but I think we need to close this
> kswapd issue in time for 2.6.38. So if the below isn't enough the next
> patch should be applied. I'll get those two patches tested in the
> server load too, and I'll let you know how the results are when it
> finishes.
> 
> Please apply also the attached "kswapd-high-wmark" before the below
> one.

If the previous patch please test the below after the attached patch
(as usual). If the previous patch (last attempt for 2.6.38 to add
compaction in kswapd) fails this is the way to go for 2.6.38.

===
Subject: compaction: fix high compaction latencies and remove compaction-kswapd

From: Andrea Arcangeli <aarcange@redhat.com>

We need to proper spin_unlock_irq/cond_resched in the compaction loop to avoid
hurting latencies. We must also stop calling compaction from kswapd as that
creates too high load during memory pressure that can't be offseted by the
improved performance of hugepage allocations. NOTE: this is not related to THP
as all THP allocations uses __GFP_NO_KSWAPD, this is only related to usually
small order allocations like the kernel stack that make kswapd go wild with
compaction.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c |   40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -271,9 +271,27 @@ static unsigned long isolate_migratepage
 	}
 
 	/* Time to isolate some pages for migration */
+	cond_resched();
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
+		int unlocked = 0;
+
+		/* give a chance to irqs before checking need_resched() */
+		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+			spin_unlock_irq(&zone->lru_lock);
+			unlocked = 1;
+		}
+		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
+			if (fatal_signal_pending(current))
+				break;
+			if (!unlocked)
+				spin_unlock_irq(&zone->lru_lock);
+			cond_resched();
+			spin_lock_irq(&zone->lru_lock);
+		} else if (unlocked)
+			spin_lock_irq(&zone->lru_lock);
+
 		if (!pfn_valid_within(low_pfn))
 			continue;
 		nr_scanned++;
@@ -397,10 +415,7 @@ static int compact_finished(struct zone 
 		return COMPACT_COMPLETE;
 
 	/* Compaction run is not finished if the watermark is not met */
-	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
-		watermark = low_wmark_pages(zone);
-	else
-		watermark = high_wmark_pages(zone);
+	watermark = low_wmark_pages(zone);
 	watermark += (1 << cc->order);
 
 	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
@@ -413,15 +428,6 @@ static int compact_finished(struct zone 
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
-	/*
-	 * Generating only one page of the right order is not enough
-	 * for kswapd, we must continue until we're above the high
-	 * watermark as a pool for high order GFP_ATOMIC allocations
-	 * too.
-	 */
-	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
-		return COMPACT_CONTINUE;
-
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
@@ -543,8 +549,7 @@ static int compact_zone(struct zone *zon
 
 unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
-				 bool sync,
-				 int compact_mode)
+				 bool sync)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -553,7 +558,6 @@ unsigned long compact_zone_order(struct 
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
-		.compact_mode = compact_mode,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -599,8 +603,7 @@ unsigned long try_to_compact_pages(struc
 								nodemask) {
 		int status;
 
-		status = compact_zone_order(zone, order, gfp_mask, sync,
-					    COMPACT_MODE_DIRECT_RECLAIM);
+		status = compact_zone_order(zone, order, gfp_mask, sync);
 		rc = max(status, rc);
 
 		/* If a normal allocation would succeed, stop compacting */
@@ -631,7 +634,6 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
-			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
 		};
 
 		zone = &pgdat->node_zones[zoneid];

[-- Attachment #2: kswapd-high_wmark --]
[-- Type: text/plain, Size: 1853 bytes --]

Subject: vmscan: kswapd must not free more than high_wmark pages

From: Andrea Arcangeli <aarcange@redhat.com>

When the min_free_kbytes is set with `hugeadm
--set-recommended-min_free_kbytes" or with THP enabled (which runs the
equivalent of "hugeadm --set-recommended-min_free_kbytes" to activate
anti-frag at full effectiveness automatically at boot) the high wmark
of some zone is as high as ~88M. 88M free on a 4G system isn't
horrible, but 88M*8 = 704M free on a 4G system is definitely
unbearable. This only tends to be visible on 4G systems with tiny
over-4g zone where kswapd insists to reach the high wmark on the
over-4g zone but doing so it shrunk up to 704M from the normal zone by
mistake.

For the trivial case where kswapd isn't waken until all zones hit the low wmark
and there is no concurrency between allocator and kswapd freeing, rotating more
the tiny above4g lru than "high-low" despite we only allocated "high-low" cache
into it doesn't sound obviously right either. Bigger gap to me looks like will
do more harm than good and if we need a real guarantee of balancing we should
rotate the allocations across the zones (bigger lru in a zone will require it
to be hit more frequently because it'll rotate slower than the other zones, the
bias should not even dependent on the zone size but on the lru size).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2408,7 +2408,7 @@ loop_again:
 			 * zone has way too many pages free already.
 			 */
 			if (!zone_watermark_ok_safe(zone, order,
-					8*high_wmark_pages(zone), end_zone, 0))
+					high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 16:36           ` Andrea Arcangeli
@ 2011-02-23 16:40             ` Andrea Arcangeli
  2011-02-23 16:47               ` Andrea Arcangeli
  0 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 16:40 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

On Wed, Feb 23, 2011 at 05:36:39PM +0100, Andrea Arcangeli wrote:
> +		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
> +			if (fatal_signal_pending(current))
> +				break;

I noticed a buglet in this break... need to repost sorry. compaction-no-kswapd-2.

===
Subject: compaction: fix high compaction latencies and remove compaction-kswapd

From: Andrea Arcangeli <aarcange@redhat.com>

We need to proper spin_unlock_irq/cond_resched in the compaction loop to avoid
hurting latencies. We must also stop calling compaction from kswapd as that
creates too high load during memory pressure that can't be offseted by the
improved performance of hugepage allocations. NOTE: this is not related to THP
as all THP allocations uses __GFP_NO_KSWAPD, this is only related to usually
small order allocations like the kernel stack that make kswapd go wild with
compaction.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c |   40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -271,9 +271,27 @@ static unsigned long isolate_migratepage
 	}
 
 	/* Time to isolate some pages for migration */
+	cond_resched();
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
+		int unlocked = 0;
+
+		/* give a chance to irqs before checking need_resched() */
+		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+			spin_unlock_irq(&zone->lru_lock);
+			unlocked = 1;
+		}
+		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
+			if (!unlocked)
+				spin_unlock_irq(&zone->lru_lock);
+			cond_resched();
+			spin_lock_irq(&zone->lru_lock);
+			if (fatal_signal_pending(current))
+				break;
+		} else if (unlocked)
+			spin_lock_irq(&zone->lru_lock);
+
 		if (!pfn_valid_within(low_pfn))
 			continue;
 		nr_scanned++;
@@ -397,10 +415,7 @@ static int compact_finished(struct zone 
 		return COMPACT_COMPLETE;
 
 	/* Compaction run is not finished if the watermark is not met */
-	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
-		watermark = low_wmark_pages(zone);
-	else
-		watermark = high_wmark_pages(zone);
+	watermark = low_wmark_pages(zone);
 	watermark += (1 << cc->order);
 
 	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
@@ -413,15 +428,6 @@ static int compact_finished(struct zone 
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
-	/*
-	 * Generating only one page of the right order is not enough
-	 * for kswapd, we must continue until we're above the high
-	 * watermark as a pool for high order GFP_ATOMIC allocations
-	 * too.
-	 */
-	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
-		return COMPACT_CONTINUE;
-
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
@@ -543,8 +549,7 @@ static int compact_zone(struct zone *zon
 
 unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
-				 bool sync,
-				 int compact_mode)
+				 bool sync)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -553,7 +558,6 @@ unsigned long compact_zone_order(struct 
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
-		.compact_mode = compact_mode,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -599,8 +603,7 @@ unsigned long try_to_compact_pages(struc
 								nodemask) {
 		int status;
 
-		status = compact_zone_order(zone, order, gfp_mask, sync,
-					    COMPACT_MODE_DIRECT_RECLAIM);
+		status = compact_zone_order(zone, order, gfp_mask, sync);
 		rc = max(status, rc);
 
 		/* If a normal allocation would succeed, stop compacting */
@@ -631,7 +634,6 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
-			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
 		};
 
 		zone = &pgdat->node_zones[zoneid];

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 16:40             ` Andrea Arcangeli
@ 2011-02-23 16:47               ` Andrea Arcangeli
  0 siblings, 0 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 16:47 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

On Wed, Feb 23, 2011 at 05:40:01PM +0100, Andrea Arcangeli wrote:
> I noticed a buglet in this break... need to repost sorry. compaction-no-kswapd-2.

and one part of the diff went missing during quilt ref...  no luck. Try
#3. So here a compaction-no-kswapd-3. Apologies for the flood.

===
Subject: compaction: fix high compaction latencies and remove compaction-kswapd

From: Andrea Arcangeli <aarcange@redhat.com>

We need to proper spin_unlock_irq/cond_resched in the compaction loop to avoid
hurting latencies. We must also stop calling compaction from kswapd as that
creates too high load during memory pressure that can't be offseted by the
improved performance of hugepage allocations. NOTE: this is not related to THP
as all THP allocations uses __GFP_NO_KSWAPD, this is only related to usually
small order allocations like the kernel stack that make kswapd go wild with
compaction.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c |   40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -271,9 +271,27 @@ static unsigned long isolate_migratepage
 	}
 
 	/* Time to isolate some pages for migration */
+	cond_resched();
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
+		int unlocked = 0;
+
+		/* give a chance to irqs before checking need_resched() */
+		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+			spin_unlock_irq(&zone->lru_lock);
+			unlocked = 1;
+		}
+		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
+			if (!unlocked)
+				spin_unlock_irq(&zone->lru_lock);
+			cond_resched();
+			spin_lock_irq(&zone->lru_lock);
+			if (fatal_signal_pending(current))
+				break;
+		} else if (unlocked)
+			spin_lock_irq(&zone->lru_lock);
+
 		if (!pfn_valid_within(low_pfn))
 			continue;
 		nr_scanned++;
@@ -397,10 +415,7 @@ static int compact_finished(struct zone 
 		return COMPACT_COMPLETE;
 
 	/* Compaction run is not finished if the watermark is not met */
-	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
-		watermark = low_wmark_pages(zone);
-	else
-		watermark = high_wmark_pages(zone);
+	watermark = low_wmark_pages(zone);
 	watermark += (1 << cc->order);
 
 	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
@@ -413,15 +428,6 @@ static int compact_finished(struct zone 
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
-	/*
-	 * Generating only one page of the right order is not enough
-	 * for kswapd, we must continue until we're above the high
-	 * watermark as a pool for high order GFP_ATOMIC allocations
-	 * too.
-	 */
-	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
-		return COMPACT_CONTINUE;
-
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
@@ -543,8 +549,7 @@ static int compact_zone(struct zone *zon
 
 unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
-				 bool sync,
-				 int compact_mode)
+				 bool sync)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -553,7 +558,6 @@ unsigned long compact_zone_order(struct 
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
-		.compact_mode = compact_mode,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -599,8 +603,7 @@ unsigned long try_to_compact_pages(struc
 								nodemask) {
 		int status;
 
-		status = compact_zone_order(zone, order, gfp_mask, sync,
-					    COMPACT_MODE_DIRECT_RECLAIM);
+		status = compact_zone_order(zone, order, gfp_mask, sync);
 		rc = max(status, rc);
 
 		/* If a normal allocation would succeed, stop compacting */
@@ -631,7 +634,6 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
-			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index dfa2ed4..cc9f7a4 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,9 +11,6 @@
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
 
-#define COMPACT_MODE_DIRECT_RECLAIM	0
-#define COMPACT_MODE_KSWAPD		1
-
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
@@ -28,8 +25,7 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			bool sync);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 extern unsigned long compact_zone_order(struct zone *zone, int order,
-					gfp_t gfp_mask, bool sync,
-					int compact_mode);
+					gfp_t gfp_mask, bool sync);
 
 /* Do not skip compaction more than 64 times */
 #define COMPACT_MAX_DEFER_SHIFT 6
@@ -74,8 +70,7 @@ static inline unsigned long compaction_suitable(struct zone *zone, int order)
 }
 
 static inline unsigned long compact_zone_order(struct zone *zone, int order,
-					       gfp_t gfp_mask, bool sync,
-					       int compact_mode)
+					       gfp_t gfp_mask, bool sync)
 {
 	return COMPACT_CONTINUE;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 17497d0..0e7121d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2385,7 +2385,6 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
-			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
 
@@ -2416,24 +2415,9 @@ loop_again:
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 
-			compaction = 0;
-			if (order &&
-			    zone_watermark_ok(zone, 0,
-					       high_wmark_pages(zone),
-					      end_zone, 0) &&
-			    !zone_watermark_ok(zone, order,
-					       high_wmark_pages(zone),
-					       end_zone, 0)) {
-				compact_zone_order(zone,
-						   order,
-						   sc.gfp_mask, false,
-						   COMPACT_MODE_KSWAPD);
-				compaction = 1;
-			}
-
 			if (zone->all_unreclaimable)
 				continue;
-			if (!compaction && nr_slab == 0 &&
+			if (nr_slab == 0 &&
 			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 16:24         ` Andrea Arcangeli
  2011-02-23 16:36           ` Andrea Arcangeli
@ 2011-02-23 16:55           ` Andrea Arcangeli
  2011-02-23 20:07             ` Arthur Marsh
  2011-02-23 17:10           ` Mel Gorman
  2 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 16:55 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

On Wed, Feb 23, 2011 at 05:24:32PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
> > OK, these patches applied together against upstream didn't cause a crash 
> > but I did observe:
> > 
> > significant slowdowns of MIDI playback (moreso than in previous cases, 
> > and with less than 20 Meg of swap file in use);
> > 
> > kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
> > 
> > If I should try only one of the patches or something else entirely, 
> > please let me know.
> 
> Yes, with irq off, schedule won't run and need_resched won't get set.
> 
> So let's try this.
> 
> In case this doesn't fix I definitely give it up with compaction in
> kswapd as GFP_ATOMIC will likely not get an huge benefit out of
> compaction anyway because 1) the allocations from GFP_ATOMIC are
> likely short lived, 2) the cost of the compaction scan loop +
> migration may be higher than the benefit we get from jumbo frames
> 
> So if this doesn't fix it, I'll already post a definitive fix that
> removes compaction from kswapd (but leaves it enabled for direct
> reclaim for all order sizes including kernel stack). I already
> verified that this solves not just the midi latency issue but the
> other server benchmark that I'm dealing with. Then we can think at
> compaction+kswapd later for 2.6.39 but I think we need to close this
> kswapd issue in time for 2.6.38. So if the below isn't enough the next
> patch should be applied. I'll get those two patches tested in the
> server load too, and I'll let you know how the results are when it
> finishes.
> 
> Please apply also the attached "kswapd-high-wmark" before the below
> one.
> +		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
> +			if (fatal_signal_pending(current))
> +				break;

this is compaction-kswapd-2, to fix the buglet same as in the other
patch.

In short (to avoid confusion) it'll be helpful if you can test
compaction-kswapd-2 vs compaction-no-kswapd-3 and let us know if both
are ok or if only compaction-no-kswapd-3 is ok. compaction-no-kswapd-3
should help, compaction-kswapd-2 I'm unsure. And as usual I suggest to
apply kswapd-high-wmark (attached to previous emails) _before_
applying both patches.

===
Subject: compaction: fix high latencies created by kswapd

From: Andrea Arcangeli <aarcange@redhat.com>

We need to proper spin_unlock_irq/cond_resched in the compaction loop to avoid
hurting latencies. We must also stop being too aggressive insisting with
compaction within kswapd if compaction don't make progress in every zone.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/compaction.h |    6 +++++
 mm/compaction.c            |   53 +++++++++++++++++++++++++++++++++++++--------
 mm/vmscan.c                |   39 +++++++++++++++++++--------------
 3 files changed, 73 insertions(+), 25 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -271,9 +271,27 @@ static unsigned long isolate_migratepage
 	}
 
 	/* Time to isolate some pages for migration */
+	cond_resched();
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
+		int unlocked = 0;
+
+		/* give a chance to irqs before checking need_resched() */
+		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+			spin_unlock_irq(&zone->lru_lock);
+			unlocked = 1;
+		}
+		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
+			if (!unlocked)
+				spin_unlock_irq(&zone->lru_lock);
+			cond_resched();
+			spin_lock_irq(&zone->lru_lock);
+			if (fatal_signal_pending(current))
+				break;
+		} else if (unlocked)
+			spin_lock_irq(&zone->lru_lock);
+
 		if (!pfn_valid_within(low_pfn))
 			continue;
 		nr_scanned++;
@@ -436,6 +454,28 @@ static int compact_finished(struct zone 
 	return COMPACT_CONTINUE;
 }
 
+static unsigned long compaction_watermark(struct zone *zone, int order)
+{
+	/*
+	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
+	 * This is because during migration, copies of pages need to be
+	 * allocated and for a short time, the footprint is higher
+	 */
+	return low_wmark_pages(zone) + (2UL << order);
+}
+
+static int __compaction_need_reclaim(struct zone *zone, int order,
+				     unsigned long watermark)
+{
+	return !zone_watermark_ok(zone, 0, watermark, 0, 0);
+}
+
+int compaction_need_reclaim(struct zone *zone, int order)
+{
+	return __compaction_need_reclaim(zone, order,
+					 compaction_watermark(zone, order));
+}
+
 /*
  * compaction_suitable: Is this suitable to run compaction on this zone now?
  * Returns
@@ -449,21 +489,16 @@ unsigned long compaction_suitable(struct
 	unsigned long watermark;
 
 	/*
-	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
-	 * This is because during migration, copies of pages need to be
-	 * allocated and for a short time, the footprint is higher
-	 */
-	watermark = low_wmark_pages(zone) + (2UL << order);
-	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
-		return COMPACT_SKIPPED;
-
-	/*
 	 * order == -1 is expected when compacting via
 	 * /proc/sys/vm/compact_memory
 	 */
 	if (order == -1)
 		return COMPACT_CONTINUE;
 
+	watermark = compaction_watermark(zone, order);
+	if (__compaction_need_reclaim(zone, order, watermark))
+		return COMPACT_SKIPPED;
+
 	/*
 	 * fragmentation index determines if allocation failures are due to
 	 * low memory or external fragmentation
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2385,9 +2385,9 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
-			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
+			int local_order;
 
 			if (!populated_zone(zone))
 				continue;
@@ -2407,20 +2407,21 @@ loop_again:
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok_safe(zone, order,
-					high_wmark_pages(zone), end_zone, 0))
+			if (!zone_watermark_ok_safe(zone, 0,
+					high_wmark_pages(zone), end_zone, 0) ||
+			    compaction_need_reclaim(zone, order)) {
 				shrink_zone(priority, zone, &sc);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-			total_scanned += sc.nr_scanned;
+				reclaim_state->reclaimed_slab = 0;
+				nr_slab = shrink_slab(sc.nr_scanned,
+						      GFP_KERNEL,
+						      lru_pages);
+				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+				total_scanned += sc.nr_scanned;
+			}
 
-			compaction = 0;
+			local_order = order;
 			if (order &&
-			    zone_watermark_ok(zone, 0,
-					       high_wmark_pages(zone),
-					      end_zone, 0) &&
+			    !compaction_need_reclaim(zone, order) &&
 			    !zone_watermark_ok(zone, order,
 					       high_wmark_pages(zone),
 					       end_zone, 0)) {
@@ -2428,12 +2429,18 @@ loop_again:
 						   order,
 						   sc.gfp_mask, false,
 						   COMPACT_MODE_KSWAPD);
-				compaction = 1;
+				/*
+				 * Let kswapd stop immediately even if
+				 * compaction didn't succeed to
+				 * generate "high_wmark_pages" of the
+				 * max order wanted in every zone.
+				 */
+				local_order = 0;
 			}
 
 			if (zone->all_unreclaimable)
 				continue;
-			if (!compaction && nr_slab == 0 &&
+			if (nr_slab == 0 &&
 			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*
@@ -2445,7 +2452,7 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
-			if (!zone_watermark_ok_safe(zone, order,
+			if (!zone_watermark_ok_safe(zone, local_order,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
 				/*
@@ -2453,7 +2460,7 @@ loop_again:
 				 * means that we have a GFP_ATOMIC allocation
 				 * failure risk. Hurry up!
 				 */
-				if (!zone_watermark_ok_safe(zone, order,
+				if (!zone_watermark_ok_safe(zone, local_order,
 					    min_wmark_pages(zone), end_zone, 0))
 					has_under_min_watermark_zone = 1;
 			} else {
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -26,6 +26,7 @@ extern int fragmentation_index(struct zo
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync);
+extern int compaction_need_reclaim(struct zone *zone, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 extern unsigned long compact_zone_order(struct zone *zone, int order,
 					gfp_t gfp_mask, bool sync,
@@ -68,6 +69,11 @@ static inline unsigned long try_to_compa
 	return COMPACT_CONTINUE;
 }
 
+static inline int compaction_need_reclaim(struct zone *zone, int order)
+{
+	return 0;
+}
+
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 16:55           ` Andrea Arcangeli
@ 2011-02-23 20:07             ` Arthur Marsh
  2011-02-23 21:25               ` Andrea Arcangeli
  0 siblings, 1 reply; 31+ messages in thread
From: Arthur Marsh @ 2011-02-23 20:07 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman



Andrea Arcangeli wrote, on 24/02/11 03:25:
> On Wed, Feb 23, 2011 at 05:24:32PM +0100, Andrea Arcangeli wrote:
>> On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
>>> OK, these patches applied together against upstream didn't cause a crash
>>> but I did observe:
>>>
>>> significant slowdowns of MIDI playback (moreso than in previous cases,
>>> and with less than 20 Meg of swap file in use);
>>>
>>> kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
>>>
>>> If I should try only one of the patches or something else entirely,
>>> please let me know.
>>
>> Yes, with irq off, schedule won't run and need_resched won't get set.
>>
>> So let's try this.
>>
>> In case this doesn't fix I definitely give it up with compaction in
>> kswapd as GFP_ATOMIC will likely not get an huge benefit out of
>> compaction anyway because 1) the allocations from GFP_ATOMIC are
>> likely short lived, 2) the cost of the compaction scan loop +
>> migration may be higher than the benefit we get from jumbo frames
>>
>> So if this doesn't fix it, I'll already post a definitive fix that
>> removes compaction from kswapd (but leaves it enabled for direct
>> reclaim for all order sizes including kernel stack). I already
>> verified that this solves not just the midi latency issue but the
>> other server benchmark that I'm dealing with. Then we can think at
>> compaction+kswapd later for 2.6.39 but I think we need to close this
>> kswapd issue in time for 2.6.38. So if the below isn't enough the next
>> patch should be applied. I'll get those two patches tested in the
>> server load too, and I'll let you know how the results are when it
>> finishes.
>>
>> Please apply also the attached "kswapd-high-wmark" before the below
>> one.
>> +		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
>> +			if (fatal_signal_pending(current))
>> +				break;
>
> this is compaction-kswapd-2, to fix the buglet same as in the other
> patch.
>
> In short (to avoid confusion) it'll be helpful if you can test
> compaction-kswapd-2 vs compaction-no-kswapd-3 and let us know if both
> are ok or if only compaction-no-kswapd-3 is ok. compaction-no-kswapd-3
> should help, compaction-kswapd-2 I'm unsure. And as usual I suggest to
> apply kswapd-high-wmark (attached to previous emails) _before_
> applying both patches.

kswapd-high_wmark + compaction-kswapd-2 - kswapd0 CPU up to 11 percent 
and slightly less pronounced slowdowns of MIDI playback compared to 
previous patches.

kswapd-high_wmark + compaction-no-kswapd-3 - kswapd0 CPU up to 2.3 
percent and no noticable slowdown of MIDI playback.

Mel Gorman's mm/compaction.c patch - kswapd0 CPU up to 20 percent and no 
noticable slowdown of MIDI playback.

Thanks everyone for the help with this.

Arthur.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 20:07             ` Arthur Marsh
@ 2011-02-23 21:25               ` Andrea Arcangeli
  2011-02-23 21:55                 ` Arthur Marsh
  0 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 21:25 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

On Thu, Feb 24, 2011 at 06:37:58AM +1030, Arthur Marsh wrote:
> kswapd-high_wmark + compaction-kswapd-2 - kswapd0 CPU up to 11 percent 
> and slightly less pronounced slowdowns of MIDI playback compared to 
> previous patches.
> 
> kswapd-high_wmark + compaction-no-kswapd-3 - kswapd0 CPU up to 2.3 
> percent and no noticable slowdown of MIDI playback.
> 
> Mel Gorman's mm/compaction.c patch - kswapd0 CPU up to 20 percent and no 
> noticable slowdown of MIDI playback.
> 
> Thanks everyone for the help with this.

Ok then I think it's safer to go with compaction-no-kswapd-3.

I also created a git tree in case anybody else wants to test in easier
way.

git clone --reference linux-2.6.git git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
(or "git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git")
cd aa
git checkout -f 3d74399aaece29047beba13a81650538af8db67a
(compaction-kswapd+compaction_alloc_lowlat)
git checkout -f f048d0082bd3c6a1cc3f5e67aa5ef83d1561ec27
(compaction-no-kswapd+rest+compaction_alloc_lowlat)
git checkout -f 859e705548f7377b1803b05b2904bae77495fd1e (only
compaction_alloc_lowlat)

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=3d74399aaece29047beba13a81650538af8db67a
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=f048d0082bd3c6a1cc3f5e67aa5ef83d1561ec27
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=859e705548f7377b1803b05b2904bae77495fd1e

Arthur could you give one more spin to the
3d74399aaece29047beba13a81650538af8db67a tree? (or do you prefer an
updated patch compaction-kswapd-3?)

I'd like to get 3d74399aaece29047beba13a81650538af8db67a tested again
because I did one more modification included in the git version
compared to the patch version (nr_slab was not always initialized and
it could lead to slightly higher kswapd cpu load than intended). I
doubt it will help though (just in case).

Mel what you think? If f048d0082bd3c6a1cc3f5e67aa5ef83d1561ec27 is
still the only one that shows no regression, I think it's safe to
apply it to 2.6.38 and revert the compaction in kswapd feature. Then
we can add compaction to kswapd later with no hurry.

Thanks a lot for the help and quick feedback!
Andrea

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 21:25               ` Andrea Arcangeli
@ 2011-02-23 21:55                 ` Arthur Marsh
  2011-02-23 23:59                   ` Andrea Arcangeli
  0 siblings, 1 reply; 31+ messages in thread
From: Arthur Marsh @ 2011-02-23 21:55 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman



Andrea Arcangeli wrote, on 24/02/11 07:55:
> On Thu, Feb 24, 2011 at 06:37:58AM +1030, Arthur Marsh wrote:
>> kswapd-high_wmark + compaction-kswapd-2 - kswapd0 CPU up to 11 percent
>> and slightly less pronounced slowdowns of MIDI playback compared to
>> previous patches.
>>
>> kswapd-high_wmark + compaction-no-kswapd-3 - kswapd0 CPU up to 2.3
>> percent and no noticable slowdown of MIDI playback.
>>
>> Mel Gorman's mm/compaction.c patch - kswapd0 CPU up to 20 percent and no
>> noticable slowdown of MIDI playback.
>>
>> Thanks everyone for the help with this.
>
> Ok then I think it's safer to go with compaction-no-kswapd-3.

One more combination I tried:

Mel Gorman's mm/compaction.c patch with Andrea Archangeli's 
kswapd-high_wmark + compaction-no-kswapd-3 patches - kswapd0 CPU less 
than 2 percent and no noticable slowdown of MIDI playback.

This may be better performing than compaction-no-kswapd-3 alone but is 
not conclusive.

>
> I also created a git tree in case anybody else wants to test in easier
> way.
>
> git clone --reference linux-2.6.git git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> (or "git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git")
> cd aa
> git checkout -f 3d74399aaece29047beba13a81650538af8db67a
> (compaction-kswapd+compaction_alloc_lowlat)
> git checkout -f f048d0082bd3c6a1cc3f5e67aa5ef83d1561ec27
> (compaction-no-kswapd+rest+compaction_alloc_lowlat)
> git checkout -f 859e705548f7377b1803b05b2904bae77495fd1e (only
> compaction_alloc_lowlat)
>
> http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=3d74399aaece29047beba13a81650538af8db67a
> http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=f048d0082bd3c6a1cc3f5e67aa5ef83d1561ec27
> http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=859e705548f7377b1803b05b2904bae77495fd1e
>
> Arthur could you give one more spin to the
> 3d74399aaece29047beba13a81650538af8db67a tree? (or do you prefer an
> updated patch compaction-kswapd-3?)

If you can send me an updated patch compaction-no-kswapd-3 (I presume 
that kswapd-high_wmark is still needed) it would be easier for me to apply.

>
> I'd like to get 3d74399aaece29047beba13a81650538af8db67a tested again
> because I did one more modification included in the git version
> compared to the patch version (nr_slab was not always initialized and
> it could lead to slightly higher kswapd cpu load than intended). I
> doubt it will help though (just in case).
>
> Mel what you think? If f048d0082bd3c6a1cc3f5e67aa5ef83d1561ec27 is
> still the only one that shows no regression, I think it's safe to
> apply it to 2.6.38 and revert the compaction in kswapd feature. Then
> we can add compaction to kswapd later with no hurry.
>
> Thanks a lot for the help and quick feedback!
> Andrea
>

Arthur.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 21:55                 ` Arthur Marsh
@ 2011-02-23 23:59                   ` Andrea Arcangeli
  2011-02-24  1:40                     ` Arthur Marsh
  0 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 23:59 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

On Thu, Feb 24, 2011 at 08:25:43AM +1030, Arthur Marsh wrote:
> One more combination I tried:
> 
> Mel Gorman's mm/compaction.c patch with Andrea Archangeli's 
> kswapd-high_wmark + compaction-no-kswapd-3 patches - kswapd0 CPU less 
> than 2 percent and no noticable slowdown of MIDI playback.

Applying Mel's patch on top should decrease latency more.

> If you can send me an updated patch compaction-no-kswapd-3 (I presume 
> that kswapd-high_wmark is still needed) it would be easier for me to apply.

It's a compaction-kswapd-3. It's likely going to work the same as the
previous compaction-kswapd-2 (not as good as
compaction-no-kswapd). It's better to apply both the kswapd-high_wmark
and Mel's patch too (not only this one) during testing.

Thanks,
Andrea

===
Subject: compaction: fix high latencies created by kswapd

From: Andrea Arcangeli <aarcange@redhat.com>

We need to proper spin_unlock_irq/cond_resched in the compaction loop to avoid
hurting latencies. We must also stop being too aggressive insisting with
compaction within kswapd if compaction don't make progress in every zone.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/compaction.h |    6 +++++
 mm/compaction.c            |   53 +++++++++++++++++++++++++++++++++++++--------
 mm/vmscan.c                |   40 ++++++++++++++++++++-------------
 3 files changed, 74 insertions(+), 25 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -278,9 +278,27 @@ static unsigned long isolate_migratepage
 	}
 
 	/* Time to isolate some pages for migration */
+	cond_resched();
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
+		int unlocked = 0;
+
+		/* give a chance to irqs before checking need_resched() */
+		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+			spin_unlock_irq(&zone->lru_lock);
+			unlocked = 1;
+		}
+		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
+			if (!unlocked)
+				spin_unlock_irq(&zone->lru_lock);
+			cond_resched();
+			spin_lock_irq(&zone->lru_lock);
+			if (fatal_signal_pending(current))
+				break;
+		} else if (unlocked)
+			spin_lock_irq(&zone->lru_lock);
+
 		if (!pfn_valid_within(low_pfn))
 			continue;
 		nr_scanned++;
@@ -443,6 +461,28 @@ static int compact_finished(struct zone 
 	return COMPACT_CONTINUE;
 }
 
+static unsigned long compaction_watermark(struct zone *zone, int order)
+{
+	/*
+	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
+	 * This is because during migration, copies of pages need to be
+	 * allocated and for a short time, the footprint is higher
+	 */
+	return low_wmark_pages(zone) + (2UL << order);
+}
+
+static int __compaction_need_reclaim(struct zone *zone,
+				     unsigned long watermark)
+{
+	return !zone_watermark_ok(zone, 0, watermark, 0, 0);
+}
+
+int compaction_need_reclaim(struct zone *zone, int order)
+{
+	return __compaction_need_reclaim(zone,
+					 compaction_watermark(zone, order));
+}
+
 /*
  * compaction_suitable: Is this suitable to run compaction on this zone now?
  * Returns
@@ -456,21 +496,16 @@ unsigned long compaction_suitable(struct
 	unsigned long watermark;
 
 	/*
-	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
-	 * This is because during migration, copies of pages need to be
-	 * allocated and for a short time, the footprint is higher
-	 */
-	watermark = low_wmark_pages(zone) + (2UL << order);
-	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
-		return COMPACT_SKIPPED;
-
-	/*
 	 * order == -1 is expected when compacting via
 	 * /proc/sys/vm/compact_memory
 	 */
 	if (order == -1)
 		return COMPACT_CONTINUE;
 
+	watermark = compaction_watermark(zone, order);
+	if (__compaction_need_reclaim(zone, watermark))
+		return COMPACT_SKIPPED;
+
 	/*
 	 * fragmentation index determines if allocation failures are due to
 	 * low memory or external fragmentation
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2385,9 +2385,9 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
-			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
+			int local_order;
 
 			if (!populated_zone(zone))
 				continue;
@@ -2407,20 +2407,22 @@ loop_again:
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok_safe(zone, order,
-					high_wmark_pages(zone), end_zone, 0))
+			nr_slab = 0;
+			if (!zone_watermark_ok_safe(zone, 0,
+					high_wmark_pages(zone), end_zone, 0) ||
+			    compaction_need_reclaim(zone, order)) {
 				shrink_zone(priority, zone, &sc);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-			total_scanned += sc.nr_scanned;
+				reclaim_state->reclaimed_slab = 0;
+				nr_slab = shrink_slab(sc.nr_scanned,
+						      GFP_KERNEL,
+						      lru_pages);
+				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+				total_scanned += sc.nr_scanned;
+			}
 
-			compaction = 0;
+			local_order = order;
 			if (order &&
-			    zone_watermark_ok(zone, 0,
-					       high_wmark_pages(zone),
-					      end_zone, 0) &&
+			    !compaction_need_reclaim(zone, order) &&
 			    !zone_watermark_ok(zone, order,
 					       high_wmark_pages(zone),
 					       end_zone, 0)) {
@@ -2428,12 +2430,18 @@ loop_again:
 						   order,
 						   sc.gfp_mask, false,
 						   COMPACT_MODE_KSWAPD);
-				compaction = 1;
+				/*
+				 * Let kswapd stop immediately even if
+				 * compaction didn't succeed to
+				 * generate "high_wmark_pages" of the
+				 * max order wanted in every zone.
+				 */
+				local_order = 0;
 			}
 
 			if (zone->all_unreclaimable)
 				continue;
-			if (!compaction && nr_slab == 0 &&
+			if (nr_slab == 0 &&
 			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*
@@ -2445,7 +2453,7 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
-			if (!zone_watermark_ok_safe(zone, order,
+			if (!zone_watermark_ok_safe(zone, local_order,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
 				/*
@@ -2453,7 +2461,7 @@ loop_again:
 				 * means that we have a GFP_ATOMIC allocation
 				 * failure risk. Hurry up!
 				 */
-				if (!zone_watermark_ok_safe(zone, order,
+				if (!zone_watermark_ok_safe(zone, local_order,
 					    min_wmark_pages(zone), end_zone, 0))
 					has_under_min_watermark_zone = 1;
 			} else {
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -26,6 +26,7 @@ extern int fragmentation_index(struct zo
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync);
+extern int compaction_need_reclaim(struct zone *zone, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 extern unsigned long compact_zone_order(struct zone *zone, int order,
 					gfp_t gfp_mask, bool sync,
@@ -68,6 +69,11 @@ static inline unsigned long try_to_compa
 	return COMPACT_CONTINUE;
 }
 
+static inline int compaction_need_reclaim(struct zone *zone, int order)
+{
+	return 0;
+}
+
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 23:59                   ` Andrea Arcangeli
@ 2011-02-24  1:40                     ` Arthur Marsh
  2011-02-24  1:54                       ` Andrea Arcangeli
  0 siblings, 1 reply; 31+ messages in thread
From: Arthur Marsh @ 2011-02-24  1:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman



Andrea Arcangeli wrote, on 24/02/11 10:29:
> On Thu, Feb 24, 2011 at 08:25:43AM +1030, Arthur Marsh wrote:
>> One more combination I tried:
>>
>> Mel Gorman's mm/compaction.c patch with Andrea Archangeli's
>> kswapd-high_wmark + compaction-no-kswapd-3 patches - kswapd0 CPU less
>> than 2 percent and no noticable slowdown of MIDI playback.
>
> Applying Mel's patch on top should decrease latency more.
>
>> If you can send me an updated patch compaction-no-kswapd-3 (I presume
>> that kswapd-high_wmark is still needed) it would be easier for me to apply.
>
> It's a compaction-kswapd-3. It's likely going to work the same as the
> previous compaction-kswapd-2 (not as good as
> compaction-no-kswapd). It's better to apply both the kswapd-high_wmark
> and Mel's patch too (not only this one) during testing.

OK, with kswapd-high_wmark + compaction-kswapd-3 + Mel's patch (with the 
compaction initialisation fix), MIDI playback is fine.
kswapd0 CPU can very occasionally hit equal highest (17 percent was the 
highest I noticed, but is generally below the top 4-5 processes and less 
than 10 percent when working, dropping to around 0.3 percent when swap 
activity has subsided). This was with loading KDE 3.5.10, konversation, 
aptitude -u, icedove and iceweasel with several dozen tabs in addition 
to aplaymidi.

Regards,

Arthur.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-24  1:40                     ` Arthur Marsh
@ 2011-02-24  1:54                       ` Andrea Arcangeli
  2011-02-26  6:43                         ` Andrea Arcangeli
  0 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-24  1:54 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

Hi Arthur,

On Thu, Feb 24, 2011 at 12:10:45PM +1030, Arthur Marsh wrote:
> OK, with kswapd-high_wmark + compaction-kswapd-3 + Mel's patch (with the 
> compaction initialisation fix), MIDI playback is fine.
> kswapd0 CPU can very occasionally hit equal highest (17 percent was the 
> highest I noticed, but is generally below the top 4-5 processes and less 
> than 10 percent when working, dropping to around 0.3 percent when swap 
> activity has subsided). This was with loading KDE 3.5.10, konversation, 
> aptitude -u, icedove and iceweasel with several dozen tabs in addition 
> to aplaymidi.

That sounds good... so maybe we don't have to backout compaction out
of kswapd and the new kswapd+compaction logic is working fine without
overloading the system like the older logic did.

Ok so tomorrow I'll get all results on these 3 kernels (
compaction-kswapd-3+compaction_alloc_lowlat-2 vs
compaction-no-kswapd-3+compaction_alloc_lowlat-2 vs
compaction_alloc_lowlat2) on network server load, where throughout is
measured in addition to latency. Then we'll have a better picture to
decide. If throughput is decreased in any measurable way I still
prefer compaction-no-kswapd and to rethink at this later without
hurry. The network load with jumbo frames from e1000 is very good at
stressing compaction+kswapd so there should be no surprises if
throughput and latency are ok with the new compaction kswapd logic.

I suggest you to keep running with this combination
(compaction-kswapd-3 + Mel's patch with the initialization fix that I
sent you + kswapd-high_wmark) if you can, to be sure it works fine for
you.

Thanks a lot for now!
Andrea

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-24  1:54                       ` Andrea Arcangeli
@ 2011-02-26  6:43                         ` Andrea Arcangeli
  2011-02-27  8:48                           ` Arthur Marsh
  0 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-26  6:43 UTC (permalink / raw)
  To: Arthur Marsh; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

On Thu, Feb 24, 2011 at 02:54:05AM +0100, Andrea Arcangeli wrote:
> Ok so tomorrow I'll get all results on these 3 kernels (
> compaction-kswapd-3+compaction_alloc_lowlat-2 vs
> compaction-no-kswapd-3+compaction_alloc_lowlat-2 vs
> compaction_alloc_lowlat2) on network server load, where throughout is
> measured in addition to latency. Then we'll have a better picture to

Latency is still lowest with compaction-no-kswapd-3. compaction_alloc
still is at the top of the profiling with
compaction-kswapd-3+compaction_alloc_lowlat-2. However
compaction-kswapd-3 reduced the overhead somewhat but not enough to be
as fast as with compaction-no-kswapd-3 (even if much better than
before).

So we should apply compaction-no-kswapd-3 to 2.6.38 I think.

====
Subject: compaction: fix high compaction latencies and remove compaction-kswapd

From: Andrea Arcangeli <aarcange@redhat.com>

It's safer to stop calling compaction from kswapd as that creates too
high load during memory pressure that can't be offseted by the
improved performance of compound allocations. NOTE: this is not
related to THP (THP allocations uses __GFP_NO_KSWAPD), this is only
related to frequent and small order allocations that make kswapd go
wild with compaction.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -405,10 +423,7 @@ static int compact_finished(struct zone 
 		return COMPACT_COMPLETE;
 
 	/* Compaction run is not finished if the watermark is not met */
-	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
-		watermark = low_wmark_pages(zone);
-	else
-		watermark = high_wmark_pages(zone);
+	watermark = low_wmark_pages(zone);
 	watermark += (1 << cc->order);
 
 	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
@@ -421,15 +436,6 @@ static int compact_finished(struct zone 
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
-	/*
-	 * Generating only one page of the right order is not enough
-	 * for kswapd, we must continue until we're above the high
-	 * watermark as a pool for high order GFP_ATOMIC allocations
-	 * too.
-	 */
-	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
-		return COMPACT_CONTINUE;
-
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
@@ -551,8 +557,7 @@ static int compact_zone(struct zone *zon
 
 unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
-				 bool sync,
-				 int compact_mode)
+				 bool sync)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -561,7 +566,6 @@ unsigned long compact_zone_order(struct 
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
-		.compact_mode = compact_mode,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -607,8 +611,7 @@ unsigned long try_to_compact_pages(struc
 								nodemask) {
 		int status;
 
-		status = compact_zone_order(zone, order, gfp_mask, sync,
-					    COMPACT_MODE_DIRECT_RECLAIM);
+		status = compact_zone_order(zone, order, gfp_mask, sync);
 		rc = max(status, rc);
 
 		/* If a normal allocation would succeed, stop compacting */
@@ -639,7 +642,6 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
-			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,9 +11,6 @@
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
 
-#define COMPACT_MODE_DIRECT_RECLAIM	0
-#define COMPACT_MODE_KSWAPD		1
-
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
@@ -28,8 +25,7 @@ extern unsigned long try_to_compact_page
 			bool sync);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 extern unsigned long compact_zone_order(struct zone *zone, int order,
-					gfp_t gfp_mask, bool sync,
-					int compact_mode);
+					gfp_t gfp_mask, bool sync);
 
 /* Do not skip compaction more than 64 times */
 #define COMPACT_MAX_DEFER_SHIFT 6
@@ -74,8 +70,7 @@ static inline unsigned long compaction_s
 }
 
 static inline unsigned long compact_zone_order(struct zone *zone, int order,
-					       gfp_t gfp_mask, bool sync,
-					       int compact_mode)
+					       gfp_t gfp_mask, bool sync)
 {
 	return COMPACT_CONTINUE;
 }
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2385,7 +2385,6 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
-			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
 			unsigned long balance_gap;
@@ -2426,24 +2425,9 @@ loop_again:
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 
-			compaction = 0;
-			if (order &&
-			    zone_watermark_ok(zone, 0,
-					       high_wmark_pages(zone),
-					      end_zone, 0) &&
-			    !zone_watermark_ok(zone, order,
-					       high_wmark_pages(zone),
-					       end_zone, 0)) {
-				compact_zone_order(zone,
-						   order,
-						   sc.gfp_mask, false,
-						   COMPACT_MODE_KSWAPD);
-				compaction = 1;
-			}
-
 			if (zone->all_unreclaimable)
 				continue;
-			if (!compaction && nr_slab == 0 &&
+			if (nr_slab == 0 &&
 			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-26  6:43                         ` Andrea Arcangeli
@ 2011-02-27  8:48                           ` Arthur Marsh
  0 siblings, 0 replies; 31+ messages in thread
From: Arthur Marsh @ 2011-02-27  8:48 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Clemens Ladisch, alsa-user, linux-kernel, Mel Gorman

Andrea Arcangeli wrote, on 26/02/11 17:13:
> On Thu, Feb 24, 2011 at 02:54:05AM +0100, Andrea Arcangeli wrote:
>> Ok so tomorrow I'll get all results on these 3 kernels (
>> compaction-kswapd-3+compaction_alloc_lowlat-2 vs
>> compaction-no-kswapd-3+compaction_alloc_lowlat-2 vs
>> compaction_alloc_lowlat2) on network server load, where throughout is
>> measured in addition to latency. Then we'll have a better picture to
>
> Latency is still lowest with compaction-no-kswapd-3. compaction_alloc
> still is at the top of the profiling with
> compaction-kswapd-3+compaction_alloc_lowlat-2. However
> compaction-kswapd-3 reduced the overhead somewhat but not enough to be
> as fast as with compaction-no-kswapd-3 (even if much better than
> before).
>
> So we should apply compaction-no-kswapd-3 to 2.6.38 I think.

I built a kernel based against git head of around 0800 UTC Saturday 
2011-02-26 plus compaction-no-kswapd-3 (Andrea's) and 
compaction_alloc_lowlat-2 (Mel's) patches.

The the latency performance for MIDI playback is excellent and kswap0 
has reported 100 seconds of CPU time in 22 hours of the kernel kernel 
running.

However, I am now battling against the kernel freeing up memory when 
using icedove/thunderbird. top is reporting 74 MiB resident, 418 MiB 
virtual, and as I pause typing to watch top output, free RAM has gone up 
to 50 MiB and I'm stuck waiting for parts of icedove to swap back in so 
that I can get a response from my input (either scrolling a message when 
reading it or composing a message, where I am left waiting for the text 
that I typed to appear).

Normally in this machine with only 384 MiB RAM and (say) 2.6.35-2.6.37 
kernels, free RAM will hover around the 5 MiB mark when lots of 
applications are open. I may wait when switching from one running 
application to another for it to respond, but once I am using that 
particular application it sufficiently responsive not to be annoying.

The aggressive reclamation of RAM by the kernel I built around 0800 UTC 
Saturday is working against the application currently in focus.

I will try Andrea's and Mel's patches built against 2.6.36-rc6 over the 
next several hours to see how that compares.

Arthur.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 16:24         ` Andrea Arcangeli
  2011-02-23 16:36           ` Andrea Arcangeli
  2011-02-23 16:55           ` Andrea Arcangeli
@ 2011-02-23 17:10           ` Mel Gorman
  2011-02-23 17:27             ` Andrea Arcangeli
  2 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2011-02-23 17:10 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Arthur Marsh, Clemens Ladisch, alsa-user, linux-kernel

On Wed, Feb 23, 2011 at 05:24:32PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
> > OK, these patches applied together against upstream didn't cause a crash 
> > but I did observe:
> > 
> > significant slowdowns of MIDI playback (moreso than in previous cases, 
> > and with less than 20 Meg of swap file in use);
> > 
> > kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
> > 
> > If I should try only one of the patches or something else entirely, 
> > please let me know.
> 
> Yes, with irq off, schedule won't run and need_resched won't get set.
> 

Stepping back a little, how did you determine that isolate_migrate was the
major problem? In my initial tests using the irqsoff tracer (sampled for
the duration fo the test every few seconds and resetting the max latency
each time), compaction_alloc() was a far worse source of problems and
isolate_migratepage didn't even register. It might be that I'm not testing
on large enough machines though.

> So let's try this.
> 
> In case this doesn't fix I definitely give it up with compaction in
> kswapd as GFP_ATOMIC will likely not get an huge benefit out of
> compaction anyway because 1) the allocations from GFP_ATOMIC are
> likely short lived, 2) the cost of the compaction scan loop +
> migration may be higher than the benefit we get from jumbo frames
> 

In another mail, I posted a patch that dealt with compaction_alloc after
finding that IRQs were being disabled for millisecond lengths of time.
That length of time for IRQs being disabled could account for the performance
loss on the network load. Can test the network load with it applied?

> <SNIP>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 17:10           ` Mel Gorman
@ 2011-02-23 17:27             ` Andrea Arcangeli
  2011-02-23 17:44               ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 17:27 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Arthur Marsh, Clemens Ladisch, alsa-user, linux-kernel

On Wed, Feb 23, 2011 at 05:10:47PM +0000, Mel Gorman wrote:
> On Wed, Feb 23, 2011 at 05:24:32PM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
> > > OK, these patches applied together against upstream didn't cause a crash 
> > > but I did observe:
> > > 
> > > significant slowdowns of MIDI playback (moreso than in previous cases, 
> > > and with less than 20 Meg of swap file in use);
> > > 
> > > kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
> > > 
> > > If I should try only one of the patches or something else entirely, 
> > > please let me know.
> > 
> > Yes, with irq off, schedule won't run and need_resched won't get set.
> > 
> 
> Stepping back a little, how did you determine that isolate_migrate was the
> major problem? In my initial tests using the irqsoff tracer (sampled for
> the duration fo the test every few seconds and resetting the max latency
> each time), compaction_alloc() was a far worse source of problems and
> isolate_migratepage didn't even register. It might be that I'm not testing
> on large enough machines though.

I think you're right compaction_alloc is a bigger problem. Your patch
to isolate_freepages is a must have and in the right direction.

However I think having large areas set as PageBuddy may be common too,
the irq latency source in isolated_migratepages I think needs fixing
too. We must be guaranteed to release irqs after max N pages (where N
is SWAP_CLUSTER_MAX in my last two patches).

> In another mail, I posted a patch that dealt with compaction_alloc after
> finding that IRQs were being disabled for millisecond lengths of time.
> That length of time for IRQs being disabled could account for the performance
> loss on the network load. Can test the network load with it applied?

kswapd was also running at 100% on all CPUs in that test.

The z1 that doesn't fix the latency source in compaction but that
removes compaction from kswapd (a light/hackish version of
compaction-no-kswapd-3 that I just posted) fixes the problem
completely for the network load too.

So clearly it's not only a problem we can fix in compaction, the irq
latency will improve for sure, but we still get an overload from
kswapd which is not ok I think.

What I am planning to test on the network load is
high-wmark+compaction_alloc_lowlat+compaction-kswapd-3 vs
high-wmark+compaction_alloc_lowlat+compaction-no-kswapd-2.

Is this ok? If you want I can test also
high-wmark+compaction_alloc_lowlat without
compaction-kswapd-3/compaction-no-kswapd-2 but I think the irq-latency
source in isolate_migratepages in presence of large PageBuddy regions
(after any large application started at boot quits) isn't ok. Also I
think having kswapd at 100% cpu load isn't ok. So I doubt we should
stop at compaction_alloc_lowlat.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 17:27             ` Andrea Arcangeli
@ 2011-02-23 17:44               ` Mel Gorman
  2011-02-23 18:14                 ` Andrea Arcangeli
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2011-02-23 17:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Arthur Marsh, Clemens Ladisch, alsa-user, linux-kernel

On Wed, Feb 23, 2011 at 06:27:34PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 23, 2011 at 05:10:47PM +0000, Mel Gorman wrote:
> > On Wed, Feb 23, 2011 at 05:24:32PM +0100, Andrea Arcangeli wrote:
> > > On Wed, Feb 23, 2011 at 04:17:44AM +1030, Arthur Marsh wrote:
> > > > OK, these patches applied together against upstream didn't cause a crash 
> > > > but I did observe:
> > > > 
> > > > significant slowdowns of MIDI playback (moreso than in previous cases, 
> > > > and with less than 20 Meg of swap file in use);
> > > > 
> > > > kswapd0 sharing equal top place in CPU usage at times (e.g. 20 percent).
> > > > 
> > > > If I should try only one of the patches or something else entirely, 
> > > > please let me know.
> > > 
> > > Yes, with irq off, schedule won't run and need_resched won't get set.
> > > 
> > 
> > Stepping back a little, how did you determine that isolate_migrate was the
> > major problem? In my initial tests using the irqsoff tracer (sampled for
> > the duration fo the test every few seconds and resetting the max latency
> > each time), compaction_alloc() was a far worse source of problems and
> > isolate_migratepage didn't even register. It might be that I'm not testing
> > on large enough machines though.
> 
> I think you're right compaction_alloc is a bigger problem. Your patch
> to isolate_freepages is a must have and in the right direction.
> 

Nice one.

> However I think having large areas set as PageBuddy may be common too,
> the irq latency source in isolated_migratepages I think needs fixing
> too. We must be guaranteed to release irqs after max N pages (where N
> is SWAP_CLUSTER_MAX in my last two patches).
> 

Your logic makes sense and I can see why it might not necessarily show
up in my tests. I was simply wondering if you spotted the problem
directly or from looking at teh source.

> > In another mail, I posted a patch that dealt with compaction_alloc after
> > finding that IRQs were being disabled for millisecond lengths of time.
> > That length of time for IRQs being disabled could account for the performance
> > loss on the network load. Can test the network load with it applied?
> 
> kswapd was also running at 100% on all CPUs in that test.
> 

On the plus side, the patch I posted also reduces kswapd CPU time.
Graphing CPU usage over time, I saw the following;

http://www.csn.ul.ie/~mel/postings/compaction-20110223/kswapdcpu-smooth-hydra.ps

i.e. CPU usage of kswapd is also reduced. The graph is smoothened because
the raw figures are so jagged as to be almost impossible to read. The z1
patches and others could also further reduce it (I haven't measured it yet)
but I thought it was interesting that IRQs being disabled for long periods
also contribed so heavily to kswapd CPU usage.

> The z1 that doesn't fix the latency source in compaction but that
> removes compaction from kswapd (a light/hackish version of
> compaction-no-kswapd-3 that I just posted) fixes the problem
> completely for the network load too.
> 

Ok. If necessary we can disable it entirely for this cycle but as I'm
seeing large sources of IRQ disabled latency in compaction and
shrink_inactive_list, it'd be nice to get that ironed out while the
problem is obvious too.

> So clearly it's not only a problem we can fix in compaction, the irq
> latency will improve for sure, but we still get an overload from
> kswapd which is not ok I think.
> 

Indeed not. 

> What I am planning to test on the network load is
> high-wmark+compaction_alloc_lowlat+compaction-kswapd-3 vs
> high-wmark+compaction_alloc_lowlat+compaction-no-kswapd-2.
> 
> Is this ok?

Sure to see what the results are. I'm still hoping we can prove the high-wmark
unnecessary due to Rik's naks. His reasoning about the corner cases it
potentially introduces is hard, if not impossible, to disprove.

> If you want I can test also
> high-wmark+compaction_alloc_lowlat without
> compaction-kswapd-3/compaction-no-kswapd-2 but I think the irq-latency
> source in isolate_migratepages in presence of large PageBuddy regions
> (after any large application started at boot quits) isn't ok.

Can you ditch all these patches in a directory somewhere because I'm
getting confused as to which patch is which exactly :)

> Also I
> think having kswapd at 100% cpu load isn't ok. So I doubt we should
> stop at compaction_alloc_lowlat.
> 

kswapd at 100% CPU is certainly unsuitable but would like to be sure we
are getting it down the right way without reintroducing the problems
this 8*high_wmark check fixed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0
  2011-02-23 17:44               ` Mel Gorman
@ 2011-02-23 18:14                 ` Andrea Arcangeli
  0 siblings, 0 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2011-02-23 18:14 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Arthur Marsh, Clemens Ladisch, alsa-user, linux-kernel

On Wed, Feb 23, 2011 at 05:44:37PM +0000, Mel Gorman wrote:
> Your logic makes sense and I can see why it might not necessarily show
> up in my tests. I was simply wondering if you spotted the problem
> directly or from looking at teh source.

I looked at the profiling and then at the source, but compaction_alloc
is on top, so it matches your findings.

This is with z1.

Samples          % of Total       Cum. Samples     Cum. % of Total   module:function
-------------------------------------------------------------------------------------------------
177786           6.178            177786           6.178             sunrpc:svc_recv
128779           4.475            306565           10.654            sunrpc:svc_xprt_enqueue
80786            2.807            387351           13.462            vmlinux:__d_lookup
62272            2.164            449623           15.626            ext4:ext4_htree_store_dirent
55896            1.942            505519           17.569            jbd2:journal_clean_one_cp_list
43868            1.524            549387           19.093            vmlinux:task_rq_lock
43572            1.514            592959           20.608            vmlinux:kfree
37620            1.307            630579           21.915            vmlinux:mwait_idle
36169            1.257            666748           23.172            vmlinux:schedule
34037            1.182            700785           24.355            e1000:e1000_clean
31945            1.110            732730           25.465            vmlinux:find_busiest_group
31491            1.094            764221           26.560            qla2xxx:qla24xx_intr_handler
30681            1.066            794902           27.626            vmlinux:_atomic_dec_and_lock
7425             0.258            xxxxxx           xxxxxx            vmlinux:get_page_from_freelist

This is with current compaction logic in kswapd.

Samples          % of Total       Cum. Samples     Cum. % of Total   module:function
-------------------------------------------------------------------------------------------------
1182928          17.358           1182928          17.358            vmlinux:get_page_from_freelist
657802           9.652            1840730          27.011            vmlinux:free_pcppages_bulk
579976           8.510            2420706          35.522            sunrpc:svc_xprt_enqueue
508953           7.468            2929659          42.991            sunrpc:svc_recv
490538           7.198            3420197          50.189            vmlinux:compaction_alloc
188620           2.767            3608817          52.957            vmlinux:tg_shares_up
97527            1.431            3706344          54.388            vmlinux:__d_lookup
85670            1.257            3792014          55.646            jbd2:journal_clean_one_cp_list
71738            1.052            3863752          56.698            vmlinux:mutex_spin_on_owner
71037            1.042            3934789          57.741            vmlinux:kfree

So clearly your patch may increase performance too (because of less
contention on the spinlock) but it's unlikely to make compaction_alloc
go away from the profiling. This isn't measuring irq latency, just the
time the CPU spent on each function but the two issues are connected
(as the more we call in that function the higher probability to run
into the high latency loop once in a while).

> On the plus side, the patch I posted also reduces kswapd CPU time.
> Graphing CPU usage over time, I saw the following;
> 
> http://www.csn.ul.ie/~mel/postings/compaction-20110223/kswapdcpu-smooth-hydra.ps
> 
> i.e. CPU usage of kswapd is also reduced. The graph is smoothened because
> the raw figures are so jagged as to be almost impossible to read. The z1
> patches and others could also further reduce it (I haven't measured it yet)
> but I thought it was interesting that IRQs being disabled for long periods
> also contribed so heavily to kswapd CPU usage.

I think it's lower contention on the heavily used zone lock may have
contributed to decreasing the overall system load if it's a large SMP,
not sure why kswapd usage went down though.

No problem so, I will test also a third kernel with your patch alone.

> Ok. If necessary we can disable it entirely for this cycle but as I'm
> seeing large sources of IRQ disabled latency in compaction and
> shrink_inactive_list, it'd be nice to get that ironed out while the
> problem is obvious too.

Sure. The current kswapd code helps to find any latency issue in
compaction ;). In fact they were totally unnoticed until we enabled it
in kswapd.

> Sure to see what the results are. I'm still hoping we can prove the high-wmark
> unnecessary due to Rik's naks. His reasoning about the corner cases it
> potentially introduces is hard, if not impossible, to disprove.

In my evaluation shrinking more on the small lists was worse for
overall zone lru balancing. That's the side effect of that change. But
I'm not against changing it to high+min like he suggested. For now
this was simpler. I've seen your patch too, that's ok with me too. But
because I don't see exactly the rationale of why it's a problem, I
don't like things that I'm uncertain about and I find the removal of
*8 simpler.

> Can you ditch all these patches in a directory somewhere because I'm
> getting confused as to which patch is which exactly :)

ok.... Let me finish sending the 3 kernels to test.

> kswapd at 100% CPU is certainly unsuitable but would like to be sure we
> are getting it down the right way without reintroducing the problems
> this 8*high_wmark check fixed.

Well the 8*high was never related to high kswapd load, simply it has
the effect that more memory is free when kswapd stops. It's very quick
at reaching 700m free, then it behaves identical as if only ~100m are
free (like now without the *8).

About kswapd: the current logic is clearly not ok in certain workloads
(my fault), so my attempt at fixing it is compaction-kswapd-3. I think
the primary problem is kswapd won't stop after the first invocation of
compaction if there's any fragmentation in any zone (could even be a
tiny dma zone). So this will fix it. But it'll still cause one
compaction invocation for every new order > 0 allocation (no big deal
for the dma zone as it's small).

If you check that vmscan.c change of compaction-kswapd-2 I think it
has better chance to work now. (also noticed __compaction_need_reclaim
doesn't need the "int order" parameter but you can ignore it, it's
harmless, worthless to fix until we know if this helps).

If even this fails, that means calling compaction even a single time
for each kswapd wakeup (in addition to direct compaction) is too
much. Next step would be to kswapd.max_order-- until it reaches zero
so it stops being called unless direct compaction is invoked too. But
then we can try this later.

compaction-no-kswapd-3 + your compaction_alloc_lowlat should fix the
problem and it's good thing kswapd misbehaved so we noticed the
latency issues.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2011-02-27  8:48 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <g0ia38-jj6.ln1@ppp121-45-136-118.lns11.adl6.internode.on.net>
2011-02-22  7:37 ` [Alsa-user] new source of MIDI playback slow-down identified - 5a03b051ed87e72b959f32a86054e1142ac4cf55 thp: use compaction in kswapd for GFP_ATOMIC order > 0 Clemens Ladisch
2011-02-22  7:46   ` Arthur Marsh
2011-02-22 13:40   ` Andrea Arcangeli
2011-02-22 16:15     ` Andrea Arcangeli
2011-02-22 16:59       ` Mel Gorman
2011-02-22 17:08         ` Andrea Arcangeli
2011-02-22 17:37           ` Mel Gorman
2011-02-22 17:47       ` Arthur Marsh
2011-02-22 19:43         ` Andrea Arcangeli
2011-02-23  9:15           ` Mel Gorman
2011-02-23 11:41             ` Arthur Marsh
2011-02-23 13:50               ` Clemens Ladisch
2011-02-23 17:01               ` Mel Gorman
2011-02-23 17:40                 ` Andrea Arcangeli
2011-02-23 16:24         ` Andrea Arcangeli
2011-02-23 16:36           ` Andrea Arcangeli
2011-02-23 16:40             ` Andrea Arcangeli
2011-02-23 16:47               ` Andrea Arcangeli
2011-02-23 16:55           ` Andrea Arcangeli
2011-02-23 20:07             ` Arthur Marsh
2011-02-23 21:25               ` Andrea Arcangeli
2011-02-23 21:55                 ` Arthur Marsh
2011-02-23 23:59                   ` Andrea Arcangeli
2011-02-24  1:40                     ` Arthur Marsh
2011-02-24  1:54                       ` Andrea Arcangeli
2011-02-26  6:43                         ` Andrea Arcangeli
2011-02-27  8:48                           ` Arthur Marsh
2011-02-23 17:10           ` Mel Gorman
2011-02-23 17:27             ` Andrea Arcangeli
2011-02-23 17:44               ` Mel Gorman
2011-02-23 18:14                 ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox