Re: [3.6 regression?] THP + migration/compaction livelock (I think)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: David Rientjes <rientjes@google.com>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Marc Duponcheel <marc@offline.be>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [3.6 regression?] THP + migration/compaction livelock (I think)
Date: Wed, 14 Nov 2012 10:01:54 +0000	[thread overview]
Message-ID: <20121114100154.GI8218@suse.de> (raw)
In-Reply-To: <alpine.DEB.2.00.1211131530020.17623@chino.kir.corp.google.com>

On Tue, Nov 13, 2012 at 03:41:02PM -0800, David Rientjes wrote:
> On Tue, 13 Nov 2012, Andy Lutomirski wrote:
> 
> > It just happened again.
> > 
> > $ grep -E "compact_|thp_" /proc/vmstat
> > compact_blocks_moved 8332448774
> > compact_pages_moved 21831286
> > compact_pagemigrate_failed 211260
> > compact_stall 13484
> > compact_fail 6717
> > compact_success 6755
> > thp_fault_alloc 150665
> > thp_fault_fallback 4270
> > thp_collapse_alloc 19771
> > thp_collapse_alloc_failed 2188
> > thp_split 19600
> > 
> 
> Two of the patches from the list provided at
> http://marc.info/?l=linux-mm&m=135179005510688 are already in your 3.6.3 
> kernel:
> 
> 	mm: compaction: abort compaction loop if lock is contended or run too long
> 	mm: compaction: acquire the zone->lock as late as possible
> 
> and all have not made it to the 3.6 stable kernel yet, so would it be 
> possible to try with 3.7-rc5 to see if it fixes the issue?  If so, it will 
> indicate that the entire series is a candidate to backport to 3.6.

Thanks David once again.

The full list of compaction-related patches I believe are necessary for
this particular problem are

e64c5237cf6ff474cb2f3f832f48f2b441dd9979 mm: compaction: abort compaction loop if lock is contended or run too long
3cc668f4e30fbd97b3c0574d8cac7a83903c9bc7 mm: compaction: move fatal signal check out of compact_checklock_irqsave
661c4cb9b829110cb68c18ea05a56be39f75a4d2 mm: compaction: Update try_to_compact_pages()kerneldoc comment
2a1402aa044b55c2d30ab0ed9405693ef06fb07c mm: compaction: acquire the zone->lru_lock as late as possible
f40d1e42bb988d2a26e8e111ea4c4c7bac819b7e mm: compaction: acquire the zone->lock as late as possible
753341a4b85ff337487b9959c71c529f522004f4 revert "mm: have order > 0 compaction start off where it left"
bb13ffeb9f6bfeb301443994dfbf29f91117dfb3 mm: compaction: cache if a pageblock was scanned and no pages were isolated
c89511ab2f8fe2b47585e60da8af7fd213ec877e mm: compaction: Restart compaction from near where it left off
62997027ca5b3d4618198ed8b1aba40b61b1137b mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
0db63d7e25f96e2c6da925c002badf6f144ddf30 mm: compaction: correct the nr_strict va isolated check for CMA

If we can get confirmation that these fix the problem in 3.6 kernels then
I can backport them to -stable. This fixing a problem where "many processes
stall, all in an isolation-related function". This started happening after
lumpy reclaim was removed because we depended on that to aggressively
reclaim with less compaction. Now compaction is depended upon more.

The full 3.7-rc5 kernel has a different problem on top of this and it's
important the problems do not get conflacted. It has these fixes *but*
GFP_NO_KSWAPD has been removed and there is a patch that scales reclaim
with THP failures that is causing problem. With them, kswapd can get
stuck in a 100% loop where it is neither reclaiming nor reaching its exit
conditions. The correct fix would be to identify why this happens but I
have not got around to it yet. To test with 3.7-rc5 then apply either

1) https://lkml.org/lkml/2012/11/5/308
2) https://lkml.org/lkml/2012/11/12/113

or

1) https://lkml.org/lkml/2012/11/5/308
3) https://lkml.org/lkml/2012/11/12/151

on top of 3.7-rc5. So it's a lot of work but there are three tests I'm
interested in hearing about. The results of each determine what happens
in -stable or mainline

Test 1: 3.6 + the last of commits above	(should fix processes stick in isolate)
Test 2: 3.7-rc5 + (1+2) above (should fix kswapd stuck at 100%)
Test 3: 3.7-rc5 + (1+3) above (should fix kswapd stuck at 100% but better)

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Mel Gorman <mgorman@suse.de>
To: David Rientjes <rientjes@google.com>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Marc Duponcheel <marc@offline.be>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [3.6 regression?] THP + migration/compaction livelock (I think)
Date: Wed, 14 Nov 2012 10:01:54 +0000	[thread overview]
Message-ID: <20121114100154.GI8218@suse.de> (raw)
In-Reply-To: <alpine.DEB.2.00.1211131530020.17623@chino.kir.corp.google.com>

On Tue, Nov 13, 2012 at 03:41:02PM -0800, David Rientjes wrote:
> On Tue, 13 Nov 2012, Andy Lutomirski wrote:
> 
> > It just happened again.
> > 
> > $ grep -E "compact_|thp_" /proc/vmstat
> > compact_blocks_moved 8332448774
> > compact_pages_moved 21831286
> > compact_pagemigrate_failed 211260
> > compact_stall 13484
> > compact_fail 6717
> > compact_success 6755
> > thp_fault_alloc 150665
> > thp_fault_fallback 4270
> > thp_collapse_alloc 19771
> > thp_collapse_alloc_failed 2188
> > thp_split 19600
> > 
> 
> Two of the patches from the list provided at
> http://marc.info/?l=linux-mm&m=135179005510688 are already in your 3.6.3 
> kernel:
> 
> 	mm: compaction: abort compaction loop if lock is contended or run too long
> 	mm: compaction: acquire the zone->lock as late as possible
> 
> and all have not made it to the 3.6 stable kernel yet, so would it be 
> possible to try with 3.7-rc5 to see if it fixes the issue?  If so, it will 
> indicate that the entire series is a candidate to backport to 3.6.

Thanks David once again.

The full list of compaction-related patches I believe are necessary for
this particular problem are

e64c5237cf6ff474cb2f3f832f48f2b441dd9979 mm: compaction: abort compaction loop if lock is contended or run too long
3cc668f4e30fbd97b3c0574d8cac7a83903c9bc7 mm: compaction: move fatal signal check out of compact_checklock_irqsave
661c4cb9b829110cb68c18ea05a56be39f75a4d2 mm: compaction: Update try_to_compact_pages()kerneldoc comment
2a1402aa044b55c2d30ab0ed9405693ef06fb07c mm: compaction: acquire the zone->lru_lock as late as possible
f40d1e42bb988d2a26e8e111ea4c4c7bac819b7e mm: compaction: acquire the zone->lock as late as possible
753341a4b85ff337487b9959c71c529f522004f4 revert "mm: have order > 0 compaction start off where it left"
bb13ffeb9f6bfeb301443994dfbf29f91117dfb3 mm: compaction: cache if a pageblock was scanned and no pages were isolated
c89511ab2f8fe2b47585e60da8af7fd213ec877e mm: compaction: Restart compaction from near where it left off
62997027ca5b3d4618198ed8b1aba40b61b1137b mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
0db63d7e25f96e2c6da925c002badf6f144ddf30 mm: compaction: correct the nr_strict va isolated check for CMA

If we can get confirmation that these fix the problem in 3.6 kernels then
I can backport them to -stable. This fixing a problem where "many processes
stall, all in an isolation-related function". This started happening after
lumpy reclaim was removed because we depended on that to aggressively
reclaim with less compaction. Now compaction is depended upon more.

The full 3.7-rc5 kernel has a different problem on top of this and it's
important the problems do not get conflacted. It has these fixes *but*
GFP_NO_KSWAPD has been removed and there is a patch that scales reclaim
with THP failures that is causing problem. With them, kswapd can get
stuck in a 100% loop where it is neither reclaiming nor reaching its exit
conditions. The correct fix would be to identify why this happens but I
have not got around to it yet. To test with 3.7-rc5 then apply either

1) https://lkml.org/lkml/2012/11/5/308
2) https://lkml.org/lkml/2012/11/12/113

or

1) https://lkml.org/lkml/2012/11/5/308
3) https://lkml.org/lkml/2012/11/12/151

on top of 3.7-rc5. So it's a lot of work but there are three tests I'm
interested in hearing about. The results of each determine what happens
in -stable or mainline

Test 1: 3.6 + the last of commits above	(should fix processes stick in isolate)
Test 2: 3.7-rc5 + (1+2) above (should fix kswapd stuck at 100%)
Test 3: 3.7-rc5 + (1+3) above (should fix kswapd stuck at 100% but better)

Thanks.

-- 
Mel Gorman
SUSE Labs

next prev parent reply	other threads:[~2012-11-14 10:01 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-13 22:13 [3.6 regression?] THP + migration/compaction livelock (I think) Andy Lutomirski
2012-11-13 22:13 ` Andy Lutomirski
2012-11-13 23:11 ` David Rientjes
2012-11-13 23:11   ` David Rientjes
2012-11-13 23:25   ` Andy Lutomirski
2012-11-13 23:25     ` Andy Lutomirski
2012-11-13 23:41     ` David Rientjes
2012-11-13 23:41       ` David Rientjes
2012-11-13 23:45       ` Andy Lutomirski
2012-11-13 23:45         ` Andy Lutomirski
2012-11-13 23:54         ` David Rientjes
2012-11-13 23:54           ` David Rientjes
2012-11-14  1:22           ` Marc Duponcheel
2012-11-14  1:22             ` Marc Duponcheel
2012-11-14  1:51             ` David Rientjes
2012-11-14  1:51               ` David Rientjes
2012-11-14 13:21               ` Marc Duponcheel
2012-11-14 13:21                 ` Marc Duponcheel
2012-11-14 10:01       ` Mel Gorman [this message]
2012-11-14 10:01         ` Mel Gorman
2012-11-14 13:29         ` Marc Duponcheel
2012-11-14 13:29           ` Marc Duponcheel
2012-11-14 21:50           ` David Rientjes
2012-11-14 21:50             ` David Rientjes
2012-11-15  1:14             ` Marc Duponcheel
2012-11-15  1:14               ` Marc Duponcheel
2012-11-17  0:18               ` Marc Duponcheel
2012-11-18 22:55                 ` David Rientjes
2012-11-18 22:55                   ` David Rientjes
2012-12-05 19:23                   ` Andy Lutomirski
2012-12-05 19:23                     ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121114100154.GI8218@suse.de \
    --to=mgorman@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=marc@offline.be \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.