All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: "P. Christeas" <xrg@linux.gr>
Cc: linux-mm@kvack.org, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	lkml <linux-kernel@vger.kernel.org>,
	David Rientjes <rientjes@google.com>,
	Norbert Preining <preining@logic.at>,
	Markus Trippelsdorf <markus@trippelsdorf.de>,
	Pavel Machek <pavel@ucw.cz>
Subject: Re: Early test: hangs in mm/compact.c w. Linus's 12d7aacab56e9ef185c
Date: Sat, 08 Nov 2014 23:18:37 +0100	[thread overview]
Message-ID: <545E96BD.5040103@suse.cz> (raw)
In-Reply-To: <3443150.6EQzxj6Rt9@xorhgos3.pefnos>

On 11/08/2014 02:11 PM, P. Christeas wrote:
> On Thursday 06 November 2014, Vlastimil Babka wrote:
>>> On Wednesday 05 November 2014, Vlastimil Babka wrote:
>>>> Can you please try the following patch?
>>>> -			compaction_defer_reset(zone, order, false);
>> Oh and did I ask in this thread for /proc/zoneinfo yet? :)
> 
> Using that same kernel[1], got again into a race, gathered a few more data.
> 
> This time, I had 1x "urpmq" process [2] hung at 100% CPU , when "kwin" got 
> apparently blocked (100% CPU, too) trying to resize a GUI window. I suppose 
> the resizing operation would mean heavy memory alloc/free.
> 
> The rest of the system was responsive, I could easily get a console, login, 
> gather the files.. Then, I have *killed* -9 the "urpmq" process, which solved 
> the race and my system is still alive! "kwin" is still running, returned to 
> regular CPU load.
> 
> Attached is traces from SysRq+l (pressed a few times, wanted to "snapshot" the 
> stack) and /proc/zoneinfo + /proc/vmstat
> 
> Bisection is not yet meaningful, IMHO, because I cannot be sure that "good" 
> points are really free from this issue. I'd estimate that each test would take 
> +3days, unless I really find a deterministic way to reproduce the issue .

Hi,

I think I finally found the cause by staring into the code... CCing
people from all 4 separate threads I know about this issue.
The problem with finding the cause was that the first report I got from
Markus was about isolate_freepages_block() overhead, and later Norbert
reported that reverting a patch for isolate_freepages* helped. But the
problem seems to be that although the loop in isolate_migratepages exits
because the scanners almost meet (they are within same pageblock), they
don't truly meet, therefore compact_finished() decides to continue, but
isolate_migratepages() exits immediately... boom! But indeed e14c720efdd7
made this situation possible, as free scaner pfn can now point to a
middle of pageblock.

So I hope the attached patch will fix the soft-lockup issues in
compact_zone. Please apply on 3.18-rc3 or later without any other reverts,
and test. It probably won't help Markus and his isolate_freepages_block()
overhead though...

Thanks,
Vlastimil

------8<------

WARNING: multiple messages have this Message-ID (diff)
From: Vlastimil Babka <vbabka@suse.cz>
To: "P. Christeas" <xrg@linux.gr>
Cc: linux-mm@kvack.org, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	lkml <linux-kernel@vger.kernel.org>,
	David Rientjes <rientjes@google.com>,
	Norbert Preining <preining@logic.at>,
	Markus Trippelsdorf <markus@trippelsdorf.de>,
	Pavel Machek <pavel@ucw.cz>
Subject: Re: Early test: hangs in mm/compact.c w. Linus's 12d7aacab56e9ef185c
Date: Sat, 08 Nov 2014 23:18:37 +0100	[thread overview]
Message-ID: <545E96BD.5040103@suse.cz> (raw)
In-Reply-To: <3443150.6EQzxj6Rt9@xorhgos3.pefnos>

On 11/08/2014 02:11 PM, P. Christeas wrote:
> On Thursday 06 November 2014, Vlastimil Babka wrote:
>>> On Wednesday 05 November 2014, Vlastimil Babka wrote:
>>>> Can you please try the following patch?
>>>> -			compaction_defer_reset(zone, order, false);
>> Oh and did I ask in this thread for /proc/zoneinfo yet? :)
> 
> Using that same kernel[1], got again into a race, gathered a few more data.
> 
> This time, I had 1x "urpmq" process [2] hung at 100% CPU , when "kwin" got 
> apparently blocked (100% CPU, too) trying to resize a GUI window. I suppose 
> the resizing operation would mean heavy memory alloc/free.
> 
> The rest of the system was responsive, I could easily get a console, login, 
> gather the files.. Then, I have *killed* -9 the "urpmq" process, which solved 
> the race and my system is still alive! "kwin" is still running, returned to 
> regular CPU load.
> 
> Attached is traces from SysRq+l (pressed a few times, wanted to "snapshot" the 
> stack) and /proc/zoneinfo + /proc/vmstat
> 
> Bisection is not yet meaningful, IMHO, because I cannot be sure that "good" 
> points are really free from this issue. I'd estimate that each test would take 
> +3days, unless I really find a deterministic way to reproduce the issue .

Hi,

I think I finally found the cause by staring into the code... CCing
people from all 4 separate threads I know about this issue.
The problem with finding the cause was that the first report I got from
Markus was about isolate_freepages_block() overhead, and later Norbert
reported that reverting a patch for isolate_freepages* helped. But the
problem seems to be that although the loop in isolate_migratepages exits
because the scanners almost meet (they are within same pageblock), they
don't truly meet, therefore compact_finished() decides to continue, but
isolate_migratepages() exits immediately... boom! But indeed e14c720efdd7
made this situation possible, as free scaner pfn can now point to a
middle of pageblock.

So I hope the attached patch will fix the soft-lockup issues in
compact_zone. Please apply on 3.18-rc3 or later without any other reverts,
and test. It probably won't help Markus and his isolate_freepages_block()
overhead though...

Thanks,
Vlastimil

------8<------
>From fbf8eb0bcd2897090312e23da6a31bad9cc6b337 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Sat, 8 Nov 2014 22:20:43 +0100
Subject: [PATCH] mm, compaction: prevent endless loop in migrate scanner

---
 mm/compaction.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ec74cf0..1b7a1be 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1029,8 +1029,12 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	}
 
 	acct_isolated(zone, cc);
-	/* Record where migration scanner will be restarted */
-	cc->migrate_pfn = low_pfn;
+	/* 
+	 * Record where migration scanner will be restarted. If we end up in
+	 * the same pageblock as the free scanner, make the scanners fully
+	 * meet so that compact_finished() terminates compaction.
+	 */
+	cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn;
 
 	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
 }
-- 
2.1.2



  reply	other threads:[~2014-11-08 22:18 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-04  7:26 Early test: hangs in mm/compact.c w. Linus's 12d7aacab56e9ef185c P. Christeas
2014-11-04  8:55 ` Vlastimil Babka
2014-11-04  8:55   ` Vlastimil Babka
2014-11-04  9:36   ` P. Christeas
2014-11-05 15:26     ` Vlastimil Babka
2014-11-05 15:26       ` Vlastimil Babka
2014-11-05 16:02       ` P. Christeas
2014-11-05 16:02         ` P. Christeas
2014-11-06 19:23       ` P. Christeas
2014-11-06 21:38         ` Vlastimil Babka
2014-11-06 21:38           ` Vlastimil Babka
2014-11-08 13:11           ` P. Christeas
2014-11-08 22:18             ` Vlastimil Babka [this message]
2014-11-08 22:18               ` Vlastimil Babka
2014-11-09  8:27               ` Pavel Machek
2014-11-09  9:43                 ` Vlastimil Babka
2014-11-09  9:43                   ` Vlastimil Babka
2014-11-09 22:32                   ` Norbert Preining
2014-11-09 22:32                     ` Norbert Preining
2014-11-10  6:07               ` Joonsoo Kim
2014-11-10  6:07                 ` Joonsoo Kim
2014-11-10  7:53                 ` Vlastimil Babka
2014-11-10  7:53                   ` Vlastimil Babka
2014-11-10  8:05                   ` Joonsoo Kim
2014-11-10  8:05                     ` Joonsoo Kim
2014-11-10  8:14               ` P. Christeas
2014-11-10  8:14                 ` P. Christeas
  -- strict thread matches above, loose matches on Subject: below --
2014-11-09  4:47 Hillf Danton
2014-11-09  4:47 ` Hillf Danton
2014-11-09  8:22 ` P. Christeas
2014-11-09  8:22   ` P. Christeas
2014-11-09  9:35   ` Vlastimil Babka
2014-11-09  9:35     ` Vlastimil Babka
2014-11-10  3:23     ` Hillf Danton
2014-11-10  3:23       ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=545E96BD.5040103@suse.cz \
    --to=vbabka@suse.cz \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=markus@trippelsdorf.de \
    --cc=pavel@ucw.cz \
    --cc=preining@logic.at \
    --cc=rientjes@google.com \
    --cc=xrg@linux.gr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.