Re: [LSF/MM TOPIC ATTEND] - Andrea Arcangeli

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrea Arcangeli <aarcange@redhat.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>, linux-mm@kvack.org
Subject: Re: [LSF/MM TOPIC ATTEND]
Date: Thu, 15 Jan 2015 21:58:43 +0100	[thread overview]
Message-ID: <20150115205843.GS6103@redhat.com> (raw)
In-Reply-To: <20150115140654.GG7000@dhcp22.suse.cz>

Hi Michal,

On Thu, Jan 15, 2015 at 03:06:54PM +0100, Michal Hocko wrote:
> 
> > http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=a0fcf2323b2e4cffd750c1abc1d2c138acdefcc8
> 
> I am not sure about this one because TIF_MEMDIE is there to give an
> access to memory reserves. GFP_NOFAIL shouldn't mean the same because
> then it would be much harder to "guarantee" that the reserves wouldn't
> be depleted completely. So I do not like this much. Besides that I think
> that GFP_NOFAIL allocation blocking OOM victim is a plain bug.
> grow_dev_page is relying on GFP_NOFAIL but I am wondering whether ext4
> can do something to pre-allocate so that it doesn't have to call it.

Well this is just the longstanding GFP_NOFAIL livelock, it always
existed deep down in the buffer header allocation even before
GFP_NOFAIL existed. GFP_NOFAIL just generalized the livelocking
concept.

There's no proper fix for that other than to teach the filesystem to
deal with allocation errors and remove GFP_NOFAIL (in this case
__GFP_NOFAIL was set:

 #0 get_page_from_freelist (gfp_mask=0x20858, nodemask=0x0
  <irq_stack_union>, order=0x0, zonelist=0xffff88007fffc100, hi
  gh_zoneidx=0x2, alloc_flags=0xc0, preferred_zone=0xffff88007fffa840,
  classzone_idx=classzone_idx@entry=0x1,
  migratetype=migratetype@entry=0x2) at mm/page_alloc.c:1953

gfp_mask=0x20858  & 0x800u = 0x800

If we're OOM and GFP_NOFAIL actually fails to allocate memory, this
patch simply tries to mitigate the potential livelock by giving it a
chance to use the memory reserves that are normally used only for high
priority allocations.

If __GFP_NOFAIL hits OOM I think it's fair to say it is very high
priority (more high priority than a GFP_ATOMIC or something that can
fail totally gracefully).

So the above second patch looks quite safe to me conceptually as well:
at least we put those last 50M of ram to good use instead of
livelocking while 50M are still free.

> > http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=798b7f9d549664f8c0007c6416a2568eedd75d6a
> 
> I think this should be fixed in the filesystem rather than paper over
> it.

No doubt, this third patch basically undoes the fix in the first
patch. This third patch makes __GFP_FS again not failing if invoked in
kernel thread context where TIF_MEMDIE cannot ever be set.

However there was no way I could run without this third patch on my
own production systems with a potential ext4 mounting itself readonly
on me during OOM killing (as result of the livelock fix in the first
patch).

Ideally the ext4 developer should reverse this third patch (which must
be keep separated from the first patch exactly for this reason) and
start an OOM killing loop to reproduce and fix this so then we can
reverse the third patch.

In short:

1) first patch makes !__GFP_FS not equivalent to __GFP_NOFAIL anymore
   (when invoked by kernel threads where TIF_MEMDIE cannot be set)

2) second patch deals with a genuine __GFP_NOFAIL livelock using the
   memory reserves (this is orthogonal with 1)

3) third patch undoes 1 and uses the memory reserves for !__GFP_FS too
   like patch 2 used them to mitigate the genuine __GFP_NOFAIL
   deadlock.  Undoing patch 1 is needed because patch 1 causes ext4 to
   remount itself readonly and complain about metadata corruption.

I later tested further the ext4 trouble after applying only 1 and it
seems ext4 thinks it's corrupted, but e2fsck -f shows it's actually
clean. So it's probably an in-memory issue only, but still having ext4
remounting itself readonly during a OOM killing isn't exactly
acceptable or graceful (until it is fixed). Hence the reason for 3.

Of course it took a long time before the trouble with patch 1 seen the
light, in fact first I hit the genuine __GFP_NOFAIL deadlock fixed by
2 before I could ever hit the ext4 error paths.

Let me know what you'd like me to submit, I just don't think
submitting only the first patch as you suggested is safe idea.

I also think allowing __GFP_NOFAIL to access the emergency reserves is
ok if __GFP_NOFAIL is hitting an OOM condition (what else could be
more urgent than succeeding a potentially livelocking __GFP_NOFAIL?).

I think the combination of the 3 patches is safe and in practice it
solves all OOM related livelocks I run into. It also allows ext4
developers to trivially (git reverse #ofpatch3) to fix their bugs and
then we can reverse the third patch upstream so !__GFP_FS allocations
from kernel threads becomes theoretically safe too. As opposed
__GFP_NOFAIL is never theoretically safe but that's much harder to fix
than the already existing ext4 error paths that aren't using
__GFP_NOFAIL but that haven't been properly exercised, simply because
they couldn't be exercised without patch 1 applied (kernel thread
allocations without __GFP_FS set cannot fail currently and making them
fail by applying patch1, exercises those untested error paths for the
first time).

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-01-15 20:58 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-06 16:14 [LSF/MM TOPIC ATTEND] Michal Hocko
2015-01-06 23:27 ` Greg Thelen
2015-01-07 14:28   ` Michal Hocko
2015-01-07 18:54     ` Greg Thelen
2015-01-07 19:00     ` Greg Thelen
2015-01-14 21:27     ` Andrea Arcangeli
2015-01-15 14:06       ` Michal Hocko
2015-01-15 20:58         ` Andrea Arcangeli [this message]
2015-01-07  8:58 ` Vladimir Davydov
2015-01-07 14:38   ` Michal Hocko
2015-01-08  8:33     ` Vladimir Davydov
2015-01-08  9:09       ` Michal Hocko
2015-02-02  8:37 ` [LSF/MM TOPIC ATTEND] - THP benefits Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150115205843.GS6103@redhat.com \
    --to=aarcange@redhat.com \
    --cc=gthelen@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).