From: Andrea Arcangeli <aarcange@redhat.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michel Lespinasse <walken@google.com>,
Sasha Levin <sasha.levin@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
Dave Jones <davej@redhat.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Vlastimil Babka <vbabka@suse.cz>, Bob Liu <lliubbo@gmail.com>
Subject: Re: mm: kernel BUG at mm/huge_memory.c:1829!
Date: Thu, 10 Apr 2014 18:27:50 +0200 [thread overview]
Message-ID: <20140410162750.GD2749@redhat.com> (raw)
In-Reply-To: <20140410134436.GA25933@node.dhcp.inet.fi>
Hi,
On Thu, Apr 10, 2014 at 04:44:36PM +0300, Kirill A. Shutemov wrote:
> Okay, below is my attempt to fix the bug. I'm not entirely sure it's
> correct. Andrea, could you take a look?
The possibility the interval tree implicitly broke the walk order of
the anon_vma list didn't cross my mind, that's very good catch!
Breakage of the rmap walk order definitely can explain that BUG_ON in
split_huge_page that signals a pte was missed by the rmap walk.
Because this bug only fired on split_huge_page I guess you assumed I
introduced this dependency on order with THP. But it's not actually
the case, there was a place in the VM that already depended on perfect
rmap walk. This is all about providing a perfect rmap walk. A perfect
rmap walk is one where missing a pte is fatal. That other place that
already existed before THP is migrate.c.
The other thing that could break the perfect rmap_walk in additon to a
wrong rmap_walk order, is the exec path where we do an mremap without
a vma covering the destination range (no vma means, no possible
perfect rmap_walk as we need the vma to reach the pte) but that was
handled by other means (see the invalid_migration_vma in migrate.c and
the other checks for is_vma_temporary_stack in huge_memory.c, THP
didn't need to handle it but migrate.c had to). Places using
is_vma_temporary_stack can tell you the cases where a perfect rmap
walk is required and I'm not aware of other locations other than these
two.
split_huge_page might be more pedantic in making sure a pte wasn't
missed (I haven't checked in detail to tell how migrate.c would behave
in such case, split_huge_page just BUG_ON).
So I doubt making a local fix to huge_memory.c is enough, at least
migrate.c (i.e. rmap_walk_anon) should be handled too somehow.
While I'm positive the breakge of rmap_walk order explains the BUG_ON
with trinity (as your forking testcase also shows), I'm quite
uncomfortable to depend on comparison on atomic mapcount that is an
atomic for a reason, to know if the order went wrong and we shall
repeat the loop because fork created a pte we missed.
The commit message doesn't explain how do you guarantee mapcount
cannot change from under us in a racey way. If we go with this fix,
I'd suggest to explain that crucial point about the safety of the
page->mapcount comparison in that code path, in the commit message. It
may be safe by other means! I'm not saying it's definitely not safe,
but it's at least not obviously safe as it looks like the atomic
mapcount could change while it is being read, and there was no obvious
explaination of how it is safe despite it is not stable.
The other downside is that an infinite loop in kernel mode with no
debug message printed, would make this less debuggable too if a real
functional bug hits (not as result of race because of rmap walk
ordering breakage).
I assume the interval tree ordering cannot be fixed, but I'd recommend
to look closer into that possibility too before ruling it out.
Thanks!
Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-04-10 16:32 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-08 14:37 mm: kernel BUG at mm/huge_memory.c:1829! Sasha Levin
2014-04-10 8:45 ` Bob Liu
2014-04-10 14:37 ` Dave Jones
2014-04-10 14:41 ` Sasha Levin
2014-04-10 10:25 ` Kirill A. Shutemov
2014-04-10 13:44 ` Kirill A. Shutemov
2014-04-10 16:27 ` Andrea Arcangeli [this message]
2014-04-14 14:42 ` Kirill A. Shutemov
2014-04-17 0:31 ` Andrea Arcangeli
2014-04-10 11:25 ` Kirill A. Shutemov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140410162750.GD2749@redhat.com \
--to=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=davej@redhat.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lliubbo@gmail.com \
--cc=sasha.levin@oracle.com \
--cc=vbabka@suse.cz \
--cc=walken@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).